Troubleshooting on NeSI

The NeSI platform occasionally experiences stability issues related the filesystems, Slurm, networking and Globus. RJM attempts to handle these issues by retrying commands that have failed but it is not always successful.

If RJM isn't working well first try running the rjm_health_check program (-ll debug will print additional output that can be useful for debugging):

rjm_health_check -ll debug

If this command fails, a good first step is to try resetting your funcX endpoint on NeSI, which can sometimes get into a bad state, particularly if there was a network issue on NeSI or one of the login nodes went down:

rjm_restart -ll debug

After running this command, try the rjm_health_check program. If it still doesn't work, there is likely to be a bigger issue, please contact NeSI support with the error message and mention you are using Globus, funcX and the RemoteJobManager tool.

Common errors