Troubleshooting on NeSI¶
The NeSI platform occasionally experiences stability issues related the filesystems, Slurm, networking and Globus. RJM attempts to handle these issues by retrying commands that have failed but it is not always successful.
If RJM isn't working well first try running the rjm_health_check
program (-ll debug
will print additional
output that can be useful for debugging):
rjm_health_check -ll debug
If this command fails, a good first step is to try resetting your funcX endpoint on NeSI, which can sometimes get into a bad state, particularly if there was a network issue on NeSI or one of the login nodes went down:
rjm_restart -ll debug
After running this command, try the rjm_health_check
program. If it still doesn't work, there is likely to be a
bigger issue, please contact NeSI support with the error message
and mention you are using Globus, funcX and the RemoteJobManager tool.