Request for 2 GPUs in Lyon return an error
I'm not very expert in GPUs training and computing in Lyon, but when I try to run Orcanet training requiring two GPUs (-> sbatch -p gpu --gres=gpu:v100:2
) or two parallel jobs, requiring one GPU each, I get the following error:
2022-08-02 14:04:17.820335: F tensorflow/core/platform/statusor.cc:33] Attempting to fetch value instead of handling error INTERNAL: failed initializing StreamExecutor for CUDA device ordinal 0: INTERNAL: failed call to cuDevicePrimaryCtxRetain\
: CUDA_ERROR_ECC_UNCORRECTABLE: uncorrectable ECC error encountered
/sps/km3net/users/ffilippi/GNNs/training_scripts/run_train_time_window.sh: line 3: 226045 Aborted
Thanks a lot
Edited by Francesco Filippini