multiclassification wants to runs on multiple devices (gpu and cpu).
Created by: MortenHolmRep
Describe the bug In an attempt to implement multi-classification #111 (closed) , I encounter an issue with calculations happening on both GPU and CPU.
To Reproduce Steps to reproduce the behavior:
- Modification in
../example/train_model.pywith new a new MulticlassificationTask and MultiClassificationCrossEntropyLoss - Implementing the task
task = MulticlassificationTask(
hidden_size=gnn.nb_outputs,
target_labels=config["target"],
loss_function=MultiClassificationCrossEntropyLoss(),
)
- implementing the output
results = get_predictions(
trainer,
model,
validation_dataloader,
[config["target"] + "_noise_pred", config["target"] + "_muon_pred", config["target"]+ "_neutrino_pred"],
additional_attributes=[config["target"], "event_no"],
)
- running the code with
accelerator: "gpu"anddevices: [0]and 10 workers. - See error
Expected behavior I expected only to run on GPU, not CPU.
Full traceback
Traceback (most recent call last):
File "/lustre/hpc/icecube/qgf305/workspace/analyses/multi_classification_on_stop_and_track_muons/modelling/train_classification_model.py", line 253, in <module>
main()
File "/lustre/hpc/icecube/qgf305/workspace/analyses/multi_classification_on_stop_and_track_muons/modelling/train_classification_model.py", line 234, in main
trainer.fit(model, training_dataloader, validation_dataloader)
File "/groups/icecube/qgf305/anaconda3/envs/graphnet/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 696, in fit
self._call_and_handle_interrupt(
File "/groups/icecube/qgf305/anaconda3/envs/graphnet/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 650, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/groups/icecube/qgf305/anaconda3/envs/graphnet/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 735, in _fit_impl
results = self._run(model, ckpt_path=self.ckpt_path)
File "/groups/icecube/qgf305/anaconda3/envs/graphnet/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1166, in _run
results = self._run_stage()
File "/groups/icecube/qgf305/anaconda3/envs/graphnet/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1252, in _run_stage
return self._run_train()
File "/groups/icecube/qgf305/anaconda3/envs/graphnet/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1274, in _run_train
self._run_sanity_check()
File "/groups/icecube/qgf305/anaconda3/envs/graphnet/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1343, in _run_sanity_check
val_loop.run()
File "/groups/icecube/qgf305/anaconda3/envs/graphnet/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 200, in run
self.advance(*args, **kwargs)
File "/groups/icecube/qgf305/anaconda3/envs/graphnet/lib/python3.8/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 155, in advance
dl_outputs = self.epoch_loop.run(self._data_fetcher, dl_max_batches, kwargs)
File "/groups/icecube/qgf305/anaconda3/envs/graphnet/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 200, in run
self.advance(*args, **kwargs)
File "/groups/icecube/qgf305/anaconda3/envs/graphnet/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 143, in advance
output = self._evaluation_step(**kwargs)
File "/groups/icecube/qgf305/anaconda3/envs/graphnet/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 240, in _evaluation_step
output = self.trainer._call_strategy_hook(hook_name, *kwargs.values())
File "/groups/icecube/qgf305/anaconda3/envs/graphnet/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1704, in _call_strategy_hook
output = fn(*args, **kwargs)
File "/groups/icecube/qgf305/anaconda3/envs/graphnet/lib/python3.8/site-packages/pytorch_lightning/strategies/strategy.py", line 370, in validation_step
return self.model.validation_step(*args, **kwargs)
File "/lustre/hpc/icecube/qgf305/workspace/graphnet/src/graphnet/models/standard_model.py", line 112, in validation_step
loss = self.shared_step(val_batch, batch_idx)
File "/lustre/hpc/icecube/qgf305/workspace/graphnet/src/graphnet/models/standard_model.py", line 96, in shared_step
loss = self.compute_loss(preds, batch)
File "/lustre/hpc/icecube/qgf305/workspace/graphnet/src/graphnet/models/standard_model.py", line 125, in compute_loss
losses = [
File "/lustre/hpc/icecube/qgf305/workspace/graphnet/src/graphnet/models/standard_model.py", line 126, in <listcomp>
task.compute_loss(pred, data)
File "/lustre/hpc/icecube/qgf305/workspace/graphnet/src/graphnet/models/task/task.py", line 134, in compute_loss
self._loss_function(pred, target, weights=weights)
File "/groups/icecube/qgf305/anaconda3/envs/graphnet/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/lustre/hpc/icecube/qgf305/workspace/graphnet/src/graphnet/training/loss_functions.py", line 54, in forward
elements = self._forward(prediction, target)
File "/lustre/hpc/icecube/qgf305/workspace/graphnet/src/graphnet/training/loss_functions.py", line 124, in _forward
return cross_entropy(
File "/groups/icecube/qgf305/anaconda3/envs/graphnet/lib/python3.8/site-packages/torch/nn/functional.py", line 2996, in cross_entropy
return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!
Additional context Add any other context about the problem here.