Skip to content

multiclassification wants to runs on multiple devices (gpu and cpu).

Created by: MortenHolmRep

Describe the bug In an attempt to implement multi-classification #111 (closed) , I encounter an issue with calculations happening on both GPU and CPU.

To Reproduce Steps to reproduce the behavior:

  1. Modification in ../example/train_model.py with new a new MulticlassificationTask and MultiClassificationCrossEntropyLoss
  2. Implementing the task
task = MulticlassificationTask(
        hidden_size=gnn.nb_outputs,
        target_labels=config["target"],
        loss_function=MultiClassificationCrossEntropyLoss(),
    )
  1. implementing the output
results = get_predictions(
        trainer,
        model,
        validation_dataloader,
        [config["target"] + "_noise_pred", config["target"] + "_muon_pred", config["target"]+ "_neutrino_pred"], 
        additional_attributes=[config["target"], "event_no"],
    )
  1. running the code with accelerator: "gpu" and devices: [0] and 10 workers.
  2. See error

Expected behavior I expected only to run on GPU, not CPU.

Full traceback

Traceback (most recent call last):
  File "/lustre/hpc/icecube/qgf305/workspace/analyses/multi_classification_on_stop_and_track_muons/modelling/train_classification_model.py", line 253, in <module>
    main()
  File "/lustre/hpc/icecube/qgf305/workspace/analyses/multi_classification_on_stop_and_track_muons/modelling/train_classification_model.py", line 234, in main
    trainer.fit(model, training_dataloader, validation_dataloader)
  File "/groups/icecube/qgf305/anaconda3/envs/graphnet/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 696, in fit
    self._call_and_handle_interrupt(
  File "/groups/icecube/qgf305/anaconda3/envs/graphnet/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 650, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/groups/icecube/qgf305/anaconda3/envs/graphnet/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 735, in _fit_impl
    results = self._run(model, ckpt_path=self.ckpt_path)
  File "/groups/icecube/qgf305/anaconda3/envs/graphnet/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1166, in _run
    results = self._run_stage()
  File "/groups/icecube/qgf305/anaconda3/envs/graphnet/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1252, in _run_stage
    return self._run_train()
  File "/groups/icecube/qgf305/anaconda3/envs/graphnet/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1274, in _run_train
    self._run_sanity_check()
  File "/groups/icecube/qgf305/anaconda3/envs/graphnet/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1343, in _run_sanity_check
    val_loop.run()
  File "/groups/icecube/qgf305/anaconda3/envs/graphnet/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 200, in run
    self.advance(*args, **kwargs)
  File "/groups/icecube/qgf305/anaconda3/envs/graphnet/lib/python3.8/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 155, in advance
    dl_outputs = self.epoch_loop.run(self._data_fetcher, dl_max_batches, kwargs)
  File "/groups/icecube/qgf305/anaconda3/envs/graphnet/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 200, in run
    self.advance(*args, **kwargs)
  File "/groups/icecube/qgf305/anaconda3/envs/graphnet/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 143, in advance
    output = self._evaluation_step(**kwargs)
  File "/groups/icecube/qgf305/anaconda3/envs/graphnet/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 240, in _evaluation_step
    output = self.trainer._call_strategy_hook(hook_name, *kwargs.values())
  File "/groups/icecube/qgf305/anaconda3/envs/graphnet/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1704, in _call_strategy_hook
    output = fn(*args, **kwargs)
  File "/groups/icecube/qgf305/anaconda3/envs/graphnet/lib/python3.8/site-packages/pytorch_lightning/strategies/strategy.py", line 370, in validation_step
    return self.model.validation_step(*args, **kwargs)
  File "/lustre/hpc/icecube/qgf305/workspace/graphnet/src/graphnet/models/standard_model.py", line 112, in validation_step
    loss = self.shared_step(val_batch, batch_idx)
  File "/lustre/hpc/icecube/qgf305/workspace/graphnet/src/graphnet/models/standard_model.py", line 96, in shared_step
    loss = self.compute_loss(preds, batch)
  File "/lustre/hpc/icecube/qgf305/workspace/graphnet/src/graphnet/models/standard_model.py", line 125, in compute_loss
    losses = [
  File "/lustre/hpc/icecube/qgf305/workspace/graphnet/src/graphnet/models/standard_model.py", line 126, in <listcomp>
    task.compute_loss(pred, data)
  File "/lustre/hpc/icecube/qgf305/workspace/graphnet/src/graphnet/models/task/task.py", line 134, in compute_loss
    self._loss_function(pred, target, weights=weights)
  File "/groups/icecube/qgf305/anaconda3/envs/graphnet/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/lustre/hpc/icecube/qgf305/workspace/graphnet/src/graphnet/training/loss_functions.py", line 54, in forward
    elements = self._forward(prediction, target)
  File "/lustre/hpc/icecube/qgf305/workspace/graphnet/src/graphnet/training/loss_functions.py", line 124, in _forward
    return cross_entropy(
  File "/groups/icecube/qgf305/anaconda3/envs/graphnet/lib/python3.8/site-packages/torch/nn/functional.py", line 2996, in cross_entropy
    return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

Additional context Add any other context about the problem here.