Skip to content

Use unsupervised learning to learn representations in data and Monte Carlo

Created by: asogaard

Suggested steps:

  • Define unsupervised learning tasks, i.e., learning tasks that don't required truth-level labels but instead relies solely on the reconstruction-level data. This is the same principle that is used for (pre-)training large language models. For instance,
    • Masking and reconstructing the spatial location of a subset of pulses in each event
    • Masking and reconstructing summary statistics (sum of charge, timing of first hit, etc.) for a subset of DOMs in each event.
    • Etc.
  • Pre-train models on such unsupervised learning tasks. This can be done in Monte Carlo data or, crucially, in recorded data.
  • Fine-tune such pre-trained models on supervised learning tasks.
    • Optionally, fine-tune only a subset of the model's parameters.
    • This should make the supervised tasks much easier/quicker, and if the model is pre-trained in an unsupervised manner on recorded data, the fine-tuned model such be much less susceptible to data/Monte Carlo discrepancies than a model trained in fully supervised manner on Monte Carlo data.