Use unsupervised learning to learn representations in data and Monte Carlo
Created by: asogaard
Suggested steps:
-
Define unsupervised learning tasks, i.e., learning tasks that don't required truth-level labels but instead relies solely on the reconstruction-level data. This is the same principle that is used for (pre-)training large language models. For instance, - Masking and reconstructing the spatial location of a subset of pulses in each event
- Masking and reconstructing summary statistics (sum of charge, timing of first hit, etc.) for a subset of DOMs in each event.
- Etc.
-
Pre-train models on such unsupervised learning tasks. This can be done in Monte Carlo data or, crucially, in recorded data. -
Fine-tune such pre-trained models on supervised learning tasks. - Optionally, fine-tune only a subset of the model's parameters.
- This should make the supervised tasks much easier/quicker, and if the model is pre-trained in an unsupervised manner on recorded data, the fine-tuned model such be much less susceptible to data/Monte Carlo discrepancies than a model trained in fully supervised manner on Monte Carlo data.