Should we change the way Detector, Coarsening, and geometry tables work?

Created by: RasmusOrsoe

One important part of the philosophy for Models in GraphNeT is that they are self-contained; functionality that a specific model requires (standardization, transformation and other auxiliary calculations) are available within the model such that it's deployable as a single package that only depends on data. i.e:

event -> Model -> prediction

Where the components that make up Model exists in the sequence [Coarsening, Detector, GNN, Task].

The Detector class is supposed to represent all the detector-specific code - and the exchange of one detector class with another is supposed to be the only difference between training a Model on IceCube vs. any other experiment. Over the past year, the Detector class is the component that has seen the least of our attention, and I've noticed a few issues with the current implementation. Here is what I've noticed:

Way back when, we used sklearn.preprocessing for standardization of data, but had to let this go because of efficiency and elegance. This need as occasionally resurfaces. The class still contains old lines of code that dates back to this. It's a mess (that I made).
The Detector contains hard-coded standardization that is perhaps not ideal for all data that the class represents, sparking the need to create multiple versions of the same detector.
Each Detector class expects a specific set of input features; if they differ in name or number the code fails. This again sparks the need to create multiple copies of each detector, because if one wants to train only on a subset of features that's associated with the detector, the assertion fails.
Because Coarsening can change both the meaning of input variables and the number of them, the current placement of Coarsening will provoke errors mentioned above.

None of the problems above are unsolvable; the checks and balances on input dimensions and feature names exists for a good reason, and we could just redesign the implementation to be more flexible, change the order of Coarsening and Detector and those of us requiring external libraries for standardization could just eat the efficiency loss. However, the path ahead becomes more opaque when taking into account new functionality; Coarsening & Geometry Tables (#421). The observations I have are:

Coarsening

Our current implementation of Coarsening is not easily understood - it has to take in batched data and aggregated correctly in vectorized form - all of which makes it naturally hard to follow. I fear that it might become difficult to extend it much further if future needs arise. But it works and gives self-contained models.
If one moved the Coarsening functionality into Dataset, the Coarsening would only need to work on single events, potentially making the coarsening much simpler and flexible. In some cases, it might be faster ( I think). However, since the Dataset class is external to Model, this would break self-containment and create new issues downstream. Unless, ofc, we moved dataloading into Model itself ([DataLoading, Coarsening, Detector, GNN, Task]).

Geometry Tables

We need geometry tables to add in-active sensors to graphs, which is interesting for several reasons. However, in #421 we're about to add this functionality to Dataset even though it's inherently Detector-specific. We add it here because it's per-event data, making the implementation simpler, but it also breaks with self-containment and provides issues downstream.
Our data in it's raw form (i3 files and kaggle dataset) has geometry and pulse information separate for storage reasons; we add these together on a per-event level for convenience and efficiency for training when we convert the data to "our" format.

Our options, as I see them

Make Detector flexible, implement Geometry tables in Dataset and implement downstream fixes, put Coarsening after Detector but keep as-is, and accept that Detector is now not the place for detector-specific code and that Coarsening just is a bit hard to read/extend, OR
Make Detector flexible, implement a vectorized&batched Geometry tables in Detector, put Coarsening after Detector but keep as-is, and accept that Coarsening and Geometry Tables are just is a bit hard to read/extend, OR
Consider this to be a sign that we actually need a place inside Model where we can do per-event operations efficiently, perhaps by moving dataloading into Model itself ([DataLoading, Coarsening, Detector, GNN, Task]).