Coarsening
Created by: Aske-Rosted
An attempted implementation for solving #431.
Creates a new class generators, which are called after the extractors in the graphnet/src/graphnet/data/dataconverter.py script.
The idea is that this class in the future can be used more generally if one wishes to create a new feature variable pulsemap etc. when reading from the I3 files.
Also introduces the CoarsePulseGenerator as the first example of a generator. Two of the functions called in CoarsePulseGenerator. have been left out of the class as they in principle could be moved to graphnet/src/graphnet/models/coarsening.py however I did not quite feel like that was appropriate either.
Merge request reports
Activity
added feature label
assigned to @jprado
assigned to @jprado
Created by: asogaard
Hi @Aske-Rosted,
Thanks for this PR! I think I understand, and empathise with, your motivation, and will probably need a bit of time to digest the proposal, but here are some initial considerations: (Ideally these would have been discussed as part of the motivating issue, here #431, but I think we can keep it here in the PR.)
(I'm just playing back my understanding of the problem here.)
Feature engineering can be done either:
- When we read data from I3-files and save it in an intermediate format, in
DataConverter
- When we read data from these intermediate files and prepare PyG graphs for processing, in
Dataset
- In the
Model
itself.
This PR suggests moving (some of) the coarsening, which is otherwise done in the
Coarsening
class as part of (3.) above, into a pre-processing step done as part of (1.). As I understand it, this is motivated by performance, since- loading events with a large number of pulses using the
Dataset
/DataLoader
classes is prohibitively slow, due to their vast size if un-preprocessed, and/or - performing the clustering itself, if done within the
Model
, is prohibitively slow.
Either of these would mean that training runs would be dominated by I/O and/or data transformations, which in turn would starve the GPU and result in very long training times.
Is the above a decent characterisation of the problem you're having, @Aske-Rosted, and the logic behind the proposed solution?
If so, have you checked whether your main bottleneck is problem (1.) or (2.) above, i.e., is it the data loading itself, or the clustering? What processing times per event or batch do you get for the data you're working on? How does that compare to a low-energy sample?
- If it's the data loading, how is it affected by the number of workers? Are there any inefficiencies in the way we're currently querying data that we could mitigate to solve your problem?
- If it's the clustering, could we then speed up the coarsening by using GPU-enabled algorithms instead of, e.g.,
sklearn.cluster.KMeans
which I would expect could ruin your GPU utilisation.
I'm asking, because all else being equal, I think it is beneficial to do as many transformations as possible within
Model
itself. This allows us to have a universal definition of what an "event" looks like (i.e., how many pulses does it have? what are the properties of these? etc.), regardless of what any downstream GNN then does with that information. That is the idea (naïve, perhaps) behind feature extractors likeI3FeatureExtractorIceCube86
: All IceCube-86 events ought to have the same format, and allow for analysis using the same set of GNN models, provided the right interface (in this case, the accompanyingIceCube86
Detector
class). This, in turn, has the benefits that:- we can easily swap out one model for another, and compare performance, because they will all assume the same things about the input event, even if they implement different feature engineering, coarsening methods, etc. as part of their internal logical.
- we can easily deploy models in IceTray, because all we need are some general/universal
I3Extractors
that simply pull out the relevant data, appropriate to each detector, from theI3Frames
, and a model that "eats" this data raw and applies an entirely self-contained logic to it. - we don't need to re-run I3-conversion to change the feature engineering/coarsening logic, as this is done on-the-fly.
The more data processing we move out of the model, the less modular the whole ecosystem becomes: We now need to keep track of not just what data was extracted, but also whether that data was pre-processed using
Generators
, and remember which models used which types of data, even for the same events/datasets.I think the above points are very appealing from a "software design" perspective. But I understand that you need to be able to train your models before the heat death of the Universe.
That is why I'm curious as to whether we can find solutions to your performance problems without adding more bells and whistles than we need to.- When we read data from I3-files and save it in an intermediate format, in
unassigned @jprado
Created by: Aske-Rosted
I very much understand the reasoning behind doing the coarsening in the model itself and in that way not having to keep track of changes made to the data prior. However I think the answer to
- loading events with a large number of pulses using the Dataset/DataLoader classes is prohibitively slow, due to their vast size if un-preprocessed, and/or
- performing the clustering itself, if done within the Model, is prohibitively slow.
unfortunately is both. The clustering on the full dataset on the scale of one to a couple of days for the full dataset, and as far as I understand it would have to run once for each epoch. However, I say "I think" because I cannot really check the loading time of the full SRTInIcePulses due to the third issue
- some of the events are so large that even a single event will not fit on the GPU Memory (or at least 2 events won't fit see #441 (closed)).
I do not personally see a fix to this other than
- Coarsening in some step prior to the training process.
- Doing the coarsening on CPU prior to throwing it to the GPU which I expect to be too slow from prior experience but admittedly have not actually checked. counting only the time spent in the coarsening script for 7.384 events 133 seconds was spent. This means that without any parallelization the naively expected increase in runtime can be roughly estimated by., 500.000/10.789 ≈ 46 46 * 133 s ≈ 101 minutes, Assuming that doubling number of CPU's halves the time spent with the 20 CPU's available to me this means that it might only increase the time of 1 epoch by 5 minutes. Since this is not unreasonable I will try to code up the "during runtime coarsening" and see what the actual increase in time is. (However with the current setup I have increasing number of CPU's only increases runtime so I do not expect to be near the optimal decrease in time by using multiple CPU's)
- Upgrading the hardware...
Created by: Aske-Rosted
@Aske-Rosted, I see that you are assigned to this PR. Have there been any recent developments that changed and/or solve your last post, or should someone else be assigned? :)
I have been looking into alternatives and have cleared the hurdle of running out of memory, however I cannot speak towards performance as I am unfortunately heavily IO-bound using this method. Some form of compression of the SQL files might relieve the stress on the IO... I think I currently need some advice or input as to what solution to work towards.
Created by: MortenHolmRep
I have been looking into alternatives and have cleared the hurdle of running out of memory, however I cannot speak towards performance as I am unfortunately heavily IO-bound using this method. Some form of compression of the SQL files might relieve the stress on the IO... I think I currently need some advice or input as to what solution to work towards.
Thanks for your quick reply @Aske-Rosted! To clarify, you can now fit all events into GPU memory from point 3. but issue #441 (closed) is still unresolved?
Did you manage to get my profiling branch to work in your own set up? It could reveal where the performance is bottlenecked, if at all. Maybe in one of the next meetings or a dedicated meeting, you could present the issue, and we could identify avenues for you to try.
Created by: Aske-Rosted
I have been looking into alternatives and have cleared the hurdle of running out of memory, however I cannot speak towards performance as I am unfortunately heavily IO-bound using this method. Some form of compression of the SQL files might relieve the stress on the IO... I think I currently need some advice or input as to what solution to work towards.
Thanks for your quick reply @Aske-Rosted! To clarify, you can now fit all events into GPU memory from point 3. but issue #441 (closed) is still unresolved?
Did you manage to get my profiling branch to work in your own set up? It could reveal where the performance is bottlenecked, if at all. Maybe in one of the next meetings or a dedicated meeting, you could present the issue, and we could identify avenues for you to try.
Yes, I have not done any further work towards #441 (closed) so I assume that nothing is changed there. I have managed to fit some some events into memory (through considerable changes to the model.), however the current training times on my machine makes it unfeasible to get to a stage where I can say anything about performance of the model.
With inspiration from your profiling branch I got some profiling to work but found the results hard to interpret. However, observing I/O utilization along with normal CPU and GPU utilization while running the script made it quite clear that at the current stage the training of the model is I/O bound.
I agree that addressing the issue at an upcoming meeting is probably the next step forward. :)
Created by: asogaard
Hi @Aske-Rosted,
I'm just back from parental leave. :) #441 (closed) should have been fixed now — please let me know if you find this is not the case.
Would you be able to provide a small (or large) test dataset (e.g., on Cobalt) and a test script for timing reading your data, possibly also coarsening? Then I would be happy to take a look to help diagnose the problem, such that we can try and reduce the solution space.