We all agree that there is no single "best track" algorithm as every reconstruction method can spit out it's own vector of tracks with no constraints on which one is the best track. More over, the single vector can hold different types of reconstruction algorithms by design.
Consider a track reconstruction which always(!) generates a best up-going and a best down-going track, in this particular order. In this case you should never take always the first, but check the first two and decide on additional parameters etc. on top of that, it might be that tracks from other reconstruction algorithms are present, so it's not straight forward.
So we decided to provide a single function with the signature e.g. best_track(tracks, strategy). Both are mandatory and strategy is the algorithm selector.
A very first implementation is best_track(tracks, strategy="first") which will take the first track in each event. This is currently almost always the best strategy.
The implementation is with the upcoming slicing magic something like this:
If mc_tracks would contain proper non-empty rec_stages entries, it would be quite straightforward. Otherwise hmm, maybe some index matching would work.
Yes, I suggest we add some cleaning in the tracks (trivial one: remove non-reconstructed tracks). I think it is also good to make sure that the tracks selected are fully reconstructed (following whatever reconstruction stages the user wants)... But that is under discussion... at the moment, we still need to explore the best reco strategies, so please @pkalaczynski do not expect a fast implementation as we would like to make this best reco method as general as possible (for the time being, since km3net has not yet agreed on a common definition of best reco track!).
Yep, exactly. We want it to be user-friendly, but also let people do complex stuff as well. And once again as a reminder, this should be applicable not just on tracks, but also their respective mc_tracks, mc_hits, hits etc. ;)
Note that we should keep in mind that we can have a mixture of reconstructed tracks inside the same vector from different reconstruction algorithms. At the moment we did not mix different reconstruction algorithms in the same processing pipeline, but this might (and will) be the case in future.
This means that there is a requirement to also always check the rec_type of each entry.
The data structure (flat vector) which is provided by the aanet data format is clearly not ideal for this.
I would like to add something, if I may. Maybe this is already possible with basic numpy, but: At times one might need a mask for the selected "best tracks" as much as the best tracks and their information. My use case for example:
I have a run processed with the standard reco and one with an alternative one (deep learning in this case). In order to compare the same events to one another I have to extract a "is reconstructed" mask from the standard reco first.
Is there an easy way to do it?
If a mask is also returned by such a "best track" function (or maybe even an own function) it could be used by the user for the mc_tracks and hits as well.
My use case for example: I have a run processed with the standard reco and one with an alternative one (deep learning in this case). In order to compare the same events to one another I have to extract a "is reconstructed" mask from the standard reco first.
This boils down to our overall goal: providing a natural way of slicing/indexing the KM3NeT datastructures in a whole, just as they were numpy arrays.
Is there an easy way to do it?
Kind of... with a few caveats We are working on this very hard, every day and it's a tough task to implement it so that it's both user-friendly and has maximum performance.
We face multiple problems, but one of the biggest issues is of course that we are not dealing with homogeneous data like 2D matrices (with named columns) but variable lengths of arrays inside variable lengths of arrays inside of variable length arrays.
Here are some examples of the slicing/indexing/masking feature:
>>>importkm3io>>>km3io.version'0.9.1.dev72+g5dee01f'>>>f=km3io.OfflineReader("datav5.40.jorcarec.aanet.00006585.root")>>>tracks=f.events.tracks>>>tracks<Branch[tracks]:141098elements>>>>f.events.n_tracksarray([128,128,128,...,128,128,120])>>>mask=f.events.n_tracks>0# boolean mask for events with at least 1 track>>>mask<ChunkedArray[TrueTrueTrueTrueTrueTrueTrue...]at0x7facfe227c10>>>>pos_x_of_every_first_track=tracks[mask].pos_x[:,0]>>>pos_x_of_every_first_track<ChunkedArray[29.75856308122447344.8793335356463563.474211273184910.348203834859914-10.1863942339137111.57492633136809443.82500580340671...]at0x7facfc1a6d00>>>>pos_x_of_every_first_track.shape(141053,)>>>selected_tracks=tracks[mask]>>>selected_tracks<Branch[tracks]:141053elements>>>>selected_hits=f.events.hits[mask]>>>selected_hits<Branch[hits]:141053elements>
As you can see, the slicing is more natural and can be chained, although we still need to fix issues so don't use this in production yet.
The problem with the best_track() mask is, that it needs to be a two-dimensional mask (event_idx+track_idx) which is not very well handled by these underlying ChunkedArrays in uproot. I opened an issue end of January (https://github.com/scikit-hep/awkward-array/issues/229) and the answer from Jim himself (the creator of uproot) is:
I made a wrong design choice by introducing ObjectArray: it delays evaluation of the (expensive) unsplit-list deserialization by introducing an opaque type that doesn't work like any other data. In Awkward 1.0, the unsplit-list deserialization will be over an order of magnitude faster (no longer the very noticeable performance hit) as shown on page 14/16 of this talk, and the Methods and ObjectArray mechanisms will be replaced by a single non-opaque mechanism as shown in this section of the Coffea demo.
So, not a bug, but exactly the sort of confusing feature that will be eliminated by transitioning to Awkward 1.0.
Recently I also experimented converting these ChunkedArrays (which are ObjectArrays behind the scenes and not "well-sliceable") to AwkwardArrays using the not-yet-released awkward1 library (which is going to replace awkward soon) and there the slicing works as expected. However, we will deal with the issue that those arrays are not lazy, which means that they will eventually need to load all the data in memory. This might not be a problem for high(est) level data though. You can see the issue here: https://github.com/scikit-hep/uproot/issues/461
All in all I can say that we are close to a nice API, but it's still a lot of work. @zaly is also working hard on the test part, to boil the hell out of it by trying to figure out all use-cases which cause problems with our new approach to the indexing/masking/slicing.
Meanwhile, let us know what you think and if you have any ideas, don't hesitate to create issues or comment wherever you like! I am confident that the numpy-style high-level analysis is the way to go.
I'd love to see pandas-like operations and if we had 2D shaped data (matrices) it would be my first choice, as it is for KM3Pipe with the thin kp.Table-class for event-by-event readout, since there you can shape every branch into a 2D columnar data independently (one table for Hits, one for Tracks etc.), but in case of km3io the user wants to deal with the whole file at once, which makes it awkward.
Just a tiny update: @zaly is working on the best_track implementation and meanwhile the slicing has been implemented to work around with the uproot3 limitations using the new awkward1 library.
Feel free to test it!
As always: the latest master is updated in Lyon on every git push (module load python/3.7.5).
Not sure if a dedicated function is available yet, but if you want to just have tracks with all reco stages and with their best fits, you can typically just use masks like in this example:
for f in FNAMES: print("Opening ", f) r = ki.OfflineReader(f) if len(r.events)>1: # only non-empty files premask = [rstgs.size>1 for rstgs in r.events.tracks.id] mask = [[1, 2, 5, 3, 5, 4] == list(s[0]) for s in r.events.tracks.rec_stages[premask]] VARS[0].extend([sum(tots) for tots in r.events.hits.tot[premask][mask]])
Not sure if that's the best way, but it works and it's rather fast.
@lmaderer@pkalaczynski@dguderian
I implemented a prototype some time ago but I have not tested it with heavy files (so I have no idea about the performance). Please go ahead and try it out and let me know your feedback, I will really appreciate it :)
This is how to proceed:
Here is some documentation:
best_track(tracks,strategy="first",rec_type=None,rec_stages=None):"""best track selection based on different strategies Parameters ---------- tracks : class km3io.offline.OfflineBranch the tracks branch. strategy : str the trategy desired to select the best tracks. It is either: - "first" : to select the first tracks. - "rec_stages": to select all tracks with a specific reconstruction stages. This requires the user to specify the reconstruction stages in rec_stages input. Example: best_track(my_tracks, strategy="rec_stages", rec_stages=[1, 2, 3, 4, 5]). - "default": to select the best tracks (the first ones) corresponding to a specific reconstruction algorithm (JGandalf, Jshowerfit, etc). This requires rec_type input. Example: best_track(my_tracks, strategy="default", rec_type="JPP_RECONSTRUCTION_TYPE"). rec_type : str, optional reconstruction type as defined in the official KM3NeT-Dataformat. rec_stages : list, optional list of the reconstruction stages. Example: [1, 2, 3, 4, 5] Returns ------- class km3io.offline.OfflineBranch tracks class with the desired "best tracks" selection."""
Thanks @zaly I decided to give it a try. There are still some problems with empty tracks which gives an IndexError. So we need to exclude those. Here is an example using a file from the ORCA DU4 production which has 3 events without any reconstructed tracks:
I added a quick fix to mask empty tracks. I performed a quick test to see the performance of best_track on orca4 data as you did, here are the results:
Hum ... maybe we need to add an option to just return the "entire" mask to apply to the tracks/events class (after the best track selection) for those who need the corresponding events ids?
I am not sure I see a trivial solution to this at the moment! I will think about it.
I am not sure that this is as trivial as it sounds like because the shape (and the type) of masks for tracks is completely different from the shape (and the type) of the masks for events (because tracks data is always nested while events data is almost always flat!). So even if we get a mask for tracks using best_track, it won't necessarily be useful for events class.
Yes, the mask for events would be a 1D, but, as you already know, the mask for tracks won't be a 1D .... So I am not sure what we want this function to do precisely ...
So if we say that an option to return masks is needed for best_track, we need to choose one of the following options:
return both masks for tracks and events (no classes are returned), and let the users do whatever they want with them.
return best tracks class and the mask for events (as an option), maybe this is what you have in mind?
return best tracks class, and the masks for tracks as an option and the mask for events as a separate option?
we can also rely on the user to apply the mask events.tracks[events.n_tracks > 0] and use ONLY the reconstructed events as input for best_tracks (since it doesn't make sense to talk about best_track selection if there isn't any reconstruction). When the mask is not applied, there will always be an IndexError raised, but I can raise a custom error to remind the user to apply the mask ...
PS: I am just thinking out loud here. I am personally in favour of doing the minimum operations possible on data and giving more flexibility to the user.
My approach is from the users point of view. If best_track(tracks, ...rectype, etc.) is executed, the user (in my point of view) assumes to retrieve the best track (a single one) for each event. I think that a 1D mask which also shows for which event there actually id a track is very natural and further selections (based on the events) can be made easily.
I don't think that the user is interested in the position of the track inside the track vector, but they definitely need access to the track parameters, which in fact can be quite annoying when there is still a nested structure of tracks per event.
We should probably think about returning a flat "vector of tracks" with all data extracted.
best_track already returns flat data! (except for the strategy "rec_stages" where some nesting is still allowed for those who want to do multiplicity studies, which can be moved to another function btw ...).
Alright, so main conclusion: add an option to return the 1D mask for events to match the tracks with their corresponding event ids!
I am wondering if we could just reuse the dtype of the events.tracks[] entries that would make it quite easy to populate a numpy recarray where the dtype is something like pos_x, pos_y, ..., rec_type etc. (all single values)
so here is a quick implementation of the events mask option:
defbest_track(events,strategy="default",rec_type=None,rec_stages=None,event_mask=False):"""best track selection based on different strategies Parameters ---------- events : class km3io.offline.OfflineBranch the events branch. strategy : str the trategy desired to select the best tracks. It is either: - "first" : to select the first tracks. - "rec_stages": to select all tracks with a specific reconstruction stages. This requires the user to specify the reconstruction stages in rec_stages input. Example: best_track(my_tracks, strategy="rec_stages", rec_stages=[1, 2, 3, 4, 5]). - "default": to select the best tracks (the first ones) corresponding to a specific reconstruction algorithm (JGandalf, Jshowerfit, etc). This requires rec_type input. Example: best_track(my_tracks, strategy="default", rec_type="JPP_RECONSTRUCTION_TYPE"). rec_type : str, optional reconstruction type as defined in the official KM3NeT-Dataformat. rec_stages : list, optional list of the reconstruction stages. Example: [1, 2, 3, 4, 5] event_mask : bool, optional True: to return the mask to apply to the events class to match the best track selected with their corresponding event ID. Returns ------- class km3io.offline.OfflineBranch tracks class with the desired "best tracks" selection. Raises ------ ValueError ValueError raised when an invalid strategy is requested."""options=['first','rec_stages','default']ifstrategynotinoptions:raiseValueError("{} not in {}".format(strategy,options))mask=events.n_tracks>0tracks=events.tracks[mask]ifstrategy=="first":out=tracks[:,0]elifstrategy=="rec_stages"andrec_stagesisnotNone:out=tracks[mask(tracks.rec_stages,rec_stages)]elifstrategy=="default"andrec_typeisnotNone:out=tracks[tracks.rec_type==reconstruction[rec_type]][tracks.lik==ak1.max(tracks.lik,axis=1)][:,0]ifevent_mask:out=(out,mask)returnout
TBH here, I really don't like the idea of returning the mask, because this mask is really trivial and I think it must be left to the user to ALWAYS provide a subset of the events class to the best_track function where all the non-reconstructed events have been removed (ie: where the mask events.n_tracks >0 have been applied on events). This actually will make the function best_track more general and it can be applied to any subset of events the user would like (with the condition events.n_tracks >a where a>0), in which case the events ids of the subset used and the best tracks selected will ALWAYS match!
Alright, now you might think "yea but in the previous implementation, we could still use best_track with any subset of events and it will still work". To this, I reply YES of course! BUT, if a user is not very experienced, and they use best_track function, they will be returned a subset of their input without necessarily knowing why, unless they look at the source code. While if we impose to the user to construct from the beginning their own subset, the best_track method becomes transparent and there will be no ambiguity on the tracks and their corresponding events IDS.
So here is an alternative:
defbest_track(events,strategy="default",rec_type=None,rec_stages=None):"""best track selection based on different strategies Parameters ---------- events : class km3io.offline.OfflineBranch a subset of reconstructed events where `events.n_tracks > 0` is always true. strategy : str the trategy desired to select the best tracks. It is either: - "first" : to select the first tracks. - "rec_stages": to select all tracks with a specific reconstruction stages. This requires the user to specify the reconstruction stages in rec_stages input. Example: best_track(my_tracks, strategy="rec_stages", rec_stages=[1, 2, 3, 4, 5]). - "default": to select the best tracks (the first ones) corresponding to a specific reconstruction algorithm (JGandalf, Jshowerfit, etc). This requires rec_type input. Example: best_track(my_tracks, strategy="default", rec_type="JPP_RECONSTRUCTION_TYPE"). rec_type : str, optional reconstruction type as defined in the official KM3NeT-Dataformat. rec_stages : list, optional list of the reconstruction stages. Example: [1, 2, 3, 4, 5] Returns ------- class km3io.offline.OfflineBranch tracks class with the desired "best tracks" selection. Raises ------ ValueError ValueError raised when an invalid strategy is requested."""options=['first','rec_stages','default']ifstrategynotinoptions:raiseValueError("{} not in {}".format(strategy,options))ifany(events.n_tracks==0):raiseValueError("'events' should not contain empty tracks. Consider applying the mask: events.n_tracks>0")tracks=events.tracksifstrategy=="first":out=tracks[:,0]elifstrategy=="rec_stages"andrec_stagesisnotNone:out=tracks[mask(tracks.rec_stages,rec_stages)]elifstrategy=="default"andrec_typeisnotNone:out=tracks[tracks.rec_type==reconstruction[rec_type]][tracks.lik==ak1.max(tracks.lik,axis=1)][:,0]returnout
I don't really care about backwards compatibility until we reach 1.0, so we can do whatever we want ;) From 1.0 on we strictly follow SemVer (we actually also do now, since SemVer says that everything is breaking until 1.0 ).
Given that any(events.n_tracks == 0) has no significant impact on the performance, I think it's a good solution. I agree that we should not make best_track too smart...
Alright, this made me realise that I did not delete the km3io.offline.best_track(..), all the functions are in tools. Sorry for that!
So I updated the master branch now, all the functions are in tools.
If we want best_track to work with a single event, I will have to update it. (I didn't think about this case (oups!), because I thought we will process all events in a file in one go).
@lnauta I am thinking about redesigning the processing of the files in your pipeline. I think it would be better to wrap the km3io readout of the whole file in one module and do the iteration per-file, instead of per-event. What do you think?
That sounds good to me. For me it the important part is that I can read it in with MONA at the end, which requires it to be a flat TTree, so as long as the conversion from some format to root in a flat TTree is possible it's good!
But I will need some help with making the pipes, I'm still learning how to do this.
Sure, no problem. I can show you an example. It’s basically the creation of a single kp.Table for a file per iteration and then dumping it into the summary file via the HDFSink.
best_track has been adapted for the case of a tree with one event only. (it is available on the master branch, in km3io.tools)
the strategy "rec_stages" has been removed from the function best_track for consistency. Since the case for strategy='rec_stages' was more for a multiplicity study, a new function was created specifically for this purpose: get_multiplicity(tracks, rec_stages).
Please let me know if you have any requests or if you have noticed other bugs.
I really appreciate all your contributions/feedback and I hope I am providing what you guys need for your analysis :)
Thanks for the update. I use the rec_stages to make sure that all stages have been done in the reconstruction like: km3io.tools.best_track(ev.tracks, strategy="rec_stages", rec_stages=all_gandalf_stages)
Should I switch to this get_multiplicity method? I am still working on rewriting my code to be on file by file level as described above, it's not finished yet.
@lnauta
I am a bit confused here, I interpret your reply in two ways:
1- If you just want to select ALL the tracks in an event, that were reconstructed following certain reconstruction stages of interest (for example [1, 2, 3, 4, 5]), then you can either,
a) use the get_multiplicty method: get_multiplicity(ev.tracks, [1, 2, 3, 4, 5]). Be careful here, the order of the reconstruction stages is important :)
PS: get_multiplicity function already does this for you, but I am just sharing all available options in km3io.tools ;)
2- If after selecting all tracks that were reconstructed following the stages: [1, 2, 3, 4, 5], you want to select the best one among these tracks (either with strategy=default or strategy=first), then you will have to pass fully_reco_tracks to best_track as an input, and use the strategy of your interest.
@lnauta here is an example how to process a file one-by-one as a whole. You can then use the best_track function to get all the tracks in one shot.
The basic idea is to create an event mask (just a boolean numpy array) which is your first cut (and probably only) cut. Then apply this cut mask to extract equal lengths of parameters like the positions of the reconstructed tracks etc.
In addition to that, I'd recommennd to create an extra table where you put the header information. For this, Mona probably needs to be adjusted but you can also create columns of the table with the same values for each file.
So here is an example (I wrote that down blindly, so expect typos and whatnot, but it shows the basic workflow
This worked, however I encountered the following bug using "default" strategy:
Pipeline and module initialisation took 0.005s (CPU 0.003s).2020-06-30 11:53:05 ++ SummaryExtractor: Processing data/mcfspec.gsg_elec-CC_10-100GeV.sirene.jte.jchain.jsh.aanet.12.rootTraceback (most recent call last): File "pipe-tam.py", line 71, in <module> pipe.drain() File "/project/antares/public_student_software/venvs/km3pipe-v9-alpha17/lib/python3.7/site-packages/thepipe/core.py", line 423, in drain return self._drain(cycles) File "/project/antares/public_student_software/venvs/km3pipe-v9-alpha17/lib/python3.7/site-packages/thepipe/core.py", line 372, in _drain new_blob = module(blob_to_send) File "/project/antares/public_student_software/venvs/km3pipe-v9-alpha17/lib/python3.7/site-packages/thepipe/core.py", line 178, in __call__ return self.process(*args, **kwargs) File "pipe-tam.py", line 52, in process strategy="default") File "/data/antares/users/ljnauta/nikhef_mc/py_packages/km3io/km3io/tools.py", line 326, in best_track return outUnboundLocalError: local variable 'out' referenced before assignmentClosing remaining open files:dump.h5...done
Using the "first" strategy, this did not occur and the code provided by @tgal and @zaly runs succesfully and about 10x faster:
Pipeline and module initialisation took 0.004s (CPU 0.002s).2020-06-30 11:55:22 ++ SummaryExtractor: Processing data/mcfspec.gsg_elec-CC_10-100GeV.sirene.jte.jchain.jsh.aanet.12.root2020-06-30 11:56:48 ++ km3pipe.io.hdf5.HDF5Sink.HDF5Sink: HDF5 file written to: dump.h5============================================================1 cycles drained in 86.081316s (CPU 85.065551s). Memory peak: 797.94 MB wall mean: 85.133203s medi: 85.133203s min: 85.133203s max: 85.133203s std: 0.000000s CPU mean: 85.017494s medi: 85.017494s min: 85.017494s max: 85.017494s std: 0.000000s
Now this full_reco_tracks is supposed to contain all tracks that were reconstructed following the stages [1, 2, 3, 4, 5]. BUT if these stages don't even exist, you will simply have empty tracks (this is what caused best_track to raise an error, which then made your code fail..)! Here is how you can check how many tracks per event were constructed following these stages:
Now you see that there isn't ANY track that was reconstructed following the stages [1, 2, 3, 4, 5]!
After looking a little bit in your file, I see that the reconstruction stages of most tracks is actually [1, 2, 5, 3, 5, 4] (just a reminder here: the order is really important). So now let's try with these reco stages:
importkm3ioaskifromkm3io.toolsimportbest_track,mask,count_nestedf="/sps/km3net/users/lnauta/mcfspec.gsg_elec-CC_10-100GeV.sirene.jte.jchain.jsh.aanet.12.root"events=ki.OfflineReader(f).eventsfull_events=events[events.n_tracks>0]mask=mask(full_events.tracks.rec_stages,[1,2,5,3,5,4])full_reco_tracks=full_events.tracks[mask]count_nested(full_reco_tracks.rec_stages,axis=1)>>><Array[1,1,1,1,1,1,...1,1,1,1,1,1]type='3248 * int64'># So now we are sure we are not providing empty tracks to `best_track`best_tracks=best_track(full_reco_tracks,strategy="default",rec_type="JPP_RECONSTRUCTION_TYPE")>>><OfflineBranch[tracks]:3248elements>
Concerning the performance, yes I am not surprised at all that strategy="first" is much faster than strategy="default"`, in the first one, we only select the very first track per event, but in the second one, we apply multiple masks (that are not trivial to create), we also play a little bit with data structures (switching from awkward data structures to awkward1.Array when necessary for nested data etc etc) ;)
I was not complete in my reply, sorry for that: I had a private communication with tgal and there my old code which is not included here took ~900s to run. With the code provided in this issue it runs in 85s, which is the 10x speedup mentioned.
For the bug I posted, I used no rec_type, so that defaults to None and combined with the strategy="default" it crashes and gives the error mentioned. So there is no measured runtime in this case. And I use the gandalf stages [1, 2, 5, 3, 5, 4].
Alright, so now the only missing thing is to be able to extract the best track for a single event (plus some improved error messages, see second paste):
Alright, so now the only missing thing is to be able to extract the best track for a single event
I think best_track fails here because track.usr raises an error with this file, it is not read by km3io. I opened #62 (closed) to solve it because I don't think it's related to best_track.
here is an error which is not so user-friendly:
Indeed, this is not user-friendly! will be corrected! Thanks for pointing it out ;)
In[6]:bt=ki.tools.best_track(trk)---------------------------------------------------------------------------KeyErrorTraceback (mostrecentcalllast)<ipython-input-6-7d12b2b91c1e>in<module>---->1bt=ki.tools.best_track(trk)~/km3net/km3net/km3io/km3io/tools.pyinbest_track(tracks,strategy,rec_type)317318ifstrategy=="default"andrec_typeisNone:-->319raiseKeyError(320"rec_type must be provided when the default strategy is used.")321KeyError:'rec_type must be provided when the default strategy is used.'
I hope this helps, let me know if anything else is needed ;)