Variable to save as reconstructed energy

mentioned in merge request !8 (merged)

@pkalaczynski thank you for reporting this issue.

Please go ahead and test the branch add-fitinf. You will find all needed information in the README.

To get your data here is how to proceed:

import km3io as ki
r = ki.OfflineReader(aanet_file)
r.tracks.reco["JENERGY_ENERGY"]

to get all the fit keys:

>>>r.tracks.reco.dtype.names
('JGANDALF_BETA0_RAD',
 'JGANDALF_BETA1_RAD',
 'JGANDALF_CHI2',
 'JGANDALF_NUMBER_OF_HITS',
 'JENERGY_ENERGY',
 'JENERGY_CHI2',
 'JGANDALF_LAMBDA',
 'JGANDALF_NUMBER_OF_ITERATIONS',
 'JSTART_NPE_MIP',
 'JSTART_NPE_MIP_TOTAL',
 'JSTART_LENGTH_METRES',
 'JVETO_NPE',
 'JVETO_NUMBER_OF_HITS',
 'JENERGY_MUON_RANGE_METRES',
 'JENERGY_NOISE_LIKELIHOOD',
 'JENERGY_NDF',
 'JENERGY_NUMBER_OF_HITS')

closed

Hmm it does not work for me:

>>> import km3pipe as kp
>>> from km3pipe.dataclasses import Table
>>> import numpy as np
>>> from glob import glob
>>> import os.path
>>> import km3io as ki
>>> import h5py
>>> 
>>> DET = '00000014'
>>> PRIM = 'Fe'
>>> #FOLDER = '/sps/km3net/users/kakiczi/CORSIKA_checks/SIBYLL-2.3/gamma_-1/' # Lyon
... FOLDER = '/mnt/home/pkalaczynski/CORSIKA_production/SIBYLL_2.3/charm/gamma_-1/' # CIŚ
>>> #FOLDERS = '/sps/km3net/users/kakiczi/CORSIKA_checks/SIBYLL-2.3/gamma_-1/*/gSeaGen_processed/mc/manual/KM3NeT_'+DET+'/v5.8/reco'
... FOLDERS = FOLDER+PRIM+'/gSeaGen_processed/mc/manual/KM3NeT_'+DET+'/v5.8/reco' # CIŚ
>>> FNAMES = glob(os.path.join(FOLDERS, "*aanet*root"))  # " + i + "
>>> FNAMES.sort()  # ensures that we go in alphabetical/numeric order
>>> 
>>> VARS = [[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],]
>>> for f in FNAMES:
...     print("Opening ", f)
...     if len(ki.OfflineReader(f)):
...         for r in ki.OfflineReader(f):
...             if r.tracks.E.size: # only taking non-empty tracks
...                 if r.tracks.rec_stages[0]==[1,2,3,4,5]: # only fully-reconstructed
...                     print("corrected: ", r.tracks.E[0])
...                     print("reco keys: ", r.tracks.reco.dtype.names)
...                     print("reco: ", r.tracks.reco)
... 
Opening  /mnt/home/pkalaczynski/CORSIKA_production/SIBYLL_2.3/charm/gamma_-1/Fe/gSeaGen_processed/mc/manual/KM3NeT_00000014/v5.8/reco/mcv5.8.DAT000001.gSeaGen.sirene.jte.jchain.aanet.1.root
corrected:  245013.2211937742
Traceback (most recent call last):
  File "<stdin>", line 8, in <module>
  File "/mnt/home/pkalaczynski/miniconda3/envs/km3net/lib/python3.7/site-packages/km3io/offline.py", line 608, in reco
    self._reco = np.core.records.fromarrays(fit_data.transpose(), names=keys)
  File "/mnt/home/pkalaczynski/miniconda3/envs/km3net/lib/python3.7/site-packages/numpy/core/records.py", line 592, in fromarrays
    shape = arrayList[0].shape
IndexError: list index out of range

reopened

That's what I call a nested for-if

Could you boil that down to an MWE?

Btw. I think that error is coming from an event which has no tracks, so it's not an error.

I'd recommend using numpy masks and wholemeal programming instead of for loops.

Yeah, we already discussed via rocketchat. It's indeed from empty tracks. It's weird, cuz they are sometimes just all 0 and sometimes empty lists.

Well, that's indeed not a MWE, but still it's rather clear what's going on I think. Perhaps I could optimize the code, but am quite lazy and not sure if it will be really faster if I use np.mask here

You should try to utilise numpy. You will great a huge performance boost and it is much more readable.

For example to extract the energy of each first MC track in every event which contains at least one MC track can be done like this:

[ins] In [15]: import km3io                                                                                                       

[ins] In [16]: f = km3io.OfflineReader("mcv5.2.mupage_10T.sirene.jte.1186.root")                                                  

[ins] In [17]: f.mc_tracks.id.counts > 0                                                                                          
Out[17]: array([ True,  True,  True, ...,  True,  True,  True])

[ins] In [18]: mask = f.mc_tracks.id.counts > 0                                                                                   

[ins] In [19]: f.mc_tracks.E[mask]                                                                                                
Out[19]: <ChunkedArray [[17.72 22.033 3360.374 ... 228.119 201.697 130.533] [73.213 12064.853] [10884.78 14.468 93.546] [1694.332 8498.187] [1221.061 105.158 4017.405 6695.817] [22945.123 244.566] [11019.418] ...] at 0x7fe84ba3e400>

[ins] In [20]: f.mc_tracks.E[mask, 0]                                                                                             
Out[20]: <ChunkedArray [17.72 73.213 10884.78 1694.332 1221.061 22945.123 11019.418 ...] at 0x7fe7ebf30880>

As you see, I use the .counts attribute to select those where at least one entry is present, then I use it as a mask for the first dimension (and 0 for the second to select the first track).

Hmm, thanks. Will give it a try.

Trust me, it can be orders of magnitudes faster, it takes just a second to filter and extracts some parameters as numpy arrays (I am doing RBR analysis and go full whack with numpy) of hundreds of thousands of events.

Hmm,

mask = r.tracks.id.counts > 0
r.tracks.rec_stages[mask,0]

does not really work well because of the structure of r.tracks.rec_stages:

>>> r.tracks.rec_stages
<ChunkedArray [[[1, 2, 3, 4, 5], [1, 2, 3, 4], [1, 2, 3], [1, 2, 3], [1, 2, 3], [1, 2, 3], [1, 2, 3], [1, 2, 3], [1, 2, 3], [1, 2, 3], [1, 2, 3], [1, 2, 3], [1, 2, 3], [1, 2, 3], [1, 2], [1, 2], [1, 2], [1, 2], [1, 2], [1, 2], [1, 2], [1, 2], [1, 2], [1, 2], [1, 2], [1, 2], [1], [1], [1], [1], [1], [1], [1], [1], [1], [1], [1], [1]] [[1, 2, 3, 4, 5], [1, 2, 3, 4], [1, 2, 3], [1, 2, 3], [1, 2, 3], [1, 2, 3], [1, 2, 3], [1, 2, 3], [1, 2, 3], [1, 2, 3], [1, 2, 3], [1, 2, 3], [1, 2, 3], [1, 2, 3], [1, 2], [1, 2], [1, 2], [1, 2], [1, 2], [1, 2], [1, 2], [1, 2], [1, 2], [1, 2], [1, 2], [1, 2], [1], [1], [1], [1], [1], [1], [1], [1], [1], [1], [1], [1]] [[1, 2, 3, 4, 5], [1, 2, 3, 4], [1, 2, 3], [1, 2, 3], [1, 2, 3], [1, 2, 3], [1, 2, 3], [1, 2, 3], [1, 2, 3], [1, 2, 3], [1, 2, 3], [1, 2, 3], [1, 2, 3], [1, 2, 3], [1, 2], [1, 2], [1, 2], [1, 2], [1, 2], [1, 2], [1, 2], [1, 2], [1, 2], [1, 2], [1, 2], [1, 2], [1], [1], [1], [1], [1], [1], [1], [1], [1], [1], [1], [1], [1], [1], [1], [1], [1], [1], [1], [1], [1], [1], [1], [1]] ... [] [[1, 2, 3, 4, 5], [1, 2, 3, 4], [1, 2, 3], [1, 2, 3], [1, 2, 3], [1, 2, 3], [1, 2, 3], [1, 2, 3], [1, 2, 3], [1, 2, 3], [1, 2, 3], [1, 2, 3], [1, 2, 3], [1, 2, 3], [1, 2], [1, 2], [1, 2], [1, 2], [1, 2], [1, 2], [1, 2], [1, 2], [1, 2], [1, 2], [1, 2], [1, 2], [1], [1], [1], [1], [1], [1], [1], [1], [1], [1], [1], [1]] []] at 0x7f50645785c0>
>>> r.tracks.rec_stages[mask,0]
<ChunkedArray [[0 0 0 ... 0 0 1] [0 0 0 ... 0 0 1] [0 0 0 ... 0 0 1] [0 0 0 ... 0 0 1] [0 0 0 ... 0 0 1] [0 0 0 ... 0 0 1] [0 0 0 ... 0 0 1] ...] at 0x7f5062dff240>
>>>

I guess it's because the sublists are of different sizes

It depends on what you want to achieve. In general such lookups are dead fast.

What do you want to do? Filter by a specific rev stage schema?

Different sized sub arrays are no problem, uproot uses awkward arrays which are just specialized numpy arrays. You work with them as they were numpy arrays.

Yep, I want only fully-reconstructed events, i.e. those that have rec_stages==[1,2,3,4,5] in the best fit (or [1,2,3,5] for ORCA1)

If it would be all the same, I would get the same here:

>>> r = ki.OfflineReader(f)
>>> for re in r.tracks.rec_stages[:3]:
...     print(re[0])
... 
[1, 2, 3, 4, 5]
[1, 2, 3, 4, 5]
[1, 2, 3, 4, 5]
>>> print(r.tracks.rec_stages[:3,0])
[[0 0 0 ... 0 0 1] [0 0 0 ... 0 0 1] [0 0 0 ... 0 0 1]]
>>>

Or am I doing sth wrong?

Let me check

Can you point me to a file?

OK, I just used our aanet file in the samples folder of km3io. It seems that this doubly-nested thing is a bit "awkward" to deal with. uproot is currently waiting for awkwardarray 1.0 which is a complete rewrite of the current awkward library and has full numba support. With numba, it's a nobrainer to write a dead simple for loop and make it ultra fast.

For now, I think I'll just ask the awkward guys how to deal with this specific problem efficiently. There must be a way.

The initial problem is that rec_stage seems to be an object array, and not an integer-array:

In [89]: rec_stages.dtype
Out[89]: dtype('O')

To be honest, the way our data is store is just stupid as hell... I guess in case of the rec_stages, there is no easy shortcut...

I will ask the uproot guys, I think it's a bug, since it should be the same, no matter if it's a chunked or a jagged awkward array:

In [138]: rec_stages = f.tracks.rec_stages

In [139]: import awkward

In [140]: rs = awkward.fromiter(rec_stages)

In [141]: rs[:, 0]
Out[141]: <JaggedArray [[1 3 5 4] [1 3 5 4] [1 3 5 4] ... [1 3 5 4] [1 3 5 4] [1 3 5 4]] at 0x7fec9a333790>

In [142]: rec_stages[:, 0]
Out[142]: <ChunkedArray [[0 0 0 ... 0 0 1] [0 0 0 ... 0 0 1] [0 0 0 ... 0 0 1] ... [0 0 0 ... 0 0 1] [0 0 0 ... 0 0 1] [0 0 0 ... 0 0 1]] at 0x7fec9a590350>

I created an issue: https://github.com/scikit-hep/awkward-array/issues/229

Not sure if they will address it soon, as they are working on awkward 1.0 which will be completely reimplemented, but let's see. Maybe we are doing something wrong

Alright. I'll stick with the ugly implementation for now then. It's not too bad to slow me down seriously or anything. When the fix is available, I'll upgrade.

@pkalaczynski @tgal

There are two new functionalities (in the branch add-fitinf) with km3io that will allow you to do what you want:

get fit data of the tracks with rec_stages = [1, 2, 3, 4, 5]:

import km3io as ki

r = ki.OfflineReader(my_file)
fit = r.get_reco_fit([1, 2, 3, 4, 5])

to get a fit parameter of interest, in your case ['JENERGY_ENERGY'] from your file:

E = fit['JENERGY_ENERGY']
E
array([1.14058828e+05, 5.50456059e+05, 9.36870766e+04, 1.35163483e+04,
       1.78629476e+05, 2.07440968e+05, 1.37926729e+03, 4.08168837e+05,
       6.09756235e+05, 3.03001873e+03, 3.54255341e+05, 1.53301666e+04,
       2.41985343e+04, 6.80781605e+05, 2.11681836e+04, 1.25355828e+05,
       3.26341177e+04, 3.62719006e+05, 9.15009875e+04, 1.32455596e+05,
       4.57767052e+04, 4.11856661e+03, 4.74003174e+05, 7.30751190e+05,
       1.50737968e+06, 1.07945148e+05, 3.32638609e+05, 2.15766678e+05,
       5.05374449e+03, 1.12783637e+04, 5.70620876e+04, 7.34044952e+04,
       1.73873892e+04, 5.59818021e+03, 1.53819615e+05, 1.46724928e+05,
       1.06738308e+04, 5.48602538e+04, 2.48603829e+05, 5.19196829e+04,
       5.23299115e+04])

get fit data of the best-reconstructed track, best being defined as the tracks with the longest rec_stages (of course this may not always be [1, 2, 3, 4, 5]!!!!)

best_fit = r.best_reco
best_E = best_fit['JENERGY_ENERGY'] 
best_E
array([1.14058828e+05, 5.50456059e+05, 9.36870766e+04, 1.35163483e+04,
       0.00000000e+00, 1.78629476e+05, 2.07440968e+05, 1.37926729e+03,
       4.08168837e+05, 6.09756235e+05, 3.03001873e+03, 3.54255341e+05,
       1.53301666e+04, 2.41985343e+04, 6.80781605e+05, 2.11681836e+04,
       1.25355828e+05, 3.26341177e+04, 0.00000000e+00, 3.62719006e+05,
       9.15009875e+04, 1.32455596e+05, 4.57767052e+04, 4.11856661e+03,
       4.74003174e+05, 7.30751190e+05, 1.50737968e+06, 1.07945148e+05,
       0.00000000e+00, 3.32638609e+05, 2.15766678e+05, 5.05374449e+03,
       1.12783637e+04, 5.70620876e+04, 7.34044952e+04, 1.73873892e+04,
       5.59818021e+03, 1.53819615e+05, 1.46724928e+05, 1.06738308e+04,
       5.48602538e+04, 2.48603829e+05, 5.19196829e+04, 5.23299115e+04,
       0.00000000e+00])

Please do not confuse both you can see that best_E.size is 45 and E.size is 41. so this means that in your file (which has 50 events) there are: 45 reconstructed tracks and 5 empty (non-reconstructed) events. in these 45 reconstructed tracks, there are 41 that are fully reconstructed (with rec_stage = [1, 2, 3, 4, 5]).

More information is available in the docs + README.

closed

This looks cool, however I would then need some way to also extract mc_tracks, events, hits for corresponding fully-reconstructed events (otherwise arrays won't match in lenght).

@pkalaczynski

Please open a new issue and I will address your question there.

Thank you.

Alright!

I think I like it. The question is of course how performant it is, but I guess for now the biggest thing is that we have a fully working version and a test suite with full coverage. We can make it faster later

Variable to save as reconstructed energy

Designs

Child items ...

Activity