h5extract is too slow
Testing on Orca4 jsieren mupage data, it turns out that the new h5extract is slower than the old tohdf5 by a factor of 20. This makes it practically unusable for large scale applications, so some speeding up is required.
Here is the timings of each module using h5extract:
403 cycles drained in 148.415606s (CPU 148.084299s). Memory peak: 447.19 MB
wall mean: 0.367872s medi: 0.335088s min: 0.173658s max: 12.586694s std: 0.609882s
CPU mean: 0.367136s medi: 0.334728s min: 0.168863s max: 12.566205s std: 0.608898s
OfflinePump - process: 0.104s (CPU 0.105s) - finish: 0.001s (CPU 0.001s)
wall mean: 0.000259s medi: 0.000250s min: 0.000242s max: 0.000690s std: 0.000032s
CPU mean: 0.000259s medi: 0.000251s min: 0.000242s max: 0.000690s std: 0.000032s
StatusBar - process: 0.000s (CPU 0.000s) - finish: 0.000s (CPU 0.000s)
wall mean: 0.000026s medi: 0.000025s min: 0.000025s max: 0.000030s std: 0.000002s
CPU mean: 0.000041s medi: 0.000040s min: 0.000039s max: 0.000045s std: 0.000002s
OfflineHeaderTabulator - process: 0.099s (CPU 0.100s) - finish: 0.000s (CPU 0.000s)
wall mean: 0.000247s medi: 0.000236s min: 0.000229s max: 0.001621s std: 0.000072s
CPU mean: 0.000247s medi: 0.000237s min: 0.000229s max: 0.001622s std: 0.000072s
EventInfoTabulator - process: 1.929s (CPU 1.928s) - finish: 0.000s (CPU 0.000s)
wall mean: 0.004785s medi: 0.004584s min: 0.004533s max: 0.007842s std: 0.000543s
CPU mean: 0.004785s medi: 0.004583s min: 0.004533s max: 0.007838s std: 0.000543s
Offline - process: 1.731s (CPU 1.716s) - finish: 0.000s (CPU 0.000s)
wall mean: 0.004296s medi: 0.004004s min: 0.003961s max: 0.027509s std: 0.001283s
CPU mean: 0.004257s medi: 0.004003s min: 0.003962s max: 0.012373s std: 0.000684s
MC - process: 1.987s (CPU 1.986s) - finish: 0.000s (CPU 0.000s)
wall mean: 0.004931s medi: 0.004744s min: 0.004704s max: 0.010127s std: 0.000496s
CPU mean: 0.004929s medi: 0.004743s min: 0.004705s max: 0.010110s std: 0.000495s
MCTracksTabulator - process: 5.193s (CPU 5.191s) - finish: 0.000s (CPU 0.000s)
wall mean: 0.012886s medi: 0.009579s min: 0.009454s max: 0.429661s std: 0.022169s
CPU mean: 0.012882s medi: 0.009576s min: 0.009452s max: 0.429575s std: 0.022164s
RecoTracksTabulator - process: 131.721s (CPU 131.673s) - finish: 0.000s (CPU 0.000s)
wall mean: 0.326852s medi: 0.297362s min: 0.132525s max: 12.078222s std: 0.586519s
CPU mean: 0.326731s medi: 0.297258s min: 0.132490s max: 12.073480s std: 0.586288s
HDF5Sink - process: 5.463s (CPU 5.235s) - finish: 0.077s (CPU 0.070s)
wall mean: 0.013556s medi: 0.013056s min: 0.009579s max: 0.030919s std: 0.002225s
CPU mean: 0.012991s medi: 0.012871s min: 0.009580s max: 0.030428s std: 0.001594s
RecoTracksTabulator takes up by far the longest time, so this is a good place to start.
The double python loop is especially expensive, so I replaced it with vectorizable operations. This leads to a speed up of factor ~6.5 - better, but still not as fast as tohdf5.