Skip to content
Snippets Groups Projects
Commit e7e92063 authored by ViaFerrata's avatar ViaFerrata
Browse files

Large Update to OrcaSong, mostly usability improvements.

- Implemented a better parser and implemented config files as an input to OrcaSong.
- Got rid of the ORCA_Geo_115lines.txt file, the detector geometry is now read from a .detx file that needs to be supplied with the input file.
- Improved general readability of the code
- fixed run_id readout
- added option to make no timecut
parent 234a4a98
No related branches found
No related tags found
No related merge requests found
This diff is collapsed.
......@@ -18,10 +18,13 @@ OrcaSong: Main Framework
.. autosummary::
:toctree: api
parse_input
parser_check_input
calculate_bin_edges
calculate_bin_edges_test
skip_event
data_to_images
main
parse_input
``orcasong.file_to_hits``: Extracting event information
......
......@@ -28,7 +28,7 @@ In order to use :code:`tohdf5`, you need to have loaded a jpp version first. A r
/sps/km3net/users/mmoser/setenvAA_jpp9_cent_os7.sh
Additionally, you need to have a python environment on Lyon, where you have installed km3pipe (e.g. use a pyenv).
Then, the usage of :code:`tohdf5` is quite easy::
~$: tohdf5 -o testfile.h5 /sps/km3net/users/kmcprod/JTE_NEMOWATER/withMX/muon-CC/3-100GeV/JTE.KM3Sim.gseagen.muon-CC.3-100GeV-9.1E7-1bin-3.0gspec.ORCA115_9m_2016.99.root
......@@ -138,41 +138,106 @@ After pulling the OrcaSong repo to your local harddisk you first need to install
~/orcasong$: pip install .
Before you can start to use OrcaSong, you need to provide a file that contains the XYZ positions of the DOMs to OrcaSong.
Before you can start to use OrcaSong, you need a .detx detector geometry file that corresponds to your input files.
OrcaSong is currently producing event "images" based on a 1 DOM / XYZ-bin assumption. This image generation is done
automatically, based on the number of bins (n_bins) for each dimension XYZ that you supply as an input and based on the
DOM XYZ position file. An examplary file for the DOM positions can be found in the folder /orcasong of the OrcaSong
repo, "ORCA_Geo_115lines.txt". Currently, this file is hardcoded as an input for OrcaSong, so if you want to use another
detector geometry, you should include your .txt file in the main() function in "data_to_images.py".
You can generate this .txt file by taking the .det (not .detx!) file, e.g.::
.detx file which contains the DOM positions.
If your .detx file is not contained in the OrcaSong/detx_files folder, please add it to the repository!
Currently, only the 115l ORCA 2016 detx file is available.
/afs/in2p3.fr/throng/km3net/detectors/orca_115strings_av23min20mhorizontal_18OMs_alt9mvertical_v1.det
At this point, you're finally ready to use OrcaSong, it can be executed as follows::
Then, you need to look for the :code:`OM_cluster_data` lines::
~/orcasong$: python data_to_images.py testfile.h5 geofile.detx
OM_cluster_data: 1 -86.171 116.880 196.500 -1.57080 0.00000 1.57080
OM_cluster_data: 2 -86.171 116.880 187.100 -1.57080 0.00000 1.57080
...
OrcaSong will then generate a hdf5 file with images into the Results folder, e.g. Results/4dTo4d/h5/xyzt.
Here, the first column is the dom_id and the second, third and fourth column is the XYZ position.
You need to copy this information into a .txt file, such that it can be read by OrcaSong. One could automate this such
that OrcaSong looks for the correct lines in the .det file automatically, however, multiple (old) conventions exist for
the .det file structure, so it may be a bit tedious. Nevertheless, contributions are very welcome! :)
The configuration options of OrcaSong can be found by calling the help::
~/orcasong$: python data_to_images.py -h
Main OrcaSong code which takes raw simulated .h5 files and the corresponding .detx detector file as input in
order to generate 2D/3D/4D histograms ('images') that can be used for CNNs.
The input file can be calibrated or not (e.g. contains pos_xyz of the hits).
Makes only 4D histograms ('images') by default.
Usage:
data_to_images.py [options] FILENAME DETXFILE
data_to_images.py (-h | --help)
Options:
-h --help Show this screen.
-c CONFIGFILE Load all options from a config file (.toml format).
--n_bins N_BINS Number of bins that are used in the image making for each dimension, e.g. (x,y,z,t).
[default: 11,13,18,60]
--det_geo DET_GEO Which detector geometry to use for the binning, e.g. 'Orca_115l_23m_h_9m_v'.
[default: Orca_115l_23m_h_9m_v]
--do2d If 2D histograms, 'images', should be created.
--do2d_plots If 2D pdf plot visualizations of the 2D histograms should be created, cannot be called if do2d=False.
--do2d_plots_n N For how many events the 2D plot visualizations should be made.
OrcaSong will exit after reaching N events. [default: 10]
--do3d If 3D histograms, 'images', should be created.
--dont_do4d If 4D histograms, 'images', should NOT be created.
--do4d_mode MODE What dimension should be used in the 4D histograms as the 4th dim.
Available: 'time', 'channel_id'. [default: time]
--timecut_mode MODE Defines what timecut mode should be used in hits_to_histograms.py.
At the moment, these cuts are only optimized for ORCA 115l neutrino events!
Currently available:
'timeslice_relative': Cuts out the central 30% of the snapshot.
'trigger_cluster': Cuts based on the mean of the triggered hits.
The timespan for this cut can be chosen in --timecut_timespan.
'None': No timecut.
[default: trigger_cluster]
--timecut_timespan TIMESPAN Only used with timecut_mode 'trigger_cluster'.
Defines the timespan of the trigger_cluster cut.
Currently available:
'all': [-350ns, 850ns] -> 20ns / bin (60 bins)
'tight-1': [-250ns, 500ns] -> 12.5ns / bin
'tight-2': [-150ns, 200ns] -> 5.8ns / bin
[default: tight-1]
--do_mc_hits If only the mc_hits (no BG hits!) should be used for the image processing.
--data_cut_triggered If non-triggered hits should be thrown away for the images.
--data_cut_e_low E_LOW Cut events that are lower than the specified threshold value in GeV.
--data_cut_e_high E_HIGH Cut events that are higher than the specified threshold value in GeV.
--data_cut_throw_away FRACTION Throw away a random fraction (percentage) of events. [default: 0.00]
--prod_ident PROD_IDENT Optional int identifier for the used mc production.
This is useful, if you use events from two different mc productions,
e.g. the 1-5GeV & 3-100GeV Orca 2016 MC. The prod_ident int will be saved in
the 'y' dataset of the output file of OrcaSong. [default: 1]
At this point, you're finally ready to use OrcaSong, it can be executed as follows::
python data_to_images.py testfile.h5
OrcaSong will then generate a hdf5 file with images into the Results folder, e.g. Results/4dTo4d/h5/xyzt.
The configuration options of OrcaSong can be found in the :code:`main()` function.
Alternatively, they can also be found in the docs of the :code:`main()` function:
.. currentmodule:: orcasong.data_to_images
.. autosummary::
main
In the future, these configurations should probably be relocated to a dedicated .config file. Again, contributions and
thoughts are very welcome!
Other than parsing every information to orcasong via the console, you can also load a .toml config file::
~/orcasong$: python data_to_images.py -c config.toml testfile.h5 geofile.detx
Please checkout the config.toml file in the main folder of the OrcaSong repo in order to get an idea about
the structure of the config file.
If anything is still unclear after this introduction just tell me in the deep_learning channel on chat.km3net.de or
write me an email at michael.m.moser@fau.de, such that I can improve this guide!
......
This diff is collapsed.
# filename: config.toml
# A config file for OrcaSong with multiple configurations.
# Outcomment the config that you want to use!
# More info about the .toml format at https://github.com/toml-lang/toml
### All available options with some dummy values
# --n_bins = '11,13,18,60'
# --det_geo = 'Orca_115l_23m_h_9m_v'
# --do2d = false
# --do2d_plots = false
# --do2d_plots_n = 10
# --do3d = false
# --do4d = true
# --do4d_mode = 'time'
# --timecut_mode = 'trigger_cluster'
# --timecut_timespan = 'tight_1'
# --do_mc_hits = false
# --data_cut_triggered = false
# --data_cut_e_low = 3
# --data_cut_e_high = 100
# --data_cut_throw_away = 0.00
# --prod_ident = 1
####----- Collection of configurations for ORCA 115l -----####
###--- 3-100GeV ---###
## XYZ-T
--n_bins = '11,13,18,60'
--det_geo = 'Orca_115l_23m_h_9m_v'
--do4d_mode = 'time'
--timecut_mode = 'trigger_cluster'
--timecut_timespan = 'tight_1'
--prod_ident = 1
## XYZ-C
#--n_bins = '11,13,18,31'
#--det_geo = 'Orca_115l_23m_h_9m_v'
#--do4d_mode = 'channel_id'
#--timecut_mode = 'trigger_cluster'
#--timecut_timespan = 'tight_1'
#--prod_ident = 1
###--- 3-100GeV ---###
###--- 1-5GeV ---###
### only take < 3GeV events, info throw away: elec-CC 0.25, muon-CC 0.25, elec-NC: 0.00
## XYZ-T
#--n_bins = '11,13,18,60'
#--det_geo = 'Orca_115l_23m_h_9m_v'
#--do4d_mode = 'time'
#--timecut_mode = 'trigger_cluster'
#--timecut_timespan = 'tight_1'
#--data_cut_e_high = 3
#--data_cut_throw_away = 0.25 # elec-CC 0.25, muon-CC 0.25, elec-NC: 0.00
#--prod_ident = 2
## XYZ-C
#--n_bins = '11,13,18,31'
#--det_geo = 'Orca_115l_23m_h_9m_v'
#--do4d_mode = 'channel_id'
#--timecut_mode = 'trigger_cluster'
#--timecut_timespan = 'tight_1'
#--data_cut_e_high = 3
#--data_cut_throw_away = 0.25 # elec-CC 0.25, muon-CC 0.25, elec-NC: 0.00
#--prod_ident = 2
###--- 1-5GeV ---###
####----- Collection of configurations for ORCA 115l -----####
This diff is collapsed.
......@@ -45,7 +45,6 @@ def get_time_residual_nu_interaction_mean_triggered_hits(time_interaction, hits_
Array with trigger flags that specifies if the hit is triggered or not.
"""
hits_time_triggered = hits_time[triggered == 1]
t_mean_triggered = np.mean(hits_time_triggered, dtype=np.float64)
time_residual_vertex = t_mean_triggered - time_interaction
......@@ -53,7 +52,7 @@ def get_time_residual_nu_interaction_mean_triggered_hits(time_interaction, hits_
return time_residual_vertex
def get_event_data(event_blob, geo, do_mc_hits, use_calibrated_file, data_cuts, do4d, prod_ident):
def get_event_data(event_blob, geo, do_mc_hits, data_cuts, do4d, prod_ident):
"""
Reads a km3pipe blob which contains the information for one event and returns a hit array and a track array
that contains all relevant information of the event.
......@@ -68,9 +67,6 @@ def get_event_data(event_blob, geo, do_mc_hits, use_calibrated_file, data_cuts,
do_mc_hits : bool
Tells the function of the hits (mc_hits + BG) or the mc_hits only should be parsed.
In the case of mc_hits, the dom_id needs to be calculated thanks to the jpp output.
use_calibrated_file : bool
Specifies if a calibrated file is used as an input for the event_blob.
If False, the hits of the event_blob are calibrated based on the geo parameter.
data_cuts : dict
Specifies if cuts should be applied.
Contains the keys 'triggered' and 'energy_lower/upper_limit' and 'throw_away_prob'.
......@@ -79,6 +75,9 @@ def get_event_data(event_blob, geo, do_mc_hits, use_calibrated_file, data_cuts,
In the case of 'channel_id', this information needs to be included in the event_hits as well.
prod_ident : int
Optional int that identifies the used production, more documentation in the docs of the main function.
geo : None/kp.Geometry
If none, the event_blob should already contain the calibrated hit information (e.g. pos_xyz).
Else, a km3pipe Geometry instance that contains the geometry information of the detector.
Returns
-------
......@@ -91,11 +90,16 @@ def get_event_data(event_blob, geo, do_mc_hits, use_calibrated_file, data_cuts,
vertex_pos_x, vertex_pos_y, vertex_pos_z, time_residual_vertex, (prod_ident)].
"""
p = get_primary_track_index(event_blob)
p = get_primary_track_index(event_blob) # TODO fix for mupage/random_noise
# parse track, contains event information like mc_energy
if 'Header' in event_blob: # if Header exists in file, take run_id from it. Else take it from RawHeader.
run_id = event_blob['Header'].start_run.run_id.astype('float32')
else:
run_id = event_blob['RawHeader'][0][0].astype('float32')
# parse tracks [event_id, particle_type, energy, isCC, bjorkeny, dir_x/y/z, time]
event_id = event_blob['EventInfo'].event_id[0]
run_id = event_blob['Header'].start_run.run_id.astype('float32') # todo check if already f32
particle_type = event_blob['McTracks'][p].type
energy = event_blob['McTracks'][p].energy
is_cc = event_blob['McTracks'][p].is_cc
......@@ -106,12 +110,9 @@ def get_event_data(event_blob, geo, do_mc_hits, use_calibrated_file, data_cuts,
time_interaction = event_blob['McTracks'][p].time
# parse hits [x,y,z,time]
if do_mc_hits is True:
hits = event_blob["McHits"]
else:
hits = event_blob["Hits"]
hits = event_blob['Hits'] if do_mc_hits is False else event_blob['McHits']
if use_calibrated_file is False:
if 'pos_x' not in event_blob['Hits'].dtype.names: # check if blob already calibrated
hits = geo.apply(hits)
if data_cuts['triggered'] is True:
......@@ -127,7 +128,7 @@ def get_event_data(event_blob, geo, do_mc_hits, use_calibrated_file, data_cuts,
# save collected information to arrays event_track and event_hits
track = [event_id, particle_type, energy, is_cc, bjorkeny, dir_x, dir_y, dir_z, time_track, run_id,
vertex_pos_x, vertex_pos_y, vertex_pos_z, time_residual_vertex]
if prod_ident is not None: track.append(np.full(event_id.shape, prod_ident))
if prod_ident is not None: track.append(prod_ident)
event_track = np.array(track, dtype=np.float64)
......
......@@ -65,12 +65,16 @@ def get_time_parameters(event_hits, mode=('trigger_cluster', 'all'), t_start_mar
t_start = t_mean - t_start_margin * t_diff
t_end = t_mean + t_end_margin * t_diff
elif mode[0] is None:
t_start = np.amin(t)
t_end = np.amax(t)
else: raise ValueError('Time cut modes other than "first_triggered" or "timeslice_relative" are currently not supported.')
return t_start, t_end
def compute_4d_to_2d_histograms(event_hits, x_bin_edges, y_bin_edges, z_bin_edges, n_bins, all_4d_to_2d_hists, timecut, event_track, do2d_pdf, pdf_2d_plots):
def compute_4d_to_2d_histograms(event_hits, x_bin_edges, y_bin_edges, z_bin_edges, n_bins, all_4d_to_2d_hists, timecut, event_track, do2d_plots, pdf_2d_plots):
"""
Computes 2D numpy histogram 'images' from the 4D data and appends the 2D histograms to the all_4d_to_2d_hists list,
[xy, xz, yz, xt, yt, zt].
......@@ -89,7 +93,7 @@ def compute_4d_to_2d_histograms(event_hits, x_bin_edges, y_bin_edges, z_bin_edge
Tuple that defines what timecut should be used in hits_to_histograms.
event_track : ndarray(ndim=2)
Contains the relevant mc_track info for the event in order to get a nice title for the pdf histos.
do2d_pdf : bool
do2d_plots : bool
If True, generate 2D matplotlib pdf histograms.
pdf_2d_plots : mpl.backends.backend_pdf.PdfPages/None
Either a mpl PdfPages instance or None.
......@@ -117,7 +121,7 @@ def compute_4d_to_2d_histograms(event_hits, x_bin_edges, y_bin_edges, z_bin_edge
np.array(hist_yt[0], dtype=np.uint8),
np.array(hist_zt[0], dtype=np.uint8)))
if do2d_pdf:
if do2d_plots:
# Format in classical numpy convention: x along first dim (vertical), y along second dim (horizontal)
# Need to take that into account in convert_2d_numpy_hists_to_pdf_image()
# transpose to get typical cartesian convention: y along first dim (vertical), x along second dim (horizontal)
......
......@@ -2,3 +2,5 @@ numpy
h5py
matplotlib
km3pipe
docopt
toml
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment