@@ -213,27 +213,27 @@ Neither Github nor PyPI make software citable in the strict sense, as they do no
#### Archiving
Both PyPI and Zenodo copy the relevant software to their platforms, copies of the source code are stored at multiple sites and make archiving easy.
# Data Quality assessment
## Data Quality assessment
The processes involved in the KM3NeT data processing chain can be grouped into a few main categories. Although the ordering of these categories is not strictly hierarchycal from the point of view of data generation and processing, in the context of a general discussion one could safely assume that a hierarchycal relation exists between them. From bottom to top, these categories would be data acquisition, detector calibration, event reconstrucion, simulations and finally scientific analyses based on the data processed in the previous categories. The quality of the scientific results produced by KM3NeT will be affected by the performance of the different processes involved in the lower levels of the data processing chain. In order to implement a complete and consistent set of data quality control procedures that span the whole data processing chain, it is required to have a complete, unambiguous and documented strategy for data processing at each of the aforementioned process categories. This includes the setting of data quality criteria which should be initiated at the highest level of the data processing chain, and propagated towards the lowest levels. For each of the aforementioned categories there exists a working group within the KM3NeT collaboration. It therefore corresponds to each of these working groups to develop the working group objectives and procedures according to the scientific needs of KM3NeT. Currently such a documented strategy does not exist for any working group. It has therefore not been possible to develop a full strategy for data quality control. Nevertheless, there have been copious software developments devoted to quality control along the different stages of the data processing chain. In the following, a description of some of the existing quality control tools and procedures is given. This description could be conceived as an incomplete prototype for a data quality plan. The implementation of these procedures into an automated workflow requires the design and implementation of a standardised data processing workflow which meets software quality standards. This does not exist either. Some of the figures and results shown here have been produced ad-hoc, and not as a result of any working system.
## Data quality control procedures
### Data quality control procedures
### Online Monitor
#### Online Monitor
During the data acquisition process, the online monitoring software presents real time plots that allow the shifters to promptly identify problems with the data acquisition. It includes an alert system that sends notifications to the shifters if during the data taking, problems appear that require human intervention. The online monitor uses the same data that are stored for offline analyses (this is actually not true, and should be changed). This implies that any anomaly observed during the detector operation can be reproduced offline. {HERE, A FIGURE COULD BE GIVEN AS AN EXAMPLE. FOR INSTANCE, THE MESSAGES ON THE CHAT WHEN THE TRIGGER RATE IS 0.}
### Detector Operation
#### Detector Operation
As explained in XX.YY, the optical data obtained from the detector operation are stored in ROOT files and moved to a high performance storage environment. The offline data quality control procedures start with a first analysis of these files which is performed daily. It mainly focuses on but is not restricted to the summary data stored in the ROOT files. The summary data contain information related to the performance of the data acquisition procedures for each optical module in the detector. As a result of this first analysis, a set of key-value pairs is produced where each key corresponds to a parameter that represents a given dimension of data quality and the value represents the evaluation of this parameter for the livetime of the analysed data. The results are tagged with a unique identifier corresponding to the analysed data set and uploaded to the database. In the present implementation the analysis is performed for each available file where each file corresponds to a data taking run, although this may change in the future as the data volume generated per run will increase with the detector size.
A further analysis of the results stored in the database includes the comparison of the values of the different parameters to some reference values, allowing for a classification of data periods according to their quality. The reference values are typically set according to the accuracy with which the current detector simulations include the different quality parameters. In addition, the evolution of the different quality parameters can be monitored and made available to the full collaboration as reports. Currently this is done every week by the shifters, and the reports are posted on an electronic log book (ELOG). Figure {FIG}, shows an example of the time evolution for two quality parameters during the period corresponding to the data sample that are provided together with this report. The selected runs correspond to a period of stable rates and during which the different quality parameters were within the allowed tolerance.
### Calibration
#### Calibration
The first step in the data processing chain is to determine the detector calibration parameters using the data obtained from the detector operation. These parametrers include the time offsets of the PMTs as well as their gains and efficiencies, the positions and orientations of the optical modules. The PMT time offsets and the positions of the optical modules are used in later stages of the data processing chain for event reconstruction, as well as by the real time data filter during the detector operation. While the event reconstruction requires an accurate knowledge of these parameters, the algorithms used by the real time data filter depend rather losely on them, and its performance is not dependent on variations occuring within a timescale of the order of months. Nevertheless, it is still necessary to monitor them and correct the values used by the data filter if necessary. The performance of the detector operation also depends on the response of the PMTs, which is partly determined by their gains. These evolve over time, and they can be set to their nominal values through a tuning of the high-voltage applied to each PMT. Monitoring the PMT gains is therefore also necessary to maximise the detector performance. Additionally, the PMT gains and efficiencies are also used offline by the detector simulation. Within the context of data quality assesment, software tools have been developed by KM3NeT that allow to monitor the parameters described above and to compare them to reference values, raising alerts when necessary. The reference values should be determined by the impact of miscalibrations on the scientific goals of KM3NeT and this work has not been adressed. The arrangement of these tools into a workflow requires the ellaboration of an underlying calibration strategy. This has not been done, and the work is therefore on hold.
### Simulations and Event reconstruction
#### Simulations and Event reconstruction
Once the calibration constnats have been determined, the data processing chain continues with the event reconstruction, and with the simulation and reconstruction of an equivalent set of events where the information in the summary data is used to simulate the data taking conditions. The simulation of particle interactions and propagation is done by dedicated software, while the detector simulation and event reconstruction is done by Jpp. As a result of the simulation chain, a ROOT file is obtained which has the same format as the ROOT file produced by the data acquisition system. This contains events obtained after the simulation of the detector trigger. The resulting file and the corresponding data taking file, are identically processed by the reconstruction software, which produces ROOT formatted files with the result of reconstructing the real data events and the simulated events respectively. The comparison between data and simulations is an important parameter to measure the quality of the data and it can be done at trigger level, and at reconstruction level. In both cases, the comparison follows the same strategy: the root files are used to produce histograms of different observables and these histograms are saved into new ROOT files. A set of libraries and applications devoted to histogram comparisons have been developped in Jpp. These implement multiple statistical tests that can be used to determine if two histograms are compatible, as well as the degree of incompatibility between them. Additionally, tools have been developed that allow to summarise the reuslts into a number per file, which represents the average result after comparing all the observables. For the example provided here, the discrepancy between data and montecarlo is measured through the calculation of the reduced chi2 for each observable, and the summary is given as the average reduced chi2 of all the compared observables for each file. Figures {X} and {Y} show the value of this parameter as a function of the run number.
...
...
@@ -245,4 +245,4 @@ The contents in the files produced by the reconstruction routines are the main i
## Copyright and Licensing
Find more infos [here](https://open-data.pages.km3net.de/licensing/)
\ No newline at end of file
Find more infos [here](https://open-data.pages.km3net.de/licensing/)