Skip to content
Snippets Groups Projects
Title: Metadata generation and datamodels
Author: Jutta
Topics:
  - data models
  - configurations for software
status: review

Datamodels

Metadata definition lies at the core of FAIR data, as it governs both the understanding of the data and as well as the interoperablity through access protocols. While some software can be used almost as-is, especially regarding the well-developed interfaces in the Virtual Observatory, the different data types and science fields that KM3NeT can link into requires a flexible approach and diverse application of software. In order to meet these various requirements, metadata and class definitions are developed within KM3NeT, drawing on well established standards e.g. of the W3 Consortium, scientific repositories or the IVOA standards.

Data published via the KM3NeT Open Data Center (ODC) is annotated as KM3OpenResource, which includes basic metadata for resource content, accessibility and identification. As resources can be provided either as part of a collection, e.g. data set or multiple resources related to an analysis, or as part of a stream of similar objects, e.g. of alert data, resources are grouped in the server as KM3ResourceCollection or KM3ResourceStream to facitlitate findability. Further details on these first data classes are documented in a developing Git project. In the future, further classes will be introduced and adapted governing e.g. the scientific workflow as discussed in the according section.

Resource description

The KM3OpenResource class serves as base class to describe any KM3NeT open resource, be it a plot, dataset or publication. The information gathered here should be easily transformable to publish the resource to repositories like the Virtual Observatory or Zenodo based on DataCite. As resource description metadata is widely based on standardized formats like the Dublin Core standard, the KM3Resource class picks the relevant entries from the resource metadata, including the VO Observation Data Model Core Components regarding the metadata specific to the scientific target, and the VOResource description and Zenodo resource description for general resource metadata.

Identifiers and content description

Identifiers serve to uniquely address digital objects. While Digital Object Identifiers (DOIs) are of long-standing use in the scientific community, these public identifiers have to link to an KM3NeT-internal identification scheme which allows to back-track the data generation and link between various data products related to a scientific target or publication. In addition to this, an ordering schema for class definitions and content descriptors helps in the interpretation of a specific digital object. To this end, the ktype and kid have been introduced.

kid

The kid is a unique identifier which follows the uuid schema. The uuid is ideally assigned at the generation of the digital object where possible and stored in the metadata set or header of the digital object. It is the goal to use kid assigment at all steps of data processing and has been implemented for all open science products.

ktype

The ktype serves as a content descriptor and is defined as a string with a controlled vocabulary of words separated by ".", starting with "km3.". The selected vocabulary comprises domain names, class and sub-class names and, in some cases, identifiers for class instances, like

km3.{domain}.{subdomains}.{class}.{subclasses}.{instance}

e.g. "km3.data.d3.optic.events.simulation" for a data set of processed optic event data (data level d3) from Monte Carlo simulation, indicating a file class, or "km3.params.physics.event.reco.reconame.E" indicating the parameter definition of the reconstructed energy of particle events from a reconstruction algorithm named "reconame".

Particle event identifiers

For various elements of data taking, identifiers are used to uniquely label e.g. different settings of software and hardware or annotate data streams. At the data aggregation level, an identifier therefore has to be introduced to uniquely identify a particle detection in one of the KM3NeT detectors.

Due to the design of the data acquisition process, these events can be uniquely identified by

  • the detector in which they were measured, assigned a detector id,
  • the run, i.e. data taking period during which it was detected, assigned a run id,
  • the frame index, indicating the numbering of the data processing package in the DAQ system on which the triggering algorithms are performed and
  • the trigger counter, i.e. the number of successes of the application of the set trigger algorithms.

The internal KM3NeT event identifier is therefore defined as

km3.{detector_id}.{run_id}.{frame_index}.{trigger_counter}