Create Sphinx-based documentation from JSON schemas

changed milestone to %Create dataformat documentation framework

changed the description

marked the checklist item Create documentation pages from the schema (https://sphinx-jsonschema.readthedocs.io/en/latest/) as completed

I don't understand how this should/could work. How are we supposed to use a JSON schema to describe HDF5, ROOT, ASCII, binary formats and protocols in DAQ, CLB etc.? Is there any concrete example how it should look like?

If we want to build a kind of catalogue of the data formats we have in KM3NeT, I would propose to do this in a uniform way, from which we can create our pdf or whichever output we want. For each data format, that would include:

The metadata for the dataformat, like description, filetype, optional naming convention, version, software to use it with, link to example file ... and a link to the schema
The content of the dataformat (schema) with field names, type, formatting or value constraints

I think for both parts a structured approach would be good. For 1. metadata, e.g. a yaml or json could be used (or anything else that allows to read the fields), and for the 2nd jsonschema works at least for the more high-level filetypes. For filetypes where it is just not nice to use, as e.g. ASCII, one can still use jsonschema and live with its limitations, or also stick with a lower-level description, for which we should however still make sure that it contains the same information.

For example, for the detx file it could look like that:

Metadata file

Here already in a partially extended version containing fields that are perhaps not necessary at the outset, but could still be useful:

file_format:
  name: Detector Data Format
  description: "The KM3NeT detector description is a text file in simple ASCII text format which contains all information about the position of the active elements (anchors, DOMs and PMTs) of the detector and the timing information."
  version: "v5"
  extension: ".detx"
  mime_type: "text/plain"
  creation_date: "2014-08-01"
  last_updated: "2020-10-01"

general_metadata:
  owner: "KM3NeT"
  license: "none"
  domain: 
    - "Datataking"
  size_constraints: "< 1MB"
  example_link: ""
  standards: 
    - ...

software_compatibility:
  supported_software:
    - "JPP"
    - "km3io"
  ...
  import_export_capability:
    - importable_into: 
        - "JPP"
        - ...
    - exportable_from:
        - ...
        
technical_metadata:
  encoding: "UTF-16"
  schema_structure: "Flat table with comma-separated fields"
  schema_link: 
  compression_support:
    methods: 
      - "GZIP"
      - "ZIP"
  encryption_support: "None natively; external tools like PGP can be used"
  checksum_support: 
    enabled: true
    methods: 
      - "MD5"
      - "SHA-1"
  validation_tools: 
      - "CSVLint"
      - "pandas (Python)"
  compliance_requirements: 
    - "Make sure that ..."

provenance_metadata:
  source: "Data originates from ..."
  modification_history: "Not tracked natively"
  ownership: "KM3NeT from data taking"
  data_lineage: "Requires external tracking for transformations"

Schema file

And for the schema file, it could look in jsonschema either like that:

{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "title": "Detector Data Formatt",
  "description": "Schema describing the ASCII file format for detector data.",
  "type": "object",
  "properties": {
    "global_det_id": {
      "type": "string",
      "description": "Global detector ID, the first field in the file.",
      "examples": ["12345"]
    },
    "format_version": {
      "type": "string",
      "description": "Format version of the data.",
      "examples": ["v1.0"]
    },
    "UTC_validity": {
      "type": "object",
      "description": "Start and end of the UTC validity period.",
      "properties": {
        "validity_from": {
          "type": "string",
          "format": "date-time",
          "description": "Start of the validity period in UTC.",
          "examples": ["2024-01-01T00:00:00Z"]
        },
...
    "ndoms": {
      "type": "integer",
      "description": "Number of DOMs (Digital Optical Modules).",
      "examples": [3]
    },
    "modules": {
      "type": "array",
      "description": "List of DOMs with details about their components.",
      "items": {
        "type": "object",
        "properties": {
          "module_id": {
            "type": "string",
            "description": "Unique identifier for the module.",
            "examples": ["module123"]
          },
...

Other information like value limits could, enumerators and similar could also be included. But yes, it is bulky.

or if we have a simple text description, with the additional information of the fields extended to cover the information that is also in the json schema (so minimum type, constraints, standards or entry schemas where applicable)

# Detector Data ASCII File Format

- Comment lines start with `#`.
- Each line contains specific data fields as described below:

### Header
- **global_det_id**: Global detector ID (string)
  - ID schema: XXX-YYY with XXX
- **format_version**: Format version (string)

### UTC Validity
- **UTC_validity_from**: Start of validity (ISO 8601 format)
- **UTC_validity_to**: End of validity (ISO 8601 format)

### UTM Reference
- **UTM_ref_grid**: UTM reference grid (string)
  - minimal value: ....

### DOMs (Digital Optical Modules)
Each module contains:
- **module_id**: Module identifier (string)
- **line_id**: Line identifier (string)
- **floor_id**: Floor identifier (integer)
- **npmts**: Number of PMTs (integer)

Each PMT entry contains:
- **pmt_id_global**: PMT global ID (string)
  - 
- **x, y, z**: Position coordinates (float)
  - unit: ...
  - accuracy: ...
- **dx, dy, dz**: Displacement (float)
- **t0**: Time offset (float)
- **PMT_STATUS**: Status of the PMT (string)

So in the latter case you almost end up at the version of your latex file, but have the ordering by metadata gained, can create a web-catalogue and for those file formats where it is possible, more easily create validators. And it is more traceable in changes and we have something nice to report in D4.2 of INFRADEV2.

Thanks for the more concrete example.

The schema above suggests that there are fields with specific names, but there are no named fields. The format is also not a comma separated field thing, but a very specific structure and we need to display it like

# comment line - meta data; started by '#' in the first line(s) of the file\n
global_det_id format_version\n 
UTC_validity_from UTC_validity_to \n
UTM_ref_grid UTM_ref_easting UTM_ref_northing UTM_ref_z \n
ndoms \n
module_id line_id floor_id x y z q0 qx qy qz t0 COMPONENT_STATUS npmts\n
 pmt_id_global x y z dx dy dz t0 PMT_STATUS \n
 pmt_id_global x y z dx dy dz t0 PMT_STATUS \n 
 ... 
 pmt_id_global x y z dx dy dz t0 PMT_STATUS \n 
 #repeat for `npmts` in a module
#repeat for each DOM in a DU
…
#repeat for each DU in a detector

There are lots of special meanings and we can of course partly provide a JSON structure to some fields, I fear that this will only (partly) work for this specific example. I have no idea how this would look like for a ROOT Tree. I have the feeling that the JSON schema thing only covers an extremely small part and everything else is "living with the limitations".

What I also don't understand is what you mean with "validation tools". The tool itself would have to implement the full parsing logic, like km3pipe or Jpp does. Parsing is usually what you want, because it is way more powerful than "just" validating. The parsing is already implemented and tested in the supported software.

So in case of the DETX, the only thing we can do is displaying this:

# comment line - meta data; started by '#' in the first line(s) of the file\n
global_det_id format_version\n 
UTC_validity_from UTC_validity_to \n
UTM_ref_grid UTM_ref_easting UTM_ref_northing UTM_ref_z \n
ndoms \n
module_id line_id floor_id x y z q0 qx qy qz t0 COMPONENT_STATUS npmts\n
 pmt_id_global x y z dx dy dz t0 PMT_STATUS \n
 pmt_id_global x y z dx dy dz t0 PMT_STATUS \n 
 ... 
 pmt_id_global x y z dx dy dz t0 PMT_STATUS \n 
 #repeat for `npmts` in a module
#repeat for each DOM in a DU
…
#repeat for each DU in a detector

and then provide a JSON which explains the placeholder text in more detail. But we also need to add many other information which are targeting multiple "fields". I don't see how this is possible in JSON.

This is what's currently in my local TeX file (followed by text explanations of different other caveats and information):

\datafield[int]{globaldet_id}{The global detector identifier. Negative values between -100 and 0 are indicating simulation detectors. It is 1 for the PPM-DU and follows the KM3NeT serial number for all detectors in testing and the sea.}
\datafield[unsigned int]{ndoms}{Number of optical modules, can be 0, which automatically means the ``end of the file''}
\datafield[int]{dom_id}{The unique optical module ID. For real detectors, the number is part of the product number and is usually the last 9 digits of the CLBs MAC address)}
\datafield[int]{line_id}{The string number}
\datafield[int]{floor_id}{The floor number starting at 1. In some older DETX files, this might be -1. In this case, the floors are assumed to be numbered strictly monotonically increasing as they appear in the file.}
\datafield[unsigned int]{npmts}{The number of PMTs. Can be 0.}
\datafield[int]{pmt_id_global}{The unique global PMT ID. When applicable, the ID corresponds to the ID in the MC files or the KM3NeT product number.}
\datafield[int or float]{x, y, z}{The position of the PMT}
\datafield[int or float]{dx, dy, dz}{The direction where the PMT is pointing at}
\datafield[int or float]{t0}{The calibration time offset which has to be added to the detected hit times}

This certainly can be pressed in the JSON scheme. I don't really see however the benefit of doing so. Nobody will see a global_det_id in any code, nor a npmts or line_id. These have different names (or no names at all, just some helper values to parse the file) in different frameworks.

Thanks for that input, I also like your version with the "datafields" - especially with the detx, we also need the information about the specific formatting. And yes, for that validation probably works through the parsers. Again, jsonschema is no must, but I would propose that for documenting the values, we use a schema-like approach. Your "datafield"-thing works just as well, as it is structured, I would also then propose to extend this per datafield by information like format, example, enumeration, unit, restrictions ... where applicable and not document this in a continuous text.

And, for the overall structure of the catalogue, work with metadata files per data format. If we produce in the end a pdf, webpage or warehouse catalogue should then only be matter of formatting.

OK I will have a look how to make it work.

I have been working on the automated generation from a structure already in the branch, you can also just dump your approach in main and I try to integrate them in my branch basic_setup.

Create Sphinx-based documentation from JSON schemas

Designs

Child items ...

Activity

Metadata file

Schema file