Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OPTIMADE JSON Lines specification appendix #531

Open
wants to merge 12 commits into
base: develop
Choose a base branch
from
59 changes: 59 additions & 0 deletions optimade.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4543,3 +4543,62 @@ Implementations that do not produce errors in this situation are RECOMMENDED to
* XML Schema appears to use a compatible regex format, except it is implicitly anchored: i.e., the beginning-of-input ``^`` and end-of-input ``$`` anchors must be removed, and missing anchors replaced by ``.*``.
* POSIX Extended regexes (and their extended GNU implementations) are incompatible because ``\`` is not a special character in character classes.
POSIX Basic regexes also have further differences, e.g., the meaning of some escaped syntax characters is reversed.


The OPTIMADE JSON Lines Format for Database Exchange
----------------------------------------------------

There are many use cases for which it is beneficial to share all of the data served by an OPTIMADE API as a single file, for example, archival, transfer of entire databases and local-first clients.
This appendix describes a lightweight standardization for doing this via the `JSON Lines <https://jsonlines.org/>`__ (JSONL) format, with some additional OPTIMADE-specific conventions.

The `JSON Lines <https://jsonlines.org/>`__ format enforces the following rules:

- each line is a valid JSON value,
- each line is separated by a newline character (``\n``), optionally ending the file with a newline.
- each file must be UTF-8 encoded,
- the recommended file extension is ``.jsonl``, with natural extensions to ``.jsonl.gz`` and ``jsonl.bz2`` for ``gzip`` and ``bzip2`` compressed files, respectively.

The OPTIMADE JSON Lines format then extends these rules with the following conventions:

- The first line of the file is a JSON object that contains metadata about the file.
It MUST be a dictionary with the key ``x-optimade``, under which the following key MUST be defined:

- ``api_version``: The OPTIMADE API version used when generating the file, as described in the ``meta`` member in `JSON Response Schema: Common Fields`_.

- The next line MAY contain a standard OPTIMADE ``meta`` object, following the same rules described in `JSON Response Schema: Common Fields`_, where every MUST and SHOULD rule can be reinterpreted as a MAY rule.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we specify what meaning fields like meta.data_returned take in this context?

- The next block of lines provides the ``info`` endpoint responses.
- First the base info response MUST be provided, following the description at `Base Info Endpoint`_.
ml-evs marked this conversation as resolved.
Show resolved Hide resolved
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this sublist is not rendering correctly to me

- The next lines MUST contain the entry info endpoint responses for the all entry types present later in the file, as described in `Entry Listing Info Endpoints`_.
- The remaining lines of the file contain data entries themselves, described in `Entry Listing JSON Response Schema`_.
These entries can be provided in any order.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The order of the data entries is not clear to me. Below, we say <entries block ordered by entry type>, while here they can be in any order. I don't have strong feelings about this, but it should be clear. If we want to mandate block-order (e.g. all structures first, then all references, etc), i guess we should say something about both, 1) the order of the blocks (should they match the info endpoint order?); 2) the order of the entries within the blocks.

My own gut feeling is that perhaps the order of blocks is not important, but maybe it's good to have one type of entries all together in a block.

- Finally, any custom extension endpoints (see `Custom Extension Endpoints`_), if present and desirable, MUST appear at the end of the file.

This leaves the following overall file structure:

.. code :: txt

<header>
<optional metadata>
<base info response>
<entry info responses>
<entries block ordered by entry type>
<optional custom extension endpoints>


This JSONL format can also be used to share provider-specific properties.
These should be consistent with any external definitions, and where appropriate, prefixes tied to the tools used to generate the file should be used.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not quite get the "prefixes tied to the tools used to generate the file should be used" part. Are there any examples for such cases?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, yeah, this has lost its meaning a bit. The intention was that a tool like aiida or ASE could support OPTIMADE JSONL as an export format, and any "tool-specific definitions" like _aiida_node_id (or whatever) should be defined and use the tool prefix. I wouldn't be against removing this sentence as the format just follows the normal OPTIMADE rules about defining properties.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, thanks for the explanation. Since "MUST" is not used in this paragraph, I gather its main purpose is to recommend best practice. We may probably want that all the OPTIMADE JSONL files bear some sort of producer identification, but we probably do not want to standardize it at this point. But maybe we can reuse meta.implementation dictionary from the main specification?

Maybe it is worth stressing that OPTIMADE JSONL files have the same requirements ("MUST/SHOULD/MAY") as OPTIMADE responses, in terms of custom properties having prefixes and metadata descriptions. But maybe it is enough just to say that?

Also, what about sharing provider-specific entry types? If we mention provider-specific properties, should be allow entry types as well?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding non-optimade-spec properties, doesn't the spec say they must have either

  • the provider specific prefix; or
  • a defininition-provider-specific prefix?

I think we can just point to section 3.5 here, no?

It is RECOMMENDED that custom properties are defined in full within the JSONL file, or pointed to a specific versioned property definition.

Example OPTIMADE JSON Lines File
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Example OPTIMADE JSON Lines File
Example file

Considering there is now two OPTIMADE JSON Lines File standards (other one being 9.4), maybe better to be more vague here (e.g. i already mixed these up when looking through the table of contents).

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. code :: jsonc

{"x-optimade": {"api_version": "1.2.0"}}
{"meta": {"time_stamp": "2024-07-19T11:47:10Z", "data_returned": 6, "provider": {"name": "Example JSONL", "description": "An example JSONL file.", "prefix": "_exmpl"}}}
{"type": "info", "id": "/", "attributes": {"api_version": "1.2.0", "available_api_versions": ["1.2.0"], "formats": ["json"], "entry_types_by_format": {"json": ["references", "structures"]}, "license": "https://example.com/licenses/example_license.html"}, "homepage": "https://example.com", "name": "Example API", "provider": {"description": "A simple example provider", "name": "Example Provider"}}}
{"type": "info", "id": "references", ...}
{"type": "info", "id": "structures", ...}
{"type": "references", "id": "2", "attributes": {...}}
{"type": "structures", "id": "1", "attributes": {...}, "relationships": {"references": {"data": [{"id": "2", "type": "references"}]}}}