Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CMIP7 requirements: "branded variable" and new mip_table specification #762

Open
taylor13 opened this issue Oct 4, 2024 · 14 comments
Open
Milestone

Comments

@taylor13
Copy link
Collaborator

taylor13 commented Oct 4, 2024

(FYI @sashakames, @durack1,@matthew-mizielinski, @wolfiex even though this is primarily for Chris)

It looks likely that some changes to the output requirements for CMIP7 will be agreed shortly and that "branded variables" will be relied on in identifying variables in the cmor output files. It would be good to now consider how this might impact CMOR, so I'll raise this issue now:

How difficult would it be to implement the following?

  1. The user specifies “frequency” as one of the entries in the CMIP6_input.json file rather than it being specified in a CMOR variable table. CMOR then handles “frequency” in the same way it handles, for example, “experiment_id”, and writes it as a global attribute. (We would also remove “frequency” and “approx_interval” from the CMOR variable tables.) I know that CMOR checks that users have sent a time coordinate that is approximately consistent with "approx_frequency", but that check could be dropped if it impairs implementation of this new approach.
  2. The user specifies “region” as one of the entries in the CMIP6_input.json file and then CMOR handles it in the same way as, for example, “experiment_id” and writes it as a global attribute?
  3. CMOR writes as a global attribute the "branding suffix", which it would need to obtain by extracting the suffix in the "table_entry" (i.e., the part following the underscore). See below for an example.
  4. CMOR writes as global attributes the values of the elements comprising the branding suffix: temporal_sampling, vertical_sampling, horizontal_sampling, and area_sampling? These would be either be extracted, along with other metadata, from the CMOR variable table (as shown in the table example below), or could be obtained from a look-up table given the branding suffix.
  5. In constructing file names and directory structure, rely on a somewhat different set of global attributes than in CMIP6. For example instead of including “table name” in the file name, include instead the “branded variable suffix”. (My guess is that this is trivially done by simply specifying a different template in the CMIP6_input.json file.)

To implement the above, new CMOR variable tables will need to be generated with the following changes (which could be implemented by someone other than Chris):

  1. Remove "approx_interval" from the header of each table.
  2. Remove “frequency” from each entry in the variable tables.
  3. Replace all the variable “table_entries” with branded variable names.
  4. Make sure out_name is set to the root name prefix of the branded variable (i.e., the part of the branded variable name preceding the underscore).
  5. Add 5 new attributes to each variable in the tables: branding_suffix, temporal_type, vertical_type, horizontal_type, and area_type. These will be written by CMOR as global attributes in the netCDF files.
  6. Reorganize and rename tables to group the variables more rationally and independently of frequency and region.

I should think most of the above changes to the variable tables should have little impact on the CMOR code itself.

A new CMOR7 table variable entry would include 5 new attributes (the first 5 lines below), and the "frequency" would be removed from the table (in CMIP6 it appeared just before the "long_name" attribute), resulting in the following:

"tas_tavg-z0-hxy-x": {

      "branding_suffix":"tavg-z0-hxy-x"
      "temporal_type":"mean"
      "vertical_type":"no vertical dimension"
      "horizontal_type":"gridded"
      "area_type":"unmasked"

      "cell_measures": "area: areacella",
      "cell_methods": "area: time: mean",
      "comment": "near-surface (usually, 2 meter) air temperature",
      "dimensions": [
        "longitude",
        "latitude",
        "time",
        "height2m"
      ],

      "long_name": "Near-Surface Air Temperature",
      "modeling_realm": [
        "atmos"
      ],
      "ok_max_mean_abs": "",
      "ok_min_mean_abs": "",
      "out_name": "tas",
      "positive": "",
      "standard_name": "air_temperature",
      "type": "real",
      "units": "K",
      "valid_max": "",
      "valid_min": ""
    },

Note that the table_entry has been changed from "tas" to the branded variable name: "tas_tavg-z0-hxy-x". Also note that the "out_name" will now without exception be just the root name (in this case tas) appearing before the underscore in the branded variable name. In CMIP6, sometimes the out_name differed from the table_entry.

We could elect to have CMOR generate "temporal_type", "vertical_type", "horizontal_type", and "area_type" by parsing the elements comprising the branding_suffix and then looking up in CVs the associated short text descriptions. That would mean these 4 global attributes would not have to be added to the existing tables.

@mauzey1
Copy link
Collaborator

mauzey1 commented Oct 4, 2024

Is this what the mip-cmor-tables will look like? Would the removal of "frequency" reduce the number of tables since they are currently grouped by modeling realm and frequency?

Are users supposed to select which "branded variables" from a table they are going to use instead of "variable_id"?

I assume "region" is going to be like "realm" in global attributes where its valid entries will be found in the CV, correct?

Will the "approx_interval" come from the CV or some other table? CMOR currently uses this value for a test.

@mauzey1 mauzey1 added this to the 4.0/Future milestone Oct 4, 2024
@taylor13
Copy link
Collaborator Author

taylor13 commented Oct 4, 2024

The tables will be structured the same as old tables with the changes I enumerated above. But, we can group variables into tables anyway we like (even placing them all into a single table, if we like), and instead of having a total of 2062 table entries (across tables), we’ll have about 1600 (because the same variable sampled at multiple frequencies will be found in only one table).

As I understand it, “variable_id” records the “out_name” found in the table, which is also the actual name of the variable array written to the netCDF file. That won’t change. As I noted, the out_name in the new tables will be the root name (i.e., prefix) of the branded variable name (e.g., “tas”, which is the prefix appearing in “tas_tavg-z0-hxy-x”)

As for realm, experiment_id, institute_id, etc., the valid regions will be found in a CV (and for CMIP7, there may only be a few options: “global”, “Antarctica”, “Greenland”, and a couple more perhaps.

We might decide to turn off the frequency check in CMIP7, which, as you say, is based on "approx_interval". Or we could provide a CV with "frequency" as the key, and the approximate interval as the value. The user would specify the "frequency" in the input table (as described above), and then CMOR would go to the frequency table and extract the approx_interval so it could perform its check. The frequency CV might look like:

“frequency” : {
      “mon” :  {
            “label” : “monthly”,
            “approx._interval”:“30”
      },
      “day” : {
             “label” : “daily”,
             “approx._interval”:“1”
      },
etc.
. 
. 
.

@durack1 durack1 changed the title potential requirements for CMIP7 CMIP7 requirements: "branded variable" and new mip_table specification Feb 12, 2025
@durack1 durack1 modified the milestones: 4.0/Future, 3.10.0 Feb 12, 2025
@taylor13
Copy link
Collaborator Author

taylor13 commented Feb 13, 2025

Please provide feedback and questions on the following. I've prepared an update of my earlier enumeration of possible changes to CMOR. A nicely-formatted version can be found at https://docs.google.com/document/d/1Hyv87wh0BS9dI0hSOydYubrsdpMe23qw3kCj1kLVuSo/edit?tab=t.0 , but i'll copy and paste here:

CMOR changes that are needed to handle “branded variables”:

Changes needed in user_input file:

  • User must define in this file the “frequency”, and CMOR must include “frequency” as a global attribute (drawn from a “frequency” CV).
  • User must define in this file the “region”, and CMOR must include “region” as a global attribute (drawn from a “region” CV).
  • Modify templates for filename and directory structure (not sure about the underscores):
output_file_template:
<variable_id>_<branding_suffix>_<frequency>_<region>_<grid_label>_<source_id>_<experiment_id>_<member_id>_<time_range>.nc

output_path_template:
<activity_id>_<source_id>_<experiment_id>_<member_id>_<region>_<variable_id>_<branding_suffix>_<grid_label>_<version>

Changes needed in CMOR table:

  • Use the full branded variable name as “entry” for each variable listed.
  • Add “brand_description” to each variable’s list of attributes.
  • Remove “out_name” and “frequency” from each variable in table.
  • Remove “approx_interval” and “mip_era” from table header. (mip_era is implied by data_specs_version.)
  • If not essential, remove “realm” from table header (but keep modeling_realm as variable attribute).
  • Update in header: Conventions (= ”CF-1.11 CMIP-7alpha”), data_specs_version (=”CMIP7.0.0.0-alpha”), cmor_version, table_id, and table_date

Changes needed in the CMOR code:

  • Remove check on time coordinate spacing, which relied on “approx_interval”. The value of approximate interval will be unknown to CMOR, so CMOR must not require it in any part of the code.
  • Read from input table and write as a variable attribute the “brand_description", "frequency", and ”region".
  • Parse the new branded variable table entries (relying on the underscore and hyphens) as follows (for sample entry:
“tas_tavg-2m-hxy-u”:
      branded_variable=“tas_tavg-h2m-hxy-u”
      out_name=”tas”  (this gets stored as the global attribute variable_id)
      branding_suffix=”tavg-h2m-hxy-u”
      temporal_label=”tavg”
      vertical_label=”h2m”
      horizontal_label=”hxy”
      area_label=”u”

Store each of the above as global attributes.

Sample new CMOR (or MIP) table:

{
    "Header": {
        "data_specs_version": "CMIP_specs7.0.0.0-alpha", 
        "cmor_version": "3.11???", 
        "table_id": "APmon???", 
        **** DELETE: "realm": "atmos atmosChem", 
        "table_date": "???", 
        "missing_value": "1e20", 
        "int_missing_value": "-999", 
        "product": "model-output", 
        **** DELETE: "approx_interval": "30.00000", 
        "generic_levels": "alevel alevhalf", 
        **** DELETE: "mip_era": "CMIP6", 
        "Conventions": "CF-1.11 CMIP-7alpha???"
    }, 
    "variable_entry": {
        "hfss_tavg-u-hxy-u": {              [NOTE: OLD "ENTRY" HAS BEEN REPLACED  WITH BRANDED VARIABLE.]
            "brand_description": "surface upward sensible heat flux: time means reported on a 2-d 
                        horizontal grid"          [NOTE: THIS IS A NEW ATTRIBUTE.]
            **** DELETE: "frequency": "mon", 
            "modeling_realm": "atmos", 
            "standard_name": "surface_upward_sensible_heat_flux", 
            "units": "W m-2", 
            "cell_methods": "area: time: mean", 
            "cell_measures": "area: areacella", 
            "long_name": "Surface Upward Sensible Heat Flux", 
            "comment": "The surface sensible heat flux, also called turbulent 
                    heat flux, is the exchange of heat between the surface 
                    and the air by motion of air.", 
            "dimensions": "longitude latitude time", 
            **** DELETE: \"out_name": "hfss", 
            "type": "real", 
            "positive": "up", 
            "valid_min": "", 
            "valid_max": "", 
            "ok_min_mean_abs": "", 
            "ok_max_mean_abs": ""
        },

Note: all variable entries will be similar, but there may be one or two cases where attributes “flag_values” and “flag_meanings” are defined in addition to the above.

Implications for data request:

If the branded variable names and the new MIP table names are not provided by the data request, then whatever variable labels are provided (e.g., root name and CMIP6 table name) will need to be translated into branded variable names and new MIP table names. This, presumably could be done relying on a look-up table.

@durack1
Copy link
Contributor

durack1 commented Feb 13, 2025

@taylor13 @mauzey1 we'll need to think about how best to enable (if possible) backward compatibility, the comments in #771 are relevant here, particularly the use of the _cmip6_option optional argument to CMOR

@taylor13
Copy link
Collaborator Author

I just noticed that the table entries that had been shown to be deleted in the original google doc lost the "strike through" marks when I copied into this issue. I've now edited the sample CMOR table segment above indicating which entries in the current CMIP table should be deleted.

@taylor13
Copy link
Collaborator Author

taylor13 commented Feb 21, 2025

I've reviewed #762 (comment) and found it needs to be tweaked. Again, a nicely-formatted version can be found at https://docs.google.com/document/d/1Hyv87wh0BS9dI0hSOydYubrsdpMe23qw3kCj1kLVuSo/edit?tab=t.0 .

For CMIP7 we expect to define 8 MIP tables, one for each realm. Here is a sample header and single entry from the "atmos" table.

{
    "Header": {
        **** MOVE TO input.json FILE: "data_specs_version": "CMIP_specs7.0.0.0-alpha", 
        "checksum":"",   **** This is a new entry to the header and will normally contain a checksum value
        "cmor_version": "3.10???", 
        "table_id": "atmos", 
        "realm": "atmos",   **** This sets a realm default value that can get overridden for individual variables.
        "table_date":"2025-02-14", 
        "missing_value": "1e20", 
        "int_missing_value": "-999", 
        "product": "model-output", 
        **** DELETE: "approx_interval": "30.00000", 
        "generic_levels": "alevel alevhalf", 
        **** MOVE TO input.json FILE: "mip_era": "CMIP6", 
        "Conventions": "CF-1.11 CMIP-7alpha???"
        "type":"real",     **** This and the following 5 attributes are default values that can be overridden for individual variables.
        "positive":"",
        "valid_min":"",
        "valid_max":"",
        "ok_min_mean_abs":"",
        "ok_max_mean_abs":"",
    }, 
    "variable_entry": {
        "hfss_tavg-u-hxy-u": {              [NOTE: OLD "ENTRY" HAS BEEN REPLACED  WITH BRANDED VARIABLE.]
            "long_name": "surface upward sensible heat flux: time means reported on a 2-d 
                        horizontal grid"         
            **** DELETE: "frequency": "mon", 
            **** DELETE: "modeling_realm": "atmos", 
            "standard_name": "surface_upward_sensible_heat_flux", 
            "units": "W m-2", 
            "cell_methods": "area: time: mean", 
            "cell_measures": "area: areacella", 
            "variable_title": "Surface Upward Sensible Heat Flux",   ****THIS IS A NEW ATTRIBUTE, but I'm not sure it will actually get written to the file;  can it be ignored?
            "comment": "The surface sensible heat flux, also called turbulent 
                    heat flux, is the exchange of heat between the surface 
                    and the air by motion of air.", 
            "dimensions": "longitude latitude time", 
            "out_name": "hfss", 
            "positive": "up", 
        },

QUESTIONS ABOUT CMOR (I've asked "yes" or "no" questions, but the real question is "how difficult would it be to make the suggested changes?"):

  1. Can we move "data_specs_version" and "mip_era" global attributes from the table header to the CMOR "CMIP7_input.json" file? When this is done, we need to check that the dataset "
  2. Currently the "realm" is given in the header and "modeling_realm" is given for each variable. How do these differ and how are they treated by CMOR. This is a global attribute that usually will have a single value for all variables in the table, but there might be some exceptions. Can we specify in the header a default value and possibly override it (or not) under some individual variables.
  3. We specify an "approx_interval" (for time-step) in the header so that CMOR can check whether the time-coordinate values are approximately correct. Can we remove this and eliminate this capability from CMOR?
  4. There are currently 6 variable attributes that for most variables are set to a single value ("real" for "type" and "" for "positive", "valid_max", "valid_min", "ok_max_mean_abs", and "ok_min_mean_abs"). Can we specify these default values in the header and allow them to be overridden for an individual variable?
  5. There are at least two options for handling the new table entry (e.g., tas_tavg-h2m-hxy-u):- Preferred option: Parse the elements separated by "_" or "-" and store as global attributes:
          branding_suffix="tavg-h2m-hxy-u"
          temporal_label = "tavg"
          vertical_label="h2m"
          horizontal_label="hxy"
          area_label="u"
          variable_id="tas"  (Alternatively, this might be named "out_name" and handled as before, I think.)
  • Other option: Put the parsed elements defined above directly into the cmor tables, but that increases file size by about 50% and makes it harder for humans to browse it quickly.
  1. We need to handle "frequency" differently in CMIP7. We need to eliminate it from the CMOR table. We need to enable the user to specify "frequency" and another attribute, "region". There are two options:
  • Preferred option: When calling "cmor_variable", user passes the "frequency" and "region" along with the required variable name (key to the cmor table definition of a variable). These attributes would be stored as global attributes and also be used in constructing filenames and directory structures.
  • Other option: Add "frequency" and "region" to the input.json file and handle like other global attributes. Data providers would not like this though because in processing a single simulation, they would have to alter the input.json file several times; in previous phases, the same input.json file table would serve all variables from a single simulation.
  1. Can we add "variable_title" as new attribute and have CMOR write it as a variable attribute? Can we add it and not have CMOR write it?
  2. Certain global and variable attributes should be stored as floats or integers, not "strings". Is that currently possible with CMOR? (I think at least for variables CMOR stores missing_value as a non-text-string.)
  3. Are flag_values and flag_meanings needed by any variables? Are they needed by coordinate variables?
  4. Can we modify the templates for filename and directory structure in the input.json file and then populate it from user input and cmor table information: output_file_template:`` <variable_id><branding_suffix><grid_label><source_id>
    <experiment_id><member_id><time_range>.nc
- output_path_template:
      <activity_id>_<source_id>_<experiment_id>_<member_id>_<region>_<variable_id>_
                         <branding_suffix>_<grid_label>_<version>
11. Include in the header a checksum value.  In a future version of CMOR, we might record the checksum in the files written by CMOR, and perhaps also ask CMOR to check whether the value in the header is consistent with a value CMOR obtains by performing checksum on the cmor table.  For now, CMOR can completely ignore ``checksum``, but it should not mind that it appears in the header.

@taylor13
Copy link
Collaborator Author

As far as priority for the above, the following are essential for CMIP7: 3, 5, 6, and 10.

@taylor13
Copy link
Collaborator Author

taylor13 commented Feb 22, 2025

I thought of another approach for addressing 5 and 6 that would not involve modifying existing cmor functions.

For item 5, we could require the data provider (user) to call a new cmor function, which we could name "cmor_treat_brand". We would call it right after function "cmor_variable". The only argument of the function would be:

var_id = integer returned by cmor_variable identifying the variable of interest

The function would

  • use the var_id to extract the brand name for the variable (e.g., "tas_tavg-h2m-hxy-u")
  • parse the brand to obtain:
          branding_suffix="tavg-h2m-hxy-u"
          temporal_label = "tavg"
          vertical_label="h2m"
          horizontal_label="hxy"
          area_label="u"
          variable_id="tas"

For item 6, after a call to "cmor_variable", we would require the user to call cmor function "cmor_set_variable_attribute" twice:

  • cmor_set_variable_attribute(var_id, "frequency", "c", value), where value is taken from the frequency CV (e.g., "mon", "day", "6hr", ...)
  • cmor_set_variable_attribute(var_id, "region", "c", value), where value is taken from the region CV (e.g., "glb", "ant", "grn")

This is really no different than doing these things inside "cmor_variable", as I suggested in the earlier comment, but this would not modify any of the existing cmor functions.

@mauzey1
Copy link
Collaborator

mauzey1 commented Feb 24, 2025

@taylor13 Answering your questions from #762 (comment)

Can we move "data_specs_version" and "mip_era" global attributes from the table header to the CMOR "CMIP7_input.json" file? When this is done, we need to check that the dataset "

data_specs_version is meant to be the version of the CMOR MIP tables being used so it should be part of the table header rather than the user input. I can see mip_era becoming a user input parameter that is checked by the CV.

Currently the "realm" is given in the header and "modeling_realm" is given for each variable. How do these differ and how are they treated by CMOR. This is a global attribute that usually will have a single value for all variables in the table, but there might be some exceptions. Can we specify in the header a default value and possibly override it (or not) under some individual variables.

When defining the realm attribute, CMOR will first check if the variable entry provides a realm value from the modeling_realm attribute. If CMOR doesn't find one, then it will get the value from the realm attribute in the current table's header.

We specify an "approx_interval" (for time-step) in the header so that CMOR can check whether the time-coordinate values are approximately correct. Can we remove this and eliminate this capability from CMOR?

Yes.

There are currently 6 variable attributes that for most variables are set to a single value ("real" for "type" and "" for "positive", "valid_max", "valid_min", "ok_max_mean_abs", and "ok_min_mean_abs"). Can we specify these default values in the header and allow them to be overridden for an individual variable?

Yes. We can follow a similar approach that CMOR takes with realm.

There are at least two options for handling the new table entry (e.g., tas_tavg-h2m-hxy-u):- Preferred option: Parse the elements separated by "_" or "-" and store as global attributes:

     branding_suffix="tavg-h2m-hxy-u"
     temporal_label = "tavg"
     vertical_label="h2m"
     horizontal_label="hxy"
     area_label="u"
     variable_id="tas"  (Alternatively, this might be named "out_name" and handled as before, I think.)

Other option: Put the parsed elements defined above directly into the cmor tables, but that increases file size by about 50% and makes it harder for humans to browse it quickly.

How about we just have the attribute branding_suffix in the variable's table entry? This will allow for backwards compatibility in CMOR by checking for the attribute before proceeding with parsing the elements within the suffix. The CMIP6 tables will skip this check since they won't have the attribute.

We need to handle "frequency" differently in CMIP7. We need to eliminate it from the CMOR table. We need to enable the user to specify "frequency" and another attribute, "region". There are two options:

  • Preferred option: When calling "cmor_variable", user passes the "frequency" and "region" along with the required variable name (key to the cmor table definition of a variable). These attributes would be stored as global attributes and also be used in constructing filenames and directory structures.
  • Other option: Add "frequency" and "region" to the input.json file and handle like other global attributes. Data providers would not like this though because in processing a single simulation, they would have to alter the input.json file several times; in previous phases, the same input.json file table would serve all variables from a single simulation.

We can do what you suggested in #762 (comment) and use cmor_set_variable_attribute to set the frequency and realm for a variable. If frequency and realm are not defined when you use cmor_write then an error message should be raised.

Can we add "variable_title" as new attribute and have CMOR write it as a variable attribute? Can we add it and not have CMOR write it?

We can make CMOR add the attribute if we want. CMOR will ignore the attribute in the variable's table entry if it is not programmed to find it.

Certain global and variable attributes should be stored as floats or integers, not "strings". Is that currently possible with CMOR? (I think at least for variables CMOR stores missing_value as a non-text-string.)

Yes, purely numeric values (i.e. numbers without units) for attributes are stored in netCDF files as floats or integers.

Are flag_values and flag_meanings needed by any variables? Are they needed by coordinate variables?

The "basin" variable in the CMIP6_Ofx.json table is the only variable that I know that has the flag_values and flag_meanings attributes.

Can we modify the templates for filename and directory structure in the input.json file and then populate it from user input and cmor table information: output_file_template:`` <variable_id><branding_suffix><grid_label><source_id>
<experiment_id><member_id><time_range>.nc

  • output_path_template:

    <activity_id><source_id><experiment_id><member_id><variable_id>
    <branding_suffix><grid_label>

Yes, we can modify the filename and directory templates to use the branding_suffix attribute value.

Include in the header a checksum value. In a future version of CMOR, we might record the checksum in the files written by CMOR, and perhaps also ask CMOR to check whether the value in the header is consistent with a value CMOR obtains by performing checksum on the cmor table. For now, CMOR can completely ignore checksum, but it should not mind that it appears in the header.

CMOR currently creates a MD5 checksum of the variable table used to write a netCDF file. This checksum is stored in the attribute table_info along with the table file's creation date. This value is currently not checked when running a file through PrePARE. Do we want to create a new checksum attribute? Perhaps we could make a SHA256 sum attribute (just the checksum, no date) for the MIP table's checksum. We can make PrePARE only check this parameter if it is present in the netCDF file.

@taylor13
Copy link
Collaborator Author

Thanks for clarifying everything. A few follow-up questions/remarks:

  1. Regarding

data_specs_version is meant to be the version of the CMOR MIP tables being used so it should be part of the table header rather than the user input. I can see mip_era becoming a user input parameter that is checked by the CV.

CMOR MIP tables are no longer relied on for identification of datasets or files, so their contents can be modified under the same "data specifications", and I don't think it is important that we define a version of the tables. What is essential is to record the entire set of data specifications that govern the metadata in the netCDF files, the templates for constructing paths and filenames, and CVs relied on by those using CMOR. I think the name "dataset_specs_version" is an appropriate name to describe the overarching data specifications, so thought we could just repurpose it.

I know in the past data_specs_version would change if the contents of the tables changed, but is there any reason for that now? Maybe others have an opinion about this.

  1. Regarding "realm" and "modeling_realm", can we make the names consistent? So if "realm" isn't found under the variable entry, then the "realm" specified in the header is used?

  2. Regarding

How about we just have the attribute branding_suffix in the variable's table entry? This will allow for backwards compatibility in CMOR by checking for the attribute before proceeding with parsing the elements within the suffix. The CMIP6 tables will skip this check since they won't have the attribute.

In my example, I use the full branded variable name (root name + branding suffix) as the table entry: "hfss_tavg-u-hxy-u", where "hfss" is the root name (variable_id) and "tavg-u-hxy-u" is the branding suffix. As a service for down-stream users of the data, I wanted to separately store as global attributes these two elements and then parse the suffix and also separately store temporal_label, vertical_label, horizontal_label, and area_label.

As for backward compatibility, if you wanted to handle the old CMIP6 tables with this version of the code, you would need to check whether an underscore were found in the variable entry. If not, then you would skip any parsing or storing of the elements.

I've probably misunderstood something, so will be interested in whether this seems like a good course or not.

  1. Regarding

We can make CMOR add the attribute if we want. CMOR will ignore the attribute in the variable's table entry if it is not programmed to find it.

Yes, that is clear. Does the input file determine which attributes CMOR looks for, or is that hardwired inside the code?

  1. Regarding checksums, I guess the immediate question is "can you include a checksum (or any other attribute) in the header and have CMOR just ignore it"? Or will CMOR error exit if it finds something in the header it doesn't know about?

@mauzey1
Copy link
Collaborator

mauzey1 commented Feb 25, 2025

In my example, I use the full branded variable name (root name + branding suffix) as the table entry: "hfss_tavg-u-hxy-u", where "hfss" is the root name (variable_id) and "tavg-u-hxy-u" is the branding suffix. As a service for down-stream users of the data, I wanted to separately store as global attributes these two elements and then parse the suffix and also separately store temporal_label, vertical_label, horizontal_label, and area_label.

As for backward compatibility, if you wanted to handle the old CMIP6 tables with this version of the code, you would need to check whether an underscore were found in the variable entry. If not, then you would skip any parsing or storing of the elements.

I've probably misunderstood something, so will be interested in whether this seems like a good course or not.

As long as we don't need to worry about variable names containing _ or -, then it should be easy to parse the suffix and its components.

From the Naming Conventions section of CF-Conventions:

It is recommended that variable, dimension, attribute and group names begin with a letter and be composed of letters, digits, and underscores.

Are there any cases of variable names with underscores?

@taylor13
Copy link
Collaborator Author

No, for CMIP the only characters allowed in variable names are alphanumeric characters; no punctuation, underscores, or hyphens.

@taylor13
Copy link
Collaborator Author

Chris, from the above, it appears that the changes we're contemplating would not render CMOR unable to process CMIP6 and CMIP6Plus data. That would be great, but is it true?

@durack1
Copy link
Contributor

durack1 commented Feb 26, 2025

@mauzey1 the placeholder MIP table files, following the new format can be found in #778

durack1 added a commit that referenced this issue Feb 27, 2025
taylor13 added a commit that referenced this issue Feb 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants