[ntuple] fix type name normalization #17637

jblomer · 2025-02-05T18:49:21Z

Fixes type name and type alias normalization, especially for types that involve meta (TClass etc.). Introduces patch version 1.0.0.1 of the specification.

Fixes #17570

tree/ntuple/v7/doc/BinaryFormatSpecification.md

pcanal · 2025-02-05T19:00:50Z

tree/ntuple/v7/doc/BinaryFormatSpecification.md

+  - Integer template arguments are written in standard decimal notation; only if necessary (long integer types),
+    a suffix `l` or `ul` is added according to the template parameter type.
+
+The type name of a field has typedefs and usings fully resolved (except for standard integer typedefs).


This may belong sooner in the document as the above mentioned rules are applying to the name with fully resolved typedefs.

Hm, I'm not sure I understand. The normalization rules, as far as I understand, should apply to both the original type name and the resolved type name.

The replacement of the typedef is an integral type of the 'normalization' of the name. Without that part, the name are not unique and thus not normalized. I.e. if you have both const vector<vector<Int_t> > and vector<vector<int32_t, typedef_to_regular_allocator>> "only" pretty printing the name does not help much. Similar if the typedef are not replaced, neither would/should the Integer template arguments be calculated/compacted. In the second example, without resolving the typedef we can not even remove the default template argument.

If the user passes us vector<vector<int32_t, typedef_to_regular_allocator>>, its "pretty printed" version will be registered as the field's alias type, i.e. the "original spelling" from the user. Otherwise (vector<vector<int>>), the alias type is empty.

In any case, the type that is relevant is the resolved type name. The alias one is just for information, except for Double32_t.

For the original spelling, I guess we ought to decide to either really put it as-is or we need to spell out, separately from the normalization, what applies there.
For example, it is clear that the 'whitespace' rule would apply, maybe the const rule and what else? (Not the typedef for sure, what about the std::[u]int(8|16|32|64)_t or even the fully qualified)

The proposal is to apply all the normalization rules to the original spelling except typedef resolution.

tree/ntuple/v7/src/RFieldBase.cxx

tree/ntuple/v7/doc/BinaryFormatSpecification.md

pcanal · 2025-02-05T21:17:15Z

tree/ntuple/v7/doc/BinaryFormatSpecification.md

+    the corresponding (at the time of writing) `std::[u]int(8|16|32|64)_t` standard integer typedef.
+  - Type names used as template arguments are themselves normalized.
+  - For user-defined types, default-initialized template arguments are explicitly spelled out;
+    optional template arguments of stdlib types such as allocators are omitted.


Note that for the I/O normalization we have elected to keep the stdlib optional argument whose value is not the default value. ie.:

TClass::GetClass("vector<int, other_alloc>")->GetName() (const char *) "vector<int,other_alloc>"

I do not recall the detailed argumentation, but at the very least they 'may' be semantically relevant to the user.

Ok, I think that should be looked / discussed in #16754.

github-actions · 2025-02-06T04:56:20Z

Test Results

18 files 18 suites 4d 7h 14m 40s ⏱️
2 717 tests 2 716 ✅ 0 💤 1 ❌
47 208 runs 47 207 ✅ 0 💤 1 ❌

For more details on these failures, see this check.

Results for commit 3187473.

♻️ This comment has been updated with latest results.

hahnjo

Some comments inline

tree/ntuple/v7/doc/BinaryFormatSpecification.md

tree/ntuple/v7/src/RFieldBase.cxx

tree/ntuple/v7/test/ntuple_types.cxx

hahnjo · 2025-02-10T09:13:57Z

I'm still trying to wrap my head around the type alias commit. I think we should test it more with class templates, like

diff --git a/tree/ntuple/v7/test/rfield_class.cxx b/tree/ntuple/v7/test/rfield_class.cxx
index 02cbc93ed6..52e1bc3ba4 100644
--- a/tree/ntuple/v7/test/rfield_class.cxx
+++ b/tree/ntuple/v7/test/rfield_class.cxx
@@ -278,6 +278,8 @@ TEST(RNTuple, TClassDefaultTemplateParameter)
       model->MakeField<DataVector<int, float>>("f2");
       model->AddField(RFieldBase::Create("f3", "DataVector<int>").Unwrap());
       model->AddField(RFieldBase::Create("f4", "struct DataVector<bool,vector<unsigned>>").Unwrap());
+      model->AddField(RFieldBase::Create("f5", "DataVector<Double32_t>").Unwrap());
+      model->AddField(RFieldBase::Create("f6", "DataVector<int, double>").Unwrap());
       auto writer = RNTupleWriter::Recreate(std::move(model), "ntpl", fileGuard.GetPath());
    }
 
@@ -286,10 +288,18 @@ TEST(RNTuple, TClassDefaultTemplateParameter)
 
    const auto &desc = reader->GetDescriptor();
    EXPECT_EQ("DataVector<std::int32_t,double>", desc.GetFieldDescriptor(desc.FindFieldId("f1")).GetTypeName());
+   EXPECT_EQ("", desc.GetFieldDescriptor(desc.FindFieldId("f1")).GetTypeAlias());
    EXPECT_EQ("DataVector<std::int32_t,float>", desc.GetFieldDescriptor(desc.FindFieldId("f2")).GetTypeName());
+   EXPECT_EQ("", desc.GetFieldDescriptor(desc.FindFieldId("f2")).GetTypeAlias());
    EXPECT_EQ("DataVector<std::int32_t,double>", desc.GetFieldDescriptor(desc.FindFieldId("f3")).GetTypeName());
+   EXPECT_EQ("", desc.GetFieldDescriptor(desc.FindFieldId("f3")).GetTypeAlias());
    EXPECT_EQ("DataVector<bool,std::vector<std::uint32_t>>",
              desc.GetFieldDescriptor(desc.FindFieldId("f4")).GetTypeName());
+   EXPECT_EQ("", desc.GetFieldDescriptor(desc.FindFieldId("f4")).GetTypeAlias());
+   EXPECT_EQ("DataVector<double,double>", desc.GetFieldDescriptor(desc.FindFieldId("f5")).GetTypeName());
+   EXPECT_EQ("DataVector<Double32_t,double>", desc.GetFieldDescriptor(desc.FindFieldId("f5")).GetTypeAlias());
+   EXPECT_EQ("DataVector<std::int32_t,double>", desc.GetFieldDescriptor(desc.FindFieldId("f6")).GetTypeName());
+   EXPECT_EQ("", desc.GetFieldDescriptor(desc.FindFieldId("f6")).GetTypeAlias());
 
    auto v1 = reader->GetView<DataVector<int>>("f1");
    auto v3 = reader->GetView<DataVector<int>>("f3");

However, that doesn't seem to work right now:

553: [ RUN      ] RNTuple.TClassDefaultTemplateParameter
553: /home/jhahnfel/ROOT/src/tree/ntuple/v7/test/rfield_class.cxx:295: Failure
553: Expected equality of these values:
553:   ""
553:   desc.GetFieldDescriptor(desc.FindFieldId("f3")).GetTypeAlias()
553:     Which is: "DataVector<std::int32_t>"
553: /home/jhahnfel/ROOT/src/tree/ntuple/v7/test/rfield_class.cxx:299: Failure
553: Expected equality of these values:
553:   "DataVector<double,double>"
553:   desc.GetFieldDescriptor(desc.FindFieldId("f5")).GetTypeName()
553:     Which is: "DataVector<Double32_t,double>"
553: /home/jhahnfel/ROOT/src/tree/ntuple/v7/test/rfield_class.cxx:300: Failure
553: Expected equality of these values:
553:   "DataVector<Double32_t,double>"
553:   desc.GetFieldDescriptor(desc.FindFieldId("f5")).GetTypeAlias()
553:     Which is: "DataVector<Double32_t>"
553: [  FAILED  ] RNTuple.TClassDefaultTemplateParameter (10 ms)

I believe Double32_t as template argument was already wrong before, but the type alias for f3 looks wrong to me...

Following clarifications and fixes: - Strings in the meta-data are UTF-8 encoded (instead of ASCII), matching the section "Naming specification" - Extend restrictions on user-defined classes regarding bit-fields and template arguments - The type name normalization is specified in a dedicated section

Normalize and resolve type name only in the innermost overload.

For types that require meta (TClass, TEnum, etc.), apply the following normalization steps: - Use TClass::GetName() to get the type name normalization from meta - Renormalize that name according to the RNTuple specs

Compares a normalized original type name with the normalized resolved type name to determine if a field type alias (i.e. the use of a typedef) should be set. Previously, the logic was buggy comparing only partially normalized types.

…TEnum

hahnjo

LGTM, some last minor comments from my side

hahnjo · 2025-02-13T09:26:13Z

tree/ntuple/v7/test/ntuple_types.cxx

@@ -37,7 +37,10 @@ TEST(RNTuple, TypeNameNormalization)
   EXPECT_EQ("std::int32_t", RFieldBase::Create("f", "signed").Unwrap()->GetTypeName());
   EXPECT_EQ("std::map<std::int32_t,std::int32_t>", RFieldBase::Create("f", "map<int, int>").Unwrap()->GetTypeName());

-   EXPECT_TRUE(RFieldBase::Create("f", "std::int32_t").Unwrap()->GetTypeAlias().empty());


You probably want to keep this newly added test that std::int32_t has an empty type alias

hahnjo · 2025-02-13T09:42:39Z

tree/ntuple/v7/src/RFieldBase.cxx

-         auto arrayLength = std::stoi(arrayDef[1]);
+         auto arrayLength = std::stoull(arrayDef[1]);


We probably want more error checking here, for example std::stoull("123junk") = 123 without error reporting. std::stoull takes a second pos argument that stores the number of processed characters, or we can use the underlying std::strtoull where str_end should point to NULL after successful conversion.

hahnjo · 2025-02-13T09:44:45Z

tree/ntuple/v7/src/RFieldUtils.cxx

+      return GetNormalizedInteger(std::stoll(intTemplateArg));
+   return GetNormalizedInteger(std::stoull(intTemplateArg));


jblomer added the in:RNTuple label Feb 5, 2025

jblomer requested review from hahnjo, pcanal, silverweed, vepadulano and enirolf February 5, 2025 18:49

jblomer self-assigned this Feb 5, 2025