We should be using XPath map/array handling rather than serializing to/from XML #239

joeytakeda · 2022-03-17T17:16:37Z

In json.xsl, we create stash the created map in a variable that are then serialized using xml-to-json(). E.g.:

Lines 204 to 209 in c1dbd48

    
           <xsl:variable name="map" as="element(j:map)"> 
        
               <xsl:call-template name="makeMap"/> 
        
           </xsl:variable> 
        
           <xsl:result-document href="{$outDir}/stems/{$stem}{$versionString}.json" method="text"> 
        
               <xsl:sequence select="xml-to-json($map)"/> 
        
           </xsl:result-document>

That map is not a map{} in the XPath sense, but an XML structure in the JSON namespace:

staticSearch/xsl/json.xsl

Lines 322 to 354 in c1dbd48

    
                               <map xmlns="http://www.w3.org/2005/xpath-functions"> 
        
                                   <!--Now the document ID, which we've created (if necessary) in the 
        
                                   tokenization step --> 
        
                                   <string key="docId"> 
        
                                       <xsl:value-of select="$thisDoc/@id"/> 
        
                                   </string> 
        
                                   <!--And the relative URI from the document, which is to be used 
        
                                   for linking from the KWIC to the document. We've created this 
        
                                   already in the tokenization stage and stored it in a custom 
        
                                   data-attribute--> 
        
                                   <string key="docUri"> 
        
                                       <xsl:value-of select="$thisDoc/@data-staticSearch-relativeUri"/> 
        
                                   </string> 
        
                                   <!--The document's score, forked depending on configured 
        
                                       algorithm --> 
        
                                   <number key="score"> 
        
                                       <xsl:choose> 
        
                                           <xsl:when test="$scoringAlgorithm = 'tf-idf'"> 
        
                                               <xsl:sequence select="hcmc:returnTfIdf($rawScore, $stemDocsCount, $currDocUri)"/> 
        
                                           </xsl:when> 
        
                                           <xsl:otherwise> 
        
                                               <xsl:sequence select="$rawScore"/> 
        
                                           </xsl:otherwise> 
        
                                       </xsl:choose> 
        
                                   </number> 
        
                                   <!--Now add the contexts array, if specified to do so --> 
        
                                   <xsl:if test="$phrasalSearch or $createContexts"> 
        
                                       <xsl:call-template name="returnContextsArray"/> 
        
                                   </xsl:if> 
        
                               </map>

However, there's no need to do that since we could use <xsl:map> and set the output method to json:

 <xsl:result-document href="{$outDir}/stems/{$stem}{$versionString}.json" method="json">
       <xsl:call-template name="makeMap"/>
</xsl:result-document>

<xsl:map>
     <xsl:map-entry key="'docId'" select="string($thisDoc/@id)"/>
     <xsl:map-entry key="'docUri'" select="string($thisDoc/@data-staticSearch-relativeUri)"/>
<!--[...]-->
</xsl:map>

I'm not sure if there was a reason why we didn't do this in the first place, but I can't think of any reason not to do so at this point. It may improve performance slightly as well as possibly lessen memory use. (There's some indication that constructing sequences in the result-document allows Saxon more leverage when it comes to releasing memory; I don't imagine it would be significant, but worth testing anyway.) In any case, it's a bit cleaner and feels a bit better to use built-in functions.

The only annoying bit is that there is no such thing (yet) as an <xsl:array> instruction[1], so we would still need to stash all of the context arrays in a variable in order to create an array using the array{} constructor, but that's simple enough.

--
[1]: They are proposing such a thing for XSLT 4.0: https://www.saxonica.com/html/documentation11/xsl-elements/array.html

The text was updated successfully, but these errors were encountered:

martindholmes · 2022-03-17T18:22:32Z

I'll be intrigued to see if this makes any difference. I must admit that I've never really understood what happens when you use method="json" and what the consequences of @json-node-output-method actually are. I would guess that by this time, xml-to-json() is very solid and well-optimized, and the construction of an in-memory XML tree precursor is also pretty optimal so there would be little to gain, but it would certainly be nice to reduce the memory requirement here if it works.

joeytakeda · 2022-03-17T18:37:49Z

I'm going to experiment with this in branch iss239-json

joeytakeda · 2022-03-18T19:55:31Z

I have now done this in the iss239-json branch, but my tests are so far inconclusive.

In my local copy, I've added the repeat option and the -t option to the JSON step:

    <java classpath="${ssSaxon}" classname="net.sf.saxon.Transform" failonerror="true">
      <arg value="-xsl:${ssBaseDir}/xsl/json.xsl"/>
      <arg value="-s:${ssBaseDir}/xsl/json.xsl"/>
      <arg value="-repeat:6"/>
      <arg value="-t"/>
      <arg value="--suppressXsltNamespaceCheck:on"/>
    </java>

I've attached the report for the test set to this ticket (with the respective branch names in the file). I'm going to test now with just the English TEI guidelines to see if those results are more conclusive.

json_dev.txt
json_iss239-json.txt

martindholmes · 2022-03-18T20:06:20Z

It's not making much difference. Does it make for more human-readable, maintainable code in your view?

joeytakeda · 2022-03-19T00:06:07Z

I think it is, but at the same time, <j:map> has the advantage of having the j:array[@key], which is much nicer than stuffing a bunch of stuff in a variable to put in the array{} constructor. So I think it may be worthwhile to hold off on this (switching the code was fairly trivial, so it's fine to throw it away) until <xsl:array> is part of the spec.

joeytakeda added this to the Blue sky milestone May 30, 2022

joeytakeda self-assigned this May 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

We should be using XPath map/array handling rather than serializing to/from XML #239

We should be using XPath map/array handling rather than serializing to/from XML #239

joeytakeda commented Mar 17, 2022

martindholmes commented Mar 17, 2022

joeytakeda commented Mar 17, 2022

joeytakeda commented Mar 18, 2022

martindholmes commented Mar 18, 2022

joeytakeda commented Mar 19, 2022

We should be using XPath map/array handling rather than serializing to/from XML #239

We should be using XPath map/array handling rather than serializing to/from XML #239

Comments

joeytakeda commented Mar 17, 2022

martindholmes commented Mar 17, 2022

joeytakeda commented Mar 17, 2022

joeytakeda commented Mar 18, 2022

martindholmes commented Mar 18, 2022

joeytakeda commented Mar 19, 2022