Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

We should be using XPath map/array handling rather than serializing to/from XML #239

Open
joeytakeda opened this issue Mar 17, 2022 · 5 comments
Assignees
Milestone

Comments

@joeytakeda
Copy link
Contributor

In json.xsl, we create stash the created map in a variable that are then serialized using xml-to-json(). E.g.:

staticSearch/xsl/json.xsl

Lines 204 to 209 in c1dbd48

<xsl:variable name="map" as="element(j:map)">
<xsl:call-template name="makeMap"/>
</xsl:variable>
<xsl:result-document href="{$outDir}/stems/{$stem}{$versionString}.json" method="text">
<xsl:sequence select="xml-to-json($map)"/>
</xsl:result-document>

That map is not a map{} in the XPath sense, but an XML structure in the JSON namespace:

staticSearch/xsl/json.xsl

Lines 322 to 354 in c1dbd48

<map xmlns="http://www.w3.org/2005/xpath-functions">
<!--Now the document ID, which we've created (if necessary) in the
tokenization step -->
<string key="docId">
<xsl:value-of select="$thisDoc/@id"/>
</string>
<!--And the relative URI from the document, which is to be used
for linking from the KWIC to the document. We've created this
already in the tokenization stage and stored it in a custom
data-attribute-->
<string key="docUri">
<xsl:value-of select="$thisDoc/@data-staticSearch-relativeUri"/>
</string>
<!--The document's score, forked depending on configured
algorithm -->
<number key="score">
<xsl:choose>
<xsl:when test="$scoringAlgorithm = 'tf-idf'">
<xsl:sequence select="hcmc:returnTfIdf($rawScore, $stemDocsCount, $currDocUri)"/>
</xsl:when>
<xsl:otherwise>
<xsl:sequence select="$rawScore"/>
</xsl:otherwise>
</xsl:choose>
</number>
<!--Now add the contexts array, if specified to do so -->
<xsl:if test="$phrasalSearch or $createContexts">
<xsl:call-template name="returnContextsArray"/>
</xsl:if>
</map>

However, there's no need to do that since we could use <xsl:map> and set the output method to json:

 <xsl:result-document href="{$outDir}/stems/{$stem}{$versionString}.json" method="json">
       <xsl:call-template name="makeMap"/>
</xsl:result-document>
<xsl:map>
     <xsl:map-entry key="'docId'" select="string($thisDoc/@id)"/>
     <xsl:map-entry key="'docUri'" select="string($thisDoc/@data-staticSearch-relativeUri)"/>
<!--[...]-->
</xsl:map>

I'm not sure if there was a reason why we didn't do this in the first place, but I can't think of any reason not to do so at this point. It may improve performance slightly as well as possibly lessen memory use. (There's some indication that constructing sequences in the result-document allows Saxon more leverage when it comes to releasing memory; I don't imagine it would be significant, but worth testing anyway.) In any case, it's a bit cleaner and feels a bit better to use built-in functions.

The only annoying bit is that there is no such thing (yet) as an <xsl:array> instruction[1], so we would still need to stash all of the context arrays in a variable in order to create an array using the array{} constructor, but that's simple enough.

--
[1]: They are proposing such a thing for XSLT 4.0: https://www.saxonica.com/html/documentation11/xsl-elements/array.html

@martindholmes
Copy link
Collaborator

I'll be intrigued to see if this makes any difference. I must admit that I've never really understood what happens when you use method="json" and what the consequences of @json-node-output-method actually are. I would guess that by this time, xml-to-json() is very solid and well-optimized, and the construction of an in-memory XML tree precursor is also pretty optimal so there would be little to gain, but it would certainly be nice to reduce the memory requirement here if it works.

@joeytakeda
Copy link
Contributor Author

I'm going to experiment with this in branch iss239-json

@joeytakeda
Copy link
Contributor Author

I have now done this in the iss239-json branch, but my tests are so far inconclusive.

In my local copy, I've added the repeat option and the -t option to the JSON step:

    <java classpath="${ssSaxon}" classname="net.sf.saxon.Transform" failonerror="true">
      <arg value="-xsl:${ssBaseDir}/xsl/json.xsl"/>
      <arg value="-s:${ssBaseDir}/xsl/json.xsl"/>
      <arg value="-repeat:6"/>
      <arg value="-t"/>
      <arg value="--suppressXsltNamespaceCheck:on"/>
    </java>

I've attached the report for the test set to this ticket (with the respective branch names in the file). I'm going to test now with just the English TEI guidelines to see if those results are more conclusive.

json_dev.txt
json_iss239-json.txt

@martindholmes
Copy link
Collaborator

It's not making much difference. Does it make for more human-readable, maintainable code in your view?

@joeytakeda
Copy link
Contributor Author

I think it is, but at the same time, <j:map> has the advantage of having the j:array[@key], which is much nicer than stuffing a bunch of stuff in a variable to put in the array{} constructor. So I think it may be worthwhile to hold off on this (switching the code was fairly trivial, so it's fine to throw it away) until <xsl:array> is part of the spec.

@joeytakeda joeytakeda added this to the Blue sky milestone May 30, 2022
@joeytakeda joeytakeda self-assigned this May 30, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants