diff --git a/build.xml b/build.xml index a9f9fa0..2185658 100644 --- a/build.xml +++ b/build.xml @@ -253,7 +253,7 @@ - + TARGET convertConfigFile This runs an identity transform on the supplied config diff --git a/configTest.xml b/configTest.xml index b19cbd5..9960a46 100644 --- a/configTest.xml +++ b/configTest.xml @@ -1,5 +1,5 @@ - + test/search.html test/VERSION diff --git a/docs/staticSearch.html b/docs/staticSearch.html index a7f3734..a274914 100644 --- a/docs/staticSearch.html +++ b/docs/staticSearch.html @@ -168,37 +168,30 @@
  • Appendix A.1.2 <context>
  • Appendix A.1.3 <contexts>
  • Appendix A.1.4 <createContexts>
  • -
  • Appendix A.1.5 <dictionaryFile>
  • +
  • Appendix A.1.5 <dictionary>
  • Appendix A.1.6 <exclude>
  • Appendix A.1.7 <excludes>
  • Appendix A.1.8 <filter>
  • Appendix A.1.9 <filters>
  • -
  • Appendix A.1.10 <kwicTruncateString>
  • -
  • Appendix A.1.11 <linkToFragmentId>
  • -
  • Appendix A.1.12 <maxKwicsToHarvest>
  • -
  • Appendix A.1.13 <maxKwicsToShow>
  • -
  • Appendix A.1.14 <minWordLength>
  • -
  • Appendix A.1.15 <outputFolder>
  • -
  • Appendix A.1.16 <params>
  • -
  • Appendix A.1.17 <phrasalSearch>
  • -
  • Appendix A.1.18 <recurse>
  • -
  • Appendix A.1.19 <resultsLimit>
  • -
  • Appendix A.1.20 <resultsPerPage>
  • -
  • Appendix A.1.21 <rule>
  • -
  • Appendix A.1.22 <rules>
  • -
  • Appendix A.1.23 <scoringAlgorithm>
  • -
  • Appendix A.1.24 <searchFile>
  • -
  • Appendix A.1.25 <span>
  • -
  • Appendix A.1.26 <stemmerFolder>
  • -
  • Appendix A.1.27 <stopwordsFile>
  • -
  • Appendix A.1.28 <totalKwicLength>
  • -
  • Appendix A.1.29 <versionFile>
  • -
  • Appendix A.1.30 <wildcardSearch>
  • +
  • Appendix A.1.10 <index>
  • +
  • Appendix A.1.11 <output>
  • +
  • Appendix A.1.12 <params>
  • +
  • Appendix A.1.13 <results>
  • +
  • Appendix A.1.14 <rule>
  • +
  • Appendix A.1.15 <rules>
  • +
  • Appendix A.1.16 <scoringAlgorithm>
  • +
  • Appendix A.1.17 <searchPage>
  • +
  • Appendix A.1.18 <span>
  • +
  • Appendix A.1.19 <stemmer>
  • +
  • Appendix A.1.20 <stopwords>
  • +
  • Appendix A.1.21 <tokenizer>
  • +
  • Appendix A.1.22 <version>
  • Appendix A.2 Attribute classes
  • @@ -522,7 +515,7 @@

    8.5 Creating a configu
  • params (Element containing most of the settings which enable the Generator to find the target website content and process it appropriately.)
  • rules (The set of rules that control weighting of search terms found in specific contexts.)
  • -
  • contexts (The set of context elements that identify contexts for keyword-in-context fragments.)
  • +
  • contexts (The set of context that identify contexts for keyword-in-context fragments.)
  • Only the <params> element is necessary, but, as we discuss shortly, we highly suggest taking advantage of the <rules> (see 8.5.3 Specifying rules (optional)) and <contexts> (8.5.4 Specifying contexts (optional)) for the best results.

    @@ -553,37 +546,28 @@

    8.5.1 The 8.5.2 Specifying parameters

    8.5.2.1 Required parameters

    -

    The <params> element has four required elements for determining the resource collection that you - wish to index, and controlling the indexing process:

    +

    The <params> element has only one required element, which is used for determining the resource + collection that you wish to index:

      -
    • searchFile (The search file (aka page) that will be the primary access point for the staticSearch. - Note that this page must be at the root of the collection directory.)
    • -
    • recurse (Whether to recurse into subdirectories of the collection directory or not.)
    • -
    • stopwordsFile (The relative path (from the config file) to a text file containing a list of stopwords - (words to be ignored when indexing).)
    • -
    • dictionaryFile (The relative path (from the config file) to a dictionary file (one word per line) - which will be used to check tokens when indexing.)
    • +
    • searchPage (The search page that will be the primary access point for staticSearch. This page + may or may not exist, but its location is used for determining the collection that + will be indexed, so it must be at the root of the collection directory.) +
      + + + + + +
      file [ss.atts.file](A pointer to a local file.)
      +
      +

    The search page is a regular HTML page which forms part of your site. The only important - characteristic it must have is a <div> element with id=staticSearch, whose contents will be rewritten by the staticSearch build process. See 8.6 Creating a search page. A stopword is a word that will not be indexed, because it is too common (the, a, you and so on). There are common stopwords files for most languages available on the - Web, but it is probably a good idea to take one of these and customize it for your - project, since there will be words in the website which are so common that it makes - no sense to index them, but they are not ordinary stopwords. For example, in a Website - dedicated to the work of John Keats, the name keats should probably be added to the stopwords file, since almost every page will include - it, and searching for it will be pointless. The project has a built-in set of common - stopwords for English, which you'll find in xsl/english_stopwords.txt. One way to find appropriate stopwords for your site is to generate your index, then - search for the largest JSON index files that are generated, to see if they might be - too common to be useful as search terms. You can also use the Word Frequency table in the generated staticSearch report (see 8.9 Generated report). The indexing process checks each word as it builds the index, and keeps a record of - all words which are not found in the configured dictionary. Though this does not have - any direct effect in the indexing process, all words not found in the dictionary are - listed in the staticSearch report (see 8.9 Generated report). This can be very useful: all words listed are either foreign (not part of the language - of the dictionary) or perhaps misspelled (in which case they may not be correctly - stemmed and index, and should be corrected). There is a default dictionary in xsl/english_words.txt which you might copy and adapt if you're working in English; lots of dictionaries - for other languages are available on the Web.

    -

    The <searchFile> element is a relative URI (resolved, like all URIs specified in the config file, - against the configuration file location) that points directly to the search page that - will be the primary access point for the search. Since the search file must be at - the root of the directory that you wish to index (i.e. the directory that contains + characteristic it must have is a <div> element with id=staticSearch, whose contents will be rewritten by the staticSearch build process. See 8.6 Creating a search page.

    +

    The <searchPage> element's file attribute specifies a relative URI (resolved, like all URIs specified in the config + file, against the configuration file location) that points directly to the search + page that will be the primary access point for the search. Since the search file must + be at the root of the directory that you wish to index (i.e. the directory that contains all of the XHTML you want the search to index), the searchFile parameter provides the necessary information for knowing what document collection to index and where to put the output JSON. In other words, in specifying the location of your search @@ -612,49 +596,107 @@

    8.5.2.1 Required param -

    We also require the <recurse> element in the case where the document collection may be nested (as is common with - static sites generated from Jekyll or Wordpress). The <recurse> element is a boolean (true or false) that determines whether or not to recurse into - the subdirectories of the collection and index those files.

    -

    Finally, in order to support stemming and phrasal search effectively, it is important - to specify a <stopwordsFile> (a file containing words that will be ignored at index time) and a <dictionaryFile> (also used for indexing). Default files for English and French are supplied in the - xsl folder, but you will probably want to create or customize the stopword list for your - own project. You may also supply empty text files for these parameters if for example - you donʼt want to use a stoplist at all.

    8.5.2.2 Optional parameters

    The following parameters are optional, but most projects will want to specify some of them:

      -
    • versionFile (The relative path to a text file containing a single version identifier (such as 1.5, - 123456, or 06ad419). This will be used to create unique filenames for JSON resources, - so that the browser will not use cached versions of older index files.)
    • +
    • index (Configures options relating to indexing.) +
      + + + + + +
      recurse(Determines whether or not to recurse into the subdirectories of the collection and + index those files.)
      +
      +
    -

    <versionFile> enables you to specify the path to a plain-text file containing a simple version - number for the project. This might take the form of a software-release-style version - number such as 1.5, or it might be a Subversion revision number or a Git commit hash. It should not - contain any spaces or punctuation. If you provide a version file, the version string - will be used as part of the filenames for all the JSON resources created for the search. - This is useful because it allows the browser to cache such resources when users repeatedly - visit the search page, but if the project is rebuilt with a new version, those cached - files will not be used because the new version will have different filenames. The - path specified is relative to the location of the configuration file (or absolute, - if you wish).

    +

    +
      +
    • stopwords (Specifies a list of stopwords--that is, words to be ignored when indexing.) +
      + + + + + +
      file(The path (relative to the config file) to a text file containing a list of words to + be ignored by the indexer (one word per line).)
      +
      +
    • +
    +

    A stopword is a word that will not be indexed, because it is too common (the, a, you and so on). There are common stopwords files for most languages available on the + Web, but it is probably a good idea to take one of these and customize it for your + project, since there will be words in the website which are so common that it makes + no sense to index them, but they are not ordinary stopwords. For example, in a website + dedicated to the work of John Keats, the name keats should probably be added to the stopwords file, since almost every page will include + it, and searching for it will be pointless. staticSearch provides a default set of common stopwords for English, which you'll + find in xsl/english_stopwords.txt. One way to find appropriate stopwords for your site is to generate your index, then + search for the largest JSON index files that are generated, to see if they might be + too common to be useful as search terms. You can also use the Word Frequency table in the generated staticSearch report (see 8.9 Generated report).

    +
      +
    • dictionary (Specifies a dictionary against which tokens may be checked during indexing.) +
      + + + + + +
      file(The relative path (from the config file) to a dictionary file (one word per line).)
      +
      +
    • +
    +

    The indexing process checks each word as it builds the index, and keeps a record of + all words which are not found in the configured dictionary. Though this does not have + any direct effect in the indexing process, all words not found in the dictionary are + listed in the staticSearch report (see 8.9 Generated report). This can be very useful: all words listed are either foreign (not part of the language + of the dictionary) or perhaps misspelled (in which case they may not be correctly + stemmed and index, and should be corrected). staticSearch provides a default dictionary in xsl/english_words.txt that can be copied and adapted if working in English; lots of dictionaries for other + languages are available on the Web.

    +
      +
    • tokenizer (Configures options for the tokenizing process.) +
      + + + + + +
      minWordLength(Specifies the minimum length in characters of a sequence of text that will be considered + to be a word worth indexing.)
      +
      +
    • +
    +

      -
    • phrasalSearch (Whether or not to support phrasal searches. If this is true, then the maxContexts - setting will be ignored, because all contexts are required to properly support phrasal - search.)
    • +
    • scoringAlgorithm (The scoring algorithm to use for ranking keyword results.) +
      + + + + + +
      name(Specifies the name of the scoring algorithm to use.)
      +
      +
    -

    Phrasal search functionality enables your users to search for specific phrases by - surrounding them with quotation marks ("), as in many search engines. In order to support this kind of search, <createContexts> must also be set to true as we store contexts for all hits for each token in each - document. Turning this on will make the index larger, because all contexts must be - stored, but once the index is built, it has very little impact on the speed of searches, - so we recommend turning this on. The default value is true. However, if your site is very large and your user base is unlikely to use phrasal - searching, it may not be worth the additional build time and increased index size.

    +

    <scoringAlgorithm> is an optional element that specifies which scoring algorithm to use when calculating + the score of a term and thus the order in which the results from a search are sorted.

      -
    • stemmerFolder (The name of a folder inside the staticSearch /stemmers/ folder, in which the JavaScript - and XSLT implementations of stemmers can be found. If left blank, then the staticSearch - default English stemmer (en) will be used.)
    • +
    • stemmer (The name of a folder inside the staticSearch /stemmers/ folder, in which the JavaScript + and XSLT implementations of stemmers can be found. If not specified, then the staticSearch + default English stemmer (en) will be used.) +
      + + + + + +
      dir(The path (relative to the config file) of the directory to use for stemming.)
      +
      +

    The staticSearch project currently has only two real stemmers: an implementation of the Porter 2 algorithm for modern English, and an implementation of the French Snowball @@ -682,101 +724,116 @@

    8.5.2.2 Optional param cases where there are mixed languages so a single stemmer will not do. To use this option, specify the value stripDiacritics in your configuration file.

      -
    • scoringAlgorithm (The scoring algorithm to use for ranking keyword results. Default is "raw" (i.e. weighted - counts))
    • -
    -

    <scoringAlgorithm> is an optional element that specifies which scoring algorithm to use when calculating - the score of a term and thus the order in which the results from a search are sorted. - There are currently two options:

    -
      -
    • raw: This is the default option (and so does not need to be set explicitly). The raw - score is simply the sum of all instances of a term (optionally multipled by a configured - weight via the <rule>/weight configuration) in a document. This will usually provide good results for most document collections.
    • -
    • tf-idf: The tf-idf algorithm (term frequency-inverse document frequency) computes the mathematical - relevance of a term within a document relative to the rest of the document collection. - The staticSearch implementation of tf-idf basically follows the textbook definition - of tf-idf: tf-idf = ($instancesOfTerm / $totalTermsInDoc) * log( $allDocumentsCount - / $docsWithThisTermCount ) This is fairly crude compared to other search engines, - like Lucene, but it may provide useful results for document collections of varying lengths or - in instances where the raw score may be insufficient or misleading. There are a number - of resources on tf-idf scoring, including: Wikipedia and Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze, Introduction to Information Retrieval, Cambridge University Press. 2008.
    • -
    -

    -
      -
    • createContexts (Whether to include keyword-in-context extracts in the index.)
    • -
    -

    <createContexts> is a boolean parameter that specifies whether you want the indexer to store keyword-in-context - extracts for each of the hits in a document. This increases the size of the index, - but of course it makes for much more user-friendly search results; instead of seeing - just a score for each document found, the user will see a series of short text strings - with the search keyword(s) highlighted. Note that contexts are necessary for phrasal searching or wildcard searching.

    -
      -
    • minWordLength (The minimum length of a term to be indexed. Default is 3 characters.)
    • -
    -

    <minWordLength> specifies the minimum length in characters of a sequence of text that will be considered - to be a word worth indexing. The default is 3, on the basis that in most European - languages, words of one or two letters are typically not worth indexing, being articles, - prepositions and so on. If you set this to a lower limit for reasons specific to your - project, you should ensure that your stopword list excludes any very common words - that would otherwise make the indexing process lengthy and increase the index size.

    -
      -
    • maxKwicsToHarvest (This controls the maximum number of keyword-in-context extracts that will be stored - for each term in a document.)
    • -
    -

    <maxKwicsToHarvest> controls the number of keyword-in-context extracts that will be harvested from the - data for each term in a document. For example, if a user searches for the word ‘elephant’, and it occurs 27 times in a document, but the <maxKwicsToHarvest> value is set to 5, then only the first five (sorted in document order) of these keyword-in-context - strings will be stored in the index. (This does not affect the score of the document - in the search results, of course.) If you set this to a low number, the size of the - JSON files will be constrained, but of course the user will only be able to see the - KWICs that have been harvested in their search results. If <phrasalSearch> is set to true, the <maxKwicsToHarvest> setting is ignored, because phrasal searches will only work properly if all contexts - are stored.

    -
      -
    • maxKwicsToShow (This controls the maximum number of keyword-in-context extracts that will be shown - in the search page for each hit document returned.)
    • -
    -

    A user may search for multiple common words, so hundreds of hits could be found in - a single document. If the keyword-in-context strings for all these hits are shown - on the results page, it would be too long and too difficult to navigate. This setting - controls how many of those hits you want to show for each document in the result set.

    -
      -
    • totalKwicLength (If createContexts is set to true, then this parameter controls the length (in words) - of the harvested keyword-in-context string.)
    • +
    • createContexts (Whether to include keyword-in-context extracts in the index.) +
      + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
      createSpecifies whether the indexer stores keyword-in-context extracts for each hit in a + document.
      createSpecifies whether the indexer stores keyword-in-context extracts for each hit in a + document.
      phrasalSearch(Whether or not to support phrasal searches. If this is true, then the maxContexts + setting will be ignored, because all contexts are required to properly support phrasal + search.)
      wildcardSearch(Whether or not to support wildcard searches.)
      maxKwicsToHarvest(Controls the number of keyword-in-context extracts that will be harvested from the + data for each term in a document.)
      maxKwicLength(Sets the maximum length (in words) of a keyword-in-context result.)
      maxKwicsToHarvest(Controls the number of keyword-in-context extracts that will be harvested from the + data for each term in a document.)
      kwicTruncateString(The string that will be used to signal ellipsis at the beginning and end of a keyword-in-context + extract. Conventionally three periods, or an ellipsis character (which is the default + value).)
      +
      +
    -

    Obviously, the longer the keyword-in-context strings are, the larger the individual - index files will be, but the more useful the KWICs will be for users looking at the - search results. Note that the phrasal searching relies on the KWICs and thus longer - KWICs allow for longer phrasal searches.

    +

    Note that contexts are necessary for phrasal searching or wildcard searching.

      -
    • kwicTruncateString (The string that will be used to signal ellipsis at the beginning and end of a keyword-in-context - extract. Conventionally three periods, or an ellipsis character (which is the default - value).)
    • +
    • results (Controls the configuration of the results page.) +
      + + + + + + + + + +
      resultsPerPage(The maximum number of document results to be displayed per page. All results are displayed + by default; setting resultsPerPage to a positive integer creates a Show More/Show + All widget at the bottom of the batch of results.)
      maxKwicsToShow(Controls the maximum number of keyword-in-context extracts that will be shown in the + search page for each hit document returned.)
      +
      +
    -

    The only reason you might need to specify a value for this parameter is if the language - of your search page conventionally uses a different ellipsis character. Japanese, - for example, uses the 3-dot-leader character.

    +

      -
    • resultsPerPage (The maximum number of document results to be displayed per page. All results are displayed - by default; setting resultsPerPage to a positive integer creates a Show More/Show - All widget at the bottom of the batch of results.)
    • +
    • version (Specifies the unique version to append to the index, so that the browser will not + use cached versions of older index files.) +
      + + + + + +
      file(The path (relative to the config file) to a text file containing a single version + identifier (such as 1.5, 123456, or 06ad419).)
      +
      +
    -

    For most sites, where the number of results is likely to be in the low thousands, - it's perfectly practical to show all the results at once, because the staticSearch - processor is so fast. However, if you have tens of thousands of documents, and it's - possible that users will do (for example) filter-only searches that retrieve a large - proportion of them, you can constrain the number of results which are shown initially - using this setting. All the results are still generated and output to the page, but - since most of them are hidden until the ‘Show More’ or ‘Show All’ button is clicked, - the browser will render them much more quickly.

    +

    <version> enables you to specify the path to a plain-text file containing a simple version + number for the project. This might take the form of a software-release-style version + number such as 1.5, or it might be a Subversion revision number or a Git commit hash. It should not + contain any spaces or punctuation. If you provide a version file, the version string + will be used as part of the filenames for all the JSON resources created for the search. + This is useful because it allows the browser to cache such resources when users repeatedly + visit the search page, but if the project is rebuilt with a new version, those cached + files will not be used because the new version will have different filenames. The + path specified is relative to the location of the configuration file (or absolute, + if you wish).

      -
    • outputFolder (The name of the output folder into which the index data and JavaScript will be placed - in the site search. This should conform with the XML Name specification.)
    • +
    • output (Sets the folder into which the index data and JavaScript will be placed.) +
      + + + + + +
      dir(A pointer to a local directory.)
      +
      +

    When the staticSearch build process creates its output, many files need to be added to the website for which an index is being created. For convenience, all of these files are stored in a single folder. This element is used to specify the name of that folder. The default is staticSearch, but if you would prefer something else, you can specify it here. You may also use this element if you are defining two different searches within the same site, so that - their files are kept in different locations.

    + their files are kept in different locations.

    @@ -810,8 +867,8 @@

    8.5.3 Specifying rules
    The <rules> elements specifies a list of conditions (using the <rule> element) that tell the parser, using XPath statements in the match attribute, specific weights to assign to particular parts of each document. For instance, if you wanted all heading elements (<h1>, <h2>, etc) in documents to be given a greater weight and thus receive a higher score in the results, you can do so using a rule like so: -
    <rules>
    <rule weight="2"
    match="h1 | h2 | h3 | h4 | h5 | h6"/>

    </rules>
    Since we're using XPath 3.0 and XSLT 3.0, this can also be simplified to: -
    <rules>
    <rule weight="2"
    match="*[matches(local-name(),'^h\d+$')]"/>

    </rules>
    (It is worth noting, however, the above example is unnecessary: all heading elements +
    <rules>
    <rule weight="2"
    match="h1 | h2 | h3 | h4 | h5 | h6"/>

    </rules>
    Since we're using XPath 3.0 and XSLT 3.0, this can also be simplified to: +
    <rules>
    <rule weight="2"
    match="*[matches(local-name(),'^h\d+$')]"/>

    </rules>
    (It is worth noting, however, the above example is unnecessary: all heading elements are given a weight of 2 by default, which is the only preconfigured weight in staticSearch.)

    The value of the match attribute is transformed in a XSLT template match attribute, and thus must follow the same rules (i.e. no complex rules like p/ancestor::div). See the W3C XSLT Specification for further details on allowable pattern rules.

    @@ -820,8 +877,8 @@

    8.5.3 Specifying rules index its contents on every page. These elements can be ignored simply by using a <rule> and setting its weight to 0. For instance, if you want to remove the header and the footer from the search indexing process, you could write something like: -
    <rule weight="0" match="footer | header"/>
    Or if you want to remove XHTML anchor tags (<a>) whose text is identical to the URL specified in its href, you could do something like: -
    <rule weight="0" match="a[@href=./text()]"/>
    +
    <rule weight="0" match="footer | header"/>
    Or if you want to remove XHTML anchor tags (<a>) whose text is identical to the URL specified in its href, you could do something like: +
    <rule weight="0" match="a[@href=./text()]"/>

    Note that the indexer does not tokenize any content in the <head> of the document (but as noted in 8.1 Configuring your site: search filters, metadata can be configured into filters) and that all elements in the <body> of a document are considered tokenizable. However, common elements that you might want to exclude include:

    @@ -834,7 +891,7 @@

    8.5.3 Specifying rules

    8.5.4 Specifying contexts (optional)

      -
    • contexts (The set of context elements that identify contexts for keyword-in-context fragments.)
    • +
    • contexts (The set of context that identify contexts for keyword-in-context fragments.)
    • context (A context definition, providing a match attribute that identifies the context, allowing keyword-in-context fragments to be bounded by a specific context.)
      @@ -858,22 +915,22 @@

      8.5.4 Specifying conte
      When the staticSearch creates the keywords-in-context strings (the "kwic" or "snippets") for each token, it does so by looking for the nearest block-level element that it can use as its context. Take, for instance, this unordered list: -
      <ul>
      <li>Keyword-in-context search results. This is also configurable, since including contexts
      increases the size of the index.</li>
      <li>Search filtering using any metadata you like, allowing users to limit their search +
      <ul>
      <li>Keyword-in-context search results. This is also configurable, since including contexts
      increases the size of the index.</li>
      <li>Search filtering using any metadata you like, allowing users to limit their search to specific
      document types.</li>
      </ul>
      Each <li> elements is, by default, a context element, meaning that the snippet generated for each token will not extend beyond the <li> element boundaries; in this case, if the <li> was not a context attribute, the term ‘search’ would produce a context that looks something like: -
      "...the size of the index.Search filtering using any metadata you like,..."
      +
      "...the size of the index.Search filtering using any metadata you like,..."
      Using the <contexts> element, you can control what elements operate as contexts. For instance, say a page contained a marginal note, encoded as a <span> in your document beside its point of attachment:1 -
      <p>About that program I shall have nothing to say here,<span class="sidenote">Some information on this subject can be found in "Second Thoughts"</span> [...]
      </p>
      Using CSS, the footnote might be alongside the text of the document in margin, or +
      <p>About that program I shall have nothing to say here,<span class="sidenote">Some information on this subject can be found in "Second Thoughts"</span> [...]
      </p>
      Using CSS, the footnote might be alongside the text of the document in margin, or made into a clickable object using Javascript. However, since the tokenizer is unaware of any server-side processing, it understands the <span> as an inline element and assumes the <p> constitutes the context of the element. A search for ‘information’ might then return: -
      "...nothing to say here,Some information on this subject can be found...
      To tell the tokenizer that the <span> constitutes the context block for any of its tokens, use the <context> element with an match pattern: -
      <contexts>
      <context match="span[contains-token(@class,'sidenote')]"/>
      </contexts>
      +
      "...nothing to say here,Some information on this subject can be found...
      To tell the tokenizer that the <span> constitutes the context block for any of its tokens, use the <context> element with an match pattern: +
      <contexts>
      <context match="span[contains-token(@class,'sidenote')]"/>
      </contexts>
      You can also configure it the other way: if a <div>, which is by default a context block, should not be understood as a context block, then you can tell the parser to not consider it as such using context set to false: -
      <contexts>
      <context match="div" context="false"/>
      </contexts>
      +
      <contexts>
      <context match="div" context="false"/>
      </contexts>

      The default context elements are:

        @@ -907,7 +964,7 @@

        8.5.5 Specifying searc
        The <context> mechanism provides a way to specify particular components of a page that can be searched within using the label attribute.
          -
        • contexts (The set of context elements that identify contexts for keyword-in-context fragments.)
        • +
        • contexts (The set of context that identify contexts for keyword-in-context fragments.)
        • context (A context definition, providing a match attribute that identifies the context, allowing keyword-in-context fragments to be bounded by a specific context.)
          @@ -930,7 +987,7 @@

          8.5.5 Specifying searc users to perform a search within only a particular component of the page. For instance, for a page structured like the journal article mentioned above, we could specify the abstract, the notes, and the document’s body like so: -
          <contexts>
          <context match="article[@id='article_content']"
          label="Article text only"/>

          <context match="div[contains-token(@class,'footnote')]"
          label="Notes only"/>

          <context match="section[@id='abstract']"
          label="Abstracts only"/>

          <context match="span[contains-token(@class,'inline-note')]"
          label="Notes only"/>

          </contexts>
          The generated search page will then contain a set of checkboxes derived from the +
          <contexts>
          <context match="article[@id='article_content']"
          label="Article text only"/>

          <context match="div[contains-token(@class,'footnote')]"
          label="Notes only"/>

          <context match="section[@id='abstract']"
          label="Abstracts only"/>

          <context match="span[contains-token(@class,'inline-note')]"
          label="Notes only"/>

          </contexts>
          The generated search page will then contain a set of checkboxes derived from the distinct label values. There is no requirement for the label values to be distinct, but any identical labels will be treated as identical contexts (i.e. in the example above, searching for a string within "Notes only" will return all results found within both the div elements with a class="footnote" and the span elements with class="inline-note".)

          @@ -965,17 +1022,17 @@

          8.5.6 Specifying exclu ignore filter controls (HTML <meta> elements, as described in 5 Search facet features) which are provided to support other search pages.

          A complex site may have two or more search pages targetting specific types of document or content, each of which may need its own particular search controls and indexes. - This can easily be achieved by specifying a different <searchFile> and <outputFolder> in the configuration file for each search.

          + This can easily be achieved by specifying a different <searchPage> and <output> in the configuration file for each search.

          For these searches to be different from each other, they will also probably have different contexts and rules. For example, imagine that you are creating a special search page that focuses only on the text describing images or figures in your documents. You might do it like this: -
          <rules>
          <rule match="text()[not(ancestor::div[@class='figure']or ancestor::title)]"
          weight="0"/>

          </rules>
          This specifies that all text nodes which are not part of the document title or descendants +
          <rules>
          <rule match="text()[not(ancestor::div[@class='figure']or ancestor::title)]"
          weight="0"/>

          </rules>
          This specifies that all text nodes which are not part of the document title or descendants of <div class="figure"> should be ignored (weight=0), so only your target nodes will be indexed.

          However, it's also likely that you will want to exclude certain features or documents from a specialized search page, and this is done using the <excludes> section and its child <exclude> elements.

          Here is an example: -
          <excludes>
          <!-- We only index files which have illustrations in them. -->
          <exclude type="index"
          match="html [not( descendant::meta [@name='Has illustration(s)'] [@content='true'] +
          <excludes>
          <!-- We only index files which have illustrations in them. -->
          <exclude type="index"
          match="html [not( descendant::meta [@name='Has illustration(s)'] [@content='true'] )]"/>

          <!-- We ignore the document type filter, because we are only indexing one type of document anyway. -->
          <exclude type="filter"
          match="meta[ @name='Document type' ]"/>

          <!-- We exclude the filter that specifies @@ -993,12 +1050,12 @@

          8.6 Creating a search also be well-formed XML, so it can be processed), containing all the site components you need, and then the search build process will insert all the necessary components into that file. The only requirement is that the page contains one <div> element with the correct id attribute: -
          <div id="staticSearch">
          [...content will be supplied by the build process...]
          </div>
          This <div> will be empty initially. The build process will find insert the search controls, +
          <div id="staticSearch">
          [...content will be supplied by the build process...]
          </div>
          This <div> will be empty initially. The build process will find insert the search controls, scripts and results <div> into this container. Then whenever you rebuild the search for your site, the contents will be replaced. There is no need to make sure it's empty every time.

          The search process will also add a link to the staticSearch CSS file to the <head> of the document: -
          <link rel="stylesheet"
          href="staticSearch/ssSearch.css" id="ssCss"/>
          You can customize this CSS by providing your own CSS that overrides it, using <style>, or <link>, placed after it in the <head> element, or by replacing the inserted CSS after the build process. Note that some - features, like <resultsPerPage> or the ‘Searching’ loading dialog, rely on rules included in the base staticSearch CSS; if you do remove +
          <link rel="stylesheet"
          href="staticSearch/ssSearch.css" id="ssCss"/>
          You can customize this CSS by providing your own CSS that overrides it, using <style>, or <link>, placed after it in the <head> element, or by replacing the inserted CSS after the build process. Note that some + features, like the ‘Show More’ widget or the ‘Searching’ loading dialog, rely on rules included in the base staticSearch CSS; if you do remove or disable the CSS, then some features may not work properly.

          Note that once your file has been processed and all this content has been added, you can process it again at any time; there is no need to start every time with a clean, @@ -1017,7 +1074,7 @@

          8.7 Running the search

          Before running the search on your own site, you can test that your system is able to do the build by doing the (very quick) build of the test materials. If you simply run the ant command, like this:

          -
          mholmes@linuxbox:~/Documents/staticSearch$ ant
          +
          mholmes@linuxbox:~/Documents/staticSearch$ ant

          you should see a build process proceed using the small test collection of documents, and at the end, a results page should open up giving you a report on what was done. If this fails, then you'll need to troubleshoot the problem based on any error messages @@ -1027,7 +1084,7 @@

          8.7 Running the search

          If the tests all work, then you're ready to build a search for your own site. Now you need to run the same command, but this time, tell the build process where to find your custom configuration file:2

          -
          ant -DssConfigFile=/home/mholmes/mysite/config_staticSearch.xml
          +
          ant -DssConfigFile=/home/mholmes/mysite/config_staticSearch.xml

          The same process should run, and if it's successful, you should have a modified search.html page as well as a lot of index files in JSON format in your site HTML folder. Now you can test your own search in the same ways suggested above.

          @@ -1039,35 +1096,35 @@

          8.8 Running staticSear using ssConfig or an absolute path using ssConfigFile). Assuming that the build file, your config file, and your staticSearch directory are all at the root of the project, you could call the staticSearch build in ant like so: -
          <ant antfile="${basedir}/staticSearch"
          inheritall="false">

          <property name="ssConfig"
          value="staticSearch_config.xml"/>

          </ant>
          +
          <ant antfile="${basedir}/staticSearch"
          inheritall="false">

          <property name="ssConfig"
          value="staticSearch_config.xml"/>

          </ant>

          Note that any arguments passed to ant at the command line arguments will be passed on to the staticSearch build. This can cause issues when the main build requires the use of the -lib parameter (since the project's version of Saxon may conflict, for instance, with the version used by staticSearch). If your build requires the use of the -lib parameter, then an alternative approach for calling staticSearch from your build is to use the exec task like so: -
          <exec executable="ant" dir="staticSearch">
          <arg value="-DssConfig=../config_staticSearch.xml"/>
          </exec>
          +
          <exec executable="ant" dir="staticSearch">
          <arg value="-DssConfig=../config_staticSearch.xml"/>
          </exec>

        8.9 Generated report

        After indexing your HTML files, the staticSearch build then generates an HTML report of helpful statistics and diagnostics about your document collection, which can be - found in the directory specified by <outputFolder>. We recommend looking at this file regularly, especially if you're encountering unexpected + found in the directory specified by <output>. We recommend looking at this file regularly, especially if you're encountering unexpected behaviour by the staticSearch engine, as it contains information that can often help diagnose issues with configured filters or the HTML document collection that, if fixed, can improve staticSearch results.

        By default, the report includes only basic information about the number of stem files created, the the filters used, and any problems encountered. However, if you run the build process using the additional parameter ssVerboseReport:

        -
        ant -DssVerboseReport=true -DssConfigFile=...
        +
        ant -DssVerboseReport=true -DssConfigFile=...

        then the report will also include a number of tables that outline some statistics about your project. However, please note that compiling these statistics is very memory-intensive and if your site is large, it may cause the build process to run out of memory.

        As of version 1.4, the word frequency table is a separate document and is no longer included as part of the verbose report. Instead, after running a build, you can then build just the word frequency table with the special concordance target:

        -
        ant -DssConfigFile=path/to/your/config.xml concordance
        +
        ant -DssConfigFile=path/to/your/config.xml concordance

        While the chart itself is not necessary for the core functionality of staticSearch, it is particularly useful during the initial development of a project’s search engine; it can be used to create and fine-tune the project-specific stopword list (i.e. if @@ -1091,18 +1148,18 @@

        8.10.1 Custom attribut result string (which is in the form of an HTML <li> element).

        Imagine that some of the paragraphs in your documents are special in some way. You could add an attribute whose name begins with data-ss- to each of those paragraphs, like this: -
        <p data-ss-type="special">This paragraph is special for some reason or other...</p>
        When the staticSearch indexer creates KWIC extracts, it automatically harvests any +
        <p data-ss-type="special">This paragraph is special for some reason or other...</p>
        When the staticSearch indexer creates KWIC extracts, it automatically harvests any attribute whose name begins with data-ss- from the containing element or its ancestors, and adds them to the keyword-in-context record in the index. Then when that KWIC string is displayed as the result of a search, the attribute will be added to the HTML <li> element on the page: -
        <li data-ss-type="special">[KWIC with marked search hit, link, etc.]</li>
        This means that you can add your own CSS or JavaScript to make that KWIC appear distinct +
        <li data-ss-type="special">[KWIC with marked search hit, link, etc.]</li>
        This means that you can add your own CSS or JavaScript to make that KWIC appear distinct from other KWICs which come from non-special paragraphs.

        You can add as many custom attributes as you like (although bear in mind that they increase the size of the index JSON files slightly and may add to the build time).

        One specific custom attribute has built-in handling that you may find useful. If you add the attribute data-ss-img with a value that points to an image, that image will be displayed to the left of the KWIC string. For example, if you do this: -
        <p data-ss-img="images/elephant.png">This paragraph is all about elephants...</p>
        then any KWIC results from that paragraph will show the elephant.png image to the left of the KWIC text. This can be especially useful if your site contains +
        <p data-ss-img="images/elephant.png">This paragraph is all about elephants...</p>
        then any KWIC results from that paragraph will show the elephant.png image to the left of the KWIC text. This can be especially useful if your site contains large documents which are broken into sections, and those sections can be helpfully represented by images; the search results will be easier for the user to understand by virtue of the associated images. Image URLs should be relative to the location @@ -1127,7 +1184,7 @@

        8.10.2 Highlighting se

        Those links are also provided with a search string, like this: https://example.com/egPage.html?ssMark=elephant#animals This link points to the section of the document which has id=animals, but it also says ‘the hit text is the word elephant.’ Some JavaScript that runs on the target page, egPage.html (which you control) will be able to parse the value of the query parameter ssMark in order to find the hit text, and highlight it in some way.

        Obviously you can implement this any way you like (or just ignore it), but we also supply a small demonstration JavaScript library which implements this functionality, - called ssHighlight.js. This JS file is included into the staticSearch output folder (see <outputFolder>) by default, and if you include it into the header of your own pages, it will probably + called ssHighlight.js. This JS file is included into the staticSearch output folder (see <output>) by default, and if you include it into the header of your own pages, it will probably do the highlighting without further intervention. If, however, you have lots of existing JavaScript that runs when the page loads, there may be some interference between this library and your own code, so you may have to make some adjustments to the code.

        @@ -1200,7 +1257,7 @@

        9.2 The search page9.3 JavaScript compilation

        The search page created for your website is entirely driven by JavaScript. The JavaScript source code can be found in a number of .js files inside the repository js folder. At build time, these files (with the exception of ssHighlight.js and ssInitialize.js) are first concatenated into a single large file called ssSearch-debug.js. This file is then optimized using the Google Closure Compiler, to create a smaller file called ssSearch.js which should be faster for the browser to download and parse. Both of these output - files are provided in your project <outputFolder>; ssSearch.js is linked in your search page, but if you're having problems and would like to debug + files are provided in your project <output>; ssSearch.js is linked in your search page, but if you're having problems and would like to debug with more human-friendly JavaScript, you can switch that link to point to ssSearch-debug.js.

        We are still experimenting with the options and affordances of the Closure compiler, in the interests of finding the best balance between file size and performance.

        @@ -1227,7 +1284,7 @@

        10 ‘ How do I get staticSearch to ignore large chunks of my document? Any element with a weight of 0 is ignored completely by the indexer, so add a <rule> for the element. So to ignore all elements with the class ignoreThisElement, you could do something like: -
        <rule weight="0"
        match="div[contains-token(@class, 'ignoreThisElement')"/>
        +
        <rule weight="0"
        match="div[contains-token(@class, 'ignoreThisElement')"/>
        @@ -1238,31 +1295,28 @@

        10 ‘ How do I get staticSearch to ignore an element, but retain its text in the KWIC? Here, you'll want to use the <exclude> function, which excludes the element from indexing, but doesn't remove it from the document itself. So, if you wanted to exclude all keyboard entry items (<xh:kbd>), but still have them in the KWIC, you could do something like: -
        <exclude match="kbd"
        type="index"/>
        +
        <exclude match="kbd"
        type="index"/>
        How can I get staticSearch to show debugging messages? Set the ssVerbose property to true at the command line: -
        ant -DssConfig=cfg.xml -DssVerbose=true 
        Note that verbosity settings persist after creating the initial config; so, if you +
        ant -DssConfig=cfg.xml -DssVerbose=true 
        Note that verbosity settings persist after creating the initial config; so, if you are trying to debug just the tokenization process, you must make sure to run the config target beforehand: -
        ant config tokenize -DssConfig=cfg.xml -DssVerbose=true 
        +
        ant config tokenize -DssConfig=cfg.xml -DssVerbose=true 
        How can I get staticSearch to highlight the found text in a target document? - There are two approaches to this: you could implement a JavaScript solution as explained - in Highlighting search hits on target pages, or you could turn on the <scrollToFragmentId> experimental feature supported by some Chromium-based browsers. The former requires - some modification to your site pages to add some JavaScript, while the latter is non-standard - and not really reliable or consistent. + See Highlighting search hits on target pages. How do I prevent staticSearch from encountering out of memory errors? If you are indexing a very large collection of files, you may need to provide ant with additional memory by configuring the ANT_OPTS system property. To provide ant with 4GB of memory, you could do something like so: -
        export ANT_OPTS="-Xmx4g"; ant -DjavaFork=false -DssConfigFile=/absolute/path/to/your/config.xml
        How much memory you can and should provide to Ant depends on your particular system +
        export ANT_OPTS="-Xmx4g"; ant -DjavaFork=false -DssConfigFile=/absolute/path/to/your/config.xml
        How much memory you can and should provide to Ant depends on your particular system and the size of the document collection. See Ant's documentation for some further examples and explanation. The javaFork parameter prevents calls to Java processes (such as Saxon) from forking into a new Java VM, which allows them to take advantage of the expanded memory you have assigned to Ant. @@ -1295,7 +1349,7 @@

        11.1 Changes in versio <verbose> The verbose option has been removed and replaced by the ssVerbose property in ant. To get debugging messages, set the ssVerbose parameter to true (other accepted values: t, yes, y, 1) -
        ant -DssConfig=cfg.xml -DssVerbose=true 
        +
        ant -DssConfig=cfg.xml -DssVerbose=true 
        @@ -1305,12 +1359,12 @@

        11.1 Changes in versio tool jq). - <linkToFragmentId> - <linkToFragmentId>, which controlled whether the search results should link back to the nearest document + <linkToFragmentId> + <linkToFragmentId>, which controlled whether the search results should link back to the nearest document fragment with an id, has been made part of the default behaviour of staticSearch and is no longer configurable. However, if you do not want the link to the nearest fragment to appear in the results, you can visually hide the link element with the fidLink class in your site's CSS: -
        .fidLink{ display:none; } 
        +
        .fidLink{ display:none; } 
        @@ -1377,7 +1431,7 @@

        11.5 Changes in versio
      • The staticSearch report has been simplified and no longer produces a concordance of stems by default. The concordance can be built at the command line by calling the concordance target in ant: -
        ant concordance -DssConfig=cfg.xml
        +
        ant concordance -DssConfig=cfg.xml
      • The version attribute has been added to the root <config> element to better future-proof the alignment of configuration files and the staticSearch codebase. See 8.5.1 The config element for more details.
      • @@ -1460,13 +1514,13 @@

        11.6 Changes in versio
      • Documentation has been significantly improved with additional explanatory remarks for many elements, and the staticSearch build of the documentation now includes hit highlighting (the feature described above).
      • -
      • Only ancestor ids are indexed when <linkToFragmentId> is enabled; formerly, any preceding id value was used.
      • -
      • Results can now optionally be viewed in batches by setting the new <resultsPerPage> configuration option.
      • +
      • Only ancestor ids are indexed when <linkToFragmentId> is enabled; formerly, any preceding id value was used.
      • +
      • Results can now optionally be viewed in batches by setting the new <resultsPerPage> configuration option.
      • The maximum number of results that a search can return has been set to 2000 results - by default and can be changed using the new <resultsLimit> element. If a search returns a set of results that exceeds this limit, staticSearch + by default and can be changed using the new <resultsLimit> element. If a search returns a set of results that exceeds this limit, staticSearch does not render the results and advises the user to try a more precise search.
      • The minimum length of a word to be indexed is now configurable, so in unusual circumstances - you can now enable searching for 1- or 2-letter words using the <minWordLength> parameter.
      • + you can now enable searching for 1- or 2-letter words using the <minWordLength> parameter.
      • The staticSearch report is now discussed in the documentation (see 8.9 Generated report) and the "Not in Dictionary" and "Foreign Words" reports have been improved.
      • The filter creation process has been rationalized such that all filter processing happens in json.xsl, which has also improved the build performance slightly.
      • @@ -1475,7 +1529,7 @@

        11.6 Changes in versio
        • The encoding structure for docImage, docSortKey, and docTitle has been constrained such that each doc* <meta> must include both a name and class value: -
          <meta name="docTitle"
          class="staticSearch_docTitle" content="My custom document title"/>
          +
          <meta name="docTitle"
          class="staticSearch_docTitle" content="My custom document title"/>
        • Temporary XML files from dictionaries are now removed during the clean step of the build process.
        • All HTML characters are properly escaped in context snippets.
        • @@ -1487,7 +1541,7 @@

          11.7 Changes in versio
          • Search results can now be sorted using a user-supplied sort key. This is useful when searching only with filters (so all documents have the same relevance score) or where many results have the same relevance score.
          • -
          • Using the new <linkToFragmentId> parameter, keyword-in-context results can now have individual links to nearest ancestor +
          • Using the new <linkToFragmentId> parameter, keyword-in-context results can now have individual links to nearest ancestor fragment id, so the searcher can go directly to the relevant section of a document.
          • The order of parameter elements in the configuration file is no longer fixed. The schema now allows elements to appear in any order.
          • @@ -1653,7 +1707,7 @@

            Appendix A.1.1 <con -
            <config version="1">
            <params>
            <!--Config options-->
            </params>
            </config>
            +
            <config version="1">
            <params>
            <!--Config options-->
            </params>
            </config>
            @@ -1691,7 +1745,7 @@

            Appendix A.1.1 <con Content model -
            +                              
             <content>
              <sequence minOccurs="1" maxOccurs="1">
               <elementRef key="params"/>
            @@ -1701,18 +1755,18 @@ 

            Appendix A.1.1 <con <elementRef key="filters" minOccurs="0"/> </sequence> </content> -

            +
            Schema Declaration -
            +                              
             element config
             {
                attribute version { text }?,
                ( params, rules?, contexts?, excludes?, filters? )
            -}
            +}
            @@ -1814,24 +1868,24 @@

            Appendix A.1.2 <con Content model -
            +                              
             <content>
              <empty/>
             </content>
            -    
            +
            Schema Declaration -
            +                              
             element context
             {
                att.match.attributes,
                att.labelled.attributes,
                attribute context { text }?,
                empty
            -}
            +}
            @@ -1842,7 +1896,7 @@

            Appendix A.1.3 <con
            - + @@ -1873,19 +1927,19 @@

            Appendix A.1.3 <con

            <contexts> (The set of context elements that identify contexts for keyword-in-context fragments.)<contexts> (The set of context that identify contexts for keyword-in-context fragments.)
            Namespace
            Content model -
            +                              
             <content>
              <elementRef key="context" minOccurs="1"
               maxOccurs="unbounded"/>
             </content>
            -    
            +
            Schema Declaration -
            -element contexts { context+ }
            +
            +element contexts { context+ }
            @@ -1906,6 +1960,14 @@

            Appendix A.1.4 <cre Module ss — Schema specification and tag documentation + + Attributes + +
            +
            +
            + + Contained by @@ -1918,50 +1980,54 @@

            Appendix A.1.4 <cre May contain - -
            -
            XSD boolean
            -
            - + Empty element Note -

            <createContexts> is a boolean parameter that specifies whether you want the indexer to store keyword-in-context - extracts for each of the hits in a document. This increases the size of the index, - but of course it makes for much more user-friendly search results; instead of seeing - just a score for each document found, the user will see a series of short text strings - with the search keyword(s) highlighted.

            Note that contexts are necessary for phrasal searching or wildcard searching.

            Content model -
            +                              
             <content>
            - <dataRef name="boolean"/>
            + <empty/>
             </content>
            -    
            +
            Schema Declaration -
            -element createContexts { xsd:boolean }
            +
            +element createContexts
            +{
            +   (
            +      ( attribute create { "false" }? )
            +    | (
            +         attribute create { "true" }?,
            +         attribute phrasalSearch { text }?,
            +         attribute wildcardSearch { text }?,
            +         attribute maxKwicsToHarvest { text }?,
            +         attribute maxKwicLength { text }?,
            +         attribute kwicTruncateString { text }?
            +      )
            +   ),
            +   empty
            +}

        -
        -

        Appendix A.1.5 <dictionaryFile>

        +
        +

        Appendix A.1.5 <dictionary>

        - + @@ -1971,6 +2037,36 @@

        Appendix A.1.5 <dic

        + + + + - + @@ -1997,25 +2089,26 @@

        Appendix A.1.5 <dic any direct effect in the indexing process, all words not found in the dictionary are listed in the staticSearch report (see 8.9 Generated report). This can be very useful: all words listed are either foreign (not part of the language of the dictionary) or perhaps misspelled (in which case they may not be correctly - stemmed and index, and should be corrected). There is a default dictionary in xsl/english_words.txt which you might copy and adapt if you're working in English; lots of dictionaries - for other languages are available on the Web.

        + stemmed and index, and should be corrected).

        +

        staticSearch provides a default dictionary in xsl/english_words.txt that can be copied and adapted if working in English; lots of dictionaries for other + languages are available on the Web.

        <dictionaryFile> (The relative path (from the config file) to a dictionary file (one word per line) - which will be used to check tokens when indexing.)<dictionary> (Specifies a dictionary against which tokens may be checked during indexing.)
        NamespaceModule ss — Schema specification and tag documentation
        Attributes +
        + + + + + +
        file(The relative path (from the config file) to a dictionary file (one word per line).) +
        + + + + + + + + + + + + + +
        Derived fromss.atts.file
        StatusRequired
        DatatypeanyURI
        +
        +
        +
        +
        Contained by @@ -1983,11 +2079,7 @@

        Appendix A.1.5 <dic

        May contain -
        -
        XSD anyURI
        -
        -
        Empty element
        Note
        Content model -
        +                              
         <content>
        - <dataRef name="anyURI"/>
        + <empty/>
         </content>
        -    
        +
        Schema Declaration -
        -element dictionaryFile { xsd:anyURI }
        +
        +element dictionary { attribute file { text }, empty }
        @@ -2106,23 +2199,23 @@

        Appendix A.1.6 <exc Content model -
        +                              
         <content>
          <empty/>
         </content>
        -    
        +
        Schema Declaration -
        +                              
         element exclude
         {
            att.match.attributes,
            attribute type { "index" | "filter" },
            empty
        -}
        +}
        @@ -2165,19 +2258,19 @@

        Appendix A.1.7 <exc Content model -
        +                              
         <content>
          <elementRef key="exclude" minOccurs="1"
           maxOccurs="unbounded"/>
         </content>
        -    
        +
        Schema Declaration -
        -element excludes { exclude+ }
        +
        +element excludes { exclude+ }
        @@ -2249,19 +2342,19 @@

        Appendix A.1.8 <fil Content model -
        +                              
         <content>
          <elementRef key="span" minOccurs="1"
           maxOccurs="unbounded"/>
         </content>
        -    
        +
        Schema Declaration -
        -element filter { attribute filterName { text }, span+ }
        +
        +element filter { attribute filterName { text }, span+ }
        @@ -2305,32 +2398,30 @@

        Appendix A.1.9 <fil Content model -
        +                              
         <content>
          <elementRef key="filter" minOccurs="1"
           maxOccurs="unbounded"/>
         </content>
        -    
        +
        Schema Declaration -
        -element filters { filter+ }
        +
        +element filters { filter+ }

        -
        -

        Appendix A.1.10 <kwicTruncateString>

        +
        +

        Appendix A.1.10 <index>

        - + @@ -2340,6 +2431,44 @@

        Appendix A.1.10 <kw

        + + + + - - - - - - - - - - - - - +
        <kwicTruncateString> (The string that will be used to signal ellipsis at the beginning and end of a keyword-in-context - extract. Conventionally three periods, or an ellipsis character (which is the default - value).)<index> (Configures options relating to indexing.)
        NamespaceModule ss — Schema specification and tag documentation
        Attributes +
        + + + + + +
        recurse(Determines whether or not to recurse into the subdirectories of the collection and + index those files.) +
        + + + + + + + + + + + + + + + + + +
        StatusRecommended
        Datatypeboolean
        Defaultfalse
        Note +

        This is useful for static sites that create nested directory structures (such as those + generated from Jekyll or Wordpress).

        +
        +
        +
        +
        +
        Contained by @@ -2352,43 +2481,17 @@

        Appendix A.1.10 <kw

        May containCharacter data only
        Note -

        The only reason you might need to specify a value for this parameter is if the language - of your search page conventionally uses a different ellipsis character. Japanese, - for example, uses the 3-dot-leader character.

        -
        Content model -
        -<content>
        - <textNode/>
        -</content>
        -    
        -
        Schema Declaration -
        -element kwicTruncateString { text }
        -
        Empty element
        -
        -

        Appendix A.1.11 <linkToFragmentId>

        +
        +

        Appendix A.1.11 <output>

        - + @@ -2399,261 +2502,45 @@

        Appendix A.1.11 <li

        - - - - - + - - - - - - - - - - - - - -
        <linkToFragmentId> (Whether to link keyword-in-context extracts to the nearest id in the document. Default - is true.)<output> (Sets the folder into which the index data and JavaScript will be placed.)
        Namespacess — Schema specification and tag documentation
        Contained by -
        -
        May containAttributes -
        -
        XSD boolean
        -
        -
        Note -

        <linkToFragmentId> is a boolean parameter that specifies whether you want the search engine to link - each keyword-in-context extract with the closest element that has an id. If the element has an ancestor with an id, then the indexer will associate that keyword-in-context extract with that id; if there are no suitable ancestor elements that have an id, then the extract is associated with first preceding element with an id.

        -

        We strongly recommend that you ensure your target documents have id attributes for - any significant divisions so that this parameter can be used effectively. With lots - of ids throughout your documents, and this parameter turned on, each keyword-in-context - in the results page will be linked directly to the section of the document in which - the hit appears, making the search results much more useful.

        -
        Content model -
        -<content>
        - <dataRef name="boolean"/>
        -</content>
        -    
        -
        Schema Declaration -
        -element linkToFragmentId { xsd:boolean }
        -
        -
        -
        -
        -

        Appendix A.1.12 <maxKwicsToHarvest>

        -
        - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
        <maxKwicsToHarvest> (This controls the maximum number of keyword-in-context extracts that will be stored - for each term in a document.)
        Namespacehttp://hcmc.uvic.ca/ns/staticSearch
        Moduless — Schema specification and tag documentation
        Contained by -
        -
        -
        ss: params
        -
        -
        -
        May contain -
        -
        XSD nonNegativeInteger
        -
        -
        Note -

        <maxKwicsToHarvest> controls the number of keyword-in-context extracts that will be harvested from the - data for each term in a document. For example, if a user searches for the word ‘elephant’, and it occurs 27 times in a document, but the <maxKwicsToHarvest> value is set to 5, then only the first five (sorted in document order) of these keyword-in-context - strings will be stored in the index. (This does not affect the score of the document - in the search results, of course.) If you set this to a low number, the size of the - JSON files will be constrained, but of course the user will only be able to see the - KWICs that have been harvested in their search results.

        -

        If <phrasalSearch> is set to true, the <maxKwicsToHarvest> setting is ignored, because phrasal searches will only work properly if all contexts - are stored.

        -
        Content model -
        -<content>
        - <dataRef name="nonNegativeInteger"/>
        -</content>
        -    
        -
        Schema Declaration -
        -element maxKwicsToHarvest { xsd:nonNegativeInteger }
        -
        -
        -
        -
        -

        Appendix A.1.13 <maxKwicsToShow>

        -
        - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
        <maxKwicsToShow> (This controls the maximum number of keyword-in-context extracts that will be shown - in the search page for each hit document returned.)
        Namespacehttp://hcmc.uvic.ca/ns/staticSearch
        Moduless — Schema specification and tag documentation
        Contained by -
        -
        -
        ss: params
        -
        -
        -
        May contain -
        -
        XSD nonNegativeInteger
        -
        -
        Note -

        A user may search for multiple common words, so hundreds of hits could be found in - a single document. If the keyword-in-context strings for all these hits are shown - on the results page, it would be too long and too difficult to navigate. This setting - controls how many of those hits you want to show for each document in the result set.

        -
        Content model -
        -<content>
        - <dataRef name="nonNegativeInteger"/>
        -</content>
        -    
        -
        Schema Declaration -
        -element maxKwicsToShow { xsd:nonNegativeInteger }
        -
        -
        -
        -
        -

        Appendix A.1.14 <minWordLength>

        -
        - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
        <minWordLength> (The minimum length of a term to be indexed. Default is 3 characters.)
        Namespacehttp://hcmc.uvic.ca/ns/staticSearch
        Moduless — Schema specification and tag documentation
        Contained by -
        -
        -
        ss: params
        -
        -
        -
        May contain -
        -
        XSD nonNegativeInteger
        +
        + + + + + +
        dir(A pointer to a local directory.) +
        + + + + + + + + + + + + + + + + + + + + + +
        Derived fromss.atts.dir
        StatusRequired
        DatatypeNCName
        DefaultstaticSearch
        Note +

        This should conform with the XML Name specification.

        +
        +
        +
        Note -

        <minWordLength> specifies the minimum length in characters of a sequence of text that will be considered - to be a word worth indexing. The default is 3, on the basis that in most European - languages, words of one or two letters are typically not worth indexing, being articles, - prepositions and so on. If you set this to a lower limit for reasons specific to your - project, you should ensure that your stopword list excludes any very common words - that would otherwise make the indexing process lengthy and increase the index size.

        -
        Content model -
        -<content>
        - <dataRef name="nonNegativeInteger"/>
        -</content>
        -    
        -
        Schema Declaration -
        -element minWordLength { xsd:nonNegativeInteger }
        -
        -
        -
        -
        -

        Appendix A.1.15 <outputFolder>

        -
        - - - - - - - - - - - - - + @@ -2686,25 +2569,25 @@

        Appendix A.1.15 <ou

        <outputFolder> (The name of the output folder into which the index data and JavaScript will be placed - in the site search. This should conform with the XML Name specification.)
        Namespacehttp://hcmc.uvic.ca/ns/staticSearch
        Moduless — Schema specification and tag documentation
        Contained by @@ -2666,11 +2553,7 @@

        Appendix A.1.15 <ou

        May contain -
        -
        XSD NCName
        -
        -
        Empty element
        Note
        Content model -
        +                              
         <content>
        - <dataRef name="NCName"/>
        + <empty/>
         </content>
        -    
        +
        Schema Declaration -
        -element outputFolder { xsd:NCName }
        +
        +element output { attribute dir { text }, empty }
        -

        Appendix A.1.16 <params>

        +

        Appendix A.1.12 <params>

        @@ -2733,193 +2616,48 @@

        Appendix A.1.16 <pa

        - - - - - -
        May contain
        Content model -
        +                              
         <content>
        - <elementRef key="searchFile"/>
        - <elementRef key="versionFile"
        -  minOccurs="0"/>
        - <elementRef key="stemmerFolder"
        -  minOccurs="0"/>
        - <elementRef key="recurse"/>
        - <elementRef key="minWordLength"
        -  minOccurs="0"/>
        + <elementRef key="searchPage"/>
        + <elementRef key="index" minOccurs="0"/>
        + <elementRef key="stopwords" minOccurs="0"/>
        + <elementRef key="dictionary" minOccurs="0"/>
        + <elementRef key="tokenizer" minOccurs="0"/>
          <elementRef key="scoringAlgorithm"
           minOccurs="0"/>
        - <elementRef key="phrasalSearch"
        -  minOccurs="0"/>
        - <elementRef key="wildcardSearch"
        -  minOccurs="0"/>
        + <elementRef key="stemmer" minOccurs="0"/>
          <elementRef key="createContexts"
           minOccurs="0"/>
        - <elementRef key="maxKwicsToHarvest"
        -  minOccurs="0"/>
        - <elementRef key="maxKwicsToShow"
        -  minOccurs="0"/>
        - <elementRef key="totalKwicLength"
        -  minOccurs="0"/>
        - <elementRef key="kwicTruncateString"
        -  minOccurs="0"/>
        - <elementRef key="stopwordsFile"
        -  minOccurs="1" maxOccurs="1"/>
        - <elementRef key="dictionaryFile"
        -  minOccurs="1" maxOccurs="1"/>
        - <elementRef key="outputFolder"
        -  minOccurs="0"/>
        - <elementRef key="resultsPerPage"
        -  minOccurs="0"/>
        - <elementRef key="resultsLimit"
        -  minOccurs="0"/>
        -</content>
        -    
        -
        Schema Declaration -
        -element params {  }
        -
        -
        -
        -
        -

        Appendix A.1.17 <phrasalSearch>

        -
        - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
        <phrasalSearch> (Whether or not to support phrasal searches. If this is true, then the maxContexts - setting will be ignored, because all contexts are required to properly support phrasal - search.)
        Namespacehttp://hcmc.uvic.ca/ns/staticSearch
        Moduless — Schema specification and tag documentation
        Contained by -
        -
        -
        ss: params
        -
        -
        -
        May contain -
        -
        XSD boolean
        -
        -
        Note -

        Phrasal search functionality enables your users to search for specific phrases by - surrounding them with quotation marks ("), as in many search engines. In order to support this kind of search, <createContexts> must also be set to true as we store contexts for all hits for each token in each - document. Turning this on will make the index larger, because all contexts must be - stored, but once the index is built, it has very little impact on the speed of searches, - so we recommend turning this on. The default value is true.

        -

        However, if your site is very large and your user base is unlikely to use phrasal - searching, it may not be worth the additional build time and increased index size.

        -
        Content model -
        -<content>
        - <dataRef name="boolean"/>
        -</content>
        -    
        -
        Schema Declaration -
        -element phrasalSearch { xsd:boolean }
        -
        -
        -
        -
        -

        Appendix A.1.18 <recurse>

        -
        - - - - - - - - - - - - - - - - - - - - - - -
        <recurse> (Whether to recurse into subdirectories of the collection directory or not.)
        Namespacehttp://hcmc.uvic.ca/ns/staticSearch
        Moduless — Schema specification and tag documentation
        Contained by -
        -
        -
        ss: params
        -
        -
        -
        May contain -
        -
        XSD boolean
        -
        -
        Content model -
        -<content>
        - <dataRef name="boolean"/>
        + <elementRef key="results" minOccurs="0"/>
        + <elementRef key="version" minOccurs="0"/>
        + <elementRef key="output"/>
         </content>
        -    
        +
        Schema Declaration -
        -element recurse { xsd:boolean }
        +
        +element params {  }
        -
        -

        Appendix A.1.19 <resultsLimit>

        +
        +

        Appendix A.1.13 <results>

        - + @@ -2930,71 +2668,117 @@

        Appendix A.1.19 <re

        - - - - - + - - - - - - - - - - - - -
        <resultsLimit> (The maximum number of results that can be returned for any search before returning - an error; if the number of documents in a result set exceeds this number, then staticSearch - will not render the results and will provide a message saying that the search returned - too many results. This is usually set to 2000 by default, but you may want to have - a higher or lower limit, depending on the specific structure of your project.)<results> (Controls the configuration of the results page.)
        Namespacess — Schema specification and tag documentation
        Contained by -
        -
        -
        ss: params
        -
        -
        -
        May containAttributes -
        -
        XSD nonNegativeInteger
        +
        + + + + + + + + + + + + + +
        resultsPerPage(The maximum number of document results to be displayed per page. All results are displayed + by default; setting resultsPerPage to a positive integer creates a Show More/Show + All widget at the bottom of the batch of results.) +
        + + + + + + + + + + + + + + + + + +
        StatusOptional
        DatatypenonNegativeInteger
        Default0
        Note +

        For most sites, where the number of results is likely to be in the low thousands, + it's perfectly practical to show all the results at once, because the staticSearch + processor is so fast. However, if you have tens of thousands of documents, and it's + possible that users will do (for example) filter-only searches that retrieve a large + proportion of them, you can constrain the number of results which are shown initially + using this setting. All the results are still generated and output to the page, but + since most of them are hidden until the ‘Show More’ or ‘Show All’ button is clicked, + the browser will render them much more quickly.

        +
        +
        +
        maxKwicsToShow(Controls the maximum number of keyword-in-context extracts that will be shown in the + search page for each hit document returned.) +
        + + + + + + + + + + + + + + + + + +
        StatusOptional
        DatatypenonNegativeInteger
        Default25
        Note +

        maxKwicsToShow is useful for avoiding situations where a given query may result in hundreds of results + (especially when searching for common words, et cetera) and make the results page + difficult to navigate.

        +
        +
        +
        maxResults(The maximum number of results that can be returned for any search before returning + an error; if the number of documents in a result set exceeds this number, then staticSearch + will not render the results and will provide a message saying that the search returned + too many results.) +
        + + + + + + + + + + + + + + + + + +
        StatusOptional
        DatatypenonNegativeInteger
        Default2000
        Note +

        This configuration option is meant to prevent errors for sites where a given set of + filters or search terms can return a set of document that can cause a browser's rendering + engine to fail. For smaller collections, it's unlikely that this limit would ever + be reached, but setting a limit may be helpful for large document collections, projects + that want to constrain the number of possible results, or projects with memory-intensive + or complex rendering.

        +

        This is set to 2000 by default, but you may want to have a higher or lower limit, + depending on the specific structure of your project.

        +
        +
        +
        Note -

        This configuration option is meant to prevent errors for sites where a given set of - filters or search terms can return a set of document that can cause a browser's rendering - engine to fail. For smaller collections, it's unlikely that this limit would ever - be reached, but setting a limit may be helpful for large document collections, projects - that want to constrain the number of possible results, or projects with memory-intensive - or complex rendering.

        -
        Content model -
        -<content>
        - <dataRef name="nonNegativeInteger"/>
        -</content>
        -    
        -
        Schema Declaration -
        -element resultsLimit { xsd:nonNegativeInteger }
        -
        -
        -
        -
        -

        Appendix A.1.20 <resultsPerPage>

        -
        - - - - - - - - - - - - - - - - - - - - - - - - - +
        <resultsPerPage> (The maximum number of document results to be displayed per page. All results are displayed - by default; setting resultsPerPage to a positive integer creates a Show More/Show - All widget at the bottom of the batch of results.)
        Namespacehttp://hcmc.uvic.ca/ns/staticSearch
        Moduless — Schema specification and tag documentation
        Contained by @@ -3007,47 +2791,13 @@

        Appendix A.1.20 <re

        May contain -
        -
        XSD nonNegativeInteger
        -
        -
        Note -

        For most sites, where the number of results is likely to be in the low thousands, - it's perfectly practical to show all the results at once, because the staticSearch - processor is so fast. However, if you have tens of thousands of documents, and it's - possible that users will do (for example) filter-only searches that retrieve a large - proportion of them, you can constrain the number of results which are shown initially - using this setting. All the results are still generated and output to the page, but - since most of them are hidden until the ‘Show More’ or ‘Show All’ button is clicked, - the browser will render them much more quickly.

        -
        Content model -
        -<content>
        - <dataRef name="nonNegativeInteger"/>
        -</content>
        -    
        -
        Schema Declaration -
        -element resultsPerPage { xsd:nonNegativeInteger }
        -
        Empty element
        -

        Appendix A.1.21 <rule>

        +

        Appendix A.1.14 <rule>

        @@ -3123,25 +2873,25 @@

        Appendix A.1.21 <ru

        Content model -
        +                              
         <content>
          <empty/>
         </content>
        -    
        +
        Schema Declaration -
        -element rule { att.match.attributes, attribute weight { text }, empty }
        +
        +element rule { att.match.attributes, attribute weight { text }, empty }
        -

        Appendix A.1.22 <rules>

        +

        Appendix A.1.15 <rules>

        @@ -3176,31 +2926,30 @@

        Appendix A.1.22 <ru

        Content model -
        +                              
         <content>
          <elementRef key="rule" minOccurs="1"
           maxOccurs="unbounded"/>
         </content>
        -    
        +
        Schema Declaration -
        -element rules { rule+ }
        +
        +element rules { rule+ }
        -

        Appendix A.1.23 <scoringAlgorithm>

        +

        Appendix A.1.16 <scoringAlgorithm>

        - + @@ -3210,6 +2959,40 @@

        Appendix A.1.23 <sc

        + + + +
        <scoringAlgorithm> (The scoring algorithm to use for ranking keyword results. Default is "raw" (i.e. weighted - counts))<scoringAlgorithm> (The scoring algorithm to use for ranking keyword results.)
        NamespaceModule ss — Schema specification and tag documentation
        Attributes +
        + + + + + +
        name(Specifies the name of the scoring algorithm to use.) +
        + + + + + + + + + +
        StatusRequired
        Legal values are: +
        +
        raw
        +
        (Default: Calculate the score based off of the weighted number of instances of a term + in a text.) raw score
        +
        tf-idf
        +
        (Calculate the score based off of the tf-idf scoring algorithm.) tf-idf (term frequency-inverse document frequency)
        +
        +
        +
        +
        +
        +
        Contained by @@ -3228,73 +3011,37 @@

        Appendix A.1.23 <sc

        Note

        <scoringAlgorithm> is an optional element that specifies which scoring algorithm to use when calculating - the score of a term and thus the order in which the results from a search are sorted. - There are currently two options:

        -
          -
        • raw: This is the default option (and so does not need to be set explicitly). The raw - score is simply the sum of all instances of a term (optionally multipled by a configured - weight via the <rule>/weight configuration) in a document. This will usually provide good results for most document collections.
        • -
        • tf-idf: The tf-idf algorithm (term frequency-inverse document frequency) computes the mathematical - relevance of a term within a document relative to the rest of the document collection. - The staticSearch implementation of tf-idf basically follows the textbook definition - of tf-idf: tf-idf = ($instancesOfTerm / $totalTermsInDoc) * log( $allDocumentsCount - / $docsWithThisTermCount ) This is fairly crude compared to other search engines, - like Lucene, but it may provide useful results for document collections of varying lengths or - in instances where the raw score may be insufficient or misleading. There are a number - of resources on tf-idf scoring, including: Wikipedia and Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze, Introduction to Information Retrieval, Cambridge University Press. 2008.
        • -
        + the score of a term and thus the order in which the results from a search are sorted.

        Content model -
        +                              
         <content>
        - <valList type="closed">
        -  <valItem ident="raw">
        -   <desc>raw score</desc>
        -   <gloss>Default: Calculate the score based off of the weighted number of
        -       instances of a term in a text.</gloss>
        -  </valItem>
        -  <valItem ident="tf-idf">
        -   <gloss>Calculate the score based off of the tf-idf scoring algorithm.</gloss>
        -  </valItem>
        - </valList>
        + <empty/>
         </content>
        -    
        Legal values are: -
        -
        raw
        -
        (Default: Calculate the score based off of the weighted number of instances of a term - in a text.) raw score
        -
        tf-idf
        -
        (Calculate the score based off of the tf-idf scoring algorithm.)
        -
        +
        Schema Declaration -
        -element scoringAlgorithm { "raw" | "tf-idf" }
        Legal values are: -
        -
        raw
        -
        (Default: Calculate the score based off of the weighted number of instances of a term - in a text.) raw score
        -
        tf-idf
        -
        (Calculate the score based off of the tf-idf scoring algorithm.)
        -
        +
        +element scoringAlgorithm { attribute name { "raw" | "tf-idf" }, empty }
        -
        -

        Appendix A.1.24 <searchFile>

        +
        +

        Appendix A.1.17 <searchPage>

        - + @@ -3304,6 +3051,17 @@

        Appendix A.1.24 <se

        + + + + - + @@ -3332,25 +3086,25 @@

        Appendix A.1.24 <se

        <searchFile> (The search file (aka page) that will be the primary access point for the staticSearch. - Note that this page must be at the root of the collection directory.)<searchPage> (The search page that will be the primary access point for staticSearch. This page + may or may not exist, but its location is used for determining the collection that + will be indexed, so it must be at the root of the collection directory.)
        NamespaceModule ss — Schema specification and tag documentation
        Attributes + +
        Contained by @@ -3316,11 +3074,7 @@

        Appendix A.1.24 <se

        May contain -
        -
        XSD anyURI
        -
        -
        Empty element
        Note
        Content model -
        +                              
         <content>
        - <dataRef name="anyURI"/>
        + <empty/>
         </content>
        -    
        +
        Schema Declaration -
        -element searchFile { xsd:anyURI }
        +
        +element searchPage { ss.atts.file.attributes, empty }
        -

        Appendix A.1.25 <span>

        +

        Appendix A.1.18 <span>

        @@ -3409,7 +3163,7 @@

        Appendix A.1.25 <sp

        Content model -
        +                              
         <content>
          <alternate minOccurs="1"
           maxOccurs="unbounded">
        @@ -3418,26 +3172,26 @@ 

        Appendix A.1.25 <sp <textNode/> </alternate> </content> -

        +
        Schema Declaration -
        -element span { attribute lang { text }?, ( anyElement_span_1* | text )+ }
        +
        +element span { attribute lang { text }?, ( anyElement_span_1* | text )+ }
        -
        -

        Appendix A.1.26 <stemmerFolder>

        +
        +

        Appendix A.1.19 <stemmer>

        - @@ -3448,6 +3202,51 @@

        Appendix A.1.26 <st

        + + + + - + @@ -3499,30 +3294,39 @@

        Appendix A.1.26 <st

        <stemmerFolder> (The name of a folder inside the staticSearch /stemmers/ folder, in which the JavaScript - and XSLT implementations of stemmers can be found. If left blank, then the staticSearch + <stemmer> (The name of a folder inside the staticSearch /stemmers/ folder, in which the JavaScript + and XSLT implementations of stemmers can be found. If not specified, then the staticSearch default English stemmer (en) will be used.)
        Module ss — Schema specification and tag documentation
        Attributes +
        + + + + + +
        dir(The path (relative to the config file) of the directory to use for stemming.) +
        + + + + + + + + + + + + + + + + + +
        Derived fromss.atts.dir
        StatusRequired
        DatatypeanyURI
        Suggested values include: +
        +
        stemmers/en/
        +
        English stemmer
        +
        stemmers/fr/
        +
        French stemmer
        +
        stemmers/identity
        +
        Identity stemmer
        +
        stemmers/stripDiacritics
        +
        Diacritic stripping stemmer
        +
        +
        +
        +
        +
        +
        Contained by @@ -3460,11 +3259,7 @@

        Appendix A.1.26 <st

        May contain -
        -
        XSD NCName
        -
        -
        Empty element
        Note
        Content model -
        +                              
         <content>
        - <dataRef name="NCName"/>
        + <empty/>
         </content>
        -    
        +
        Schema Declaration -
        -element stemmerFolder { xsd:NCName }
        +
        +element stemmer
        +{
        +   attribute dir
        +   {
        +      "stemmers/en/"
        +    | "stemmers/fr/"
        +    | "stemmers/identity"
        +    | "stemmers/stripDiacritics"
        +   },
        +   empty
        +}
        -
        -

        Appendix A.1.27 <stopwordsFile>

        +
        +

        Appendix A.1.20 <stopwords>

        - + @@ -3532,6 +3336,37 @@

        Appendix A.1.27 <st

        + + + + - + @@ -3556,10 +3387,11 @@

        Appendix A.1.27 <st

        A stopword is a word that will not be indexed, because it is too common (the, a, you and so on). There are common stopwords files for most languages available on the Web, but it is probably a good idea to take one of these and customize it for your project, since there will be words in the website which are so common that it makes - no sense to index them, but they are not ordinary stopwords. For example, in a Website + no sense to index them, but they are not ordinary stopwords. For example, in a website dedicated to the work of John Keats, the name keats should probably be added to the stopwords file, since almost every page will include - it, and searching for it will be pointless. The project has a built-in set of common - stopwords for English, which you'll find in xsl/english_stopwords.txt. One way to find appropriate stopwords for your site is to generate your index, then + it, and searching for it will be pointless.

        +

        staticSearch provides a default set of common stopwords for English, which you'll + find in xsl/english_stopwords.txt. One way to find appropriate stopwords for your site is to generate your index, then search for the largest JSON index files that are generated, to see if they might be too common to be useful as search terms. You can also use the Word Frequency table in the generated staticSearch report (see 8.9 Generated report).

        @@ -3567,30 +3399,29 @@

        Appendix A.1.27 <st

        <stopwordsFile> (The relative path (from the config file) to a text file containing a list of stopwords - (words to be ignored when indexing).)<stopwords> (Specifies a list of stopwords--that is, words to be ignored when indexing.)
        NamespaceModule ss — Schema specification and tag documentation
        Attributes +
        + + + + + +
        file(The path (relative to the config file) to a text file containing a list of words to + be ignored by the indexer (one word per line).) +
        + + + + + + + + + + + + + +
        Derived fromss.atts.file
        StatusRequired
        DatatypeanyURI
        +
        +
        +
        +
        Contained by @@ -3544,11 +3379,7 @@

        Appendix A.1.27 <st

        May contain -
        -
        XSD anyURI
        -
        -
        Empty element
        Note
        Content model -
        +                              
         <content>
        - <dataRef name="anyURI"/>
        + <empty/>
         </content>
        -    
        +
        Schema Declaration -
        -element stopwordsFile { xsd:anyURI }
        +
        +element stopwords { attribute file { text }, empty }
        -
        -

        Appendix A.1.28 <totalKwicLength>

        +
        +

        Appendix A.1.21 <tokenizer>

        - + @@ -3600,6 +3431,46 @@

        Appendix A.1.28 <to

        + + + + - - - - - - - - - - - - - +
        <totalKwicLength> (If createContexts is set to true, then this parameter controls the length (in words) - of the harvested keyword-in-context string.)<tokenizer> (Configures options for the tokenizing process.)
        NamespaceModule ss — Schema specification and tag documentation
        Attributes +
        + + + + + +
        minWordLength(Specifies the minimum length in characters of a sequence of text that will be considered + to be a word worth indexing.) +
        + + + + + + + + + + + + + + + + + +
        StatusRecommended
        DatatypenonNegativeInteger
        Default2
        Note +

        Values of 3 or above may be useful for European languages to exclude common prepositions, + articles, et cetera. If you set this to a lower limit for reasons specific to your + project, you should ensure that your stopword list excludes any very common words + that would otherwise make the indexing process lengthy and increase the index size.

        +
        +
        +
        +
        +
        Contained by @@ -3612,49 +3483,18 @@

        Appendix A.1.28 <to

        May contain -
        -
        XSD nonNegativeInteger
        -
        -
        Note -

        Obviously, the longer the keyword-in-context strings are, the larger the individual - index files will be, but the more useful the KWICs will be for users looking at the - search results. Note that the phrasal searching relies on the KWICs and thus longer - KWICs allow for longer phrasal searches.

        -
        Content model -
        -<content>
        - <dataRef name="nonNegativeInteger"/>
        -</content>
        -    
        -
        Schema Declaration -
        -element totalKwicLength { xsd:nonNegativeInteger }
        -
        Empty element
        -
        -

        Appendix A.1.29 <versionFile>

        +
        +

        Appendix A.1.22 <version>

        - + @@ -3664,6 +3504,37 @@

        Appendix A.1.29 <ve

        + + + + - + - - - - - -
        <versionFile> (The relative path to a text file containing a single version identifier (such as 1.5, - 123456, or 06ad419). This will be used to create unique filenames for JSON resources, - so that the browser will not use cached versions of older index files.)<version> (Specifies the unique version to append to the index, so that the browser will not + use cached versions of older index files.)
        NamespaceModule ss — Schema specification and tag documentation
        Attributes +
        + + + + + +
        file(The path (relative to the config file) to a text file containing a single version + identifier (such as 1.5, 123456, or 06ad419).) +
        + + + + + + + + + + + + + +
        Derived fromss.atts.file
        StatusRequired
        DatatypeanyURI
        +
        +
        +
        +
        Contained by @@ -3676,16 +3547,12 @@

        Appendix A.1.29 <ve

        May contain -
        -
        XSD anyURI
        -
        -
        Empty element
        Note -

        <versionFile> enables you to specify the path to a plain-text file containing a simple version +

        <version> enables you to specify the path to a plain-text file containing a simple version number for the project. This might take the form of a software-release-style version number such as 1.5, or it might be a Subversion revision number or a Git commit hash. It should not contain any spaces or punctuation. If you provide a version file, the version string @@ -3700,83 +3567,18 @@

        Appendix A.1.29 <ve

        Content model -
        -<content>
        - <dataRef name="anyURI"/>
        -</content>
        -    
        -
        Schema Declaration -
        -element versionFile { xsd:anyURI }
        -
        -
        -
        -
        -

        Appendix A.1.30 <wildcardSearch>

        -
        - - - - - - - - - - - - - - - - - - - - - - - - - - -
        <wildcardSearch> (Whether or not to support wildcard searches. Note that wildcard searches are more - effective when phrasal searching is also turned on, because the contexts available - for phrasal searches are also used to provide wildcard results.)
        Namespacehttp://hcmc.uvic.ca/ns/staticSearch
        Moduless — Schema specification and tag documentation
        Contained by -
        -
        -
        ss: params
        -
        -
        -
        May contain -
        -
        XSD boolean
        -
        -
        Note -

        Wildcard searching can coexist with stemmed searching, but it is especially useful - when stemming is not available, either because there is no available stemmer for the - language of the site, or because the site contains multiple languages. Unless your - site is particularly large, we recommend turning on wildcard searching, and therefore - also phrasal searching (<phrasalSearch>).

        -
        Content model -
        +                              
         <content>
        - <dataRef name="boolean"/>
        + <empty/>
         </content>
        -    
        +
        Schema Declaration -
        -element wildcardSearch { xsd:boolean }
        +
        +element version { attribute file { text }, empty }
        @@ -3880,6 +3682,50 @@

        Appendix A.2.2 att.mat

        +
        +

        Appendix A.2.3 ss.atts.file

        +
        + + + + + + + + + + + + + + + + +
        ss.atts.file (A class providing a file attribute that can be used to specify a file path.)
        Moduless — Schema specification and tag documentation
        Membersdictionary searchPage stopwords version
        Attributes +
        + + + + + +
        file(A pointer to a local file.) +
        + + + + + + + + + +
        StatusRequired
        DatatypeanyURI
        +
        +
        +
        +
        +
        +
        @@ -3896,7 +3742,7 @@

        Appendix A.2.2 att.mat

        diff --git a/schema/staticSearch.odd b/schema/staticSearch.odd index 3fc5b54..f3f87ed 100644 --- a/schema/staticSearch.odd +++ b/schema/staticSearch.odd @@ -492,21 +492,24 @@ Specifying parameters
        Required parameters -

        The params element has four required elements for determining the resource collection that you wish to index, - and controlling the indexing process: +

        The params element has only one required element, which is used + for determining the resource collection that you wish to index: - - - - +

        -

        The searchFile element is a relative URI (resolved, like all URIs specified in the config file, against the configuration file location) that points - directly to the search page that will be the primary access point for the search. Since the search file must be at the root of the directory that you wish to index - (i.e. the directory that contains all of the XHTML you want the search to index), the searchFile parameter provides the necessary information for knowing - what document collection to index and where to put the output JSON. In other words, in specifying the location of your search page, you are also - specifying the location of your document collection. See Creating a search page for more information on how to configure this file.

        -

        Note that all output files will be in a directory that is a sibling to the search page. For instance, in a document collection that looks something like: +

        The searchPage element's file attribute specifies a relative URI (resolved, like all URIs specified in the config file, against the configuration file location) that points + directly to the search page that will be the primary access point for the search. + Since the search file must be at the root of the directory that you wish to index + (i.e. the directory that contains all of the XHTML you want the search to index), + the searchFile parameter provides the necessary information for knowing + what document collection to index and where to put the output JSON. In other words, + in specifying the location of your search page, you are also specifying the location + of your document collection. See Creating a search page for more + information on how to configure this file.

        + +

        Note that all output files will be in a directory that is a sibling to the search page. + For instance, in a document collection that looks something like: myProject @@ -530,15 +533,6 @@

        -

        We also require the recurse element in the case where the document collection may be nested (as is common with static sites generated from Jekyll or Wordpress). The recurse element is a boolean (true or false) that determines whether or not to recurse into the subdirectories of the collection and index those files.

        - -

        Finally, in order to support stemming and phrasal search effectively, it is important - to specify a stopwordsFile (a file containing words that will be ignored at index time) - and a dictionaryFile (also used for indexing). Default files for English and French are - supplied in the xsl folder, but you will probably want to create or customize the - stopword list for your own project. You may also supply empty text files for these parameters - if for example you donʼt want to use a stoplist at all.

        -
        @@ -549,52 +543,59 @@ directly after each one. --> - + - + - + - + - + - + - + - + - + - + - - - - - + + - - - + + + + +

        @@ -748,8 +749,8 @@

        A complex site may have two or more search pages targetting specific types of document or content, each of which may need its own particular search controls and indexes. This can easily - be achieved by specifying a different searchFile and - outputFolder in the configuration file for each search.

        + be achieved by specifying a different searchPage and + output in the configuration file for each search.

        For these searches to be different from each other, they will also probably have different contexts and rules. For @@ -838,7 +839,7 @@ You can customize this CSS by providing your own CSS that overrides it, using style, or link, placed after it in the head element, or by replacing - the inserted CSS after the build process. Note that some features, like resultsPerPage or + the inserted CSS after the build process. Note that some features, like the Show More widget or the Searching loading dialog, rely on rules included in the base staticSearch CSS; if you do remove or disable the CSS, then some features may not work properly.

        Note that once your file has been processed and all this content has been added, @@ -915,7 +916,7 @@

        Generated report

        After indexing your HTML files, the staticSearch build then generates an HTML report of helpful statistics - and diagnostics about your document collection, which can be found in the directory specified by outputFolder. + and diagnostics about your document collection, which can be found in the directory specified by output. We recommend looking at this file regularly, especially if you're encountering unexpected behaviour by the staticSearch engine, as it contains information that can often help diagnose issues with configured filters or the HTML document collection that, if fixed, can improve staticSearch results.

        @@ -1027,7 +1028,7 @@

        Obviously you can implement this any way you like (or just ignore it), but we also supply a small demonstration JavaScript library which implements this functionality, called ssHighlight.js. This JS file is included - into the staticSearch output folder (see outputFolder) by default, and + into the staticSearch output folder (see output) by default, and if you include it into the header of your own pages, it will probably do the highlighting without further intervention. If, however, you have lots of existing JavaScript that runs when the page loads, there may be some @@ -1126,7 +1127,7 @@ Google Closure Compiler, to create a smaller file called ssSearch.js which should be faster for the browser to download and parse. Both of these output files are provided in your - project outputFolder; ssSearch.js is linked in your search page, + project output; ssSearch.js is linked in your search page, but if you're having problems and would like to debug with more human-friendly JavaScript, you can switch that link to point to ssSearch-debug.js.

        @@ -1186,11 +1187,7 @@ How can I get staticSearch to highlight the found text in a target document? - There are two approaches to this: you could implement a JavaScript solution as explained in - Highlighting search hits on target pages, or you could turn on the - scrollToFragmentId experimental feature supported by some Chromium-based browsers. The former requires - some modification to your site pages to add some JavaScript, while the latter is non-standard and not really reliable or - consistent. + See Highlighting search hits on target pages. How do I prevent staticSearch from encountering out of memory errors? @@ -1573,7 +1570,8 @@
        Schema specification and tag documentation - @@ -1620,226 +1618,249 @@ find the target website content and process it appropriately. - - - - - + + + + + - - + - - - - - - - - - + + + - - - The set of rules that control weighting of search terms - found in specific contexts. - - - - - - A rule that specifies a document path as XPath in the - match attribute, and provides weighting for search - terms found in that context. + + The search page that will be the primary access point for staticSearch. This page may or may not exist, but its location is used for determining the collection that will be indexed, so it must be at the root of the collection directory. - + + +

        The search page is a regular HTML page which forms part of your site. The only + important characteristic it must have is a div element with + id=staticSearch, whose contents will be rewritten by + the staticSearch build process. See .

        +
        +
        + + + Configures options relating to indexing. - - The weighting to give to a search token found in the context specified by the - match attribute. Set to 0 to completely suppress indexing for a - specific context, or greater than 1 to give stronger weighting. + + Determines whether or not to recurse into the subdirectories of the collection and index those files. - + + false + +

        This is useful for static sites that create nested + directory structures (such as those generated from Jekyll or Wordpress).

        +
        - -

        The rule element is used to identify nodes in the XHTML document collection which should be - treated in a special manner when indexed; either they might be ignored (if weight=0), - or any words found in them might be given greater weight than words in normal contexts - weight>1. Words appearing in headings or titles, for example, might - be weighted more heavily, while navigation menus, banners, or footers might be ignored completely.

        -
        - - - The set of context elements that identify - contexts for keyword-in-context fragments. - - - - - - - A context definition, providing a match attribute that identifies the context, - allowing keyword-in-context fragments to be bounded by a specific context. + + + Specifies the unique version to append to the index, so that the browser + will not use cached versions of older index files. - - + - - - - - - - ERROR: If a context has a label, it must be a context for the purposes of indexing. - - - - - - - - - - + + The path (relative to the config file) to a text file + containing a single version identifier (such as + 1.5, 123456, or 06ad419). + -

        When the indexer is extracting keyword-in-context strings for each word, it uses a common-sense - approach based on common element definitions, so that for example when it reaches the end of a paragraph, - it will not continue into the next paragraph to get more context words. You may have special runs of - text in your document collection which do not appear to be bounding contexts, but actually are; for - example, you may have span elements with class=note that appear in the middle - of sentences but are not actually part of them. Use context elements to identify these - special contexts so that the indexer knows the right boundaries from which to retrieve its - keyword-in-context strings.

        +

        version enables you to specify the path to a plain-text file + containing a simple version number for the project. This might take the form of + a software-release-style version number such as 1.5, or it might be + a Subversion revision number or a Git commit hash. It should not contain any + spaces or punctuation. If you provide a version file, the version string will + be used as part of the filenames for all the JSON resources created for the + search. This is useful because it allows the browser to cache such resources + when users repeatedly visit the search page, but if the project is rebuilt with + a new version, those cached files will not be used because the new version will + have different filenames. The path specified is relative to the location of the + configuration file (or absolute, if you wish).

        - - The set of exclusions, expressed as exclude elements, that control the subset of documents - or filters used for a particular search. + + Specifies a list of stopwords--that is, words to be ignored when indexing. + + + - + + + + The path (relative to the config file) to a text file + containing a list of words to be ignored by the indexer (one word per line). + + + +

        A stopword is a word that will not be indexed, because it is too + common (the, a, you + and so on). There are common stopwords files for most languages available on the Web, but + it is probably a good idea to take one of these and customize it for your project, since + there will be words in the website which are so common that it makes no sense to index + them, but they are not ordinary stopwords. For example, in a website dedicated to the + work of John Keats, the name keats should probably be added + to the stopwords file, since almost every page will include it, and searching for it + will be pointless.

        +

        staticSearch provides a default set of common stopwords for English, which + you'll find in xsl/english_stopwords.txt. One way to find appropriate stopwords + for your site is to generate your index, then search for the largest JSON index files that are + generated, to see if they might be too common to be useful as search terms. You can also use the + Word Frequency table in the generated staticSearch report + (see ).

        +

        +
        - - An exclusion definition, which excludes either documents or filters - as defined by an XPath in the match attribute. + + Specifies a dictionary against which tokens may be checked during indexing. - + - - - - Index exclusion - An exclusion that specifies HTML fragment (which itself can be the root HTML element) to exclude from the document index. - - - Filter exclusion - An exclusion that matches an HTML meta tag to exclude from the filter controls on the search page. - - + + The relative path (from the config file) to a dictionary file (one word per line). -

        exclude can be used to identify documents or parts of documents that are to be omitted from - indexing, but, unlike setting weight to zero, should be retained during the indexing process. This is helpful in cases where the text itself should be ignored by the indexer, but should still appear in KWICs. Another common use is for multiple search engines/pages that each have their own special features; in this case, you may want one specific search index/page to ignore filter controls (HTML meta elements, as - described in ) which are provided to support other search pages.

        +

        The indexing process checks each word as it builds the index, and keeps a record + of all words which are not found in the configured dictionary. Though this does not have + any direct effect in the indexing process, all words not found in the dictionary are listed + in the staticSearch report (see ). This can be very useful: + all words listed are either foreign (not part of the language of the dictionary) or perhaps + misspelled (in which case they may not be correctly stemmed and index, and should be + corrected).

        +

        staticSearch provides a default dictionary in xsl/english_words.txt that + can be copied and adapted if working in English; lots of dictionaries for other + languages are available on the Web.

        - - - A class providing attributes that enable specification of document locations. - - - An XPath equivalent to the @match attribute of an xsl:template, which - specifies a context in a document. - - - - - - - - A class providing a label attribute that can be used to identify/describe contexts. + + Configures options for the tokenizing process. - - A string identifier specifying the name for a given context. + + Specifies the minimum length in + characters of a sequence of text that will be considered to + be a word worth indexing. - + + 2 -

        When describing a context, the label attribute names a component of the page that - can be searched within (see ).

        +

        Values of 3 or above may be useful for European languages to exclude + common prepositions, articles, et cetera. If you set this to a lower + limit for reasons specific to your project, you should ensure that your + stopword list excludes any very common words that would otherwise make + the indexing process lengthy and increase the index size.

        -
        - - - The search file (aka page) that will be the primary access point for the staticSearch. Note - that this page must be at the root of the collection directory. - - - - -

        The search page is a regular HTML page which forms part of your site. The only - important characteristic it must have is a div element with - id=staticSearch, whose contents will be rewritten by - the staticSearch build process. See .

        -
        - - The relative path to a text file containing a single version identifier (such as - 1.5, 123456, or 06ad419). This will be used to create - unique filenames for JSON resources, so that the browser - will not use cached versions of older index files. + + The scoring algorithm to use for ranking keyword results. - + + + + Specifies the name of the scoring algorithm to use. + + + raw score + Default: Calculate the score based off of the weighted number of + instances of a term in a text. + +

        The raw score is simply the sum of all instances of a term + (optionally multipled by a configured weight via the + rule/weight configuration) in a document. This will usually provide good + results for most document collections.

        +
        +
        + + tf-idf (term frequency-inverse document frequency) + Calculate the score based off of the tf-idf scoring algorithm. + +

        The tf-idf algorithm (term frequency-inverse document frequency) + computes the mathematical relevance of a term within a document relative to the rest + of the document collection. The staticSearch implementation of tf-idf basically follows the textbook definition of tf-idf: + + tf-idf = ($instancesOfTerm / $totalTermsInDoc) * log( $allDocumentsCount / $docsWithThisTermCount ) + + This is fairly crude compared to other search engines, like + Lucene, but it may provide useful results + for document collections of varying lengths or in instances where the raw score may be + insufficient or misleading. There are a number of resources on tf-idf + scoring, including: Wikipedia and + Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze, Introduction + to Information Retrieval, Cambridge University Press. 2008.

        +
        +
        +
        +
        +
        -

        versionFile enables you to specify the path to a plain-text file - containing a simple version number for the project. This might take the form of - a software-release-style version number such as 1.5, or it might be - a Subversion revision number or a Git commit hash. It should not contain any - spaces or punctuation. If you provide a version file, the version string will - be used as part of the filenames for all the JSON resources created for the - search. This is useful because it allows the browser to cache such resources - when users repeatedly visit the search page, but if the project is rebuilt with - a new version, those cached files will not be used because the new version will - have different filenames. The path specified is relative to the location of the - configuration file (or absolute, if you wish).

        +

        scoringAlgorithm is an optional element that specifies which + scoring algorithm to use when calculating the score of a term and thus the order + in which the results from a search are sorted.

        - + + + The name of a folder inside the staticSearch /stemmers/ folder, - in which the JavaScript and XSLT implementations - of stemmers can be found. If left blank, then the staticSearch default English - stemmer (en) will be used. + in which the JavaScript and XSLT implementations of stemmers can be found. + If not specified, then the staticSearch default English stemmer (en) + will be used. + + + - + + + + The path (relative to the config file) of the directory to use for stemming. + + + English stemmer + + + French stemmer + + + Identity stemmer + + + Diacritic stripping stemmer + + + +

        The staticSearch project currently has only two real stemmers: an implementation of the Porter 2 algorithm for modern English, and @@ -1850,7 +1871,6 @@ be adding more stemmers as the project develops. However, if your document collection is not English or French, you have a couple of options, one hard and one easy. - Hard option: implement your own stemmers. You will need to write two implementations of the stemmer algorithm, one in XSLT (which @@ -1890,307 +1910,399 @@ - - The scoring algorithm to use for ranking keyword results. Default is "raw" (i.e. weighted counts) - - - - raw score - Default: Calculate the score based off of the weighted number of - instances of a term in a text. - - - Calculate the score based off of the tf-idf scoring algorithm. - - - - -

        - scoringAlgorithm is an optional element that specifies which - scoring algorithm to use when calculating the score of a term and thus the order - in which the results from a search are sorted. There are currently two options: - - raw: This is the default option (and so does not need - to be set explicitly). The raw score is simply the sum of all instances of a term - (optionally multipled by a configured weight via the - rule/weight configuration) in a document. This will usually provide good - results for most document collections. - tf-idf: The tf-idf algorithm (term frequency-inverse document frequency) - computes the mathematical relevance of a term within a document relative to the rest - of the document collection. The staticSearch implementation of tf-idf basically follows the textbook definition of tf-idf: - - tf-idf = ($instancesOfTerm / $totalTermsInDoc) * log( $allDocumentsCount / $docsWithThisTermCount ) - - This is fairly crude compared to other search engines, like - Lucene, but it may provide useful results - for document collections of varying lengths or in instances where the raw score may be - insufficient or misleading. There are a number of resources on tf-idf - scoring, including: Wikipedia and - Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze, Introduction - to Information Retrieval, Cambridge University Press. 2008. - -

        -
        -
        - - - Whether to recurse into subdirectories of the collection directory or - not. - - - - - Whether to include keyword-in-context extracts in the index. - + + + + + Specifies whether the indexer stores keyword-in-context extracts + for each hit in a document. + + + + + + + + Specifies whether the indexer stores keyword-in-context extracts + for each hit in a document. + + + +

        Setting create=true increases the size of the index, but makes for much more user-friendly search results; instead of seeing just a score for each document found, the user will see a series of short text strings with the search keyword(s) highlighted.

        +
        + + Whether or not to support phrasal searches. + If this is true, then the maxContexts + setting will be ignored, + because all contexts are required to properly support phrasal search. + + + + true + +

        Phrasal search functionality enables your users to search for specific phrases + by surrounding them with quotation marks ("), as in many search engines. In order + to support this kind of search, createContexts must also be set to true as we store + contexts for all hits for each token in each document. + Setting this to true will make the index larger, because all + contexts must be stored, but once the index is built, it has very little + impact on the speed of searches, so we recommend turning this on. The + default value is true.

        +

        However, if your site is very large and your user base is unlikely to + use phrasal searching, it may not be worth the additional build time and + increased index size.

        +
        +
        + + Whether or not to support wildcard searches. + + + + true + +

        Wildcard searches are + more effective when phrasal searching is also turned on, because the contexts + available for phrasal searches are also used to provide wildcard results.

        +

        Wildcard searching can coexist with stemmed searching, but it is especially + useful when stemming is not available, either because there is no available stemmer + for the language of the site, or because the site contains multiple languages. + Unless your site is particularly large, we recommend turning on wildcard searching, + and therefore also phrasal searching (phrasalSearch).

        +
        +
        + + Controls the number of keyword-in-context extracts + that will be harvested from the data for each term in a document. + + + + 5 + +

        For example, if a user + searches for the word elephant, and it occurs 27 times in a document, but the + maxKwicsToHarvest value is set to 5, then only the first five (sorted in document order) + of these keyword-in-context strings will be stored in the index. (This does not affect the + score of the document in the search results.) If you set this to + a low number, the size of the JSON files will be constrained, but the + user will only be able to see the KWICs that have been harvested in their search results.

        +

        If phrasalSearch is set to true, the maxKwicsToHarvest setting is + ignored, because phrasal searches will only work properly if all contexts are + stored.

        +
        +
        + + Sets the maximum length (in words) of a keyword-in-context result. + + + + 15 + +

        The longer the keyword-in-context strings are, the larger the individual index + files will be, but the more useful the KWICs will be for users looking at the search results. + Note that the phrasal searching relies on the KWICs and thus longer KWICs allow for longer + phrasal searches.

        +
        +
        + + The string that will be used to signal ellipsis at the beginning and end of a + keyword-in-context extract. Conventionally three periods, or an ellipsis + character (which is the default value). + + + + + +

        This parameter is particularly useful + if the language of your search page conventionally uses a different ellipsis + character. Japanese, for example, uses the 3-dot-leader character.

        +
        +
        +
        +
        -

        - createContexts is a boolean parameter that specifies whether you - want the indexer to store keyword-in-context extracts for each of the hits in a - document. This increases the size of the index, but of course it makes for much - more user-friendly search results; instead of seeing just a score for each - document found, the user will see a series of short text strings with the - search keyword(s) highlighted. -

        Note that contexts are necessary for phrasal searching or wildcard searching.

        - - The minimum length of a term to be indexed. Default is 3 characters. - - - - -

        minWordLength specifies the minimum length in - characters of a sequence of text that will be considered to - be a word worth indexing. The default is 3, on the basis that - in most European languages, words of one or two letters are - typically not worth indexing, being articles, prepositions - and so on. If you set this to a lower limit for reasons specific - to your project, you should ensure that your stopword list excludes - any very common words that would otherwise make the indexing - process lengthy and increase the index size.

        -
        + + + + Controls the configuration of the results page. + + + The maximum number of document results to be displayed per page. + All results are displayed by default; setting resultsPerPage to a + positive integer creates a Show More/Show All widget at + the bottom of the batch of results. + + + + 0 + +

        For most sites, where the number of results is likely to be in the low thousands, + it's perfectly practical to show all the results at once, because the staticSearch + processor is so fast. However, if you have tens of thousands of documents, and it's + possible that users will do (for example) filter-only searches that retrieve a + large proportion of them, you can constrain the number of results which are shown + initially using this setting. All the results are still generated and output to + the page, but since most of them are hidden until the Show More + or Show All button is clicked, the browser will render them + much more quickly.

        +
        +
        + + Controls the maximum number of keyword-in-context extracts that will be shown + in the search page for each hit document returned. + + + + 25 + +

        maxKwicsToShow is useful for avoiding situations where a given query + may result in hundreds of results (especially when searching for common words, et cetera) + and make the results page difficult to navigate.

        +
        +
        + + The maximum number of results that can be returned for any search + before returning an error; if the number of documents in a result set exceeds this number, + then staticSearch will not render the results and will provide a message + saying that the search returned too many results. + + + + 2000 + +

        This configuration option is meant to prevent errors for sites where a given set of + filters or search terms can return a set of document that can cause a browser's rendering + engine to fail. For smaller collections, it's unlikely + that this limit would ever be reached, but setting a limit may be helpful + for large document collections, projects that want to constrain the number + of possible results, or projects with memory-intensive or complex rendering.

        +

        This is set to 2000 by default, but you may want to have a higher or lower limit, + depending on the specific structure of your project.

        +
        +
        +
        - - Whether to link keyword-in-context extracts to the nearest id in the document. Default is true. + + Sets the folder into which the index data and JavaScript will + be placed. + + + - + + + + + + + staticSearch + +

        This should conform with the + XML Name specification.

        +
        +
        +
        -

        linkToFragmentId is a boolean parameter that specifies whether you want - the search engine to link each keyword-in-context extract with the closest element that - has an id. If the element has an ancestor with an id, then the indexer will associate - that keyword-in-context extract with that id; if there are no suitable ancestor elements that have - an id, then the extract is associated with first preceding element with an id.

        -

        We strongly recommend that you ensure your target documents have id attributes for any significant divisions - so that this parameter can be used effectively. With lots of ids throughout your documents, and this parameter - turned on, each keyword-in-context in the results page will be linked directly to the section of the - document in which the hit appears, making the search results much more useful.

        +

        When the staticSearch build process creates its output, many files need to be + added to the website for which an index is being created. For convenience, all of + these files are stored in a single folder. This element is used to specify the + name of that folder. The default is staticSearch, + but if you would prefer something else, you can specify it here. You may also use this element + if you are defining two different searches within the same site, so that their files are kept in + different locations.

        - - - If createContexts is set to true, then this parameter controls the length (in words) of - the harvested keyword-in-context string. - - - - -

        Obviously, the longer the keyword-in-context strings are, the larger the individual index - files will be, but the more useful the KWICs will be for users looking at the search results. - Note that the phrasal searching relies on the KWICs and thus longer KWICs allow for longer - phrasal searches.

        -
        -
        + + A class providing attributes that enable specification of document locations. + + + An XPath equivalent to the @match attribute of an xsl:template, which + specifies a context in a document. + + + + + + + + + A class providing a label attribute that can be used to identify/describe contexts. + + + A string identifier specifying the name for a given context. + + + + +

        When describing a context, the label attribute names a component of the page that + can be searched within (see ).

        +
        +
        +
        +
        - - This controls the maximum number of keyword-in-context extracts that will be - stored for each term in a document. - - - - -

        maxKwicsToHarvest controls the number of keyword-in-context extracts - that will be harvested from the data for each term in a document. For example, if a user - searches for the word elephant, and it occurs 27 times in a document, but the - maxKwicsToHarvest value is set to 5, then only the first five (sorted in document order) of these - keyword-in-context strings will be stored in the index. (This does not affect the - score of the document in the search results, of course.) If you set this to - a low number, the size of the JSON files will be constrained, but of course the - user will only be able to see the KWICs that have been harvested in their search results.

        -

        If phrasalSearch is set to true, the maxKwicsToHarvest setting is - ignored, because phrasal searches will only work properly if all contexts are - stored.

        -
        -
        - - This controls the maximum number of keyword-in-context extracts that will be shown - in the search page for each hit document returned. - - - - -

        A user may search for multiple common words, so hundreds of hits could be found in - a single document. If the keyword-in-context strings for all these hits are shown on - the results page, it would be too long and too difficult to navigate. This setting - controls how many of those hits you want to show for each document in the result set.

        -
        -
        - - The string that will be used to signal ellipsis at the beginning and end of a - keyword-in-context extract. Conventionally three periods, or an ellipsis - character (which is the default value). - - - - -

        The only reason you might need to specify a value for this parameter is - if the language of your search page conventionally uses a different ellipsis - character. Japanese, for example, uses the 3-dot-leader character.

        -
        -
        + + A class providing a file attribute that + can be used to specify a file path. + + + A pointer to a local file. + + + + + + - - Whether or not to support phrasal searches. If this is true, then the maxContexts - setting will be ignored, because all contexts are required to properly support phrasal search. - - - - -

        Phrasal search functionality enables your users to search for specific phrases - by surrounding them with quotation marks ("), as in many search engines. In order - to support this kind of search, createContexts must also be set to true as we store contexts for all - hits for each token in each document. Turning this on will make the index larger, because all - contexts must be stored, but once the index is built, it has very little - impact on the speed of searches, so we recommend turning this on. The - default value is true.

        -

        However, if your site is very large and your user base is unlikely to - use phrasal searching, it may not be worth the additional build time and - increased index size.

        -
        -
        + + A class providing a dir attribute that + can be used to specify a file path. + + + A pointer to a local directory. + + + + + + - - Whether or not to support wildcard searches. Note that wildcard searches are - more effective when phrasal searching is also turned on, because the contexts - available for phrasal searches are also used to provide wildcard results. + + The set of rules that control weighting of search terms + found in specific contexts. - + - -

        Wildcard searching can coexist with stemmed searching, but it is especially - useful when stemming is not available, either because there is no available stemmer - for the language of the site, or because the site contains multiple languages. - Unless your site is particularly large, we recommend turning on wildcard searching, - and therefore also phrasal searching (phrasalSearch).

        -
        - - The maximum number of document results to be displayed per page. All results - are displayed by default; setting resultsPerPage to a positive integer creates a - Show More/Show All widget at the bottom of the batch of results. + + A rule that specifies a document path as XPath in the + match attribute, and provides weighting for search + terms found in that context. + + + - + + + + The weighting to give to a search token found in the context specified by the + match attribute. Set to 0 to completely suppress indexing for a + specific context, or greater than 1 to give stronger weighting. + + + + + -

        For most sites, where the number of results is likely to be in the low thousands, - it's perfectly practical to show all the results at once, because the staticSearch - processor is so fast. However, if you have tens of thousands of documents, and it's - possible that users will do (for example) filter-only searches that retrieve a - large proportion of them, you can constrain the number of results which are shown - initially using this setting. All the results are still generated and output to - the page, but since most of them are hidden until the Show More - or Show All button is clicked, the browser will render them - much more quickly.

        +

        The rule element is used to identify nodes in the XHTML document collection which should be + treated in a special manner when indexed; either they might be ignored (if weight=0), + or any words found in them might be given greater weight than words in normal contexts + weight>1. Words appearing in headings or titles, for example, might + be weighted more heavily, while navigation menus, banners, or footers might be ignored completely.

        - - The maximum number of results that can be returned for any search before returning an error; if the number - of documents in a result set exceeds this number, then staticSearch will not render the results and will provide a message - saying that the search returned too many results. This is usually set to 2000 by default, but you may want to have a higher or lower limit, - depending on the specific structure of your project. + + The set of context that identify + contexts for keyword-in-context fragments. - + - -

        This configuration option is meant to prevent errors for sites where a given set of filters or search terms - can return a set of document that can cause a browser's rendering engine to fail. For smaller collections, it's unlikely - that this limit would ever be reached, but setting a limit may be helpful for large document collections, projects that want to constrain the number - of possible results, or projects with memory-intensive or complex rendering.

        -
        - - The relative path (from the config file) to a text file containing a list of - stopwords (words to be ignored when indexing). + + A context definition, providing a match attribute that identifies the context, + allowing keyword-in-context fragments to be bounded by a specific context. + + + + - + + + + + + + + ERROR: If a context has a label, it must be a context for the purposes of indexing. + + + + + + + + + + + + -

        A stopword is a word that will not be indexed, because it is too - common (the, a, you - and so on). There are common stopwords files for most languages available on the Web, but - it is probably a good idea to take one of these and customize it for your project, since - there will be words in the website which are so common that it makes no sense to index - them, but they are not ordinary stopwords. For example, in a Website dedicated to the - work of John Keats, the name keats should probably be added - to the stopwords file, since almost every page will include it, and searching for it - will be pointless. The project has a built-in set of common stopwords for English, which - you'll find in xsl/english_stopwords.txt. One way to find appropriate stopwords - for your site is to generate your index, then search for the largest JSON index files that are - generated, to see if they might be too common to be useful as search terms. You can also use the - Word Frequency table in the generated staticSearch report (see ).

        -

        +

        When the indexer is extracting keyword-in-context strings for each word, it uses a common-sense + approach based on common element definitions, so that for example when it reaches the end of a paragraph, + it will not continue into the next paragraph to get more context words. You may have special runs of + text in your document collection which do not appear to be bounding contexts, but actually are; for + example, you may have span elements with class=note that appear in the middle + of sentences but are not actually part of them. Use context elements to identify these + special contexts so that the indexer knows the right boundaries from which to retrieve its + keyword-in-context strings.

        - - The relative path (from the config file) to a dictionary file (one word per line) which will be used to check - tokens when indexing. + + + The set of exclusions, expressed as exclude elements, that control the subset of documents + or filters used for a particular search. - + - -

        The indexing process checks each word as it builds the index, and keeps a record - of all words which are not found in the configured dictionary. Though this does not have - any direct effect in the indexing process, all words not found in the dictionary are listed - in the staticSearch report (see ). This can be very useful: - all words listed are either foreign (not part of the language of the dictionary) or perhaps - misspelled (in which case they may not be correctly stemmed and index, and should be - corrected). There is a default dictionary in xsl/english_words.txt which - you might copy and adapt if you're working in English; lots of dictionaries for other - languages are available on the Web.

        -
        - - The name of the output folder into which the index data and JavaScript will - be placed in the site search. This should conform with the - XML Name specification. + + + An exclusion definition, which excludes either documents or filters + as defined by an XPath in the match attribute. + + + - + + + + + + Index exclusion + An exclusion that specifies HTML fragment (which itself can be the root HTML element) to exclude from the document index. + + + Filter exclusion + An exclusion that matches an HTML meta tag to exclude from the filter controls on the search page. + + + + -

        When the staticSearch build process creates its output, many files need to be - added to the website for which an index is being created. For convenience, all of - these files are stored in a single folder. This element is used to specify the - name of that folder. The default is staticSearch, - but if you would prefer something else, you can specify it here. You may also use this element - if you are defining two different searches within the same site, so that their files are kept in - different locations.

        +

        exclude can be used to identify documents or parts of documents that are to be omitted from + indexing, but, unlike setting weight to zero, should be retained during the indexing process. This is helpful in cases where the text itself should be ignored by the indexer, but should still appear in KWICs. Another common use is for multiple search engines/pages that each have their own special features; in this case, you may want one specific search index/page to ignore filter controls (HTML meta elements, as + described in ) which are provided to support other search pages.

        diff --git a/schema/staticSearch.rng b/schema/staticSearch.rng index 21b7c9a..c69304f 100644 --- a/schema/staticSearch.rng +++ b/schema/staticSearch.rng @@ -5,7 +5,7 @@ xmlns:xlink="http://www.w3.org/1999/xlink" datatypeLibrary="http://www.w3.org/2001/XMLSchema-datatypes" ns="http://hcmc.uvic.ca/ns/staticSearch"> - - - - - 2000 - - -
        + version - - - + diff --git a/xsl/convert_v1_to_v2.xsl b/xsl/convert_v1_to_v2.xsl index 1924aed..ef78369 100644 --- a/xsl/convert_v1_to_v2.xsl +++ b/xsl/convert_v1_to_v2.xsl @@ -102,29 +102,38 @@ file="{hcmc:getString(stopwordsFile, '')}"/> - - + + + + + + + + + + + + + + + + - - - +