From be8e589de22e47a3763295a6400ffde4d87edd84 Mon Sep 17 00:00:00 2001 From: Tom Strange Date: Wed, 5 Feb 2025 16:16:13 +0000 Subject: [PATCH] Docs: Update sentence normalisation page --- docs/source/explanation/index.rst | 8 +- docs/source/explanation/normalisation.rst | 202 +++++++--------------- 2 files changed, 63 insertions(+), 147 deletions(-) diff --git a/docs/source/explanation/index.rst b/docs/source/explanation/index.rst index e8e28fb..8a0a2c5 100644 --- a/docs/source/explanation/index.rst +++ b/docs/source/explanation/index.rst @@ -7,8 +7,10 @@ Post-processing of the sequence of labels and tokens is then used to populate th The figure below shows the processing pipelines used for training the model and parsing a sentence. -.. image:: /_static/pipelines.svg - :alt: Training and parsing pipelines +.. figure:: /_static/pipelines.svg + :alt: Training and parsing pipelines. + + Training and parsing pipelines. The **first** step is normalising the input sentence. The goal of normalisation is to transform certain aspects of the sentence into a standardised form to make it easier for the model to learn the correct labels, and make subsequent post-processing easier too. @@ -45,9 +47,9 @@ The :doc:`Post-processing ` page provides more details on this p :maxdepth: 1 :hidden: - Data Sentence Normalisation Feature Generation + Data Training Model Usage Post-processing diff --git a/docs/source/explanation/normalisation.rst b/docs/source/explanation/normalisation.rst index b00c05f..fcc5ba0 100644 --- a/docs/source/explanation/normalisation.rst +++ b/docs/source/explanation/normalisation.rst @@ -3,18 +3,19 @@ Sentence Normalisation ====================== -Normalisation is the process of transforming the sentences to ensure that particular features of the sentence have a standard form. This pre-process step is there to remove as much of the variation in the data that can be reasonably foreseen, so that the model is presented with tidy and consistent data and therefore has an easier time of learning or labelling. +Normalisation is the process of transforming the sentences to ensure that particular features of the sentence have a standardised form. +This pre-processing step is there to remove as much of the variation in the data that can be reasonably foreseen, so that the model is presented with tidy and consistent data and therefore has an easier time assigning the correct labels. -The :class:`PreProcessor` class handles the sentence normalisation for us. +The :class:`PreProcessor` class handles the sentence normalisation. .. code:: python >>> from Preprocess import PreProcessor >>> p = PreProcessor("1/2 cup orange juice, freshly squeezed") >>> p.sentence - '0.5 cup orange juice, freshly squeezed' + '#1$2 cup orange juice, freshly squeezed' -The normalisation of the input sentence is done immediately when the :class:`PreProcessor` class is instantiated. The :func:`_normalise` method of the :class:`PreProcessor` class is called, which executes a number of steps to clean up the input sentence. +The normalisation of the input sentence is done on initialisation of a :class:`PreProcessor` object. The :func:`_normalise` method of the :class:`PreProcessor` class is called, which executes a number of steps to clean up the input sentence. .. literalinclude:: ../../../ingredient_parser/en/preprocess.py :pyobject: PreProcessor._normalise @@ -22,169 +23,82 @@ The normalisation of the input sentence is done immediately when the :class:`Pre .. tip:: - By setting ``show_debug_output=True`` when instantiating the :class:`PreProcessor` class, the sentence will be printed out at each step of the normalisation process. + By setting ``show_debug_output=True`` when instantiating a :class:`PreProcessor` object, the sentence will be printed out at each step of the normalisation process. -Each of the normalisation functions are detailed below. +Each of the normalisation steps is described below. +#. ``_replace_en_em_dash`` + En-dashes (`–`) and em-dashes (`—`) are replaced with hyphens (`-`). This makes identification of ranges of quantities easier. -``_replace_en_em_dash`` -^^^^^^^^^^^^^^^^^^^^^^^ +#. ``_replace_html_fractions`` + Fractions written as html entities (e.g. ``½`` for 0.5) are replaced with Unicode equivalents (e.g. ½). + This is done using the standard library's :func:`html.unescape` function. -En-dashes and em-dashes are replaced with hyphens. +#. ``_replace_unicode_fractions`` + Fractions represented by Unicode fractions are replaced a textual format (.e.g ½ as 1/2), as defined by the dictionary in this function. + Because we replaced the html fractions in the previous step, these are also converted here too. -.. literalinclude:: ../../../ingredient_parser/en/preprocess.py - :pyobject: PreProcessor._replace_en_em_dash - :dedent: 4 - - -``_replace_html_fractions`` -^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -Fractions represented by html entities (e.g. 0.5 as ``½``) are replaced with Unicode equivalents (e.g. ½). This is done using the standard library :func:`html.unescape` function. - -.. literalinclude:: ../../../ingredient_parser/en/preprocess.py - :pyobject: PreProcessor._replace_html_fractions - :dedent: 4 - - -``_replace_unicode_fractions`` -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -Fractions represented by Unicode fractions are replaced a textual format (.e.g ½ as 1/2), as defined by the dictionary in this function. The next step (``_replace_fake_fractions``) will turn these into decimal numbers. - -We have to handle two cases: where the character before the unicode fraction is a hyphen and where it is not. In the latter case, we want to insert a space before the replacement so we don't accidentally merge with the character before. However, if the character before is a hyphen, we don't want to do this because we could end up splitting a range up. - -.. literalinclude:: ../../../ingredient_parser/en/_constants.py - :lines: 197-233 - -.. literalinclude:: ../../../ingredient_parser/en/preprocess.py - :pyobject: PreProcessor._replace_unicode_fractions - :dedent: 4 - -``combine_quantities_split_by_and`` -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -Fractional quantities split by 'and' e.g. 1 and 1/2 are converted to the format described in `_identify_fractions`_. - -A regular expression is used to find these in the sentence. - -.. literalinclude:: ../../../ingredient_parser/en/_regex.py - :lines: 60-62 - -.. literalinclude:: ../../../ingredient_parser/en/_utils.py - :pyobject: combine_quantities_split_by_and + There are two cases to consider: where the character before the unicode fraction is a hyphen and where it is not. + In the second case, we insert a space before the replacement so we don't accidentally merge with the character before. + For example we want **1½** to become **1 1/2** and not **11/2**. -``_identify_fractions`` -^^^^^^^^^^^^^^^^^^^^^^^^^^^ + However, if the character before is a hyphen, we don't want to do this because we could end up splitting a range up. + For example, we want **½-¾** to become **1/2-3/4** and not **1/2- 3/4** (note the space before the 3). -Fractions represented in a textual format (e.g. 1/2, 3/4) are identified and modified so that they survive tokenisation as a single token. -A regular expression is used to find these in the sentence. The regular expression also matches fractions greater than 1. +#. ``combine_quantities_split_by_and`` + Fractional quantities split by 'and' e.g. 1 and 1/2 are converted to the format described in the next step. + We do this now instead of later to avoid treating the 1/2 on it's own. -For fractions less than 1, the foward slash is replaced by ``$`` and a ``#`` is prepended e.g. #1$2 for 1/2. +#. ``_identify_fractions`` + All remaining fractions are modified so that they survive tokenisation as a single token. + This is necessary so that we can convert them to :class:`fractions.Fraction` objects later. -For fractions greater than 1, the foward slash is replaced by ``$`` and a ``#`` is inserted between the integer and the fraction e.g. 2#3$4 for 2 3/4. + For fractions less than 1, the foward slash is replaced by ``$`` and a ``#`` is prepended e.g. **1/2** becomes **#1$2**. -.. literalinclude:: ../../../ingredient_parser/en/_regex.py - :lines: 7-10 + For fractions greater than 1, the foward slash is replaced by ``$`` and a ``#`` is inserted between the integer and the fraction e.g. **2 3/4** becomes **2#3$4**. -.. literalinclude:: ../../../ingredient_parser/en/preprocess.py - :pyobject: PreProcessor._identify_fractions - :dedent: 4 +#. ``_split_quantity_and_units`` + A space is enforced between quantities and units to make sure they are tokenized to separate tokens. + If a quantity and unit are joined by a hyphen, this is also replaced by a space. + This takes into account certain strings that aren't technically units, but we want to treat in the same way here, for example **x** in the context **1x** or **2x**. +#. ``_remove_unit_trailing_period`` + Units with a trailing period have the period removed. + This is only done for a subset of units where this has been observed in the model training data. -``_split_quantity_and_units`` -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +#. ``replace_string_range`` + Ranges are replaced with a standardised form of **X-Y**. + A regular expression searches for ranges in the sentence that match anything in the following forms: -A space is enforced between quantities and units to make sure they are tokenized to separate tokens. If an quantity and unit are joined by a hyphen, this is also replaced by a space. This also takes into account certain strings that aren't technically units, but we want to treat in the same way here. + * 1 to 2 + * 1- to 2- + * 1 or 2 + * 1- or 2- -.. literalinclude:: ../../../ingredient_parser/en/_regex.py - :lines: 15-35 + where the numbers 1 and 2 represent any decimal value or fraction as modified above. -.. literalinclude:: ../../../ingredient_parser/en/preprocess.py - :pyobject: PreProcessor._split_quantity_and_units - :dedent: 4 + The purpose of this is to ensure the range is kept as a single token. +#. ``_replace_dupe_units_ranges`` + Ranges where the unit is given for both quantities are replaced with the standardised range format, e.g. **5 oz - 8 oz** is replaced by **5-8 oz**. + Cases where the same unit is used but in different forms (e.g. 5 oz - 8 ounce) are also considered for the unit synonyms defined in the ``UNIT_SYNONYMS`` constant. -``_remove_unit_trailing_period`` -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -Units with a trailing period have the period removed. This is only done for a subset of units where this has been observed. - -.. literalinclude:: ../../../ingredient_parser/en/preprocess.py - :pyobject: PreProcessor._remove_unit_trailing_period - :dedent: 4 +#. ``_merge_quantity_x`` + Quantities followed by an "x" are merged together so they form a single token, for example: + * 1 x -> 1x + * 0.5 x -> 0.5x -``replace_string_range`` -^^^^^^^^^^^^^^^^^^^^^^^^^ - -Ranges are replaced with a standardised form of X-Y. The regular expression that searches for ranges in the sentence matches anything in the following forms: - -* 1 to 2 -* 1- to 2- -* 1 or 2 -* 1- to 2- - -where the numbers 1 and 2 represent any decimal value. - -The purpose of this is to ensure the range is kept as a single token. - -.. literalinclude:: ../../../ingredient_parser/en/_regex.py - :lines: 37-58 - -.. literalinclude:: ../../../ingredient_parser/en/_utils.py - :pyobject: replace_string_range - -``_replace_dupe_units_ranges`` -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -Ranges where the unit is given for both quantities are replaced with the standardised range format, e.g. 5 oz - 8 oz is replaced by 5-8 oz. Cases where the same unit is used, but in a different form (e.g. 5 oz - 8 ounce) are also considered for the unit synonyms defined in ``UNIT_SYNONYMS``. - -.. literalinclude:: ../../../ingredient_parser/en/_constants.py - :lines: 404-415 - -.. literalinclude:: ../../../ingredient_parser/en/_regex.py - :lines: 64-87 - -.. literalinclude:: ../../../ingredient_parser/en/preprocess.py - :pyobject: PreProcessor._replace_dupe_units_ranges - :dedent: 4 - -``_merge_quantity_x`` -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -Merge quantities followed by an "x" into a single token, for example: - -* 1 x -> 1x -* 0.5 x -> 0.5x - -.. literalinclude:: ../../../ingredient_parser/en/_regex.py - :lines: 89-98 - -.. literalinclude:: ../../../ingredient_parser/en/preprocess.py - :pyobject: PreProcessor._merge_quantity_x - :dedent: 4 - -``_collapse_ranges`` -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -Remove any white space surrounding the hyphen in a range - -.. literalinclude:: ../../../ingredient_parser/en/_regex.py - :lines: 101-105 - -.. literalinclude:: ../../../ingredient_parser/en/preprocess.py - :pyobject: PreProcessor._collapse_ranges - :dedent: 4 - +#. ``_collapse_ranges`` + Remove any white space surrounding the hyphen in a range -``_singlarise_unit`` -^^^^^^^^^^^^^^^^^^^^ -Units are made singular using a predefined list of plural units and their singular form. -This step is actually performed after tokenisation (see :doc:`Extracting the features `) and we keep track of the index of each token that has been singularised. This is so we can automatically re-pluralise only the tokens that were singularised after the labelling by the model. +Singularising units +^^^^^^^^^^^^^^^^^^^ -.. literalinclude:: ../../../ingredient_parser/en/_constants.py - :lines: 6-124 +Units are converted to their singular form, using a predefined list of plural units and their singular form. +This step is actually performed after tokenisation so that we can keep track of the index of each token that has been modified. +This is so we can automatically re-pluralise only the tokens that were singularised after the labelling by the model.