Skip to content

Commit

Permalink
Docs: Update sentence normalisation page
Browse files Browse the repository at this point in the history
  • Loading branch information
strangetom committed Feb 5, 2025
1 parent c699931 commit be8e589
Show file tree
Hide file tree
Showing 2 changed files with 63 additions and 147 deletions.
8 changes: 5 additions & 3 deletions docs/source/explanation/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,8 +7,10 @@ Post-processing of the sequence of labels and tokens is then used to populate th

The figure below shows the processing pipelines used for training the model and parsing a sentence.

.. image:: /_static/pipelines.svg
:alt: Training and parsing pipelines
.. figure:: /_static/pipelines.svg
:alt: Training and parsing pipelines.

Training and parsing pipelines.

The **first** step is normalising the input sentence.
The goal of normalisation is to transform certain aspects of the sentence into a standardised form to make it easier for the model to learn the correct labels, and make subsequent post-processing easier too.
Expand Down Expand Up @@ -45,9 +47,9 @@ The :doc:`Post-processing <postprocessing>` page provides more details on this p
:maxdepth: 1
:hidden:

Data <data>
Sentence Normalisation <normalisation>
Feature Generation <features>
Data <data>
Training <training>
Model Usage <usage>
Post-processing <postprocessing>
Expand Down
202 changes: 58 additions & 144 deletions docs/source/explanation/normalisation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,188 +3,102 @@
Sentence Normalisation
======================

Normalisation is the process of transforming the sentences to ensure that particular features of the sentence have a standard form. This pre-process step is there to remove as much of the variation in the data that can be reasonably foreseen, so that the model is presented with tidy and consistent data and therefore has an easier time of learning or labelling.
Normalisation is the process of transforming the sentences to ensure that particular features of the sentence have a standardised form.
This pre-processing step is there to remove as much of the variation in the data that can be reasonably foreseen, so that the model is presented with tidy and consistent data and therefore has an easier time assigning the correct labels.

The :class:`PreProcessor` class handles the sentence normalisation for us.
The :class:`PreProcessor` class handles the sentence normalisation.

.. code:: python
>>> from Preprocess import PreProcessor
>>> p = PreProcessor("1/2 cup orange juice, freshly squeezed")
>>> p.sentence
'0.5 cup orange juice, freshly squeezed'
'#1$2 cup orange juice, freshly squeezed'
The normalisation of the input sentence is done immediately when the :class:`PreProcessor` class is instantiated. The :func:`_normalise` method of the :class:`PreProcessor` class is called, which executes a number of steps to clean up the input sentence.
The normalisation of the input sentence is done on initialisation of a :class:`PreProcessor` object. The :func:`_normalise` method of the :class:`PreProcessor` class is called, which executes a number of steps to clean up the input sentence.

.. literalinclude:: ../../../ingredient_parser/en/preprocess.py
:pyobject: PreProcessor._normalise
:dedent: 4

.. tip::

By setting ``show_debug_output=True`` when instantiating the :class:`PreProcessor` class, the sentence will be printed out at each step of the normalisation process.
By setting ``show_debug_output=True`` when instantiating a :class:`PreProcessor` object, the sentence will be printed out at each step of the normalisation process.

Each of the normalisation functions are detailed below.
Each of the normalisation steps is described below.

#. ``_replace_en_em_dash``
En-dashes (``) and em-dashes (``) are replaced with hyphens (`-`). This makes identification of ranges of quantities easier.

``_replace_en_em_dash``
^^^^^^^^^^^^^^^^^^^^^^^
#. ``_replace_html_fractions``
Fractions written as html entities (e.g. ``&frac12;`` for 0.5) are replaced with Unicode equivalents (e.g. ½).
This is done using the standard library's :func:`html.unescape` function.

En-dashes and em-dashes are replaced with hyphens.
#. ``_replace_unicode_fractions``
Fractions represented by Unicode fractions are replaced a textual format (.e.g ½ as 1/2), as defined by the dictionary in this function.
Because we replaced the html fractions in the previous step, these are also converted here too.

.. literalinclude:: ../../../ingredient_parser/en/preprocess.py
:pyobject: PreProcessor._replace_en_em_dash
:dedent: 4


``_replace_html_fractions``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Fractions represented by html entities (e.g. 0.5 as ``&frac12;``) are replaced with Unicode equivalents (e.g. ½). This is done using the standard library :func:`html.unescape` function.

.. literalinclude:: ../../../ingredient_parser/en/preprocess.py
:pyobject: PreProcessor._replace_html_fractions
:dedent: 4


``_replace_unicode_fractions``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Fractions represented by Unicode fractions are replaced a textual format (.e.g ½ as 1/2), as defined by the dictionary in this function. The next step (``_replace_fake_fractions``) will turn these into decimal numbers.

We have to handle two cases: where the character before the unicode fraction is a hyphen and where it is not. In the latter case, we want to insert a space before the replacement so we don't accidentally merge with the character before. However, if the character before is a hyphen, we don't want to do this because we could end up splitting a range up.

.. literalinclude:: ../../../ingredient_parser/en/_constants.py
:lines: 197-233

.. literalinclude:: ../../../ingredient_parser/en/preprocess.py
:pyobject: PreProcessor._replace_unicode_fractions
:dedent: 4

``combine_quantities_split_by_and``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Fractional quantities split by 'and' e.g. 1 and 1/2 are converted to the format described in `_identify_fractions`_.

A regular expression is used to find these in the sentence.

.. literalinclude:: ../../../ingredient_parser/en/_regex.py
:lines: 60-62

.. literalinclude:: ../../../ingredient_parser/en/_utils.py
:pyobject: combine_quantities_split_by_and
There are two cases to consider: where the character before the unicode fraction is a hyphen and where it is not.

In the second case, we insert a space before the replacement so we don't accidentally merge with the character before.
For example we want **** to become **1 1/2** and not **11/2**.

``_identify_fractions``
^^^^^^^^^^^^^^^^^^^^^^^^^^^
However, if the character before is a hyphen, we don't want to do this because we could end up splitting a range up.
For example, we want **½-¾** to become **1/2-3/4** and not **1/2- 3/4** (note the space before the 3).

Fractions represented in a textual format (e.g. 1/2, 3/4) are identified and modified so that they survive tokenisation as a single token.
A regular expression is used to find these in the sentence. The regular expression also matches fractions greater than 1.
#. ``combine_quantities_split_by_and``
Fractional quantities split by 'and' e.g. 1 and 1/2 are converted to the format described in the next step.
We do this now instead of later to avoid treating the 1/2 on it's own.

For fractions less than 1, the foward slash is replaced by ``$`` and a ``#`` is prepended e.g. #1$2 for 1/2.
#. ``_identify_fractions``
All remaining fractions are modified so that they survive tokenisation as a single token.
This is necessary so that we can convert them to :class:`fractions.Fraction` objects later.

For fractions greater than 1, the foward slash is replaced by ``$`` and a ``#`` is inserted between the integer and the fraction e.g. 2#3$4 for 2 3/4.
For fractions less than 1, the foward slash is replaced by ``$`` and a ``#`` is prepended e.g. **1/2** becomes **#1$2**.

.. literalinclude:: ../../../ingredient_parser/en/_regex.py
:lines: 7-10
For fractions greater than 1, the foward slash is replaced by ``$`` and a ``#`` is inserted between the integer and the fraction e.g. **2 3/4** becomes **2#3$4**.

.. literalinclude:: ../../../ingredient_parser/en/preprocess.py
:pyobject: PreProcessor._identify_fractions
:dedent: 4
#. ``_split_quantity_and_units``
A space is enforced between quantities and units to make sure they are tokenized to separate tokens.
If a quantity and unit are joined by a hyphen, this is also replaced by a space.
This takes into account certain strings that aren't technically units, but we want to treat in the same way here, for example **x** in the context **1x** or **2x**.

#. ``_remove_unit_trailing_period``
Units with a trailing period have the period removed.
This is only done for a subset of units where this has been observed in the model training data.

``_split_quantity_and_units``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
#. ``replace_string_range``
Ranges are replaced with a standardised form of **X-Y**.
A regular expression searches for ranges in the sentence that match anything in the following forms:

A space is enforced between quantities and units to make sure they are tokenized to separate tokens. If an quantity and unit are joined by a hyphen, this is also replaced by a space. This also takes into account certain strings that aren't technically units, but we want to treat in the same way here.
* 1 to 2
* 1- to 2-
* 1 or 2
* 1- or 2-

.. literalinclude:: ../../../ingredient_parser/en/_regex.py
:lines: 15-35
where the numbers 1 and 2 represent any decimal value or fraction as modified above.

.. literalinclude:: ../../../ingredient_parser/en/preprocess.py
:pyobject: PreProcessor._split_quantity_and_units
:dedent: 4
The purpose of this is to ensure the range is kept as a single token.

#. ``_replace_dupe_units_ranges``
Ranges where the unit is given for both quantities are replaced with the standardised range format, e.g. **5 oz - 8 oz** is replaced by **5-8 oz**.
Cases where the same unit is used but in different forms (e.g. 5 oz - 8 ounce) are also considered for the unit synonyms defined in the ``UNIT_SYNONYMS`` constant.

``_remove_unit_trailing_period``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Units with a trailing period have the period removed. This is only done for a subset of units where this has been observed.

.. literalinclude:: ../../../ingredient_parser/en/preprocess.py
:pyobject: PreProcessor._remove_unit_trailing_period
:dedent: 4
#. ``_merge_quantity_x``
Quantities followed by an "x" are merged together so they form a single token, for example:

* 1 x -> 1x
* 0.5 x -> 0.5x

``replace_string_range``
^^^^^^^^^^^^^^^^^^^^^^^^^

Ranges are replaced with a standardised form of X-Y. The regular expression that searches for ranges in the sentence matches anything in the following forms:

* 1 to 2
* 1- to 2-
* 1 or 2
* 1- to 2-

where the numbers 1 and 2 represent any decimal value.

The purpose of this is to ensure the range is kept as a single token.

.. literalinclude:: ../../../ingredient_parser/en/_regex.py
:lines: 37-58

.. literalinclude:: ../../../ingredient_parser/en/_utils.py
:pyobject: replace_string_range

``_replace_dupe_units_ranges``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Ranges where the unit is given for both quantities are replaced with the standardised range format, e.g. 5 oz - 8 oz is replaced by 5-8 oz. Cases where the same unit is used, but in a different form (e.g. 5 oz - 8 ounce) are also considered for the unit synonyms defined in ``UNIT_SYNONYMS``.

.. literalinclude:: ../../../ingredient_parser/en/_constants.py
:lines: 404-415

.. literalinclude:: ../../../ingredient_parser/en/_regex.py
:lines: 64-87

.. literalinclude:: ../../../ingredient_parser/en/preprocess.py
:pyobject: PreProcessor._replace_dupe_units_ranges
:dedent: 4

``_merge_quantity_x``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Merge quantities followed by an "x" into a single token, for example:

* 1 x -> 1x
* 0.5 x -> 0.5x

.. literalinclude:: ../../../ingredient_parser/en/_regex.py
:lines: 89-98

.. literalinclude:: ../../../ingredient_parser/en/preprocess.py
:pyobject: PreProcessor._merge_quantity_x
:dedent: 4

``_collapse_ranges``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Remove any white space surrounding the hyphen in a range

.. literalinclude:: ../../../ingredient_parser/en/_regex.py
:lines: 101-105

.. literalinclude:: ../../../ingredient_parser/en/preprocess.py
:pyobject: PreProcessor._collapse_ranges
:dedent: 4

#. ``_collapse_ranges``
Remove any white space surrounding the hyphen in a range

``_singlarise_unit``
^^^^^^^^^^^^^^^^^^^^

Units are made singular using a predefined list of plural units and their singular form.

This step is actually performed after tokenisation (see :doc:`Extracting the features <features>`) and we keep track of the index of each token that has been singularised. This is so we can automatically re-pluralise only the tokens that were singularised after the labelling by the model.
Singularising units
^^^^^^^^^^^^^^^^^^^

.. literalinclude:: ../../../ingredient_parser/en/_constants.py
:lines: 6-124
Units are converted to their singular form, using a predefined list of plural units and their singular form.
This step is actually performed after tokenisation so that we can keep track of the index of each token that has been modified.
This is so we can automatically re-pluralise only the tokens that were singularised after the labelling by the model.

0 comments on commit be8e589

Please sign in to comment.