Skip to content

Commit

Permalink
explain use of UTF-8 BOM
Browse files Browse the repository at this point in the history
  • Loading branch information
ian-hoyle committed Oct 16, 2024
1 parent 665e752 commit a96216d
Showing 1 changed file with 8 additions and 10 deletions.
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# 35. Metadata CSV file Validation
# 35. Metadata CSV file UTF-8 Validation

**Date**: 2024-10-07

Expand All @@ -12,14 +12,6 @@ The structure and integrity of the CSV file is checked before the actual [data i

1. **Virus check**: All files uploaded to TDR are checked for viruses using the virus check lambda
2. **[UTF-8](utf-8-validation)**: The CSV file must be UTF-8 format
3. **[Valid CSV file](valid-csv-file)**: Is it a CSV file
4. **[Required columns](required-columns)**: Certain columns are required (such as closure information columns) for all transfers

### UTF-8 Validation

Technically it is not possible to [guarantee](https://www.quora.com/How-do-you-make-sure-a-file-is-UTF-8-encoded#:~:text=Technically%20you%20can't.,supplier%20how%20she%20encoded%20it) a file is UTF-8 encoded.
- "Technically you can’t. Files containing characters may be encoded in a wide variety of different ways, of which plain ASCIII and UTF-8 are just 2 encodings.
- Basically to read a text file given to you by another party, the *only* reliable way to know its encoding is to ask the supplier how she encoded it. Absent that, you simply don’t know what encoding it has."

### UTF-8 'validation' options
- Check leading BOM 'EF BB BF'
Expand All @@ -28,5 +20,11 @@ Technically it is not possible to [guarantee](https://www.quora.com/How-do-you-m
The metadata csv files are prepared by the transferring body, and it is expected **Microsoft Excel** will be used and users can be required to save with the UTF-8 option.
Files stored in this manner will add the BOM

The tdr-draft-metadata-validator will use the presence of the BOM (Byte order mark) at the beginning of the file. This indicates the file have been created with **Microsoft Excel** and will be UTF-8.
The tdr-draft-metadata-validator will use the presence of the BOM (Byte order mark) at the beginning of the file. This indicates the file may have been created with **Microsoft Excel** and explicitly saved as UTF-8.
This method is restrictive as other spreadsheet software such as LibreOffice (Linux) and Numbers (Mac) do not add the BOM

TNA CSV validator uses TNA utf8-validator library. This library checks the byte sequence and invalid single bytes. It does not check the BOM that suggests a file saved explicitly saved as UTF-8 from Microsoft Excel

Whilst TDR requires the user to save the metadata csv in UTF-8 format from **Microsoft Excel** the BOM validation method will be used and no further UTF-8 checks made


0 comments on commit a96216d

Please sign in to comment.