A file validator for validating eQTL summary statistics TSV files prior to conversion to HDF5. The validator uses pandas_schema.
git clone https://github.com/eQTL-Catalogue/eQTL-sumstats-file-validator.git
cd eQTL-sumstats-file-validator/
pip install -r requirments.txt
pip install .
To run the validator on a file:
eqss-validate -f <file_to_validate.tsv> --logfile <logfile_name>
Information and errors are logged to the console and errors logged to the file specified. A console output might look like:
(INFO): Validating headers...
(INFO): Validating file...
(ERROR): Length of row 7 is: 16 instead of 15
(ERROR): Please fix the table. Some rows have different numbers of columns to the header
(INFO): Rows with different numbers of columns to the header are not validated
(ERROR): {row: 1, column: "p_value"}: "-99" was not in the range [0, 1)
The errors from the output tell us that row seven has too many columns and row one does not have a valid pvalue. If these rows are not fixed, they will later be dropped and not converted to HDF5.
-
--linelimit
: int, default 1000Once this number of erroneous rows has been reached, stop looking for more.
-
--drop-bad-lines
: bool, default False