Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

packaging dep and IcfMetadata.__eq__ #322

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

ACEnglish
Copy link

Change 1

After creating a clean python3 -m venv b2z and python3 -m pip install . in bio2zarr, running vcf2zarr raised the
error:

<truncated>/python3.13/site-packages/bio2zarr/zarr_utils.py", line 2, in <module>
    from packaging.version import Version
ModuleNotFoundError: No module named 'packaging'

After manually installing python3 -m pip install packaging, the error was no longer raised. Therefore,
adding packaging to the requirements in pyproject.toml, making a new venv, reinstalling, and successfully
ran vcf2zarr.

Change 2

When exploding VCFs with different Header field order, the error was raised

  File "<truncated>/python3.13/site-packages/bio2zarr/vcf2zarr/icf.py", line 328, in scan_vcfs
    raise ValueError("Incompatible VCF chunks")
ValueError: Incompatible VCF chunks

To fix this, add order to relevant dataclasses, add __eq__ on IcfMetadata and call sorted for field comparison

Example data:
test1.vcf

##fileformat=VCFv4.1
##contig=<ID=chr20,length=64444167>
##contig=<ID=chr21,length=46709983>
##INFO=<ID=QNAME,Number=1,Type=String,Description="Query name">
##INFO=<ID=QSTART,Number=1,Type=Integer,Description="Query start">
##INFO=<ID=QSTRAND,Number=1,Type=String,Description="Query strand">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	NA12878
chr21	64506	.	C	CTCCAT	60	.	QNAME=cluster19_000000F;QSTART=25698928;QSTRAND=-	GT	1|0

test2.vcf

##fileformat=VCFv4.1
##contig=<ID=chr20,length=64444167>
##contig=<ID=chr21,length=46709983>
##INFO=<ID=QSTART,Number=1,Type=Integer,Description="Query start">
##INFO=<ID=QNAME,Number=1,Type=String,Description="Query name">
##INFO=<ID=QSTRAND,Number=1,Type=String,Description="Query strand">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	NA12878
chr20	63971	.	TTCCATTCCAC	T	60	.	QNAME=cluster19_000000F;QSTART=25698928;QSTRAND=-	GT	1|0

@coveralls
Copy link
Collaborator

coveralls commented Feb 23, 2025

Coverage Status

coverage: 98.812% (-0.08%) from 98.888%
when pulling 818ced8 on ACEnglish:main
into d851fc3 on sgkit-dev:main.

Copy link
Contributor

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great, thanks @ACEnglish!

I think the solution is slightly over cooking it, it it's easy enough to just compare on the fields we care about here. We'd like to add your test case to the vcf examples. Can you:

  1. Create a new directory, tests/data/vcf/out_of_order_fields and put your test1.vcf and test2.vcf in there (in bgzip format). Also add the csi or tabix indexes, whichever suits best. Add the directory to git.
  2. Add a new test class to tests/test_vcf_examples.py TestOutOfOrderFields``, and add a few simple tests to it (following, say, this one:
    class TestSmallExampleLocalAlleles:
    ) The details will slightly differ as you'll be passing multiple files to explode.
  3. Update the TestDu class here to account for the new files (sorry, this is a stupid test, I really need to think of a better way to do this!)

def __eq__(self, other):
if not isinstance(other, IcfMetadata):
return NotImplemented
return (sorted(self.samples) == sorted(other.samples) and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorting the samples will lead to corruption I think, as we assume that the samples are all in the same order currently (that could be fixed, easily enough, though). How about we just add the sorted on the self.fields attribute, and keep all the others as plain ==?

class Filter:
id: str
description: str = ""


@dataclasses.dataclass
@dataclasses.dataclass(order=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need order=True here, do we? We're just defining the eq method, and don't need to sort the header definitions.

@ACEnglish
Copy link
Author

Hello,

Thanks for the feedback. I've added tests and made some corrections. I was unsure about if Filters/Format fields needed sorting, so I made them very out of order in the input.bcf files and found that indeed they didn't need order=True or sorted to pass the tests.

I also figured out this pre-commit which is new to me so hopefully there won't be any CI/CD fails.

Let me know if there's more to be done. I'm happy to help.

Have a great day,
~/Adam

@jeromekelleher
Copy link
Contributor

Looks great, thanks @ACEnglish ! Good to merge.

It would be nice to squash the commits down to two, if you don't mind (but no big deal if you don't want to/can't be bothered etc)

adding packing dependency and allowing header INFO fields to be in
different orders between multiple VCFs
@ACEnglish
Copy link
Author

Commits squished. I'm not sure if the du tests are going to pass. But everything I can see from tests on a mac seems fine.

@jeromekelleher
Copy link
Contributor

Thanks @ACEnglish, this is great. I'll merge it next week and follow up with a release (this week is deadline heavy, I'm afraid!)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants