Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat!: store URIs for ingested VCFs #15

Merged
merged 17 commits into from
Nov 29, 2024
Merged

feat!: store URIs for ingested VCFs #15

merged 17 commits into from
Nov 29, 2024

Conversation

jsstevenson
Copy link
Contributor

@jsstevenson jsstevenson commented Nov 20, 2024

close #2

% vrsix load --help
Usage: vrsix load [OPTIONS] VCF [URI]

  Index the VRS annotations in a VCF by loading it into the sqlite DB.

  Optionally provide a custom file URI to describe how to retrieve VCF records
  after index lookup:

      % vrsix load input.vcf gs://my_storage/input.vcf

Options:
  --db-location PATH
  --help              Show this message and exit.
  • pass an arg to the load command to declare a custom file URI (e.g. above). Otherwise, use the local file address as the URI.
  • creates a second table within the sqlite DB for storing file URIs. Will need to make basic updates to fetch methods (just add a JOIN to get to the URIs table).
  • uri can be an arbitrary string here -- not sure if it's worth trying to validate upon ingest. I'm thinking it could either be a file descriptor a la file:///Users/james/code/my_vcf.vcf or a Google cloud storage address like gs://fc-secure-00000000-0000-0000-0000-00000000/files/input.vcf. My hope is that one string-based file designator is enough for a user to plug into their own workflow as needed.
  • added some more fine-grained tests directly into the Rust side for my own debugging.

Oh, and this will of course require an update to any fetch methods in the toolkit in order to retrieve it.

@jsstevenson jsstevenson changed the title feat: store URIs for ingested VCFs feat!: store URIs for ingested VCFs Nov 22, 2024
@jsstevenson jsstevenson requested a review from quinnwai November 22, 2024 20:58
Copy link

@quinnwai quinnwai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, minor comments. Anything I need to test for this PR?

README.md Outdated
@@ -8,6 +8,12 @@ From a VCF, ingest a VRS ID and the corresponding VCF-called location (i.e. suff
% vrsix load chr1.vcf
```

By default, this uses the input file's location in the local file system as a URI, but a custom URI can be declared:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

possibly add it as more declarative:

... If you want to use custom URI instead, specify the URI as a second argument

what was the reasoning behind adding this as an argument vs a flag? Ig it's trivial in terms of functionality but just curious

Copy link
Contributor Author

@jsstevenson jsstevenson Nov 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I didn't feel super strongly about it... I think it felt a little bit more ergonomic to do it this way, but now that you mention it I'm kind of torn

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since there's not much else needed in terms of flag, I'm good passing in as an optional argument for now.


#[tokio::test]
async fn test_load_file_uri() {
let temp_file = NamedTempFile::new().expect("Failed to create temp file");

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

help me out here, so what is this file testing? I'm unfamiliar with Rust so not sure how the db file is being created

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, good q. I think I just wanted to get a minimal test happening inside the Rust code, because failures that happen on the Rust side are often fairly opaque in the context of the Python integration tests (if it's not something you're explicitly looking for, it just looks like a panic without much detail). This test case literally just loads a single entry into the file_uris table and then verifies that it works and it gets the ID you expect for it. (the expect call on this specific line is sort of a red herring, the compiler is happy when you cover edge cases like a failure to create a tempfile when constructing an input for a test).

I'd originally just written end to end tests from the Python side to make sure things work the whole way through, but if/as this grows in complexity, it's probably smarter to focus more testing within the Rust code because it's much more granular/interpretable.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gotcha and for my understanding, how is the sqlite:/// interpreted? I was expecting to see an file exntension like .sqlite on the filename for example.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's just there as a scheme declaration to indicate a sqlite file, I guess. I think it's an expectation imposed by the DB client library. Just checked and this test fails if you change it.

I don't know if there's really an accepted file extension for sqlite files (i just googled it and this stackoverflow answer is very unsatisfying). It's definitely not something we try to validate anywhere, although we could if it seems worthwhile.

@jsstevenson
Copy link
Contributor Author

jsstevenson commented Nov 26, 2024

Anything I need to test for this PR?

Theoretically the tests are all running in CI so... hopefully it's all going the way it looks like it should? Maybe some outstanding considerations could be

  • edge cases for URIs. Should any validation be performed or can the user just put whatever?
  • would any extra metadata need to be associated with the file URI?
  • Not necessarily related to this PR, but see if the development instructions work for you?

@jsstevenson jsstevenson requested a review from quinnwai November 26, 2024 21:26
Copy link

@quinnwai quinnwai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, feel free to close any other comments I've made.

edge cases for URIs. Should any validation be performed or can the user just put whatever?

since this is data provenance I think it's okay to have no validation for now. Log as a question/issue?

would any extra metadata need to be associated with the file URI?

I don't think so, only other thing I can think of is the index file.

Not necessarily related to this PR, but see if the development instructions work for you?

README setup comments:

  • typo, need to fix virtualenv to venv: python3 -m virtualenv venv?
  • everything else worked for me!

@jsstevenson
Copy link
Contributor Author

typo, need to fix virtualenv to venv: python3 -m virtualenv venv?

👍 I habitually use virtualenv but yeah it's not in the stdlib like venv is

@jsstevenson jsstevenson merged commit 8be8b7f into main Nov 29, 2024
21 checks passed
@jsstevenson jsstevenson deleted the issue-2 branch November 29, 2024 21:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Handle mapping between chromosomes and files
2 participants