-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat!: store URIs for ingested VCFs #15
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, minor comments. Anything I need to test for this PR?
README.md
Outdated
@@ -8,6 +8,12 @@ From a VCF, ingest a VRS ID and the corresponding VCF-called location (i.e. suff | |||
% vrsix load chr1.vcf | |||
``` | |||
|
|||
By default, this uses the input file's location in the local file system as a URI, but a custom URI can be declared: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
possibly add it as more declarative:
... If you want to use custom URI instead, specify the URI as a second argument
what was the reasoning behind adding this as an argument vs a flag? Ig it's trivial in terms of functionality but just curious
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I didn't feel super strongly about it... I think it felt a little bit more ergonomic to do it this way, but now that you mention it I'm kind of torn
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since there's not much else needed in terms of flag, I'm good passing in as an optional argument for now.
|
||
#[tokio::test] | ||
async fn test_load_file_uri() { | ||
let temp_file = NamedTempFile::new().expect("Failed to create temp file"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
help me out here, so what is this file testing? I'm unfamiliar with Rust so not sure how the db file is being created
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, good q. I think I just wanted to get a minimal test happening inside the Rust code, because failures that happen on the Rust side are often fairly opaque in the context of the Python integration tests (if it's not something you're explicitly looking for, it just looks like a panic without much detail). This test case literally just loads a single entry into the file_uris
table and then verifies that it works and it gets the ID you expect for it. (the expect
call on this specific line is sort of a red herring, the compiler is happy when you cover edge cases like a failure to create a tempfile when constructing an input for a test).
I'd originally just written end to end tests from the Python side to make sure things work the whole way through, but if/as this grows in complexity, it's probably smarter to focus more testing within the Rust code because it's much more granular/interpretable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gotcha and for my understanding, how is the sqlite:/// interpreted? I was expecting to see an file exntension like .sqlite on the filename for example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's just there as a scheme declaration to indicate a sqlite file, I guess. I think it's an expectation imposed by the DB client library. Just checked and this test fails if you change it.
I don't know if there's really an accepted file extension for sqlite files (i just googled it and this stackoverflow answer is very unsatisfying). It's definitely not something we try to validate anywhere, although we could if it seems worthwhile.
Theoretically the tests are all running in CI so... hopefully it's all going the way it looks like it should? Maybe some outstanding considerations could be
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, feel free to close any other comments I've made.
edge cases for URIs. Should any validation be performed or can the user just put whatever?
since this is data provenance I think it's okay to have no validation for now. Log as a question/issue?
would any extra metadata need to be associated with the file URI?
I don't think so, only other thing I can think of is the index file.
Not necessarily related to this PR, but see if the development instructions work for you?
README setup comments:
- typo, need to fix virtualenv to venv:
python3 -m virtualenv venv
? - everything else worked for me!
👍 I habitually use virtualenv but yeah it's not in the stdlib like |
close #2
% vrsix load --help Usage: vrsix load [OPTIONS] VCF [URI] Index the VRS annotations in a VCF by loading it into the sqlite DB. Optionally provide a custom file URI to describe how to retrieve VCF records after index lookup: % vrsix load input.vcf gs://my_storage/input.vcf Options: --db-location PATH --help Show this message and exit.
load
command to declare a custom file URI (e.g. above). Otherwise, use the local file address as the URI.uri
can be an arbitrary string here -- not sure if it's worth trying to validate upon ingest. I'm thinking it could either be a file descriptor a lafile:///Users/james/code/my_vcf.vcf
or a Google cloud storage address likegs://fc-secure-00000000-0000-0000-0000-00000000/files/input.vcf
. My hope is that one string-based file designator is enough for a user to plug into their own workflow as needed.Oh, and this will of course require an update to any
fetch
methods in the toolkit in order to retrieve it.