Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add tutorial / how-to about filtering file (not granule) results by name #428

Open
mfisher87 opened this issue Jan 16, 2024 · 5 comments
Open
Labels
impact: documentation Improvements or additions to documentation

Comments

@mfisher87
Copy link
Collaborator

mfisher87 commented Jan 16, 2024

Based on #409 and a recent Slack discussion, this is a common need that we should document, and perhaps also explore convenience features to make it easier. @betolink suggested e.g.

files = earthaccess.open(results, regex=["*B01*", "*B02*"])

Let's create a new ticket for such a feature?

@mfisher87 mfisher87 added the impact: documentation Improvements or additions to documentation label Jan 16, 2024
@andypbarrett
Copy link
Collaborator

Playing devils advocate here. I'm wondering if this is a level of abstraction too far. And if a more general use case is filtering on any component of results, not just the data_link. So teaching/suggesting a filtering step along the lines of

filtered_results = [r for r in results if <add_filter_condition_here>]

where the filter condition could be a regex on data links or on some other element, or even a random selection.

Also from a reproducibility point of view, someone might want to save filtered_results as a json or some other file. Abstracting this to open hides the filtering and makes documenting the actual files used more difficult.

@mfisher87
Copy link
Collaborator Author

I see your point! I don't really know how to balance "user-friendliness" / "accessibility" with "just learn/write a bit of Python (e.g. list comprehension filtering) if you want to do this". The latter response can be super valuable for learners if coupled with excellent learning materials and guidance. Or it can be off-putting.

In this case, we're talking about list comprehensions, which are a critical Python skill. Maybe our criteria for what's in scope should include "are we abstracting away a critical Python skill?" as a reason to reject a feature request.

We definitely need to be rejecting some subset of feature requests, but I think we need a conversation about how to handle those rejections kindly. I don't want to create experiences where expectations are unclear and people do work and then feel bad when it's not accepted!

@andypbarrett
Copy link
Collaborator

@mfisher87 I Agree 100%. We might want to explore passing a regex to the search_data using keyword. I think this is possible. Assuming that pycmr allows it.

@jhkennedy
Copy link
Collaborator

jhkennedy commented Jan 23, 2024

As I understand it, this is not about allowing regex/pattern-based search, which is already supported in CMR:
https://cmr.earthdata.nasa.gov/search/site/docs/search/api.html#parameter-options

but opening a specific subset of the assets in a granule record -- here's the STAC formatted record for the example in #409:
https://cmr.earthdata.nasa.gov/search/granules.stac?collection_concept_id=C2021957295-LPCLOUD&producer_granule_id=HLS.S30.T01KBU.2023340T220911

So, only wanting to open Band 10, for this record in particular, would mean opening asset data8, but what a user knows it by is "Band 10", which you can only get to via pattern matching the asset title or URL.

While it is a decently simple comprehension to get the URLs, it seems like we should support directly opening and/or downloading a specific asset, or a specific subset of assets, from results. And in that vein, I'd be in favor of filtering based on a pattern.

@jhkennedy
Copy link
Collaborator

Also from a reproducibility point of view, someone might want to save filtered_results as a json or some other file. Abstracting this to open hides the filtering and makes documenting the actual files used more difficult.

Importantly, this is selecting file URLs out of a record, not filtering down search results, so the filtered_results would be a list of URLs.

I would expect, if you wanted to record for reproducibility, you'd write the resulting search records to disk themselves, with all the metadata that way, not just the URLs (especially since DAACs do delete individual files/records when a scene is reprocessed, for example).

having an .open and .download method requires you to select someway what specific files/assets you want to open/download. I think earthaccess can easily support allowing that and making a best-guess at what asset to load/download if not specified (e.g., all of them?) without putting a list comprehension step between search and access.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
impact: documentation Improvements or additions to documentation
Projects
Status: 🆕 New
Development

No branches or pull requests

3 participants