-
Notifications
You must be signed in to change notification settings - Fork 246
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Settings: Set ensemble paths, record additional site parameters #3456
base: develop
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current workflows of ERA5
, ICs
, and soilinitcond
should already provide the feature of recording paths to the settings. I am worried the current implementation would reinvent the wheel and improve the difficulty of code maintenance.
@@ -73,12 +78,21 @@ createSitegroupMultiSettings <- function( | |||
#' @example examples/examples.MultiSite.MultiSettings.r | |||
createMultiSiteSettings <- function(templateSettings, siteIds) { | |||
templateSettings <- as.MultiSettings(templateSettings) | |||
|
|||
if (is.data.frame(siteIds)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The siteIds
as a data.frame
or list is a bit confusing as we usually use the term site_info
within which the id
field is usually set as site_id
. I am wondering if you want to put a single vector of site_ids
or a list site_info
that contains the site_id
field as input.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree this is confusing but thought changing the existing argument name would be too big a break in compatibility with existing scripts. The concept here is that a single vector of ids should continue working as it always has, but a list/dataframe should be accepted and incorporated if it's provided.
One alternate design I considered was to have two arguments, with siteIds
only accepting a vector and site_info
wanting a dataframe/list. To me that seemed like too much extra complexity, but maybe the improvement in name clarity would make up for it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you might just need a vector of site ids as the only inputs to make the function simpler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need? Sure, folks have gotten by for a long time with the existing design that only takes site ids. It's always possible to add the other info manually. But I have a lot of workflows that start from a CSV of site-level variables that I then want to copy into the Settings object, so I would like to be able to pass all that info as one dataframe rather than assigning it in multiple passes.
One use case that's relevant here is that many of the met retrieval steps expect to work from site$lat and site$lon, which are populated from a Bety lookup if only site$id is present. When taking Bety out of the loop, the lat and lon are probably ultimately going to come from site_info.csv, so why not set them at the outset?
Basically I think this lets the Settings package work more nicely for a very common way of organizing model workflows, and my claim is that the minor (and I think well-contained) increase in complexity at settings creation time lets me and other users eliminate a lot of downstream complexity from other scripts.
@@ -73,12 +78,21 @@ createSitegroupMultiSettings <- function( | |||
#' @example examples/examples.MultiSite.MultiSettings.r | |||
createMultiSiteSettings <- function(templateSettings, siteIds) { | |||
templateSettings <- as.MultiSettings(templateSettings) | |||
|
|||
if (is.data.frame(siteIds)) { | |||
ids <- siteIds$id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We usually use site_id
instead of id
when extracting from the site_info
list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good to know, but this also needs to align with workflows that use PEcAn.DB::query.site()
, which uses id
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. Sounds like some discrepancies between pecan settings and SDA workflows.
} | ||
site_info <- siteId | ||
} else { | ||
site_info <- list(id = siteId) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same naming convention concerns here.
#' ) | ||
#' m2$run$site.a1$inputs | ||
#' @export | ||
setEnsemblePaths <- function( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this function, you are creating those paths just from the naming conventions, which might have the risks of a) putting paths to the settings without knowing if certain files have been generated or not and b) causing issues when future files have more complicated conventions.
My recommendation would be to 1) always check if the file exists on the server and don't record the file if it doesn't exist, 2) the current workflows for the ERA5_met_process
, soilinitcond
, and poolinitcond
already have the feature to write paths to the settings, I still prefer to use those functions instead of creating paths in the air, and 3) the current workflows (ERA5, IC, soil texture) on pecan should grab paths either when generating new files or grabbing paths from existing files, which means you can include those workflows in this function to track and record paths of files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
putting paths to the settings without knowing if certain files have been generated
I agree it's important to know files exist on the server, but I vigorously disagree that a settings-creation function is the right place to check it:
(1) Functions in the settings package should be concerned with making settings internally consistent and should not be interwoven with logic that tries to detect the state of a particular system. Both of those are complicated tasks that can only get harder if we mix them together in the same code!
(2) There are valid reasons to generate a settings that references paths that don't exist yet: Maybe the model will be run on a different machine with different paths, maybe the settings object will itself be the input to the file-checking process, maybe it'll be the input to the file-creation process.
when future files have more complicated conventions
The point here is to make it easy to implement a naming convention, so I don't see how this affects future files. If your naming convention changes in the future, then call setEnsembleNames again to update it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. I get that point. But I would still prefer using the existing functions (e.g., ‘ERA5’ and ‘soilinitcond’) to update the paths because the features are already there.
Description
setEnsemblePaths(settings, n_reps, path_template, ...)
inserts the paths to all n of your ensemble (met, IC, etc) files into eachsettings$run$<site>$inputs
blockcreateMultiSiteSettings
to record other site-level parameters (lat, lon, elevation, sitename, etc) alongside the siteID in eachsettings$run$<site>$site
blockwrite.settings
filename is null (probably mostly useful for printing to the console during debugging, but very handy for that)write.settings
: It does not in fact perform any fixes; those all happen upstream in other functions.Key high-level questions for reviewers:
run$<site>
blocks are now named "site.<siteid>
" instead of "settings.<n>
"? I see this as an improvement because it lets me extract individual sites without knowing their order in the file, but it is a potentially breaking change if anyone is writing something likesite_tag <- paste0("settings.", i); site_block <- settings$run[[site_tag]]...
. I took a quick look through existing scripts and only saw code that iterated over the blocks by positional index rather than by using their names, but please speak up if you have code that acts differently.Motivation and Context
Review Time Estimate
Types of changes
Checklist: