Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Settings: Set ensemble paths, record additional site parameters #3456

Open
wants to merge 17 commits into
base: develop
Choose a base branch
from

Conversation

infotroph
Copy link
Member

Description

  • New function setEnsemblePaths(settings, n_reps, path_template, ...) inserts the paths to all n of your ensemble (met, IC, etc) files into each settings$run$<site>$inputs block
  • Optionally pass a dataframe into createMultiSiteSettings to record other site-level parameters (lat, lon, elevation, sitename, etc) alongside the siteID in each settings$run$<site>$site block
  • Return XML as a string if write.settings filename is null (probably mostly useful for printing to the console during debugging, but very handy for that)
  • Corrected the Rd for write.settings: It does not in fact perform any fixes; those all happen upstream in other functions.

Key high-level questions for reviewers:

  • How well established are our norms about the multisite settings structure, and do these changes follow existing practice well enough?
  • Are we OK with the change that run$<site> blocks are now named "site.<siteid>" instead of "settings.<n>"? I see this as an improvement because it lets me extract individual sites without knowing their order in the file, but it is a potentially breaking change if anyone is writing something like site_tag <- paste0("settings.", i); site_block <- settings$run[[site_tag]].... I took a quick look through existing scripts and only saw code that iterated over the blocks by positional index rather than by using their names, but please speak up if you have code that acts differently.

Motivation and Context

Review Time Estimate

  • Immediately
  • Within one week
  • When possible

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

  • My change requires a change to the documentation.
  • My name is in the list of CITATION.cff
  • I agree that PEcAn Project may distribute my contribution under any or all of
    • the same license as the existing code,
    • and/or the BSD 3-clause license.
  • I have updated the CHANGELOG.md.
  • I have updated the documentation accordingly.
  • I have read the CONTRIBUTING document.
  • I have added tests to cover my changes.
  • All new and existing tests passed.

Copy link
Contributor

@DongchenZ DongchenZ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current workflows of ERA5, ICs, and soilinitcond should already provide the feature of recording paths to the settings. I am worried the current implementation would reinvent the wheel and improve the difficulty of code maintenance.

@@ -73,12 +78,21 @@ createSitegroupMultiSettings <- function(
#' @example examples/examples.MultiSite.MultiSettings.r
createMultiSiteSettings <- function(templateSettings, siteIds) {
templateSettings <- as.MultiSettings(templateSettings)

if (is.data.frame(siteIds)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The siteIds as a data.frame or list is a bit confusing as we usually use the term site_info within which the id field is usually set as site_id. I am wondering if you want to put a single vector of site_ids or a list site_info that contains the site_id field as input.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree this is confusing but thought changing the existing argument name would be too big a break in compatibility with existing scripts. The concept here is that a single vector of ids should continue working as it always has, but a list/dataframe should be accepted and incorporated if it's provided.

One alternate design I considered was to have two arguments, with siteIds only accepting a vector and site_info wanting a dataframe/list. To me that seemed like too much extra complexity, but maybe the improvement in name clarity would make up for it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you might just need a vector of site ids as the only inputs to make the function simpler.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need? Sure, folks have gotten by for a long time with the existing design that only takes site ids. It's always possible to add the other info manually. But I have a lot of workflows that start from a CSV of site-level variables that I then want to copy into the Settings object, so I would like to be able to pass all that info as one dataframe rather than assigning it in multiple passes.

One use case that's relevant here is that many of the met retrieval steps expect to work from site$lat and site$lon, which are populated from a Bety lookup if only site$id is present. When taking Bety out of the loop, the lat and lon are probably ultimately going to come from site_info.csv, so why not set them at the outset?

Basically I think this lets the Settings package work more nicely for a very common way of organizing model workflows, and my claim is that the minor (and I think well-contained) increase in complexity at settings creation time lets me and other users eliminate a lot of downstream complexity from other scripts.

@@ -73,12 +78,21 @@ createSitegroupMultiSettings <- function(
#' @example examples/examples.MultiSite.MultiSettings.r
createMultiSiteSettings <- function(templateSettings, siteIds) {
templateSettings <- as.MultiSettings(templateSettings)

if (is.data.frame(siteIds)) {
ids <- siteIds$id
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We usually use site_id instead of id when extracting from the site_info list.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to know, but this also needs to align with workflows that use PEcAn.DB::query.site(), which uses id

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Sounds like some discrepancies between pecan settings and SDA workflows.

}
site_info <- siteId
} else {
site_info <- list(id = siteId)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same naming convention concerns here.

#' )
#' m2$run$site.a1$inputs
#' @export
setEnsemblePaths <- function(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this function, you are creating those paths just from the naming conventions, which might have the risks of a) putting paths to the settings without knowing if certain files have been generated or not and b) causing issues when future files have more complicated conventions.
My recommendation would be to 1) always check if the file exists on the server and don't record the file if it doesn't exist, 2) the current workflows for the ERA5_met_process, soilinitcond, and poolinitcond already have the feature to write paths to the settings, I still prefer to use those functions instead of creating paths in the air, and 3) the current workflows (ERA5, IC, soil texture) on pecan should grab paths either when generating new files or grabbing paths from existing files, which means you can include those workflows in this function to track and record paths of files.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

putting paths to the settings without knowing if certain files have been generated

I agree it's important to know files exist on the server, but I vigorously disagree that a settings-creation function is the right place to check it:
(1) Functions in the settings package should be concerned with making settings internally consistent and should not be interwoven with logic that tries to detect the state of a particular system. Both of those are complicated tasks that can only get harder if we mix them together in the same code!
(2) There are valid reasons to generate a settings that references paths that don't exist yet: Maybe the model will be run on a different machine with different paths, maybe the settings object will itself be the input to the file-checking process, maybe it'll be the input to the file-creation process.

when future files have more complicated conventions

The point here is to make it easy to implement a naming convention, so I don't see how this affects future files. If your naming convention changes in the future, then call setEnsembleNames again to update it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I get that point. But I would still prefer using the existing functions (e.g., ‘ERA5’ and ‘soilinitcond’) to update the paths because the features are already there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants