Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable CDAT-migrated E3SM Diags #651

Merged
merged 2 commits into from
Jan 28, 2025
Merged

Conversation

forsyth2
Copy link
Collaborator

@forsyth2 forsyth2 commented Dec 10, 2024

Issue resolution

Select one: This pull request is...

  • a bug fix: increment the patch version
  • a small improvement: increment the minor version
  • an incompatible (non-backwards compatible) API change: increment the major version

1. Does this do what we want it to do?

Objectives:

  • Enable zppy to call the CDAT-migrated E3SM Diags

Required:

  • Product Management: I have confirmed with the stakeholders that the objectives above are correct and complete.
  • Testing: I have added or modified at least one "min-case" configuration file to test this change. Every objective above is represented in at least one cfg.
  • Testing: I have considered likely and/or severe edge cases and have included them in testing.

If applicable:

  • Testing: this pull request introduces an important feature or bug fix that we must test often. I have updated the weekly-test configuration files, not just a "min-case" one.
  • Testing: this pull request adds at least one new possible parameter to the cfg. I have tested using this parameter with and without any other parameter that may interact with it.

2. Are the implementation details accurate & efficient?

Required:

  • Logic: I have visually inspected the entire pull request myself.
  • Logic: I have left GitHub comments highlighting important pieces of code logic. I have had these code blocks reviewed by at least one other team member.

If applicable:

  • Dependencies: This pull request introduces a new dependency. I have discussed this requirement with at least one other team member. The dependency is noted in zppy/conda, not just an import statement.

3. Is this well documented?

Required:

  • Documentation: by looking at the docs, a new user could easily understand the functionality introduced by this pull request.

4. Is this code clean?

Required:

  • Readability: The code is as simple as possible and well-commented, such that a new team member could understand what's happening.
  • Pre-commit checks: All the pre-commits checks have passed.

If applicable:

  • Software architecture: I have discussed relevant trade-offs in design decisions with at least one other team member. It is unlikely that this pull request will increase tech debt.

@forsyth2
Copy link
Collaborator Author

forsyth2 commented Dec 10, 2024

To do:

  • (currently running) test just e3sm diags on all available sets: zppy -c tests/integration/generated/test_min_case_e3sm_diags_comprehensive_v3_chrysalis.cfg
  • Run the full weekly test suite
  • Merge this PR

Comment on lines 12 to 16
output = "#expand user_output#zppy_weekly_comprehensive_v3_output/#expand unique_id#/#expand case_name#"
partition = "#expand partition_short#"
qos = "#expand qos_short#"
www = "#expand user_www#zppy_weekly_comprehensive_v3_www/#expand unique_id#"
years = "1985:1989:2",
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

output and www need updated names!

@forsyth2
Copy link
Collaborator Author

Currently running full weekly test suite (doing this pre-merge because the CDAT migration was a large change, so there's a higher chance of breaking main).

$ python tests/integration/utils.py 
CFG FILES HAVE BEEN GENERATED FROM TEMPLATES WITH THESE SETTINGS:
UNIQUE_ID=test-346-20241210
unified_testing=False
diags_environment_commands=source /gpfs/fs1/home/ac.forsyth2/miniforge3/etc/profile.d/conda.sh; conda activate e3sm_diags_main_20241209
global_time_series_environment_commands=source /gpfs/fs1/home/ac.forsyth2/miniforge3/etc/profile.d/conda.sh; conda activate zi_dev_weekly_20241122
e3sm_to_cmip_environment_commands=
environment_commands=
Reminder: `e3sm_to_cmip_environment_commands=''` => the environment of the `ts` task will be used
zppy -c tests/integration/generated/test_weekly_comprehensive_v3_chrysalis.cfg
zppy -c tests/integration/generated/test_weekly_comprehensive_v2_chrysalis.cfg
zppy -c tests/integration/generated/test_weekly_bundles_chrysalis.cfg # Runs 1st part of bundles cfg

@forsyth2
Copy link
Collaborator Author

Continuing testing from previous comment

Continuing testing gives:

zppy -c tests/integration/generated/test_weekly_bundles_chrysalis.cfg # Runs 2nd part of bundles cfg

cd /lcrc/group/e3sm/ac.forsyth2/zppy_weekly_comprehensive_v3_output/test-346-20241210/v3.LR.historical_0051/post/scripts/
grep -v "OK" *status

cd /lcrc/group/e3sm/ac.forsyth2/zppy_weekly_comprehensive_v2_output/test-346-20241210/v2.LR.historical_0201/post/scripts
grep -v "OK" *status

cd /lcrc/group/e3sm/ac.forsyth2/zppy_weekly_bundles_output/test-346-20241210/v3.LR.historical_0051/post/scripts
grep -v "OK" *status

# Run integration tests
cd ~/ez/zppy
pytest tests/integration/test_*.py

That gives:

======================================================================= short test summary info ========================================================================
FAILED tests/integration/test_bash_generation.py::test_bash_generation - assert 256 == 0
FAILED tests/integration/test_weekly.py::test_comprehensive_v2_images - AssertionError
FAILED tests/integration/test_weekly.py::test_comprehensive_v3_images - AssertionError
FAILED tests/integration/test_weekly.py::test_bundles_images - AssertionError
============================================================== 4 failed, 10 passed in 1505.03s (0:25:05) ===============================================================

Bash generation test

The FAILED tests/integration/test_bash_generation.py::test_bash_generation - assert 256 == 0 line is benign. The changes are specifically because the e3sm_diags.bash template gets updated in this PR; i.e., these diffs are expected.

Image check failures

There's a large number of image check diffs in subdirectories of:

But as far as I can tell, the errors are mostly benign. Example errors:

  • Many involve actual having "RMSE" and "CORR" further to the left than in expected.
  • Slight differences in text shape/size -- e.g., actual and expected
  • Slight differences in contours like actual and expected
  • Differences in lat/lon labels -- actual and expected

Some errors seem more concerning:

Conclusions

@chengzhuzhang @tomvothecoder What's the best path forward here? None of the grep -v "OK" *status checks showed errors. That said, there are simply too many diffs in the image check tests to manually check them all. Most of the errors seem benign and likely arise from variations in matplotlib, but I found at least one instance of a more concerning change (noted above).

Do we accept these diffs, merge this PR, and update the expected images? Or do these need to be investigated more thoroughly?

@forsyth2 forsyth2 marked this pull request as ready for review December 10, 2024 19:58
@tomvothecoder
Copy link
Collaborator

@chengzhuzhang @tomvothecoder What's the best path forward here? None of the grep -v "OK" *status checks showed errors. That said, there are simply too many diffs in the image check tests to manually check them all. Most of the errors seem benign and likely arise from variations in matplotlib, but I found at least one instance of a more concerning change (noted above).

Do we accept these diffs, merge this PR, and update the expected images? Or do these need to be investigated more thoroughly?

I think we should expect that most of the differences will be benign. The underlying results from the dev branch won't exactly match the CDAT codebase, which can produce diffs.

Completely different stats and colors: actual and expected

However, any formatting differences (e.g., labels for axes) or major differences like the one you mentioned above should be investigated. I will open a separate GitHub issue in e3sm_diags to investigate the diffs and address the ones that should be fixed.

Afterwards, we can update the expected results and you can re-run the zppy test again.

@forsyth2
Copy link
Collaborator Author

Thanks @tomvothecoder!

@forsyth2
Copy link
Collaborator Author

forsyth2 commented Dec 23, 2024

zppy & post-CDAT-migration e3sm_diags to-do list, for the 1/15 Unified release-candidate deadline:

@tomvothecoder @chengzhuzhang ^FYI for timeline planning

@tomvothecoder
Copy link
Collaborator

@forsyth2 I updated your checklist above. I just merged E3SM-Project/e3sm_diags#907 and you should be able to re-test now. Hopefully this fixes the diffs in zppy.

@forsyth2
Copy link
Collaborator Author

@tomvothecoder Thanks! E3SM-Project/e3sm_diags#907 has extensive changes, so I will integrate that into zppy with this PR in the next release candidate.

@forsyth2 forsyth2 force-pushed the issue-346-diags-post-refactor branch 3 times, most recently from 84a2b59 to c2188bf Compare January 15, 2025 23:37
@forsyth2
Copy link
Collaborator Author

Since the zppy release was delayed anyway yesterday (from the issues described on #634 (reply in thread)), I tried testing this pull request using the latest E3SM Diags (E3SM-Project/e3sm_diags#907), however I ran into a number of issues.

# Set up branch
cd ~/ez/zppy
git checkout issue-346-diags-post-refactor

# Set up and test zppy-interfaces env
# zi_dev_weekly_20250114 still exists from weekly testing

(That environment was created in #634 (comment))

# Set up e3sm_diags env
cd ~/ez/e3sm_diags
git status
# Check no file changes will persist when we switch branches.
git fetch upstream
git checkout main
git reset --hard upstream/main
git log
# Confirmed last commit, from 1/15: Address diffs v2.12.1 to v3 (#907)
conda clean --all --y
conda env create -f conda-env/dev.yml -n e3sm_diags_20250115
conda activate e3sm_diags_20250115
pip install .

# Set up zppy env
cd ~/ez/zppy
conda clean --all --y
conda env create -f conda/dev.yml -n zppy_dev_pr651_20250115
conda activate zppy_dev_pr651_20250115
pip install .

# Unit tests for zppy
pytest tests/test_*.py
# 23 passed, 1 warning in 0.35s
# (DeprecationWarning: invalid escape sequence \.)

# Integration testing for zppy

# Edit tests/integration/utils.py:
# UNIQUE_ID = "test_zppy_pr651_20250115"
#        "diags_environment_commands": "source /gpfs/fs1/home/ac.forsyth2/miniforge3/etc/profile.d/conda.sh; conda activate e3sm_diags_20250115",
#        "global_time_series_environment_commands": "source /gpfs/fs1/home/ac.forsyth2/miniforge3/etc/profile.d/conda.sh; conda activate zi_dev_weekly_20250114",
# Keep this as-is: generate_cfgs(unified_testing=False, dry_run=False)

# Run zppy to produce actual results to compare in the integration tests
python tests/integration/utils.py
# That prints:
# CFG FILES HAVE BEEN GENERATED FROM TEMPLATES WITH THESE SETTINGS:
# UNIQUE_ID=test_zppy_pr651_20250115
# unified_testing=False
# diags_environment_commands=source /gpfs/fs1/home/ac.forsyth2/miniforge3/etc/profile.d/conda.sh; conda activate e3sm_diags_20250115
# global_time_series_environment_commands=source /gpfs/fs1/home/ac.forsyth2/miniforge3/etc/profile.d/conda.sh; conda activate zi_dev_weekly_20250114
# environment_commands=
# Reminder: `environment_commands=''` => the latest E3SM Unified environment will be used

zppy -c tests/integration/generated/test_weekly_comprehensive_v3_chrysalis.cfg
zppy -c tests/integration/generated/test_weekly_comprehensive_v2_chrysalis.cfg
zppy -c tests/integration/generated/test_weekly_bundles_chrysalis.cfg # Runs 1st part of bundles cfg
# Wait to finish

# Here as of 2025-01-15 end of day
# Picking up 2025-01-16 
cd ~/ez/zppy
lcrc_conda
conda activate zppy_dev_pr651_20250115
pip install .
sq # alias for: squeue -o "%8u %.7a %.4D %.9P %7i %.2t %.10r %.10M %.10l %j" --sort=P,-t,-p -u ac.forsyth2
# The MPAS-Analysis errors:
# USER     ACCOUNT NODE PARTITION JOBID   ST     REASON       TIME TIME_LIMIT NAME
# ac.forsy    e3sm    1   compute 666747  PD Dependency       0:00      30:00 mpas_analysis_ts_1980-1990_climo_1985-1990
# ac.forsy    e3sm    1     debug 666748  PD Dependency       0:00      30:00 global_time_series_1980-1990
# ac.forsy    e3sm    1   compute 666819  PD Dependency       0:00      30:00 mpas_analysis_ts_1980-1990_climo_1985-1990
# ac.forsy    e3sm    1     debug 666820  PD Dependency       0:00      30:00 global_time_series_1980-1990

(These actually appear to be from earlier runs. The MPAS-Analysis errors are noted starting with #634 (reply in thread)).

zppy -c tests/integration/generated/test_weekly_bundles_chrysalis.cfg # Runs 2nd part of bundles cfg
sq
# USER     ACCOUNT NODE PARTITION JOBID   ST     REASON       TIME TIME_LIMIT NAME
# ac.forsy    e3sm    1   compute 666747  PD Dependency       0:00      30:00 mpas_analysis_ts_1980-1990_climo_1985-1990
# ac.forsy    e3sm    1     debug 666748  PD Dependency       0:00      30:00 global_time_series_1980-1990
# ac.forsy    e3sm    1   compute 666819  PD Dependency       0:00      30:00 mpas_analysis_ts_1980-1990_climo_1985-1990
# ac.forsy    e3sm    1     debug 666820  PD Dependency       0:00      30:00 global_time_series_1980-1990
# ac.forsy    e3sm    1   compute 667322  PD   Priority       0:00    7:00:00 ilamb_1985-1986
# ac.forsy    e3sm    1   compute 667323  PD   Priority       0:00    7:00:00 bundle3
# ac.forsy    e3sm    1   compute 667324  PD   Priority       0:00    7:00:00 bundle1

# Notice, bundle1 is rerunning... probably because of failed MPAS-Analysis?

cd /lcrc/group/e3sm/ac.forsyth2/zppy_weekly_comprehensive_v3_output/test_zppy_pr651_20250115/v3.LR.historical_0051/post/scripts/
grep -v "OK" *status
# e3sm_diags_atm_monthly_180x360_aave_model_vs_obs_1987-1988.status:ERROR (9)
# e3sm_diags_atm_monthly_180x360_aave_mvm_model_vs_model_1987-1988_vs_1985-1986.status:RUNNING 666785
# e3sm_diags_lnd_monthly_mvm_lnd_model_vs_model_1987-1988_vs_1985-1986.status:ERROR (9)
emacs e3sm_diags_atm_monthly_180x360_aave_model_vs_obs_1987-1988.o666784 
# slurmstepd: error: *** STEP 666784.0 ON chr-0221 CANCELLED AT 2025-01-15T18:24:58 ***
# Only ran for 44s
emacs e3sm_diags_atm_monthly_180x360_aave_mvm_model_vs_model_1987-1988_vs_1985-1986.o666785 
# slurmstepd: error: *** JOB 666785 ON chr-0222 CANCELLED AT 2025-01-15T23:24:35 DUE TO TIME LIMIT ***
emacs e3sm_diags_lnd_monthly_mvm_lnd_model_vs_model_1987-1988_vs_1985-1986.o666786 
# slurmstepd: error: *** STEP 666786.0 ON chr-0140 CANCELLED AT 2025-01-15T18:24:47 ***
# Only ran for 36s
emacs emacs mpas_analysis_ts_1985-1989_climo_1985-1989.o666787 
# All good
emacs emacs mpas_analysis_ts_1985-1995_climo_1990-1995.o666788
# All good
emacs global_time_series_1985-1995.o666789 
# cp: cannot stat 'global_time_series_1985-1995_results/viewer': No such file or directory

cd /lcrc/group/e3sm/ac.forsyth2/zppy_weekly_comprehensive_v2_output/test_zppy_pr651_20250115/v2.LR.historical_0201/post/scripts
grep -v "OK" *status
# e3sm_diags_atm_monthly_180x360_aave_model_vs_obs_1982-1983.status:RUNNING 666815
# e3sm_diags_atm_monthly_180x360_aave_mvm_model_vs_model_1980-1981_vs_1980-1981.status:RUNNING 666816
# e3sm_diags_lnd_monthly_mvm_lnd_model_vs_model_1982-1983_vs_1980-1981.status:RUNNING 666817
# global_time_series_1980-1990.status:WAITING 666820
# mpas_analysis_ts_1980-1984_climo_1980-1984.status:ERROR (1)
# mpas_analysis_ts_1980-1990_climo_1985-1990.status:WAITING 666819
emacs e3sm_diags_atm_monthly_180x360_aave_model_vs_obs_1982-1983.o666815 
# slurmstepd: error: *** JOB 666815 ON chr-0140 CANCELLED AT 2025-01-15T23:35:35 DUE TO TIME LIMIT ***
emacs e3sm_diags_atm_monthly_180x360_aave_mvm_model_vs_model_1980-1981_vs_1980-1981.o666816
# slurmstepd: error: *** JOB 666816 ON chr-0221 CANCELLED AT 2025-01-15T23:26:35 DUE TO TIME LIMIT ***
emacs e3sm_diags_lnd_monthly_mvm_lnd_model_vs_model_1982-1983_vs_1980-1981.o666817 
# slurmstepd: error: *** JOB 666817 ON chr-0224 CANCELLED AT 2025-01-15T23:24:35 DUE TO TIME LIMIT ***
emacs mpas_analysis_ts_1980-1984_climo_1980-1984.o666818
# KeyError: 'config_am_conservationcheck_enable'
# ...
# There were errors in task climatologyMapArgoSalinity: plotJFM_latlon_depth_-600

# No errors
cd /lcrc/group/e3sm/ac.forsyth2/zppy_weekly_bundles_output/test_zppy_pr651_20250115/v3.LR.historical_0051/post/scripts
grep -v "OK" *status
# bundle1.status:ERROR
# bundle3.status:ERROR
# e3sm_diags_atm_monthly_180x360_aave_model_vs_obs_1985-1986.status:ERROR (9)
# e3sm_diags_atm_monthly_180x360_aave_model_vs_obs_1987-1988.status:ERROR (9)
# e3sm_diags_atm_monthly_180x360_aave_mvm_model_vs_model_1987-1988_vs_1985-1986.status:ERROR (9)
emacs bundle1.o667324 # 2nd run
# === e3sm_diags_atm_monthly_180x360_aave_model_vs_obs_1985-1986.bash ===
# ...
# OSError: No file found for '' and 'MAM' in /lcrc/group/e3sm/diagnostics/observations/Atm/climatology/
# ...
# slurmstepd: error: *** STEP 667324.1 ON chr-0074 CANCELLED AT 2025-01-16T11:17:29 ***
emacs bundle3.o667323
# === e3sm_diags_atm_monthly_180x360_aave_mvm_model_vs_model_1987-1988_vs_1985-1986.bash ===
# ...
# OSError: The dataset file has no matching source variables for AODMOM
# ...
# slurmstepd: error: *** STEP 667323.0 ON chr-0224 CANCELLED AT 2025-01-16T11:17:07 ***

@chengzhuzhang
Copy link
Collaborator

chengzhuzhang commented Jan 16, 2025

@forsyth2 thank you for keep testing.
I looked at the zppy error logs.

For e3sm_diags:

  1. There is slurmstepd: error: mpi/pmi2: failed to read PMI1 init command in two tasks:
    atm_monthly_180x360_aave_model_vs_obs,
    atm_monthly_180x360_aave_mvm_model_vs_model
    This one is mysterious to me.

  2. However, the unit conversion error in lnd_monthly_mvm_lnd_model_vs_model_1987-1988, should be related due to the recent refactor, to replace genutil.udunit. I will file an issue in e3sm_diags for a fix.
    This run is also failed with a slurm error within one minute:

srun: Job step aborted: Waiting up to 92 seconds for job step to finish.
slurmstepd: error: *** STEP 666786.0 ON chr-0140 CANCELLED AT 2025-01-15T18:24:47 ***

@forsyth2
Copy link
Collaborator Author

Testing with CDAT-migrated E3SM Diags for zppy v3.0.0rc2.

Testing summary

v2 test:

Commits num. mismatched images num. missing images num. passing images num. total images
Both commits -- no more copying files* 2322 81 1973 4376
Just the first commit 2299 117 1960 4376

v3 test:

Commits num. mismatched images num. missing images num. passing images num. total images
Both commits -- no more copying files* 2868 102 2246 5216
Just the first commit 2845 138 2233 5216

*Address the issue of file paths (latest comment on that: E3SM-Project/e3sm_diags#866 (comment))

Note: the missing images don't show up as image diffs, like the mismatched images do. Missing images mean we expected that image (it's in the expected results), but there was no actual to diff against.

There's quite a few errors of both types (missing, mismatched) to consider...

Testing details

Testing steps
# Set up and test zppy-interfaces env
cd ~/ez/zppy-interfaces
git status
# Check no file changes will persist when we switch branches.
git fetch upstream main
git checkout -b test_zi_weekly_20250117 upstream/main
git log
# Last commit, from 1/17: Further PR template updates (#15)
conda clean --all --y
conda env create -f conda/dev.yml -n zi_dev_weekly_20250117
conda activate zi_dev_weekly_20250117
pip install .
pytest tests/unit/global_time_series/test_*.py
# 10 passed in 24.69s

# Set up e3sm_diags env
cd ~/ez/e3sm_diags
git status
# Check no file changes will persist when we switch branches.
git fetch upstream
git checkout -b main upstream main
git reset --hard upstream/main
git log
# Confirmed last commit, from 1/17: Revert "Drop Python 3.9 support (#919)"
conda clean --all --y
conda env create -f conda-env/dev.yml -n e3sm_diags_20250117
conda activate e3sm_diags_20250117
pip install .

With both commits:

# Set up branches
cd ~/ez/zppy
git status
# Check no file changes will persist when we switch branches.
git checkout issue-346-diags-post-refactor
git log
# b90506f7d18b77d41a4a74b5beff413bb6f1fe2b => 1/14 Update PR template (#658)
git fetch upstream
git rebase -i
# Drop the testing commit
# 2 commits:
# Remove symbolic links for ts
# Enable CDAT-migrated E3SM Diags
# Now rebased on 1/17: Further PR template updates (#662)

# Set up zppy env
conda clean --all --y
conda env create -f conda/dev.yml -n zppy_dev_pr651_both_commits_20250117
conda activate zppy_dev_pr651_both_commits_20250117
pip install .

# Unit tests for zppy
pytest tests/test_*.py
# 23 passed in 0.29s

# Integration testing for zppy

# Edit tests/integration/utils.py:
# UNIQUE_ID = "test_pr651_both_commits_20250117"
#        "diags_environment_commands": "source /gpfs/fs1/home/ac.forsyth2/miniforge3/etc/profile.d/conda.sh; conda activate e3sm_diags_20250117",
#        "global_time_series_environment_commands": "source /gpfs/fs1/home/ac.forsyth2/miniforge3/etc/profile.d/conda.sh; conda activate zi_dev_weekly_20250117",
# Keep this as-is: generate_cfgs(unified_testing=False, dry_run=False)

# Run zppy to produce actual results to compare in the integration tests
python tests/integration/utils.py

zppy -c tests/integration/generated/test_weekly_comprehensive_v3_chrysalis.cfg
zppy -c tests/integration/generated/test_weekly_comprehensive_v2_chrysalis.cfg

cd /lcrc/group/e3sm/ac.forsyth2/zppy_weekly_comprehensive_v3_output/test_pr651_both_commits_20250117/v3.LR.historical_0051/post/scripts/
grep -v "OK" *status
# No errors
cd /lcrc/group/e3sm/ac.forsyth2/zppy_weekly_comprehensive_v2_output/test_pr651_both_commits_20250117/v2.LR.historical_0201/post/scripts
grep -v "OK" *status
# No errors

cd ~/ez/zppy
git checkout issue-346-diags-post-refactor
git branch # Check that we're on issue-346-diags-post-refactor
ls tests/integration/test_*.py
pytest tests/integration/test_bash_generation.py
# 1 failed in 2.15s
# Diffs are expected
pytest tests/integration/test_campaign.py
# 6 passed in 1.93s
pytest tests/integration/test_defaults.py
# 1 passed in 0.32s
pytest tests/integration/test_last_year.py
# 1 passed in 0.26s
pytest tests/integration/test_weekly.py # Skip bundles tests for now
# FAILED tests/integration/test_weekly.py::test_comprehensive_v2_images - AssertionError
# FAILED tests/integration/test_weekly.py::test_comprehensive_v3_images - AssertionError
# 2 failed, 3 skipped in 1351.74s (0:22:31)

# For v2:
# Total number of images checked: 4376
# Missing images: 81
# For v3:
# Total number of images checked: 5216
# Missing images: 102

With first commit only:

# Set up branches
cd ~/ez/zppy
git status
# Check no file changes will persist when we switch branches.
git checkout issue-346-diags-post-refactor
git checkout -b issue-346-diags-post-refactor-1st-commit
git fetch upstream
git rebase -i upstream/main
# Keep only the 1st commit:
# Enable CDAT-migrated E3SM Diags
# Now rebased on 1/17: Further PR template updates (#662)

# Set up zppy env
conda clean --all --y
conda env create -f conda/dev.yml -n zppy_dev_pr651_1st_commit_20250117
conda activate zppy_dev_pr651_1st_commit_20250117
pip install .

# Unit tests for zppy
pytest tests/test_*.py
# 23 passed in 0.23s

# Integration testing for zppy

# Edit tests/integration/utils.py:
# UNIQUE_ID = "test_pr651_1st_commit_20250117"
#        "diags_environment_commands": "source /gpfs/fs1/home/ac.forsyth2/miniforge3/etc/profile.d/conda.sh; conda activate e3sm_diags_20250117",
#        "global_time_series_environment_commands": "source /gpfs/fs1/home/ac.forsyth2/miniforge3/etc/profile.d/conda.sh; conda activate zi_dev_weekly_20250117",
# Keep this as-is: generate_cfgs(unified_testing=False, dry_run=False)

# Run zppy to produce actual results to compare in the integration tests
python tests/integration/utils.py

zppy -c tests/integration/generated/test_weekly_comprehensive_v3_chrysalis.cfg
zppy -c tests/integration/generated/test_weekly_comprehensive_v2_chrysalis.cfg

cd /lcrc/group/e3sm/ac.forsyth2/zppy_weekly_comprehensive_v3_output/test_pr651_1st_commit_20250117/v3.LR.historical_0051/post/scripts/
grep -v "OK" *status
# No errors
cd /lcrc/group/e3sm/ac.forsyth2/zppy_weekly_comprehensive_v2_output/test_pr651_1st_commit_20250117/v2.LR.historical_0201/post/scripts
grep -v "OK" *status
# No errors

cd ~/ez/zppy
git branch # Check that we're on issue-346-diags-post-refactor-1st-commit
ls tests/integration/test_*.py
pytest tests/integration/test_bash_generation.py
# 1 failed in 2.49s
# Diffs are expected
pytest tests/integration/test_campaign.py
# 6 passed in 1.78s
pytest tests/integration/test_defaults.py
# 1 passed in 0.33s
pytest tests/integration/test_last_year.py
# 1 passed in 0.25s
pytest tests/integration/test_weekly.py # Skip bundles tests for now
# FAILED tests/integration/test_weekly.py::test_comprehensive_v2_images - AssertionError
# FAILED tests/integration/test_weekly.py::test_comprehensive_v3_images - AssertionError
# 2 failed, 3 skipped in 1339.78s (0:22:19)

# For v2:
# Total number of images checked: 4376
# Missing images: 117
# For v3:
# Total number of images checked: 5216
# Missing images: 138

@chengzhuzhang
Copy link
Collaborator

@forsyth2 nice to see all task status shows OK, Could you provide instructions on how to review the diffs?

@forsyth2
Copy link
Collaborator Author

See /lcrc/group/e3sm/public_html/diagnostic_output/ac.forsyth2/pr651_results_20250117

Diffs can be found at:

The image_check directories have multiple subdirectories. You'll need to navigate through a few layers to get to Actual, Diff, and Expected images.

I've also copied outputs of running the tests into the following files:

  • 1_commit_v2.txt
  • 1_commit_v3.txt
  • 2_commits_v2.txt
  • 2_commits_v3.txt

grep for Total number of images checked, Missing images, Mismatched images to get the counts shown in the table on #651 (comment).

@forsyth2
Copy link
Collaborator Author

@chengzhuzhang @tomvothecoder I'm just checking in to see if either of you were able to look into the diffs any further.

I know we discussed the challenges of the image diff checks. The internet suggests a number of ways to do a more nuanced approach to image diff checking, which I could try to implement, but we still have the problem of a lot of diffs being the same thing qualitatively. For instance, looking at some of the diffs, I see axis label differences "-60 => 60 S". It would be fantastic if we had some way for the code to automatically recognize these things and then group them together -- e.g., "here's all the diffs that are just axis label differences" or "here's all the diffs where the metrics differ". That might be infeasible though.

@chengzhuzhang
Copy link
Collaborator

chengzhuzhang commented Jan 27, 2025

@forsyth2 thank you for the paths. Could you also provide the e3sm_diags run directories of these runs? The metrics overview tables are not included in the image checker.

Also, does the image checker output the filenames of missing images?

@forsyth2
Copy link
Collaborator Author

Could you also provide the e3sm_diags run directories of these runs?

@chengzhuzhang For www paths, just go up one level on each, and then drill down accordingly:

For the output paths:

  • 1st commit only, v2: /lcrc/group/e3sm/ac.forsyth2/zppy_weekly_comprehensive_v2_output/test_pr651_1st_commit_20250117/v2.LR.historical_0201/post
  • 1st commit only, v3: /lcrc/group/e3sm/ac.forsyth2/zppy_weekly_comprehensive_v3_output/test_pr651_1st_commit_20250117/v3.LR.historical_0051
  • Both commits, v2: /lcrc/group/e3sm/ac.forsyth2/zppy_weekly_comprehensive_v2_output/test_pr651_both_commits_20250117/v2.LR.historical_0201/post
  • Both commits, v3: /lcrc/group/e3sm/ac.forsyth2/zppy_weekly_comprehensive_v3_output/test_pr651_both_commits_20250117/v3.LR.historical_0051

does the image checker output the filenames of missing images?

It prints them as part of the test output, but they don't show up in the image_check_failures directories. That is, in #651 (comment), they appear in the output under:

I've also copied outputs of running the tests into the following files

and not at the output under:

Diffs can be found at:

@chengzhuzhang
Copy link
Collaborator

@forsyth2 thanks, but I'm still having trouble locating these files:
1_commit_v2.txt
1_commit_v3.txt
2_commits_v2.txt
2_commits_v3.txt
could you provide a complete path for one of these?

@forsyth2
Copy link
Collaborator Author

See /lcrc/group/e3sm/public_html/diagnostic_output/ac.forsyth2/pr651_results_20250117

Sorry, I was trying to say all these notes are in this directory.

@chengzhuzhang
Copy link
Collaborator

@forsyth2 thank you. And could you also give the links or paths for the baseline runs (which provides the expected results)?

@forsyth2
Copy link
Collaborator Author

paths for the baseline runs (which provides the expected results)?

It's the expected_dir in tests/integration/utils.py:

/lcrc/group/e3sm/public_html/zppy_test_resources
/global/cfs/cdirs/e3sm/www/zppy_test_resources/
/compyfs/www/zppy_test_resources/

Copy link
Collaborator Author

@forsyth2 forsyth2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chengzhuzhang We discussed just merging this PR, because it appears the integration test diffs are most likely coming from 1) e3sm_diags itself or 2) the testing code (e.g., a path might need to be updated).

I can do that, but can I first get a visual inspection from you on the combined commits? (https://github.com/E3SM-Project/zppy/pull/651/files, changes explained in comments as part of this self-review). Thanks!

@@ -67,64 +67,6 @@ create_links_climo_diurnal()
cd ..
}

create_links_ts()
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. We completely remove the create_links functions. No more using cdscan to compile a list of files into an xml.

# Create xml files for time series variables
ts_dir_source={{ output }}/post/atm/{{ grid }}/ts/monthly/{{ '%dyr' % (ts_num_years) }}
create_links_ts ${ts_dir_source} ${ts_dir_primary} ${Y1} ${Y2} 5
ts_dir_primary={{ output }}/post/atm/{{ grid }}/ts/monthly/{{ '%dyr' % (ts_num_years) }}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Rather than using the create_links function to populate ts_dir_primary with an xml listing files from ts_dir_source, we just set ts_dir_primary to point the files directly. No more ts_dir_source at all. No more file copying or compiling files into a list. Just pointing zppy directly to the source directory.

ts_dir_source={{ reference_data_path_ts }}/{{ ts_num_years_ref }}yr
ts_dir_ref=ts_ref
create_links_ts ${ts_dir_source} ${ts_dir_ref} ${ref_Y1} ${ref_Y2} 6
ts_dir_ref={{ reference_data_path_ts }}/{{ ts_num_years_ref }}yr
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Same as above, but for model-vs-model's ref

{%- endif %}
ts_rof_dir_source="{{ output }}/post/rof/native/ts/monthly/{{ ts_num_years }}yr"
create_links_ts_rof ${ts_rof_dir_source} ${ts_rof_dir_primary} ${Y1} ${Y2} 7
ts_rof_dir_primary="{{ output }}/post/rof/native/ts/monthly/{{ ts_num_years }}yr"
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Same as above, but for streamflow

ts_rof_dir_source={{ reference_data_path_ts_rof }}/{{ ts_num_years_ref }}yr
ts_rof_dir_ref=ts_rof_ref
create_links_ts_rof ${ts_rof_dir_source} ${ts_rof_dir_ref} ${ref_Y1} ${ref_Y2} 8
ts_rof_dir_ref={{ reference_data_path_ts_rof }}/{{ ts_num_years_ref }}yr
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Same as above, but for streamflow's model-vs-model ref

@chengzhuzhang chengzhuzhang self-requested a review January 28, 2025 00:57
Copy link
Collaborator

@chengzhuzhang chengzhuzhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@forsyth2 I can confirm that with the second commit, the missing 36 figures based on monthly time-series data are created now. The code change also looks okay for me.

The miss-matched figures mostly due to small cosmetics change due to matplotlib updates, though there are a few cases are non-cosmetics and not directly related to zppy's code, I'm investigating if these are known behaviors documented during e3sm-diags refactor. To not delay e3sm-unified testing, I suggest to merge this PR for now.

@forsyth2 forsyth2 force-pushed the issue-346-diags-post-refactor branch from 2af9bd7 to dadc602 Compare January 28, 2025 01:39
@forsyth2
Copy link
Collaborator Author

Rebased and renamed second commit for clarity/accuracy. Using merge commit since there are two important commits to include here.

@forsyth2 forsyth2 merged commit 01d4f7b into main Jan 28, 2025
4 checks passed
@forsyth2 forsyth2 deleted the issue-346-diags-post-refactor branch January 28, 2025 01:45
@tomvothecoder
Copy link
Collaborator

tomvothecoder commented Jan 28, 2025

@forsyth2 I can confirm that with the second commit, the missing 36 figures based on monthly time-series data are created now. The code change also looks okay for me.

The miss-matched figures mostly due to small cosmetics change due to matplotlib updates, though there are a few cases are non-cosmetics and not directly related to zppy's code, I'm investigating if these are known behaviors documented during e3sm-diags refactor. To not delay e3sm-unified testing, I suggest to merge this PR for now.

@chengzhuzhang There are some cases where the CDAT-based code had bugs, resulting in mismatches. Can you document the specific variables/files and I'll confirm which ones? Here are some clues:

    if "MISRCOSP-CLDLOW_TAU1.3_9.4_MISR" not in f
    and "MISRCOSP-CLDLOW_TAU1.3_MISR" not in f
    and "MISRCOSP-CLDLOW_TAU9.4_MISR" not in f
    and "MISRCOSP-CLDTOT_TAU1.3_9.4_MISR" not in f
    and "MISRCOSP-CLDTOT_TAU1.3_MISR" not in f

@chengzhuzhang
Copy link
Collaborator

@tomvothecoder thank you! i was scratching my head to understand the mismatches. The list of documented issues can help a lot. I will try do document all diffs, to see if there are additional items need to be documented and addressed.

@tomvothecoder
Copy link
Collaborator

Thank you for this PR @forsyth2!

@chengzhuzhang
Copy link
Collaborator

chengzhuzhang commented Jan 31, 2025

It is a headache to review 2k+ mis mached images and capture the meaningful signal. Most of the problem was cause by:

  1. location of RMSE and CORR printed differently
  2. in new e3sm_diags, there is inconsistency of units in upper and middle figures, example in old version, W m-2 is used (if variable is derived) instead of W/m2 used in model data. expected
  3. ChatGPT helped a bit to filter out problems from 1 and 2. And I manually check rest of the diffs in lat-lon, images have actual difference listed as follows:
ERA5-TREFHT-ANN-land.png
ERA5_ext-QREFHT-ANN-global.png   
ERA5_ext-U10-ANN-global.png      
GPCP_v3.2-PRECT-ANN-global.png   
HadISST_CL-SST-ANN-global.png    
HadISST_PD-SST-ANN-global.png    
HadISST_PI-SST-ANN-global.png    
MACv2-AODVIS-ANN-global.png      
MERRA2-OMEGA-850-ANN-global.png      
MERRA2-PSL-ANN-global.png       
MERRA2-T-850-ANN-global.png     
MERRA2-TAUXY-ANN-ocean.png      
MERRA2-TREFHT-ANN-land.png      
MERRA2-TREFMNAV-ANN-global.png  
MERRA2-TREFMXAV-ANN-global.png  
_MISRCOSP-CLDLOW_TAU1.3_9.4_MISR-ANN-global.png  
MISRCOSP-CLDLOW_TAU1.3_MISR-ANN-global.png      
MISRCOSP-CLDLOW_TAU9.4_MISR-ANN-global.png_      
OMI-MLS-TCO-ANN-60S60N.png
ceres_ebaf_surface_v4.1-ALBEDO_SRF-ANN-global.png

I think they can be further grouped, the MISRCOSP diff can be explained by a recent fix. Other than this, some figures missing the last longitude grid, some missing last latitude, and some have more mysterious diffs...
For reference here is the image checker results.

@tomvothecoder maybe you can recognize if there are some diffs that are expected?

@tomvothecoder
Copy link
Collaborator

@tomvothecoder maybe you can recognize if there are some diffs that are expected?

Thank you for working on this @chengzhuzhang. I will take a closer look at those specific variables.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Replace CDAT
3 participants