Skip to content

Commit

Permalink
Refactor Data Guide (#162)
Browse files Browse the repository at this point in the history
* Refactor Data Guide

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Finished Refactor of docs

* Fix relative ref

* More fixes

* gahh more

* Change infra ref title

* Fix markdown callout

* Update data_guide.md

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
  • Loading branch information
jbusecke and pre-commit-ci[bot] authored Aug 22, 2024
1 parent db4bbe0 commit 8ef2f9c
Show file tree
Hide file tree
Showing 23 changed files with 650 additions and 613 deletions.
32 changes: 16 additions & 16 deletions book/_toc.yml
Original file line number Diff line number Diff line change
Expand Up @@ -4,29 +4,29 @@
format: jb-book
root: intro
parts:
- caption: LEAP-Pangeo
- caption: Tutorials
chapters:
- file: leap-pangeo/tutorial.md
- file: leap-pangeo/jupyterhub.md
- file: leap-pangeo/architecture
- file: leap-pangeo/implementation
- file: tutorials/getting_started
- caption: Guides
chapters:
- file: guides/hub_guides
- file: leap-pangeo/solutions
- file: guides/education
- file: guides/bootcamp
- file: guides/team_docs
- file: guides/faq
- file: guides/data_guide
- file: guides/compute_guide
- file: guides/education_guide
- file: guides/bootcamp_guide
- file: guides/vm_access
- caption: Policies
- file: guides/team_guide
- file: guides/faq
- caption: Explanation
chapters:
- file: policies/code_policy
- file: policies/data_policy
- file: policies/infrastructure_policy
- file: policies/users_roles
- file: explanation/architecture
- file: explanation/implementation
- file: explanation/code_policy
- file: explanation/data_policy
- file: explanation/infrastructure_policy
- caption: Reference
chapters:
- file: reference/infrastructure
- file: reference/membership
- file: reference/education
- caption: Miscellaneous
chapters:
Expand Down
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
(explanation.architecture)=

# LEAP-Pangeo Architecture

LEAP-Pangeo is a cloud-based data and computing platform that will be used to support research, education, and knowledge transfer within the LEAP program.
Expand Down Expand Up @@ -27,6 +29,8 @@ LEAP-Pangeo high-level architecture diagram

There are four primary components to LEAP-Pangeo.

(explanation.architecture.data-library)=

### The Data Library

The data library will provide analysis-ready, cloud-optimized data for all aspects of LEAP.
Expand All @@ -44,6 +48,22 @@ Examples of data that may become part of the library are
- Easily accessible syntheses of climate projections from [CMIP6 data](https://esgf-node.llnl.gov/projects/cmip6/), produced by the LEAP team,
for use by industry partners for business strategy and decision making.

(explanation.architecture.catalog)=

#### Data Catalog

A [STAC](https://stacspec.org/) data catalog be used to enumerate all LEAP-Pangeo datasets and provide this information to the public.
The catalog will store all relevant metadata about LEAP datasets following established metadata standards (e.g. CF Conventions).
It will also provide direct links to raw data in cloud object storage.

The catalog will facilitate several different modes of access:

- Searching, crawling, and opening datasets from within notebooks or scripts
- "Crawling" by search indexes or other machine-to-machine interfaces
- A pretty web front-end interface for interactive public browsing

The [Radiant Earth MLHub](https://mlhub.earth/) is a great reference for how we imagine the LEAP data catalog will eventually look.

### Data Storage Service

The underlying technology for the LEAP Data catalog will be cloud object storage (e.g. Amazon S3),
Expand Down Expand Up @@ -82,20 +102,6 @@ This synergistic relationship with be mutually beneficial to two NSF-sponsored p
Using Pangeo Forge effectively will require LEAP scientists and data engineers to engage
with the open-source development process around Pangeo Forge and related technologies.

#### Catalog

A [STAC](https://stacspec.org/) data catalog be used to enumerate all LEAP-Pangeo datasets and provide this information to the public.
The catalog will store all relevant metadata about LEAP datasets following established metadata standards (e.g. CF Conventions).
It will also provide direct links to raw data in cloud object storage.

The catalog will facilitate several different modes of access:

- Searching, crawling, and opening datasets from within notebooks or scripts
- "Crawling" by search indexes or other machine-to-machine interfaces
- A pretty web front-end interface for interactive public browsing

The [Radiant Earth MLHub](https://mlhub.earth/) is a great reference for how we imagine the LEAP data catalog will eventually look.

### The Hub

```{figure} https://jupyter.org/assets/homepage/labpreview.webp
Expand Down Expand Up @@ -130,9 +136,7 @@ with full-featured Python software environments for environmental data science.
These environments will be the starting point for LEAP environments.
They may be augmented as LEAP evolves with more specific software as needed by research projects.

Use management and access control for the Hub are described in {doc}`/policies/users_roles`.
We use GitHub for identity management, in order to make it easy to include participants
from any partner institution..
Use management and access control for the Hub are described in [](reference.membership).

### The Knowledge Graph

Expand All @@ -145,9 +149,9 @@ LEAP "outputs" will be of four main types:
- **Educational Modules** - used for teaching

All of these object must be tracked and cataloged in a uniform way.
The {doc}`/policies/code_policy` and {doc}`/policies/data_policy` will help set these standards.
The [](explanation.code_policy) and [](explanation.data-policy) will help set these standards.

```{figure} LEAP_knowledge_graph.png
```{figure} ../images/LEAP_knowledge_graph.png
---
width: 600px
name: knowledge-graph
Expand Down
10 changes: 10 additions & 0 deletions book/explanation/code_policy.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
(explanation.code_policy)=

# LEAP-Pangeo Code Policy

(explanation.code-policy.dont-let-perfect-be-the-enemy-of-good)=

## Enable Science now, but keep evolving.

"Don't let perfect be the enemy of good"
🚧
30 changes: 28 additions & 2 deletions book/policies/data_policy.md → book/explanation/data_policy.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,15 +3,31 @@ abbreviations:
ARCO: Analysis-Ready Cloud-Optimized
---

# Data Policy
(explanation.data-policy)=

# LEAP-Pangeo Data Policy

(explanation.data-policies.access)=

## Data Access

🚧

(explanation.data-policy.reproducibility)=

## Reproducibility

🚧

(explanation.data-policy.types)=

## Types of Data Used at LEAP

Within the LEAP project we distinguish between several different types of data mostly based on whether the data was used or produced at LEAP and if the data is already accessible in {abbr}`ARCO` formats in the cloud.

:::\{admonition} LEAP produced
:class: dropdown
Data that has been created or modified by LEAP researchers.
Data that has been created or modified by LEAP researchers. We currently do not provide any way of ensuring that data is archived, and users should never rely on LEAP-Pangeo resources as the only replicate of valuable data (see also [](guides.data.ingestion)).
:::

:::\{admonition} LEAP ingested
Expand All @@ -23,3 +39,13 @@ Data that is already publically available but has been ingested into cloud stora
:class: dropdown
Data that is already available in {abbr}`ARCO` formats in publically accessible object storage. Adding this data to the LEAP-Pangeo Catalog enables us to visualize it with the Data Viewer, and collect all datasets of importance in one single location, but none of the data itself is modified.
:::

## Roles

Many different people at LEAP interact with data in various ways. Here is a list of typical roles (some people have multiple roles):

(explanation.data-policy.roles.data-expert)=

### Data Expert

🚧
File renamed without changes.
1 change: 1 addition & 0 deletions book/explanation/infrastructure_policy.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
# LEAP-Pangeo Infrastructure Policy
2 changes: 1 addition & 1 deletion book/guides/bootcamp.md → book/guides/bootcamp_guide.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Running LEAP bootcamps
# Bootcamp Guide

We collect all bootcamp materials in the [LEAP-Pangeo bootcamp repository](https://github.com/leap-stc/LEAP-bootcamps). Please keep all relevant information and materials in this repository to make it easier for participants to find them.

Expand Down
10 changes: 10 additions & 0 deletions book/guides/compute_guide.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
# Compute Guide

These are a set of guides for using the JupyterHub Compute Environment effectively.

## Dask

To help you scale up calculations using a cluster, the Hub is configured with Dask Gateway.
For a quick guide on how to start a Dask Cluster, consult this page from the Pangeo docs:

- <https://pangeo.io/cloud.html#dask>
Loading

0 comments on commit 8ef2f9c

Please sign in to comment.