diff --git a/book/_config.yml b/book/_config.yml index 03afeaa1..85d07015 100644 --- a/book/_config.yml +++ b/book/_config.yml @@ -34,7 +34,12 @@ sphinx: recursive_update: true extra_extensions: - sphinx_jinja + - sphinx_design config: + suppress_warnings: + - app.add_node + - app.add_directive + - app.add_role bibtex_reference_style: author_year jinja_contexts: # super weird to enter all this data here. Waiting on https://github.com/executablebooks/jupyter-book/issues/858#issuecomment-1767368922 to see if there is a better way. team-data: diff --git a/book/_toc.yml b/book/_toc.yml index cf6e37b1..cdfbb451 100644 --- a/book/_toc.yml +++ b/book/_toc.yml @@ -9,6 +9,7 @@ parts: - file: tutorials/getting_started - caption: Guides chapters: + - file: guides/code_guide - file: guides/data_guide - file: guides/compute_guide - file: guides/education_guide diff --git a/book/guides/code_guide.md b/book/guides/code_guide.md new file mode 100644 index 00000000..3aef38db --- /dev/null +++ b/book/guides/code_guide.md @@ -0,0 +1,7 @@ +# Code Guide + +(guide.code.secrets)= + +## Handling Secrets + +🚧 Coming soon ... diff --git a/book/guides/data_guide.md b/book/guides/data_guide.md index e210a239..d293ad8c 100644 --- a/book/guides/data_guide.md +++ b/book/guides/data_guide.md @@ -2,7 +2,7 @@ # Data Guide -Data is fundamental to most people's work at LEAP. This guide describes best practices how to find, transfer, ingest, and catalog data. +Data is fundamental to most people's work at LEAP. This guide describes best practices how to find, read, write, transfer, ingest, and catalog data. ## Discovering Dataset @@ -17,21 +17,166 @@ To help onboard you to this new way of working, we have written a guide to Files We recommend you read this thoroughly, especially the part about Git and GitHub. LEAP provides several [cloud buckets](reference.infrastructure.buckets), and the following steps illustrate how to work with data in object storage as opposed to a filesystem. +### Tools + +There are many tools available to interact with cloud object storage. We currently have basic operations documented for two tools: + +- [fsspec](https://filesystem-spec.readthedocs.io/en/latest/) (and its submodules [gcsfs](https://gcsfs.readthedocs.io/en/latest/) and [s3fs](https://s3fs.readthedocs.io/en/latest/)) which provide filesystem-like access from within a python session. Fsspec is also used by xarray under the hood. + +- [rclone](https://rclone.org/) which provides a Command Line Interface to many different storage backends. + +:::\{admonition} Note on rclone documentation +:class: tip, dropdown +Rclone is a very extensive and powerful tool, but with its many options it can be overwhelming (at least it was for Julius) at the beginning. We will only demonstrate essential options here, for more details see the [docs](https://rclone.org/). If however instructions here are not working for your specific use case, please reach out so we can improve the docs. +::: + +(data.config-files)= + +#### Configuration for Authenticated Access + +Unless a given cloud bucket allows anonymous access or is preauthenticated within your environment (like it is the case for some of the [LEAP-Pangeo owned buckets](reference.infrastructure.buckets)) you will need to authenticate with a key/secret pair. + +:::\{admonition} Always Handle credentials with care! +:class: warning +Always handle secrets with care. Do not store them in plain text that is visible to others (e.g. in a notebook cell that is pushed to a public github repository). See [](guide.code.secrets) for more instructions on how to keep secrets safe. +::: + +We recommend to store your secrets in one of the following configuration files (which will be used in the following example to read and write data): +s + +`````{tab-set} +````{tab-item} Fsspec +Fsspec supports named [](aws profiles) in a credentials files. You can create one via Generate an aws credential file via the [aws CLI](https://docs.aws.amazon.com/cli/v1/userguide/cli-configure-files.html#cli-configure-files-examples)(installed on the hub by defaule): + +```shell +aws configure --profile +``` +Pick a sensible name for your profile, particularly if you are working with multiple profiles and buckets. + +The file `~/.aws/credentials` then contains your key/secret similar to this: + +``` +[] +aws_access_key_id = *** +aws_secret_access_key = *** +``` +```` + +````{tab-item} Rclone +Rclone has its own [configuration file format](https://rclone.org/docs/#config-config-file) where you can specify the key and secret (and many other settings) in a similar fashion (note the missing `aws_` though!). + +We recommend setting up the config file (show the default location with `rclone config file`) by hand to look something like this: + +``` +[] +... # other values +access_key_id = XXX +secret_access_key = XXX +``` + +You can have multiple 'remotes' in this file for different cloud buckets. + +For the [](reference.infrastructrue.osn_pod) use this remote definition: + +``` +[osn] +type = s3 +provider = Ceph +endpoint = https://nyu1.osn.mghpcc.org +access_key_id = XXX +secret_access_key = XXX +``` + +```` +````` + +:::\{warning} +Ideally we want to store these secrets only in one central location. The natural place for these seems to be in an [AWS cli profiles](https://docs.aws.amazon.com/cli/v1/userguide/cli-configure-files.html#cli-configure-files-format), which can also be used for fsspec. There however seem to be multiple issues ([here](https://forum.rclone.org/t/shared-credentials-file-is-not-recognised/46993)) around this feature in rclone, and so far we have not succeeded in using AWS profiles in rclone. +According to those issues we can only make the aws profiles (or [source profiles?](https://forum.rclone.org/t/s3-profile-failing-when-explicit-s3-endpoint-is-present/36063/4?u=jbusecke), anyways the credentials part of it) work if we define one config file per remote [and use the 'default' profile](https://forum.rclone.org/t/shared-credentials-file-is-not-recognised/46993/2?u=jbusecke)which presumably breaks compatibility with fsspec, and also does not work at all right now. So at the moment we will have to keep the credentials in two separate spots 🤷‍♂️. **Please make sure to apply proper caution when [handling secrets](guide.code.secrets) for each config files that stores secrets in plain text!** +::: + +(hub.data.setup)= + +### + (hub.data.list)= ### Inspecting contents of the bucket -We recommend using [gcsfs](https://gcsfs.readthedocs.io/en/latest/) or [fsspec](https://filesystem-spec.readthedocs.io/en/latest/) which provide a filesytem-like interface for python. +`````{tab-set} +````{tab-item} Fsspec +The initial step in working with fsspec is to create a `filesystem` object which enables the abstraction on top of different object storage system. + +```python +import fsspec -You can e.g. list the contents of your personal folder with +# for Google Storage +fs = fsspec.filesystem('gs') # equivalent to gcsfs.GCSFileSystem() +# for s3 +fs = fsspec.filesystem('s3') # equivalent to s3fs.S3FileSystem() +``` + +For **authenticated access** you need to pass additional arguments. In this case (for the m2lines OSN pod) we pass a custom endpoint and an [aws profile](data.config-files): ```python -import gcsfs +fs = fsspec.filesystem( + 's3', + profile='', ## This is the profile name you configured above. + client_kwargs={'endpoint_url': 'https://nyu1.osn.mghpcc.org '} # This is the endpoint for the m2lines osn pod +) +``` -fs = gcsfs.GCSFileSystem() # equivalent to fsspec.fs('gs') -fs.ls("leap-persistent/funky-user") +You can now use the `.ls` method to list contents of a bucket and prefixes. + +You can e.g. list the contents of your personal folder on the persistent GCS bucket with + +```python +fs.ls("leap-persistent/funky-user") # replace with your github username +``` + +```` + +````{tab-item} Rclone + +To inspect a bucket you can use clone with the profile ('remote' in rclone terminology) set up [above](data.config-files): + +```shell +rclone ls :bucket-name/funky-user ``` +```` +````` + +### Moving Data + +`````{tab-set} +````{tab-item} Fsspec +🚧 +```` + +````{tab-item} Rclone + +You can move directories from a local computer to cloud storage with rclone (make sure you are properly [authenticated](data.config-files)): + +```shell +rclone copy path/to/local/dir/ :/funky-user +``` + +You can also move data between cloud buckets using rclone + +```shell +rclone copy \ + :/funky-user\ + :/funky-user +``` + +:::{note} +Copying with rclone will stream the data from the source to your computer and back to the target, and thus transfer speed is likely limited by the internet connection of your local machine. +::: + +```` +````` + (hub.data.read_write)= ### Basic writing to and reading from cloud buckets diff --git a/book/reference/infrastructure.md b/book/reference/infrastructure.md index 15f5bd3e..e4ee3c16 100644 --- a/book/reference/infrastructure.md +++ b/book/reference/infrastructure.md @@ -118,6 +118,14 @@ However, these will disappear when your server shuts down. For a more permanent solution we recommend building project specific dockerfiles and using those as [custom images](reference.infrastructure.hub.image.custom). +## Cloud Storage + +(reference.infrastructrue.osn_pod)= + +### m2lines OSN Pod + +🚧 + (reference.infrastructure.buckets)= ## LEAP-Pangeo Cloud Storage Buckets diff --git a/environment.yml b/environment.yml index 666c4379..ed97e250 100644 --- a/environment.yml +++ b/environment.yml @@ -5,6 +5,7 @@ dependencies: - sphinx==4.5.0 - jupyter-book==0.12.3 - sphinxcontrib-bibtex + - sphinx-design - pip - pip: - sphinx-jinja