Skip to content

Commit

Permalink
Add basic rclone guide + config (#183)
Browse files Browse the repository at this point in the history
* Add basic rclone guide + config

* Ignore redefinition warngins for directives roles and nodes
  • Loading branch information
jbusecke authored Oct 14, 2024
1 parent 8850d12 commit 6a7709e
Show file tree
Hide file tree
Showing 6 changed files with 173 additions and 6 deletions.
5 changes: 5 additions & 0 deletions book/_config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,12 @@ sphinx:
recursive_update: true
extra_extensions:
- sphinx_jinja
- sphinx_design
config:
suppress_warnings:
- app.add_node
- app.add_directive
- app.add_role
bibtex_reference_style: author_year
jinja_contexts: # super weird to enter all this data here. Waiting on https://github.com/executablebooks/jupyter-book/issues/858#issuecomment-1767368922 to see if there is a better way.
team-data:
Expand Down
1 change: 1 addition & 0 deletions book/_toc.yml
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ parts:
- file: tutorials/getting_started
- caption: Guides
chapters:
- file: guides/code_guide
- file: guides/data_guide
- file: guides/compute_guide
- file: guides/education_guide
Expand Down
7 changes: 7 additions & 0 deletions book/guides/code_guide.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# Code Guide

(guide.code.secrets)=

## Handling Secrets

🚧 Coming soon ...
157 changes: 151 additions & 6 deletions book/guides/data_guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

# Data Guide

Data is fundamental to most people's work at LEAP. This guide describes best practices how to find, transfer, ingest, and catalog data.
Data is fundamental to most people's work at LEAP. This guide describes best practices how to find, read, write, transfer, ingest, and catalog data.

## Discovering Dataset

Expand All @@ -17,21 +17,166 @@ To help onboard you to this new way of working, we have written a guide to Files

We recommend you read this thoroughly, especially the part about Git and GitHub. LEAP provides several [cloud buckets](reference.infrastructure.buckets), and the following steps illustrate how to work with data in object storage as opposed to a filesystem.

### Tools

There are many tools available to interact with cloud object storage. We currently have basic operations documented for two tools:

- [fsspec](https://filesystem-spec.readthedocs.io/en/latest/) (and its submodules [gcsfs](https://gcsfs.readthedocs.io/en/latest/) and [s3fs](https://s3fs.readthedocs.io/en/latest/)) which provide filesystem-like access from within a python session. Fsspec is also used by xarray under the hood.

- [rclone](https://rclone.org/) which provides a Command Line Interface to many different storage backends.

:::\{admonition} Note on rclone documentation
:class: tip, dropdown
Rclone is a very extensive and powerful tool, but with its many options it can be overwhelming (at least it was for Julius) at the beginning. We will only demonstrate essential options here, for more details see the [docs](https://rclone.org/). If however instructions here are not working for your specific use case, please reach out so we can improve the docs.
:::

(data.config-files)=

#### Configuration for Authenticated Access

Unless a given cloud bucket allows anonymous access or is preauthenticated within your environment (like it is the case for some of the [LEAP-Pangeo owned buckets](reference.infrastructure.buckets)) you will need to authenticate with a key/secret pair.

:::\{admonition} Always Handle credentials with care!
:class: warning
Always handle secrets with care. Do not store them in plain text that is visible to others (e.g. in a notebook cell that is pushed to a public github repository). See [](guide.code.secrets) for more instructions on how to keep secrets safe.
:::

We recommend to store your secrets in one of the following configuration files (which will be used in the following example to read and write data):
s

`````{tab-set}
````{tab-item} Fsspec
Fsspec supports named [](aws profiles) in a credentials files. You can create one via Generate an aws credential file via the [aws CLI](https://docs.aws.amazon.com/cli/v1/userguide/cli-configure-files.html#cli-configure-files-examples)(installed on the hub by defaule):
```shell
aws configure --profile <pick_a_name>
```
Pick a sensible name for your profile, particularly if you are working with multiple profiles and buckets.
The file `~/.aws/credentials` then contains your key/secret similar to this:
```
[<the_profile_name_you_picked>]
aws_access_key_id = ***
aws_secret_access_key = ***
```
````
````{tab-item} Rclone
Rclone has its own [configuration file format](https://rclone.org/docs/#config-config-file) where you can specify the key and secret (and many other settings) in a similar fashion (note the missing `aws_` though!).
We recommend setting up the config file (show the default location with `rclone config file`) by hand to look something like this:
```
[<remote_name>]
... # other values
access_key_id = XXX
secret_access_key = XXX
```
You can have multiple 'remotes' in this file for different cloud buckets.
For the [](reference.infrastructrue.osn_pod) use this remote definition:
```
[osn]
type = s3
provider = Ceph
endpoint = https://nyu1.osn.mghpcc.org
access_key_id = XXX
secret_access_key = XXX
```
````
`````

:::\{warning}
Ideally we want to store these secrets only in one central location. The natural place for these seems to be in an [AWS cli profiles](https://docs.aws.amazon.com/cli/v1/userguide/cli-configure-files.html#cli-configure-files-format), which can also be used for fsspec. There however seem to be multiple issues ([here](https://forum.rclone.org/t/shared-credentials-file-is-not-recognised/46993)) around this feature in rclone, and so far we have not succeeded in using AWS profiles in rclone.
According to those issues we can only make the aws profiles (or [source profiles?](https://forum.rclone.org/t/s3-profile-failing-when-explicit-s3-endpoint-is-present/36063/4?u=jbusecke), anyways the credentials part of it) work if we define one config file per remote [and use the 'default' profile](https://forum.rclone.org/t/shared-credentials-file-is-not-recognised/46993/2?u=jbusecke)which presumably breaks compatibility with fsspec, and also does not work at all right now. So at the moment we will have to keep the credentials in two separate spots 🤷‍♂️. **Please make sure to apply proper caution when [handling secrets](guide.code.secrets) for each config files that stores secrets in plain text!**
:::

(hub.data.setup)=

###

(hub.data.list)=

### Inspecting contents of the bucket

We recommend using [gcsfs](https://gcsfs.readthedocs.io/en/latest/) or [fsspec](https://filesystem-spec.readthedocs.io/en/latest/) which provide a filesytem-like interface for python.
`````{tab-set}
````{tab-item} Fsspec
The initial step in working with fsspec is to create a `filesystem` object which enables the abstraction on top of different object storage system.
```python
import fsspec
You can e.g. list the contents of your personal folder with
# for Google Storage
fs = fsspec.filesystem('gs') # equivalent to gcsfs.GCSFileSystem()
# for s3
fs = fsspec.filesystem('s3') # equivalent to s3fs.S3FileSystem()
```
For **authenticated access** you need to pass additional arguments. In this case (for the m2lines OSN pod) we pass a custom endpoint and an [aws profile](data.config-files):
```python
import gcsfs
fs = fsspec.filesystem(
's3',
profile='<the_profile_name_you_picked>', ## This is the profile name you configured above.
client_kwargs={'endpoint_url': 'https://nyu1.osn.mghpcc.org '} # This is the endpoint for the m2lines osn pod
)
```
fs = gcsfs.GCSFileSystem() # equivalent to fsspec.fs('gs')
fs.ls("leap-persistent/funky-user")
You can now use the `.ls` method to list contents of a bucket and prefixes.
You can e.g. list the contents of your personal folder on the persistent GCS bucket with
```python
fs.ls("leap-persistent/funky-user") # replace with your github username
```
````
````{tab-item} Rclone
To inspect a bucket you can use clone with the profile ('remote' in rclone terminology) set up [above](data.config-files):
```shell
rclone ls <remote_name>:bucket-name/funky-user
```
````
`````

### Moving Data

`````{tab-set}
````{tab-item} Fsspec
🚧
````
````{tab-item} Rclone
You can move directories from a local computer to cloud storage with rclone (make sure you are properly [authenticated](data.config-files)):
```shell
rclone copy path/to/local/dir/ <remote_name>:<bucket-name>/funky-user
```
You can also move data between cloud buckets using rclone
```shell
rclone copy \
<remote_name_a>:<bucket-name>/funky-user\
<remote_name_b>:<bucket-name>/funky-user
```
:::{note}
Copying with rclone will stream the data from the source to your computer and back to the target, and thus transfer speed is likely limited by the internet connection of your local machine.
:::
````
`````

(hub.data.read_write)=

### Basic writing to and reading from cloud buckets
Expand Down
8 changes: 8 additions & 0 deletions book/reference/infrastructure.md
Original file line number Diff line number Diff line change
Expand Up @@ -118,6 +118,14 @@ However, these will disappear when your server shuts down.

For a more permanent solution we recommend building project specific dockerfiles and using those as [custom images](reference.infrastructure.hub.image.custom).

## Cloud Storage

(reference.infrastructrue.osn_pod)=

### m2lines OSN Pod

🚧

(reference.infrastructure.buckets)=

## LEAP-Pangeo Cloud Storage Buckets
Expand Down
1 change: 1 addition & 0 deletions environment.yml
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@ dependencies:
- sphinx==4.5.0
- jupyter-book==0.12.3
- sphinxcontrib-bibtex
- sphinx-design
- pip
- pip:
- sphinx-jinja
Expand Down

0 comments on commit 6a7709e

Please sign in to comment.