Skip to content
natalieda edited this page Oct 18, 2023 · 90 revisions

Tropomi2grid quick start

Index

How does Grid work

You interact with the Grid via the dedicated tropomi UI machine trui.grid.surfsara.nl. You describe each job in a JDL (Job Description Language) file where you list which program should be executed and what are the worker node requirements. From the UI, you submit jobs to multiple clusters with glite-* commands. The resource broker, called WMS (short for Workload Management System), accepts your jobs, assigns them to the most appropriate CE (Computing Element), records the jobs statuses and retrieves the output.

Preamble

In order to run your work on the Grid, you need:

  1. A personal Grid Certificate, so that you will be identified on the various clusters. It is issued by a Certificate Authority (CA) and contains your name and your organisation.
  2. A UI account on trui.grid.surfsara.nl, so that you can interact with the Grid.
  3. A projects.nl/tropomi VO membership (Virtual Organisation), so that you can run your jobs on certain clusters.

Simple example

Login to the UI:

ssh hhu@trui.grid.surfsara.nl
  • Create a proxy valid for a week:
startGridSession projects.nl:/projects.nl/tropomi
  • Get the example and compile it:
cp /tmp/tropomi/fractals.c .
cc fractals.c -o fractals -lm
  • Run it locally with a set of parameters:
./fractals -o output -q 0.184 -d 2280 -m 4400 # try different parameters: -q 0.184 -d 2280 -m 4400
  • Display the image:
cp /tmp/tropomi/convert .
./convert output "output.png"
display output.png
  • Create the jdl file:

In the JDL specify the content of the in- and output sandboxes. These sandboxes allow you to transfer small files to or from the Grid. The input sandbox contains all the files that you want to send with your job to the worker node, like e.g. a script that you want executed. The output sandbox contains all the files that you want to have transferred back to the UI.

cp /tmp/tropomi/fractals.jdl .
cp /tmp/tropomi/wrapper.sh .
  • Submit the job to the Grid:
glite-wms-job-submit -d $USER -o jobIds fractals.jdl
  • Check the job status from command line and your browser:
glite-wms-job-status https://wms2.grid.sara.nl:9000/6swP5FEfGVZ69tVB3PwnDQ #replace with your jobID
#or
glite-wms-job-status -i jobIds
  • Get the job output to the UI:
glite-wms-job-output --dir . -i jobIds

Grid Storage

The backend of Grid storage at SURFsara is dcache. The storage element located at SURFsara is accessible from any Grid cluster or UI for storing and retrieving large amounts of data. It consists of magnetic tape storage and hard disk storage and both are addressed by a common file system.

You can refer to your files on the Grid with two different ways depending on which storage client you use to manage your files:

  • Transport URL or TURL:
gsiftp://gridftp.grid.sara.nl:2811/pnfs/grid.sara.nl/data/projects.nl/tropomi/
  • Storage URL or SURL:
srm://srm.grid.sara.nl:8443/pnfs/grid.sara.nl/data/projects.nl/tropomi/

The InputSandbox and OutputSandbox attributes in the JDL file are the basic way to move files to and from the User Interface (UI) and the Worker Node (WN). However, when you have large files (from about 100 MB and larger) then you should not use these Sandboxes to move data around. Instead you should use a Grid Storage client, such as uberftp and globus-url-copy.

  • Listing directories:
uberftp -ls gsiftp://gridftp.grid.sara.nl:2811/pnfs/grid.sara.nl/data/projects.nl/tropomi/
  • Make a directory:
uberftp -mkdir gsiftp://gridftp.grid.sara.nl:2811/pnfs/grid.sara.nl/data/projects.nl/tropomi/haili/
  • Remove a file or directory:
uberftp -rm gsiftp://gridftp.grid.sara.nl:2811/pnfs/grid.sara.nl/data/projects.nl/tropomi/testdir/file
uberftp -rm -r gsiftp://gridftp.grid.sara.nl:2811/pnfs/grid.sara.nl/data/projects.nl/tropomi/testdir/
  • Transfer files:
globus-url-copy -vb file:///${PWD}/file gsiftp://gridftp.grid.sara.nl:2811/pnfs/grid.sara.nl/data/projects.nl/tropomi/haili/file # transfer file from UI or worker node to Grid Storage
globus-url-copy -vb gsiftp://gridftp.grid.sara.nl:2811/pnfs/grid.sara.nl/data/projects.nl/tropomi/haili/file file:///${PWD}/file2 # transfer file from Grid Storage to UI or worker node
  • Recursive download
globus-url-copy -vb -cd -r gsiftp://gridftp.grid.sara.nl:2811/pnfs/grid.sara.nl/data/projects.nl/tropomi/haili/testdir/ file:///${PWD}/testdir/ # transfer a directory from the Grid Storage to the UI or worker node

Exercise: Run the Grid job again and fetch the output image from the Grid storage this time ...

  • Recursive upload
# first create the test directory on the Grid storage
globus-url-copy -vb -cd -r file:///${PWD}/testdir/ gsiftp://gridftp.grid.sara.nl:2811/pnfs/grid.sara.nl/data/projects.nl/tropomi/haili/testdir/ # transfer a directory from UI or worker node to Grid Storage
  • Rename a file
# gfal storage client allows remote rename
gfal-rename gsiftp://gridftp.grid.sara.nl:2811/pnfs/grid.sara.nl/data/projects.nl/tropomi/haili/filename  gsiftp://gridftp.grid.sara.nl:2811/pnfs/grid.sara.nl/data/projects.nl/tropomi/haili/fileNEWname 
  • Third party transfers

To copy data from a [source] directory to a [target] directory that are both located on the Grid storage (dCache) we use third party transfers. This method allows to copy files directly between two dCache endpoints without downloading it to an intermediate machine:

# first create the [target] directory on the Grid storage, e.g. ecmwf
# start a screen session and copy the files with:
globus-url-copy -cd -r gsiftp://gridftp.grid.sara.nl:2811/pnfs/grid.sara.nl/data/projects.nl/tropomi/[source]/ gsiftp://gridftp.grid.sara.nl:2811/pnfs/grid.sara.nl/data/projects.nl/tropomi/[target]/

This will copy the content of [source] to the [target] directory. Use -vb flag to display verbose information for the transferred files. NB: you must include the trailing slash after the [source] and [target] directories!

Example:

uberftp -mkdir gsiftp://gridftp.grid.sara.nl:2811/pnfs/grid.sara.nl/data/projects.nl/tropomi/s5p/input_2/ecmwf/analysis_1
screen
globus-url-copy -cd -r gsiftp://gridftp.grid.sara.nl:2811/pnfs/grid.sara.nl/data/projects.nl/tropomi/s5p/input/ecmwf/analysis_1/ gsiftp://gridftp.grid.sara.nl:2811/pnfs/grid.sara.nl/data/projects.nl/tropomi/s5p/input_2/ecmwf/analysis_1/

JDL requirements

  • To exclude a specific site, add the following line in the .jdl:
Requirements=(!RegExp("iihe.ac.be", other.GlueCEUniqueID));
  • To schedule the jobs on a specific site, add the following line in the .jdl:
Requirements=(RegExp("gina", other.GlueCEUniqueID));
  • Request 3 cores per job on the same node, add in the .jdl:
SmpGranularity = 3;
CPUNumber = 3;
  • Requirement for Nikhef high memory jobs:

Request a node with at least 8GB available.

CERequirements = "other.GlueHostMainMemoryRAMSize >= 8192";
  • Check which clusters match your jdl requirements:
glite-wms-job-list-match -a <your_jdl>

Downtime notifications

  • Ongoing maintenances

You can lookup announcements for the current status of the Grid systems on the SURFsara website user info by browsing to:

Systems > System status > Grid: National e-Infrastructure Grid Maintenances
  • Upcoming maintenances

If you use want to receive notifications for upcoming downtimes and maintenances, you can create personal subscriptions to receive announcements tailored specifically for the Grid services and clusters. Create a subscription in this link (from a browser that your Grid certificate is installed): https://operations-portal.egi.eu/downtimes/subscription by using the following rule:

Rule Region Site Node Service VO
I WANT NGI_NL ALL ALL ALL ALL

Softdrive

  • Login to the Softdrive UI with the tropomi account s5p:
ssh s5p@softdrive.grid.surfsara.nl
  • Install gcc-4.8.5 netcdf-4.3.3.1 netcdf-fortran-4.4.2 to your cvmfs user path:
cp -r /cvmfs/softdrive.nl/lyklev/gcc-4.8.5/ /cvmfs/softdrive.nl/lyklev/netcdf-4.3.3.1/ /cvmfs/softdrive.nl/lyklev/netcdf-fortran-4.4.2/ /cvmfs/softdrive.nl/s5p/
wget https://3230d63b5fc54e62148e-c95ac804525aac4b6dba79b00b39d1d3.ssl.cf1.rackcdn.com/Anaconda2-2.4.0-Linux-x86_64.sh
  • Install anaconda with PREFIX=/cvmfs/softdrive.nl/s5p/anaconda-2-2.4.0:
bash Anaconda2-2.4.0-Linux-x86_64.sh
  • Install netcdf4-python:
conda install netcdf4 h5py
  • Trigger software publication after any change:
publish-my-softdrive

The software will become available on the UI and Grid worker nodes after ~ an hour. Check from trui.grid.surfsara.nl with ls /cvmfs/softdrive.nl/s5p/

  • Add the install location in the Grid job wrapper, Makefile, bash_profile:
export PATH=/cvmfs/softdrive.nl/s5p/gcc-4.8.5/bin:/cvmfs/softdrive.nl/s5p/anaconda-2-2.4.0/bin:$PATH
export LD_LIBRARY_PATH=/cvmfs/softdrive.nl/s5p/gcc-4.8.5/lib64:/cvmfs/softdrive.nl/s5p/netcdf-4.3.3.1/lib/:/cvmfs/softdrive.nl/s5p/netcdf-fortran-4.4.2/lib/:/cvmfs/softdrive.nl/s5p/anaconda-2-2.4.0/lib:$LD_LIBRARY_PATH

PiCaS pilot jobs

Application Design

Let's say that the application "fractals" needs to be executed in parallel for a certain amount of times on the Grid. The pipeline of the job is as follows:

  • We have a number of tasks (fractals application) to be processed on the Grid Worker Nodes.
  • Each task requires a set of parameters to run. The parameters of each task construct an individual piece of work, called token. NB: the token is just a description not the task itself.
  • This approach launches a number of Grid jobs called pilot jobs, smaller than the number of tasks. The pilot jobs run in parallel.
  • Each pilot job is like a normal job, but instead of executing the task directly it asks for the next task from a central repository once it is running on a worker node. A task corresponds to one token. NB: only one client can work on any token at any one time.
  • A pilot job terminates when no more tokens are available.

This approach is based on a wrapper script for the arrangement of subsequent Fractals runs (tasks) on the grid, making use of the PiCaS job management system. The procedure consists of the following 2 steps:

  1. Tokens - PiCaS pool server: we upload the parameters as tokens to the central repository (PiCaS token pool).
  2. Application - Run Pilot Jobs: we submit pilot jobs to the Grid Worker Nodes. When a pilot job starts:
  • Makes a connection to the PiCaS server and gets the next available token.
  • Runs a wrapper script to execute a task with the parameters fetched from the token.
  • Attaches the log files to the corresponding token when the task is done.
  • Asks for the next available token or dies if there is no other token left.

Run the PiCaS example

Prerequisites

To be able to run the Fractals example, you must have:

  • An account on trui (trui.grid.surfsara.nl)
  • A Grid certificate
  • A VO membership for /projects.nl/tropomi
  • An account on PiCaS

Preparation

  • Login to trui and get the complete example:
cp -r /tmp/tropomi/pilot_picas_fractals/ .
  • List the contents of pilot_picas_fractals/ directory.
cd pilot_picas_fractals/
ls -l

Tokens/
Application/

Detailed information regarding the operations performed in each of the scripts below is embedded to the comments inside each of the scripts individually.

1. Tokens - PiCaS pool server

  • Move to the Tokens directory':
cd Tokens
  • List the files in Tokens directory:
ls -l

createTokens
createTokens.py
createViews.py
picas/ #picas client repository
couchdb/ #python library for couchdb - current v0.9

This example includes a bash script (createTokens) that generates a sensible parameter file, with each line representing a set of parameters that the fractals program can be called with. Without arguments it creates a fairly sensible set of 24 lines of parameters.

  • Run the createTokens script and inspect the file it creates:
./createTokens 
# /tmp/tmp.fZ33Kd8wXK
# cat /tmp/tmp.fZ33Kd8wXK
  • Upload the Tokens:
python createTokens.py /tmp/tmp.fZ33Kd8wXK $PICAS_DB $PICAS_USR $PICAS_USR_PWD  # replace with your /tmp/tmp.XXX file, picas database name, picas username and picas password

Check the tropomi database here: https://picas-tropomi.grid.surfsara.nl:6984/_utils/ The tokens must have been uploaded there.

  • Create the Views (pools):
python createViews.py $PICAS_DB $PICAS_USR $PICAS_USR_PWD

Refresh and check the tropomi database here: https://nosql01.grid.sara.nl:6984/_utils/
The views are now loaded. Unfold the "View -> All documents" on the top right. Browse it each view (Monitor: done, error, locked, overview_total, todo). Inspect the javascript code.

2. Application - Run Pilot Jobs

  • Create a proxy:
startGridSession projects.nl:/projects.nl/tropomi
  • Move to the Application directory:
cd Application/
  • List the files in Application directory:
ls -l

fractals.jdl
sandbox/

ls -l sandbox/

couchdb.tar # couchdb client
fractals # fractals executable
fractals.c # fractals source
picas.tar # picas client
startpilot.sh # sets the environment for picas and starts the pilot job
pilot.py # the pilot job iterator. initiates the master script that runs the application
master.sh # starts the application

  • Submit the pilot jobs:

a) Execution on trui.grid.surfsara.nl for debugging (or other machine with python >=2.6):

cd sandbox/
. startpilot.sh $PICAS_DB $PICAS_USR $PICAS_USR_PWD 

Refresh and check the tropomi database here: https://picas-tropomi.grid.surfsara.nl:6984/_utils/
Select the "overview_total" view and refresh.
The first token must be locked. While the UI is processing some tokens, start the pilot jobs on the Grid with a new terminal in order to process the rest of the tokens (in parallel). See next section.

b) Execution on GRID:

  • Modify fractals.jdl by replacing [$PICAS_DB] [$PICAS_USR] [$PICAS_USR_PWD] with your credentials (hard-coded):
cd Application/
# vim fractals.jdl
  • Submit the pilot jobs to the Grid
glite-wms-job-submit -d $USER -o jobIDs fractals.jdl

This is a Parametric grid job that creates five child jobs (see jdl file, line Parameters=5;). The five child jobs will run in parallel the same pilot job. Each pilot job will start fetching tokens in iteration until it reaches the queue walltime limit or the tokens finish.

This example stores the output images on the Grid storage here:

uberftp -ls gsiftp://gridftp.grid.sara.nl:2811/pnfs/grid.sara.nl/data/projects.nl/tropomi/natalie/

The pilot jobs recursively generate an image based on parameters received from PiCas.
Refresh and check the tropomi database here: https://nosql01.grid.sara.nl:6984/_utils and browse through the different Picas views.

Combined Picas databases

Accessing multiple databases from the same script is possible with the couchdb package. The different databases should be defined in the config file, e.g.:

cat picasconfigcombined.py

PICAS_HOST_URL="https://picas-tropomi.grid.surfsara.nl:6984/"
PICAS_DATABASE="nataliedb"
PICAS_DATABASE2="openstack_tests"
PICAS_USERNAME="natalie_user"
PICAS_PASSWORD="..."

An example snippet to parse the tokens of two different databases in a certain view is:

import couchdb
import picasconfigcombined
url=picasconfigcombined.PICAS_HOST_URL
username=picasconfigcombined.PICAS_USERNAME
password=picasconfigcombined.PICAS_PASSWORD
dbname=picasconfigcombined.PICAS_DATABASE
dbname2=picasconfigcombined.PICAS_DATABASE2
db = couchdb.Database(url + "/" + dbname)
db2 = couchdb.Database(url + "/" + dbname2)
db.resource.credentials = (username, password)
db2.resource.credentials = (username, password)
db
 #<Database u'nataliedb'>
db2
 #<Database u'openstack_tests'>

v=db.iterview("Monitor" + "/" + "todo", 100)
to_show=[]
for x in v:
    doc = db[x['key']]
    to_show.append(doc)
to_show

v2=db2.iterview("Monitor" + "/" + "todo", 100)
to_show2 = []
for x in v2:
    doc = db2[x['key']]
    to_show2.append(doc)
to_show2

Dedicated Tropomi-PiCaS

All the pilot job scripts should be updated with the correct URL above of the dedicated Tropomi-PiCaS.

The databases and all the document revisions have been migrated from the generic PiCaS instance. Access to the generic PiCaS instance is disabled to make sure that there are no conflicts with document updates on different locations.

Grid toolkit installation

This section describes how to install gridFTP without root permissions on a Debian based machine. It installs the srm and globus-url-copy clients to interact with the Grid storage directly from your local machine.

Prerequisite: java

Installation

  • Copy your ~/.globus (Grid certificates) directory from the UI to your machine.
  • Login to your machine and check that the permissions in your ~/.globus directory are correct:
-rw-r--r-- usercert.pem
-r-------- userkey.pem
  • Download the tarball “projectsnl_grid_clients.tar.gz”:
wget -O projectsnl_grid_clients.tar.gz https://surfdrive.surf.nl/files/index.php/s/5qQ1UOXYl3gsZpW/download
  • Untar the package:
tar -xvzf projectsnl_grid_clients.tar.gz
  • Source the "init_java6.sh":
. projectsnl_grid/init_java6.sh
  • Update the certificates
cd projectsnl_grid
. update_certificates_eugridpma.sh
  • Set the proxy environment variable to custom location e.g. in your bashrc. The default is /tmp on the UI.
export X509_USER_PROXY=/tmp/x509up_uXXX  #replace XXX with your proxy ID, check your ID in /tmp of UI
  • Set the expected ports range for the storage clients:
export GLOBUS_TCP_PORT_RANGE=20000,25000
  • Install globus-url-copy
wget -O globus_tookit.tar.gz downloads.globus.org/toolkit/gt6/stable/installers/linux/globus_toolkit-6.0.1502136246-x86_64-unknown-linux-gnu-Build-14.tar.gz
tar --strip-components=1 -xvf globus_tookit.tar.gz
export PATH=$(pwd)/globus/bin:$PATH

Usage

  • Create a proxy
voms-proxy-init --voms  projects.nl:/projects.nl/tropomi --valid 168:00
  • List the files in a directory
globus-url-copy -list gsiftp://gridftp.grid.sara.nl:2811/pnfs/grid.sara.nl/data/projects.nl/tropomi/natalie/
  • Transfer files
globus-url-copy -vb gsiftp://gridftp.grid.sara.nl:2811/pnfs/grid.sara.nl/data/projects.nl/tropomi/natalie/bootstrap2.log file:///${PWD}/testfile

Staging data

The Grid storage at SURFsara (dCache) consists of magnetic tape storage and hard disk storage. Data stored on magnetic tape has to be copied to a hard drive before it can be used. This action is called staging or bringing files online.

The Grid storage files remain online as long as there is free space on the disk pools. When a pool group is full (maximum of assigned quota) and free space is needed, dCache will purge the least recently used cached files. The tape replica will remain on tape.

The amount of time that a file is requested to stay on disk is called pin lifetime. The file will not be purged until the pin lifetime has expired.

For running the staging commands, it is required to have valid proxy on the machine that you execute the commands from.

srmbringonline practice

The practice for staging files with srmbringonline has been archived in the page staging with srm commands.

The preferred method for bulk staging is the gfal2 practice below.

gfal2 practice

The gfal2 API provides a set of functions to support staging operations. Given an srm list of files, necessary scripts (based on gfal2) are:

  • state.py: display the locality (state) of the files
  • stage.py: stage (bring online) the files and return a token
  • release.py: release the files based on the token

Here is an example:

Preparation

  • Download the scripts on trui and inspect the usage instructions inside each of state.py, stage.py, release.py.

  • Create a file that includes a list with SURLS (srm://srm.grid.sara.nl/..) of the files you want to stage from tape, e.g. the file called mysurls inside the folder datasets:

cat datasets/mysurls
#srm://srm.grid.sara.nl/pnfs/grid.sara.nl/data/projects.nl/tropomi/natalie/file1 
#srm://srm.grid.sara.nl/pnfs/grid.sara.nl/data/projects.nl/tropomi/natalie/file2 
#srm://srm.grid.sara.nl/pnfs/grid.sara.nl/data/projects.nl/tropomi/natalie/file3 
  • In case that you plan to stage a big bulk of data, split the filelist in chunks of 1000 files:
split -l 1000 --numeric-suffixes [datasets/mysurls] [output_prefix]

State operations

  • Check the state of the files with:
python state.py --file [datasets/mysurls]
#srm://srm.grid.sara.nl/pnfs/grid.sara.nl/data/projects.nl/tropomi/natalie/file1 ONLINE_AND_NEARLINE
#srm://srm.grid.sara.nl/pnfs/grid.sara.nl/data/projects.nl/tropomi/natalie/file2 NEARLINE
#srm://srm.grid.sara.nl/pnfs/grid.sara.nl/data/projects.nl/tropomi/natalie/file3 NEARLINE

ONLINE: The file is only on disk
NEARLINE: The file is only on tape
ONLINE_AND_NEARLINE: The file is both on disk and on tape

NB: If your surls filelist is too long it is better to redirect the output of this command in a file.

  • Display the total number of files on tape and/or disk:
python state.py --file [datasets/mysurls] | awk '{print $2}' | sort | uniq --count
#1 ONLINE_AND_NEARLINE
#2 NEARLINE
  • If your surls filelist is too long, please use the token returned by the stage.py (see section below) to retrieve the state of your files. It improves performance significantly! Given a token id [5d43ee2e:-1992583391] the equivalent commands would be:
python state.py --file [datasets/mysurls] --token [5d43ee2e:-1992583391]
python state.py --file [datasets/mysurls] --token [5d43ee2e:-1992583391] | awk '{print $2}' | sort | uniq --count

Stage operations

  • Submit a staging request for your surls filelist. The command returns a token that you can use to check the state of the files:
python stage.py --file [datasets/mysurls]
#Got token 5d43ee2e:-1992583391 

NB: You can store the output in a file stage.log to make sure that you keep safe the token IDs of your stage requests. This is required to state/release the files later on.

NB: When all files in your surls filelist are already ONLINE, the stage.py script will return a message and not a token.

NB: The token always corresponds to the given surls file. You need to use the exact same surls file in your state/release commands. If you add/remove any file in your surls filelist and use the old token, you will get an error.

NB: In case that you plan to stage a big bulk of data, then submit each chunk of 1000 files with 1 minute interval time to prevent overloading the staging namespace server.

  • The pin lifetime is set in seconds, in this example script the requested pin time is two weeks (or 1209600 sec). You can edit the desired pin time in line 55 inside stage.py:

(status, token) = context.bring_online(surls, 1209600, 604800, True) => the pin time is the 2nd argument of bring_online function

NB: The pin lifetime counts from the moment you submit the request independent to the actual time that the files are on disk.

Unpinning operations

  • Release the pin of your bulk of files with:
python release.py --file [datasets/mysurls] --token [5d43ee2e:-1992583391]

NB: After submitting the unpinning command above, the files will remain cached but purgeable until new requests will claim the available space.

NB: If your surls filelist is too long it is better to redirect the output of this command in a file.

Gina reservation

We reserved 12 nodes of newest type (Fujitsu RX2530 and Dell R630) that can be accessed via a TROPOMI dedicated queue. The reserved nodes specifications are: 24 cores per node, 100GB{Fujitsu} or 300GB{Dell} scratch per core, 8GB RAM per core.

  • Access the reservation:

Add the following line in your JDL and comment out any other requirement related to the queue selection:

Requirements = RegExp("tropomi",other.GlueCEUniqueID);

The dedicated queue wall-time is 96 hours.

  • Access to the regular Grid nodes:

Next to the reservation, you can also submit jobs to the generic Grid cluster with the regular JDL requirements:

#Requirements=(RegExp("gina.sara.nl:8443/cream-pbs-short", other.GlueCEUniqueID));
#Requirements=(RegExp("gina.sara.nl:8443/cream-pbs-medium", other.GlueCEUniqueID));
#Requirements=(RegExp("gina.sara.nl:8443/cream-pbs-long", other.GlueCEUniqueID));
#Requirements=((RegExp("gina", other.GlueCEUniqueID))&&(other.GlueCEPolicyMaxWallClockTime >= 2160));  

Get job ID in token

In order to save the job URL when a token is locked follow these steps:

  • In your processing token creation scripts add a new attribute to store the Grid job ID: 'wms_jobid':''

  • In your picas library (picas.tar) edit the script modifiers.py:

    • Add this import on top: from os import environ
    • Find the function lock() of the BasicTokenModifier class and replace it with the content below:
      def lock(self, token):
          """Function which modifies the token such that it is locked.
          @param key: the key generated by the couchdb view.
          @param token: the token content.
          @return: modified token.
          """
    
          wms_jobid = environ.get("GLITE_WMS_JOBID")
          lock_content = {
              'hostname': socket.gethostname(),
              'lock': int( time.time() ),
              'wms_jobid': wms_jobid
          }
          token.update(lock_content)
          return token
  • Save the script and tar picas library again to ship it in your sandbox

Note: if the token gets fetched from a non Grid job (e.g. trui), then the value of 'wms_jobid' will be set to 'null' once the token gets locked. Otherwise, the value is replaced with the Grid job URL.

Proxy renewal on trui

Trui has a mechanism for automatic proxy renewal. This mechanism depends on the validity time of the registered certificate(s). Any time that the certificate is renewed (and installed in .globus) the following command has to be executed from trui once in order to trigger the proxy renewal mechanism.

start_proxy_renewal

Grid storage graphs

ECMWF download

The proposed solution to transfer the data from the ECMWF cluster directly to the tropomi Grid storage requires the following steps.

  • Create a shared link
  • Install a webdav client
  • Transfer data to dCache

Create a shared link

A shared link can be created on dCache with a method called macaroons. Macaroons are bearer tokens that authorize someone to access certain directories or files on dCache without a Grid certificate. A Macaroon may contain caveats that limit access. Such caveats can be based on the data path, the activities that may be performed with the data (list, download, upload, etc.), the IP address of the client, or a maximum validity period. To make it easy to obtain a Macaroon we’ve created a script called get-macaroon that you can run from our UIs (trui, spider-login, etc).

We will create a shared link for the ecmwf directory on the tropomi Grid storage that will be valid for one year to allow automation of data transfers. The user can create a macaroon based on his existing authentication which is proxy authentication in case of tropomi. Please note that the lifetime of your proxy does not limit the lifetime of the macaroon and you can create multiple macaroons with different permissions, paths, vality time, etc.

Here is an example on how to create a macaroon:

  • Login to a UI where you have a valid proxy, e.g. trui
ssh username@trui.grid.surfsara.nl
  • Create the macaroon by authenticating with your proxy. Run this command:
get-macaroon --url https://webdav.grid.surfsara.nl:2883/pnfs/grid.sara.nl/data/projects.nl/tropomi/natalie/ --duration PT2H --chroot --proxy --permissions DOWNLOAD,UPLOAD,DELETE,MANAGE,LIST,READ_METADATA,UPDATE_METADATA --output rclone tokenfile_tropomi

What happened?
--url: the path to the Grid storage that will be shared
--duration: the lifetime of the macaroon. In this case it is 2hours
--chroot: make specified path in url the root diectory, hiding upper dirs
--proxy: authenticatie with the local personal proxy
--permissions: actions allowd for anyone holding this macaroon on the specified url path
--output: when rclone is specified it will save an rclone config file with .conf. If you use the same name for different macaroons it will overwrite the file (e.g. tokenfile_tropomi.conf). To prevent overwriting the file specify a different name in --output rclone <name>

If the command executes successfully, it creates two files locally:

  1. ~/macaroons.log
    This file is created for logging purposes and is updated every time you create a new macaroon. Inside this file you will find information about the macaroon identifier, its expiry date, the relevant path and more.

  2. ./tokenfile_tropomi.conf
    This file is created on the working directory and will be used in the next steps as our configuration file for the webdav transfers. Inside this file you will find the actual token. Keep this safe as anyone holding this file can create the shared link and access the directory (and subdirectories) we specified upon creation with the permissions set in our get-macaroon command.

Optionally paste the shared link in your browser to verify that everything worked. For this:

  • Open the ./tokenfile_tropomi.conf file and copy the token to clipboard. The token is the value of the variable bearer_token (copy the long string next to equal sign).
  • Paste this in your browser and hit enter: https://webdav.grid.surfsara.nl:2883/?authz=PASTE-HERE-YOUR-TOKEN

Another way to verify your macaroon is your command line client rclone. Try the following command on trui:

rclone --config=tokenfile_tropomi.conf ls tokenfile_tropomi:

It should list the contents of the directory we used in our macaroon path.

Play with a few macaroons and once you feel confident, create the macaroon for our ecmwf transfers. Replace the path with the path to the directory where the ecmwf data will be stored and change the duration to --duration P365D to set a year long lifetime for the token.

Install a webdav client

Only WebDAV clients support bearer tokens. For example curl, rclone and (read only) ordinary browsers such as Firefox. We recommend rclone client for working with your files with macaroon authentication.

You can install rclone on any machine without root access from precompiled binary. In order to transfer data from the ecmwf cluster, ask the local admins to install rclone system wide or install it on user space as:

curl -O https://downloads.rclone.org/rclone-current-linux-amd64.zip
unzip rclone-current-linux-amd64.zip
cd rclone-*-linux-amd64

Test that the installation works with:

rclone --help
rclone --version

Transfer data to dCache

We have created a macaroon (shared link) for our Grid storage directory on a SURFsara UI and installed the rclone on the ecmwf cluster client to enable transfers. We are ready to trigger data transfers between the ecmwf cluster and dCache without installing any of the Grid toolkit. Here are the steps (can be applied to any machine):

  1. Copy the token configuration file (e.g. tokenfile_tropomi.conf) to the ecmwf cluster. Make sure to keep this file safe and readable only to you.

  2. Start a data transfer from the ecmwf cluster to your macaroon directory on the Grid storage. You can either copy the files or sync the directories as mirror options:

# Create a source directory `./test-dir/` locally and add some test files in it
# Create a new directory in your destination directory
rclone --config=tokenfile_tropomi.conf mkdir tokenfile_tropomi:/test-dir
# Copy a single file from source to destination
rclone --config=tokenfile_tropomi.conf copyto ./test-dir/src-file tokenfile_tropomi:/test-dir/dest-file -P
# Check
rclone --config=tokenfile_tropomi.conf ls tokenfile_tropomi:/test-dir
# Copy all files from source directory to destination, skipping already copied. It will not remove files that exist in destination and not in source
rclone --config=tokenfile_tropomi.conf copy ./test-dir/ tokenfile_tropomi:/test-dir/ -P
# Check
rclone --config=tokenfile_tropomi.conf ls tokenfile_tropomi:/test-dir
# Make source and destination identical, modifying destination only. Destination is updated to match source and files that exist in destination and not in source will be removed
rclone --config=tokenfile_tropomi.conf sync ./test-dir1/ tokenfile_tropomi:/tape/test-dir1/ -P
# Check
rclone --config=tokenfile_tropomi.conf ls tokenfile_tropomi:/test-dir

For further information on rclone commands checkout rclone copy and rclone sync

SIF input download

We will use the ADA (Advanced dCache API) tool for the SIF band 6 data retrieval from dCache to Spider. ADA is based on the dCache API and the webdav protocol to access and process your data on dCache from any platform and with various authentication methods.

In this example we will use a so called 'macaroon' token to authenticate and work with dCache instead of a Grid proxy certificate in order to simplify the process. This token has been created for you and you can find it in the following project space location on Spider:

ls -l /project/tropomi/Data/SIF/sif-band6.conf
#-rw-rw---- 1 tropomi-adanezi    tropomi-adanezi    782 Jun 22 17:40 sif-band6.conf

The token above can be used to list, download or stage any data under the Grid storage directory /pnfs/grid.sara.nl/data/projects.nl/tropomi/s5p/input_2/l1b/01. The specific token expires in 10 days (starting on 22/06/2020). This should be sufficient time to stage and download the data.

NB: In case that you need a new token after this one expires you can create it with the commands below:

#create a proxy on Spider
voms-proxy-init --voms projects.nl:/projects.nl/tropomi
#create a new token to download data from Grid storage (optionally, change the duration time to match the desired expiration time
get-macaroon --url https://webdav.grid.surfsara.nl:2883/pnfs/grid.sara.nl/data/projects.nl/tropomi/s5p/input_2/l1b/01 --proxy --chroot --duration P15D --permissions DOWNLOAD,LIST --output rclone sif-band6

This command generates the token file "sif-band6.conf"

Let's list our files on dCache using the token. Note that the root directory / in the commands below points to the Grid storage path that we used to generate the macaroon token /pnfs/grid.sara.nl/data/projects.nl/tropomi/s5p/input_2/l1b/01:

#browse to the directory where the token is located (or provide the path to the tokenfile in the `ada` commands that follow)
cd /project/tropomi/Data/SIF/

#list the directory where our token allows permissions
ada --tokenfile sif-band6.conf --list /
#2017/
#2018/
#2019/
#2020/

#list with more details
ada --tokenfile sif-band6.conf --longlist /

#list a subfolder
ada --tokenfile sif-band6.conf --longlist /2017

Let's find if there are any files on tape that need to be staged:

#check all subdirectories
ada --tokenfile sif-band6.conf --longlist /2017/ | grep ' NEARLINE' 
ada --tokenfile sif-band6.conf --longlist /2018/ | grep ' NEARLINE' 
ada --tokenfile sif-band6.conf --longlist /2019/ | grep ' NEARLINE' 
ada --tokenfile sif-band6.conf --longlist /2020/ | grep ' NEARLINE' 

We see that there are some files in "NEARLINE" state within each subdirectory that need to be staged before download. You can either stage the files one-by-one or in a batch from a file. Here are examples for both options

Stage files one by one with ADA

List the properties of a tape file and submit a stage request:

ada --tokenfile sif-band6.conf --longlist /2020/S5P_OFFL_L1B_RA_BD8_20200105T163146_20200105T181316_11553_01_010000_20200105T200222.nc
#/2020/S5P_OFFL_L1B_RA_BD8_20200105T163146_20200105T181316_11553_01_010000_20200105T200222.nc  2093647083  2020-01-09 00:08 UTC  tape  NEARLINE
ada --tokenfile sif-band6.conf --stage /2020/S5P_OFFL_L1B_RA_BD8_20200105T163146_20200105T181316_11553_01_010000_20200105T200222.nc
#success
#if we list the file again we see that the staging is triggered: tape→disk+tape
ada --tokenfile sif-band6.conf --longlist /2020/S5P_OFFL_L1B_RA_BD8_20200105T163146_20200105T181316_11553_01_010000_20200105T200222.nc
#/2020/S5P_OFFL_L1B_RA_BD8_20200105T163146_20200105T181316_11553_01_010000_20200105T200222.nc  2093647083  2020-01-09 00:08 UTC  tape→disk+tape  NEARLINE
#Once the file is staged the locality will change to "ONLINE_AND_NEARLINE"

Repeat for all files on tape one by one or better stage the files together with the method below.

Stage files from file

Create a file to contain only the files on tape or "NEARLINE":

ada --tokenfile sif-band6.conf --longlist /2017 | grep ' NEARLINE' | awk '{print "/2017/" $1}' >> file-to-stage
ada --tokenfile sif-band6.conf --longlist /2018 | grep ' NEARLINE' | awk '{print "/2018/" $1}' >> file-to-stage
ada --tokenfile sif-band6.conf --longlist /2019 | grep ' NEARLINE' | awk '{print "/2019/" $1}' >> file-to-stage
ada --tokenfile sif-band6.conf --longlist /2020 | grep ' NEARLINE' | awk '{print "/2020/" $1}' >> file-to-stage

List the files inside "file-to-stage" to verify their locality. It can take a while if the file contains many enties:

ada --tokenfile sif-band6.conf --longlist --from-file file-to-stage

Submit a stage request for all the files inside "file-to-stage":

ada --tokenfile sif-band6.conf --stage --from-file file-to-stage

The files will start staging from tape to disk. You can check their status using the ada --longlist command.

Once all files are "ONLINE_AND_NEARLINE", you can trigger the download on Spider with the following command:

#start a screen session
screen
#browse to the download folder
cd /project/tropomi/Data/SIF
#start downloading the data with rclone. The command below will trigger 4 parallel transfers
#rclone --config=sif-band6.conf copy sif-band6:/[source directory on Grid storage]/ [destination on Spider -P
rclone --config=sif-band6.conf copy sif-band6:/2020/ ./sif-band6-folder -P
#you can stop and rerun the transfer command at will and it will be skipping already copied files

Once the download is complete, you can release the staged files with:

ada --tokenfile sif-band6.conf --unstage --from-file file-to-stage

Further, investigate other ADA commands with:

ada --help

Intel compiler

Using the Intel compiler on Softdrive

  • Set the environment on Softdrive as:
. /opt/intel/parallel_studio_xe_2017/psxevars.sh intel64

Using the Intel compiler on Spider

On Spider we have installed Intel Parallel Studio XE Cluster Edition for Linux 2019 on the login node that is connected to our license server. In order to test access and set the environment to use the intel compiler, run the command:

. /opt/intel/parallel_studio_xe_2019/bin/psxevars.sh intel64

Using GCC9

On Spider we have multiple GCC versions. The gcc 4.8 will still be the default compiler in the system, but you can switch to version gcc 9 with the following command:

scl enable devtoolset-9 bash
export PATH="/opt/rh/devtoolset-9/root/usr/bin:${PATH}"

Picas JSON error

When querying or changing a design view from the web interface in a database that contains too many documents, this action triggers a load of re-indexing processes and possible timeouts with error "Error: SyntaxError?: JSON.parse: unexpected character at line 1 column 1 of the JSON data".

A possible solution is to purge the design document of the view that throws the error and recreate it. It doesn't have any impact on the tokens. Here is an example:

    1. Suppose that the affected views are tobiasb/todo, tobiasb/locked. Then we will purge the design document "tobiasb" and recreate it. NB: Before deleting the views document, make sure that you are able to recreate it from your createviews scripts. If not, save the '_design/tobiasb' document locally, e.g. save the content of the field views in the document in a text editor.
    1. Purge the views document. This does not affect the tokens:
In [1]: import couchdb
   ...: import picasconfig
   ...: url=picasconfig.PICAS_HOST_URL
   ...: username=picasconfig.PICAS_USERNAME
   ...: password=picasconfig.PICAS_PASSWORD
   ...: dbname=picasconfig.PICAS_DATABASE
   ...: db = couchdb.Database(url + "/" + dbname)
   ...: db.resource.credentials = (username, password)
   ...:In [2]: db
Out[2]: <Database u'tropomi_processing_2'>

In [3]: to_purge_design_docs = []

In [4]: to_purge_design_docs.append(db["_design/tobiasb"])

In [5]: to_purge_design_docs
   ...:
Out[5]: [<Document u'_design/tobiasb'@u'2-cb9b25e441bb9a495c0b9a42e70c5553' {u'language': u'javascript', u'views': {u'all': {u'map': u'function(doc) {\n   if(doc.type == "token" && (doc.user == "tobiasb" || doc.user =="tobiasb")) {\n       if (doc.lock == 0 && doc.done == 0){\n          emit(\'todo\', [doc.algorithm, doc.orbitid, doc.runid, doc.version_algorithm]);\n       }\n       if(doc.lock > 0 && doc.done == 0) {\n          emit(\'locked\', [doc.algorithm, doc.orbitid, doc.runid, doc.version_algorithm]);\n       }\n       if(doc.lock > 0 && doc.done > 0 && parseInt(doc.exit_code) == 0) {\n          emit(\'done\', [doc.algorithm, doc.orbitid, doc.runid, doc.version_algorithm]);\n       } \n       if(doc.lock > 0 && doc.done > 0 && parseInt(doc.exit_code) != 0&& parseInt(doc.exit_code) != 64) {\n          emit(\'error\', [doc.algorithm, doc.orbitid, doc.runid, doc.version_algorithm]);\n       }\n       if(doc.lock > 0 && doc.done > 0 && parseInt(doc.exit_code) == 64) {\n          emit(\'empty\', [doc.algorithm, doc.orbitid, doc.runid, doc.version_algorithm]);\n       }\n       if(doc.done < 0) {\n          emit(\'waiting\',[doc.algorithm, doc.orbitid, doc.runid, doc.version_algorithm]);\n       }\n   }\n}\n', u'reduce': u'function (key, values, rereduce) {\n    var stats = {number: 0};\n    if (rereduce) {\n        for(var i=0; i < values.length; i++) {\n            stats.number += values[i].number;\n        }\n        return stats;\n   }\n   stats.number=values.length\n   return stats;\n}\n'}, u'locked': {u'map': u'function(doc) {\n   if(doc.type == "token" && (doc.user == "tobiasb" || doc.user =="tobiasb")) {\n    if(doc.lock > 0 && doc.done == 0) {\n      emit(doc._id, [doc.algorithm, doc.orbitid, doc.runid, doc.version_algorithm]);\n    }\n  }\n}\n'}, u'waiting': {u'map': u'function(doc) {\n   if(doc.type == "token" && (doc.user == "tobiasb" || doc.user =="tobiasb")) {\n    if(doc.lock == 0 && doc.done < 0) {\n      emit(doc._id, [doc.algorithm, doc.orbitid, doc.runid, doc.version_algorithm]);\n    }\n  }\n}\n'}, u'done': {u'map': u'function(doc) {\n   if(doc.type == "token" && (doc.user == "tobiasb" || doc.user =="tobiasb")) {\n    if(doc.lock > 0 && doc.done > 0  && parseInt(doc.exit_code) == 0) {\n      emit(doc._id, [doc.algorithm, doc.orbitid, doc.runid, doc.version_algorithm]);\n    }\n  }\n}\n'}, u'error': {u'map': u'function(doc) {\n   if(doc.type == "token" && (doc.user == "tobiasb" || doc.user =="tobiasb")) {\n    if(doc.lock > 0 && doc.done > 0 && parseInt(doc.exit_code) != 0 && parseInt(doc.exit_code) != 64) {\n      emit(doc._id, [doc.algorithm, doc.orbitid, doc.runid, doc.version_algorithm]);\n    }\n  }\n}\n'}, u'todo': {u'map': u'function(doc) {\n   if(doc.type == "token" && (doc.user == "tobiasb" || doc.user =="tobiasb")) {\n    if(doc.lock == 0 && doc.done == 0) {\n      emit(doc._id, doc._id);\n    }\n  }\n}\n'}, u'empty': {u'map': u'function(doc) {\n   if(doc.type == "token" && (doc.user == "tobiasb" || doc.user =="tobiasb")) {\n    if(doc.lock > 0 && doc.done > 0 && parseInt(doc.exit_code) == 64) {\n      emit(doc._id, [doc.algorithm, doc.orbitid, doc.runid, doc.version_algorithm]);\n    }\n  }\n}\n'}, u'deleteme': {u'map': u'function(doc) {\n   if(doc.type == "token" && (doc.algorithm=="ch4") && (doc.version_algorithm==9)  && (doc.user == "tobiasb" || doc.user =="tobiasb")) {\n    if(doc.lock > 0 && doc.done > 0  && parseInt(doc.exit_code) == 0) {\n      emit(doc._id, [doc.algorithm, doc.orbitid, doc.runid, doc.version_algorithm]);\n    }\n  }\n}\n'}}}>]

In [6]: db.purge(to_purge_design_docs)
   ...:

Out[6]:
{u'purge_seq': 159,
 u'purged': {u'_design/tobiasb': [u'2-cb9b25e441bb9a495c0b9a42e70c5553']}}
    1. Finally recreate the purged design document either from your python client or from the ui as: Press new document and add the following fields:
       _id: _design/tobiasb
       language: javascript
       views: <PASTE HERE THE CONTENT SAVED IN YOUR EDITOR in step i)>
    

Migration to Dirac

Dirac basics

  • Always enable first the dirac environment on the UI before using the Dirac commands. This can be placed in your bashrc file after the project is fully migrated:
source /etc/diracosrc
  • Create a Dirac proxy. This step will be automated on trui once the project is migrated:
dirac-proxy-init -b 2048 -g tropomi_user -M --valid 168:00
  • Submit a Dirac job and save the jobID in the jobIds local file:
dirac-wms-job-submit parametric.jdl -f jobIds
  • Check the status a Dirac job or a list of jobs:
dirac-wms-job-status [jobID]
dirac-wms-job-status -f jobIds
  • Get the output of a Dirac job or a list of jobs:
dirac-wms-job-get-output [jobID]
dirac-wms-job-get-output -f jobIds

Dirac Extras

  • Get more detailed information for a particular job status:
dirac-wms-job-logging-info [jobID]
  • Delete a job or a list of jobs from the queue, if running it will be killed:
dirac-wms-job-delete [jobID]
dirac-wms-job-delete -f jobIds
  • Get the JDL file of a submitted job:
dirac-wms-job-get-jdl [jobID]
  • Track progress of submitted jobs:
dirac-wms-job-status -f jobIds  | awk '{print $5}' | sort | uniq --count 
  • Track amount of done jobs:
watch -n 60 'dirac-wms-job-status -f jobIds  | grep -c "Status=Done;"' 
  • Submit multiple jobs of a certain JDL type:
for i in {1..100}; do dirac-wms-job-submit parametric.jdl -f jobIds; done

Migration from glite to Dirac

1. Convert all of the JDLs

An example of the requirements needed in the new Dirac format is given below. The "run_ch4.jdl" would look like:

[
  Site = "GRID.SURF.nl";         #Job destination site
  NumberOfProcessors = 1;        #Number of cores on a single node, options: 1,2,4,or 8
  Tags = {"long"};               #Queue name, long default walltime is 96 hours

  Type = "Job";
  JobName = "Parametric";    
  ParameterStart=0;
  ParameterStep=1;
  Parameters=50;
  
  Executable = "/bin/sh";
  Arguments = "startpilot.sh ch4";
  Stdoutput = "parametricjob.out";
  StdError = "parametricjob.err";
  InputSandbox = {"sandbox/couchdb.tar","sandbox/picas.tar","sandbox/startpilot.sh","sandbox/pilot.py","sandbox/master.sh","sandbox/picasconfig.py","sandbox/run_tropomi_ch4.sh","sandbox/copy_tropomi_ch4.py","sandbox/geolocate_subset.py", "sandbox/plot_s5p_l2_ch4.py"};  
  OutputSandbox = {"parametricjob.out", "parametricjob.err"};
]

Please note that only the three first lines are necessary requirements and all the rest in old JDLs should be removed.

2. Adjust cron jobs

The jobsubmission commands need to be converted from glite to Dirac in all scripts, including the cron jobs.

3. Update the function handling the JobID on Picas

In order to save the job URL when a token is locked follow these steps:

  • In your picas library (picas.tar) edit the script modifiers.py:

    • Add this import on top: import os
    • Find the function lock() of the BasicTokenModifier class and replace the following line:
    wms_jobid = environ.get("GLITE_WMS_JOBID") 
    # replace with the following command to fetch the Dirac job ID
    wms_jobid = os.popen('cat job.info | grep JobID | cut -f 2 -d "="').read().strip()
  • Save the script and tar picas library again to ship it in your sandbox

Monitor Dirac jobs from web browser

  • Open the following link from your Firefox browser where your Grid certificate is installed: https://dirac.surfsara.nl/
  • From the left pane select: Applications -> Job Monitor
  • In the new page that is displayed:
    • To get all tropomi jobs status select : OwnerGroup -> tropomi_user
    • To get your own jobs status select: Owner -> [your_username]
  • Click "Submit" on the bottom left page

Migration to Couchdb3

The Picas server has been migrated from CouchDB version 1.7.1 to CouchDB version 3.3.2. The new CouchDB version offers new features and optimisations. A summary of some changes that may affect user functionalities can be found here.

Query the reduce function for a certain view

  • From the Web interface

In the old Couchdb 1 environment you had to browse into the view document and check the reduce box as shown below: design-views-before

In the new Couchdb 3 environment the reduce function can be found with the steps below: design-views-after

  • From command-line, the REST API:

The same result can be retrieved from command line with the following curl command. In this example, the authentication details are parsed in a netrc file:

$ cat .netrc-picas-tropomi

machine picas-tropomi.grid.surfsara.nl
login USERNAME
password PASSWORD

$ curl --silent --netrc-file .netrc-picas-tropomi -X GET https://picas-tropomi.grid.surfsara.nl:6984/tropomi_input_2/_design/Monitor/_view/overview_total?group=true

{"rows":[
{"key":"todo","value":124895}
]}