Skip to content

Commit

Permalink
Merge pull request #13 from UCSC-Treehouse/12_aglyle
Browse files Browse the repository at this point in the history
Updates to documentation for initial setup - finally actually merging this
  • Loading branch information
e-t-k authored Jan 10, 2019
2 parents f072781 + 7d711e0 commit 9e028eb
Showing 1 changed file with 68 additions and 13 deletions.
81 changes: 68 additions & 13 deletions treeshop.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Treeshop Cluster Processing

To process multiple samples through the [Treehouse pipelines Makefile](https://github.com/UCSC-Treehouse/pipelines/blob/master/Makefile) we use [docker-machine](https://docs.docker.com/machine/overview/) to spin up a cluster of machines on Openstack and a simple [Fabric](http://www.fabfile.org/) file to control the compute.
To process multiple samples through the [Treehouse pipelines Makefile](https://github.com/UCSC-Treehouse/pipelines/blob/master/Makefile) we use [docker-machine](https://docs.docker.com/machine/overview/) to spin up a cluster of machines on Openstack and a simple [Fabric](http://www.fabfile.org/) file to control the compute.

## Requirements

Expand All @@ -22,19 +22,37 @@ sense of if things are going smoothly.

## Getting Started

#### Installing docker-machine

From the home directory (type `cd` to get to home directory) type:

curl -L https://github.com/docker/machine/releases/download/v0.14.0/docker-machine-`uname -s`-`uname -m` > ~/docker-machine
install ~/docker-machine ~/bin/docker-machine

Congratulations, you are now ready to set up your docker-machine.

### Set Up

Clone this repository:

git clone https://github.com/UCSC-Treehouse/pipelines.git

Create a folders that match the [Treehouse storage layout](https://github.com/UCSC-Treehouse/pipelines/blob/master/fabfile.py#L12):
Create needed directory and navigate into the newly cloned repository:

mkdir ~/.aws
cd pipelines

### Processing the test sample

Create folders that match the [Treehouse storage layout](https://github.com/UCSC-Treehouse/pipelines/blob/master/fabfile.py#L12):

mkdir -p treeshop/primary/original/TEST treeshop/downstream

Copy the TEST fastq samples into the storage hierarchy
Copy the TEST fastq samples into the storage hierarchy:

cp samples/*.fastq.gz treeshop/primary/original/TEST/

Spin up a single cluster machine:
Spin up a single cluster machine (make sure you have created your SSH key):

fab up

Expand Down Expand Up @@ -80,17 +98,19 @@ Output:

Process the samples in manifest.tsv with source and destination under the treeshop folder sending log output to the console and log.txt:

fab process:manifest=manifest.txt,base=treeshop 2>&1 | tee log.txt
fab process:manifest=manifest.tsv,base=treeshop 2>&1 | tee log.txt

Output:

[10.50.102.245] Executing task 'process'
Warning: run() received nonzero return code 1 while executing 'docker stop $(docker ps -a -q)'!
Warning: run() received nonzero return code 1 while executing 'docker rm $(docker ps -a -q)'!
[10.50.102.245] put: /scratch/username/pipelines/Makefile -> /mnt/Makefile
10.50.102.245 processing TEST
[10.50.102.245] Executing task 'process'
Warning: run() received nonzero return code 1 while executing 'docker stop $(docker ps -a -q)'!
Warning: run() received nonzero return code 1 while executing 'docker rm $(docker ps -a -q)'!
[10.50.102.245] put: /scratch/username/pipelines/Makefile -> /mnt/Makefile
10.50.102.245 processing TEST

...lot and lots of output...
Done.

Done.

After this you should have the following under downstream:

Expand Down Expand Up @@ -137,13 +157,42 @@ After this you should have the following under downstream:
├── methods.json
└── mini.ann.vcf

### Shut Down

After confirming that you successfully processed your data, you may want to shut down your docker machine.
This will free up resources and space for other users.

To shut down all docker-machines type:

fab down

## Notes

Error output with respect to finding and copying files will be written to error.log. All of the output for all machines running in parallel will end up in log.txt. As a result if there are internal errors to the pipelines you'll need to sort through log.txt.

Treeshop is a cheap and cheerful option to process 10's to up to 100 samples at a time. Larger scale projects will require a more sophisticated distributed computing approach. If you are not comfortable ssh'ng into various machines, running docker, and scp'ng results around then you may want to find someone that is before trying Treeshop.

While running 'fab top' will show you what dockers are running on each machine. After an initial
To set up multiple machines to process large amounts of samples you can give the `fab up` command a numeric variable input.
For example, to spin up 5 machines type:

fab up:5

When processing multiple samples you will need to format your manifest.tsv appropriately.
Each sample name will need to be placed on a separate line.
For example:

1. TEST1
2. TEST2
3. TEST3
etc.

The fabfile will automatically assign the docker-machines samples to run.

WARNING: Running `fab process` will automatically stop all currently running docker-machines in order to work on the newly assigned samples.
Make sure your docker-machines have finished processing their samples.
Users comfortable with changing commands may wish to learn how to restrict which machines are used to process samples by using the hosts parameter. [Fabfile hosts](http://docs.fabfile.org/en/1.14/usage/execution.html#globally-via-the-command-line).

While running `fab top` will show you what dockers are running on each machine. After an initial
delay copying the fastqs over you should see the alpine running (calculating md5) and then rnaseq.

The first sample on a fresh machine will cause all the docker's to be pulled, later samples will be
Expand All @@ -155,3 +204,9 @@ quite a bit of extra provenance by writing methods.json files as well as organiz
per the Treehouse storage layout. That said if you have some custom additional pipelines you want to
run its fairly easy to just add another target to the Makefile and then copy/paste inside of the
fabfile.py process method.

#### Advanced options

Users seeking more information on using multiple fabfiles or using different options should visit the Fabric website. [Fabric options](http://docs.fabfile.org/en/1.14/usage/fab.html).

For more information on selectively shutting down docker-machines review the docker-machine documentation. [docker-machine](https://docs.docker.com/machine/reference/rm/).

0 comments on commit 9e028eb

Please sign in to comment.