Skip to content

Commit

Permalink
Update documentation
Browse files Browse the repository at this point in the history
  • Loading branch information
npcarter committed Jan 16, 2025
1 parent 5c25ffd commit b749ced
Show file tree
Hide file tree
Showing 5 changed files with 178 additions and 93 deletions.
1 change: 1 addition & 0 deletions index.html
Original file line number Diff line number Diff line change
Expand Up @@ -96,6 +96,7 @@ <h1 class="websitetitle"><a href="https://eddyrivaslab.github.io/">Eddy and Riva
<h2>Available HOWTOs</h2>
<ul>
<li><a href="https://eddyrivaslab.github.io/pages/cluster-computing-in-the-eddy-and-rivas-labs.html">Eddy and Rivas Lab Cluster Resources and how to Access Them</a></li>
<li><a href="https://eddyrivaslab.github.io/pages/leaving-the-lab.html">Leaving the Lab</a></li>
<li><a href="https://eddyrivaslab.github.io/pages/modifying-this-website.html">Modifying This Website</a></li>
<li><a href="https://eddyrivaslab.github.io/pages/my-jobs-arent-running.html">My Jobs Aren't Running</a></li>
<li><a href="https://eddyrivaslab.github.io/pages/running-jobs-on-our-cluster.html">Running Jobs on Our RC Machines</a></li>
Expand Down
83 changes: 23 additions & 60 deletions pages/cluster-computing-in-the-eddy-and-rivas-labs.html
Original file line number Diff line number Diff line change
Expand Up @@ -98,7 +98,7 @@ <h2>Overview</h2>
When you log in, that's where you'll land. You have 100GB of space
here. </p>
<p>Our <em>lab storage</em> is <code>/n/eddy_lab/</code>. We have 400TB of what RC calls
Tier 1 storage. </p>
Tier 1 storage, which is fast but expensive. </p>
<p>Both your home directory and our lab storage are backed up nightly to
what RC calls <em>snapshots</em>, and periodically to what RC calls <em>disaster
recovery</em> (DR) backups.</p>
Expand All @@ -109,45 +109,18 @@ <h2>Overview</h2>
machine using <code>samba</code>. (Warning: a samba mount is slow, and may
sometimes be flaky; don't rely on it except for lightweight tasks.)
Instructions are below.</p>
<p>RC also provides <em>shared scratch storage</em> for us in
<code>/n/holyscratch01/eddy_lab</code>. You have write access here, so at any
time you can create your own temp directory(s). Best practice is to
use a directory of your own, in
<code>/n/holyscratch01/eddy_lab/Users/&lt;username&gt;</code>. We have a 50TB
allocation. This space can't be remote mounted, isn't backed up, and
is automatically deleted after 90 days.</p>
<p>RC also provides <em>shared scratch storage</em>, which is very fast but not backed up. Files on the scratch storage that are older than 90 days are automatically deleted, and RC strongly frowns on playing tricks to make files look younger than they are. Because RC occasionally moves the scratch storage to different devices, the easiest way to access it is through the &dollar;SCRATCH variable, which is defined on all RC machines. Our lab has an eddy_lab directory on the scratch space with a 50TB quota, which contains a Users directory, so '&amp;dollarSCRATCH/eddy_lab/Users/yourusername' will point to your directory on the scratch space <span class="marginnote">The Users directory was pre-populated with space for a set of usernames at some point in the past. If your username wasn't included, you'll have to email RC to get a directory created for you.</span>. </p>
<p>The scratch space is intended for temporary data, so is a great place to put input or output files from jobs, particularly if you intend to post-process your outputs to extract a smaller amount of data from them.</p>
<p>You can read
<a href="https://docs.rc.fas.harvard.edu/kb/cluster-storage/">more documentation on how RC storage works</a>.</p>
<p>We have three compute partitions dedicated to our lab (the <code>-p</code>, for
partition, will make sense when you learn how to launch compute jobs
with the <code>slurm</code> scheduler):</p>
<ul>
<li>
<p><strong>-p eddy:</strong> 640 cores, 16 nodes (40 cores/node). We use this partition for most of
our computing.</p>
</li>
<li>
<p><strong>-p eddy_gpu:</strong>
4 GPU nodes [holyb0909,holyb0910,holygpu2c0923,holygpu2c1121].
Each holyb node has 4 <a href="https://www.nvidia.com/en-us/data-center/v100/">NVIDIA Tesla V100 NVLINK GPUs</a>
with 32G VRAM, 2 16-core Xeon CPUs, and 192G RAM [installed 2018].
Each holygpu2c node has 8 <a href="https://www.nvidia.com/en-us/data-center/a40/">NVIDIA Ampere A40 GPUs</a>
with 48G VRAM, 2 24-core Xeon CPUs, and 768G RAM [installed 2022].</p>
</li>
</ul>
<p>We are awaiting one more GPU node with 4 <a href="https://www.nvidia.com/en-us/data-center/hgx/">NVIDIA HGX A100 GPUs</a>
with 80G VRAM, 2 24-core AMD CPUs, and 1024G RAM [shipping expected Nov 2022].</p>
<p>We use this partition for GPU-enabled machine learning stuff, TensorFlow and the like.</p>
<ul>
<li><strong>-p eddy_hmmer:</strong> 576 cores in 16 nodes. These are older cores
(circa 2016). We use this partition for long-running or large jobs, to
keep them from getting in people's way on <code>-p eddy</code>.</li>
</ul>
<p>We are awaiting installation of another 1536 CPU cores (in 24 nodes,
64 cores/node) [expected fall 2022].</p>
<p>All of our lab's computing equipment is contained in the eddy partition, which contains 1,872 cores. Most of our machines have 8GB of RAM per core. In addition, we have three GPU-equipped machines, which are part of the partition: holygpu2c0923, holygpu2c1121, and holygpu7c0920<span class="marginnote">The "holy" at the beginning of our machine names refers to their location in the Holyoke data center.</span></p>
<p>Each holygpu2c node has 8 <a href="https://www.nvidia.com/en-us/data-center/a40/">NVIDIA Ampere A40 GPUs</a>
with 48G VRAM [installed 2022]. </p>
<p>The holygpu7 node has 4 <a href="https://www.nvidia.com/en-us/data-center/hgx/">NVIDIA HGX A100 GPUs</a>
with 80G VRAM [installed 2023]. </p>
<p>We can also use Harvard-wide shared partitions on the RC cluster. <code>-p
shared</code> is 17,952 cores (in 375 nodes), for example. RC has
<a href="https://docs.rc.fas.harvard.edu/kb/running-jobs/#Slurm_partitions">much more documentation on available partitions</a>.</p>
shared</code> is 19,104 cores (in 399 nodes), for example (as of Jan 2023). RC has
<a href="https://docs.rc.fas.harvard.edu/kb/running-jobs/#Slurm_partitions">much more documentation on available partitions</a>. </p>
<h2>Accessing the cluster</h2>
<h3>logging on, first time</h3>
<ul>
Expand Down Expand Up @@ -197,20 +170,20 @@ <h3>configuring an ssh host alias</h3>

<p>You still have to authenticate by password and OpenAuth code, though.</p>
<h3>configuring single sign-on scp access</h3>
<p>It can get tedious to have to authenticate every time you <code>ssh</code> to RC,
especially if you're using ssh-based tools like <code>scp</code> to copy
individual files back and forth. You can streamline this using
<p>Even better, but a little more complicated: you can make it so you
only have to authenticate once, and every ssh or scp after that is
passwordless. To do this, I use
<a href="https://docs.rc.fas.harvard.edu/kb/using-ssh-controlmaster-for-single-sign-on/">SSH ControlMaster for single sign-on</a>,
to open a single <code>ssh</code> connection that you authenticate once, and all
subsequent <code>ssh</code>-based traffic to RC goes via that connection.</p>
<p>RC's
<a href="https://docs.rc.fas.harvard.edu/kb/using-ssh-controlmaster-for-single-sign-on/">instructions are here</a>
but briefly:</p>
<ul>
<li>Add another hostname alias to your <code>.ssh/config</code> file. Mine is
called <strong>odx</strong>:</li>
<li>Replace the above hostname alias in <code>.ssh/config</code> file with
something like this:</li>
</ul>
<div class="highlight"><pre><span></span><code>Host odx
<div class="highlight"><pre><span></span><code>Host ody
User seddy
HostName login.rc.fas.harvard.edu
ControlMaster auto
Expand All @@ -221,19 +194,19 @@ <h3>configuring single sign-on scp access</h3>
<ul>
<li>Add some aliases to your <code>.bashrc</code> file:</li>
</ul>
<div class="highlight"><pre><span></span><code> <span class="nb">alias</span> odx-start<span class="o">=</span><span class="s1">&#39;ssh -Y -o ServerAliveInterval=30 -fN odx&#39;</span>
<span class="nb">alias</span> odx-stop<span class="o">=</span><span class="s1">&#39;ssh -O stop odx&#39;</span>
<span class="nb">alias</span> odx-kill<span class="o">=</span><span class="s1">&#39;ssh -O exit odx&#39;</span>
<div class="highlight"><pre><span></span><code> <span class="nb">alias</span> ody-start<span class="o">=</span><span class="s1">&#39;ssh -Y -o ServerAliveInterval=30 -fN ody&#39;</span>
<span class="nb">alias</span> ody-stop<span class="o">=</span><span class="s1">&#39;ssh -O stop ody&#39;</span>
<span class="nb">alias</span> ody-kill<span class="o">=</span><span class="s1">&#39;ssh -O exit ody&#39;</span>
</code></pre></div>

<p>Now you can launch a session with:</p>
<div class="highlight"><pre><span></span><code><span class="w"> </span><span class="c">% odx-start</span><span class="w"></span>
<div class="highlight"><pre><span></span><code><span class="w"> </span><span class="c">% ody-start</span><span class="w"></span>
</code></pre></div>

<p>It'll ask you to authenticate. After you do this, all your ssh-based
commands (in any terminal window) will work without further
authentication. To stop the connection, do</p>
<div class="highlight"><pre><span></span><code><span class="w"> </span><span class="c">% odx-stop</span><span class="w"></span>
<div class="highlight"><pre><span></span><code><span class="w"> </span><span class="c">% ody-stop</span><span class="w"></span>
</code></pre></div>

<p>If you forget to stop it, no big deal, the connection will eventually
Expand Down Expand Up @@ -395,17 +368,7 @@ <h3>writing an sbatch script</h3>
format. An example that (stupidly) loads gcc and just calls
<code>hostname</code>, so the output will be the name of the compute node the
script ran on:</p>
<div class="highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span class="normal"> 1</span>
<span class="normal"> 2</span>
<span class="normal"> 3</span>
<span class="normal"> 4</span>
<span class="normal"> 5</span>
<span class="normal"> 6</span>
<span class="normal"> 7</span>
<span class="normal"> 8</span>
<span class="normal"> 9</span>
<span class="normal">10</span>
<span class="normal">11</span></pre></div></td><td class="code"><div><pre><span></span><code><span class="ch">#!/bin/bash</span>
<div class="highlight"><pre><span></span><code><span class="ch">#!/bin/bash</span>
<span class="c1">#SBATCH -c 1 # Number of cores/threads</span>
<span class="c1">#SBATCH -N 1 # Ensure that all cores are on one machine</span>
<span class="c1">#SBATCH -t 6-00:00 # Runtime in D-HH:MM</span>
Expand All @@ -416,7 +379,7 @@ <h3>writing an sbatch script</h3>

module load gcc
hostname
</code></pre></div></td></tr></table></div>
</code></pre></div>

<p>Save this to a file (<code>foo.sh</code> for example) and submit it with <code>sbatch</code>:</p>
<div class="highlight"><pre><span></span><code> sbatch foo.sh
Expand Down
140 changes: 140 additions & 0 deletions pages/leaving-the-lab.html
Original file line number Diff line number Diff line change
@@ -0,0 +1,140 @@
<!DOCTYPE html>
<html lang="en"
xmlns:og="http://ogp.me/ns#"
xmlns:fb="https://www.facebook.com/2008/fbml">
<head>
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1">
<meta name="author" content="A cast of tens">

<title>Leaving the Lab | Eddy and Rivas Labs Resource Page</title>

<link rel="canonical" href="https://eddyrivaslab.github.io/pages/leaving-the-lab.html">

<link href="/theme/css/tufte.css" rel="stylesheet">
<link href="/theme/css/latex.css" rel="stylesheet">


<!-- Haddock syntax highlighting -->
<style type="text/css">code{white-space: pre;}</style>
<style type="text/css">
div.sourceCode { overflow-x: auto; }
table.sourceCode, tr.sourceCode, td.lineNumbers, td.sourceCode {
margin: 0; padding: 0; vertical-align: baseline; border: none; }
table.sourceCode { width: 100%; line-height: 100%; }
td.lineNumbers { text-align: right; padding-right: 4px; padding-left: 4px; color: #aaaaaa; border-right: 1px solid #aaaaaa; }
td.sourceCode { padding-left: 5px; }
code > span.kw { color: #0000ff; } /* Keyword */
code > span.ch { color: #008080; } /* Char */
code > span.st { color: #008080; } /* String */
code > span.co { color: #008000; } /* Comment */
code > span.ot { color: #ff4000; } /* Other */
code > span.al { color: #ff0000; } /* Alert */
code > span.er { color: #ff0000; font-weight: bold; } /* Error */
code > span.wa { color: #008000; font-weight: bold; } /* Warning */
code > span.cn { } /* Constant */
code > span.sc { color: #008080; } /* SpecialChar */
code > span.vs { color: #008080; } /* VerbatimString */
code > span.ss { color: #008080; } /* SpecialString */
code > span.im { } /* Import */
code > span.va { } /* Variable */
code > span.cf { color: #0000ff; } /* ControlFlow */
code > span.op { } /* Operator */
code > span.bu { } /* BuiltIn */
code > span.ex { } /* Extension */
code > span.pp { color: #ff4000; } /* Preprocessor */
code > span.do { color: #008000; } /* Documentation */
code > span.an { color: #008000; } /* Annotation */
code > span.cv { color: #008000; } /* CommentVar */
code > span.at { } /* Attribute */
code > span.in { color: #008000; } /* Information */
</style>
<style type="text/css">
pre:not([class]) {
background-color: white;
}
</style>

<!-- HTML5 shim and Respond.js for IE8 support of HTML5 elements and media queries -->
<!-- WARNING: Respond.js doesn't work if you view the page via file:// -->
<!--[if lt IE 9]>
<script src="https://oss.maxcdn.com/html5shiv/3.7.3/html5shiv.min.js"></script>
<script src="https://oss.maxcdn.com/respond/1.4.2/respond.min.js"></script>
<![endif]-->

<!-- <link rel="apple-touch-icon-precomposed" sizes="57x57" href="/theme/apple-touch-icon-57x57.png" />
<link rel="apple-touch-icon-precomposed" sizes="114x114" href="/theme/apple-touch-icon-114x114.png" />
<link rel="apple-touch-icon-precomposed" sizes="72x72" href="/theme/apple-touch-icon-72x72.png" />
<link rel="apple-touch-icon-precomposed" sizes="144x144" href="/theme/apple-touch-icon-144x144.png" />
<link rel="apple-touch-icon-precomposed" sizes="120x120" href="/theme/apple-touch-icon-120x120.png" />
<link rel="apple-touch-icon-precomposed" sizes="152x152" href="/theme/apple-touch-icon-152x152.png" />
<link rel="icon" type="image/png" href="/theme/favicon-32x32.png" sizes="32x32" />
<link rel="icon" type="image/png" href="/theme/favicon-16x16.png" sizes="16x16" />
<meta name="application-name" content="Scorecard Diplomacy"/>
<meta name="msapplication-TileColor" content="#FFFFFF" />
<meta name="msapplication-TileImage" content="/theme/mstile-144x144.png" /> -->

<script src="//use.typekit.net/vpe6zfl.js"></script>
<script>try{Typekit.load({ async: true });}catch(e){}</script>

</head>
<body class="">
<div class="wrap">
<header>
<nav class="group">
<h1 class="websitetitle"><a href="https://eddyrivaslab.github.io/">Eddy and Rivas Labs Resource Page</a></h1>
</nav>
<nav class="bottom-group">
</nav>
</header>
<article>
<h1>Leaving the Lab</h1>

<p>All good things must end, and that includes everyone's time in the
Eddy lab. To help make sure that all your work is preserved after you
leave, and to help keep us out of trouble with Harvard and our funding
agencies, please follow the following checklist as you get ready to
leave:</p>
<ol>
<li>
<p>Copy any relevant data/code/results from your RC home directory,
laptop, or other computing devices into your directory under
/n/eddy_lab/users.</p>
</li>
<li>
<p>Work with your mentor/supervisor to determine what of your
work/data needs to stay accessible after you leave and copy/transfer
that data somewhere group-accessible, such as /n/eddy_lab/data.</p>
</li>
<li>
<p>If you have a lab laptop or other computing device, back it up to
an external hard disk.</p>
</li>
<li>
<p>The last thing you should do on the research computing machines is
to cd to /n/eddy_lab/users and run
/n/eddy_lab/software/prep_for_archive.py <yourusername>. This will go
through your home directory and mark everything in it accessible to
the entire lab. It will then move your home directory under
/n/eddy_lab/Lab and create a tarfile of your data there. Doing this
will make it easy for us to archive your data to tape so that it costs
less to maintain but is still accessible if we need it.</p>
</li>
<li>
<p>Return any lab "loaner" laptops to Nick. Turn in lab-owned
computing equipment, including any backup disks, to Ariane or her
successor. Turn in any notebooks to Ariane as well
so that they can be scanned to satisfy Harvard's research retention
rules.</p>
</li>
<li>
<p>Enjoy the rest of your life, and come back to visit!</p>
</li>
</ol>

</article>
<footer>Powered by <a href="https://getpelican.com/">Pelican</a>. Site theme is a modified version of <a href="https://github.com/andrewheiss/ath-tufte-pelican">ath-tufte-pelican</a>.</footer>
</div>
</body>
</html>
Loading

0 comments on commit b749ced

Please sign in to comment.