From 227a87325bb1ff6f1238c4814609842476e67883 Mon Sep 17 00:00:00 2001
From: James Corbett <corbett8@llnl.gov>
Date: Fri, 1 Nov 2024 23:17:48 -0700
Subject: [PATCH 1/3] coral2: update rabbit JGF generation instructions

Problem: the instructions for generating rabbit JGF is out of date.

Update it to describe the new way of generating JGF using the
rabbitmapping script and resources defined in the config file.
---
 tutorials/lab/coral2.rst | 16 ++++++++++------
 1 file changed, 10 insertions(+), 6 deletions(-)

diff --git a/tutorials/lab/coral2.rst b/tutorials/lab/coral2.rst
index 8837264..0eebd04 100644
--- a/tutorials/lab/coral2.rst
+++ b/tutorials/lab/coral2.rst
@@ -154,18 +154,22 @@ on the same node as the rank 0 broker of the system instance
 a kubeconfig file in its home directory granting it read
 and write access to, at a minimum, ``Storages``, ``Workflows``,
 ``Servers``, and ``Computes`` resources (all of which are defined by
-dataworkflowservices).
+dataworkflowservices). There are instructions for how to grant Flux
+the minimum permissions necessary by setting up role-based access control
+`here <https://nearnodeflash.github.io/latest/guides/rbac-for-users/readme/#rbac-for-workload-manager-wlm>`_.
 
 Lastly, the Fluxion scheduler must be configured to recognize rabbit
-resources. This can be done by taking ``R`` for the cluster (see the
-"Configuring Resources" section of the Flux Administrator's guide)
-and piping it to ``flux dws2jgf`` like so:
+resources. This can be done by generating a file describing the rabbit layout
+for the cluster and then running ``flux dws2jgf`` like so:
 
 .. code-block:: bash
 
-    cat /etc/flux/system/R | flux dws2jgf [--no-validate] [--cluster-name=CLUSTER_NAME] > new_R
+    flux rabbitmapping > /tmp/rabbitmapping.json
+    flux dws2jgf [--no-validate] --from-config /etc/flux/system/conf.d/resource.toml --only-sched /tmp/rabbitmapping.json
 
-The output (which may be large) must replace the old ``R`` for the cluster.
+The output (which may be large) must be saved to a file and pointed to with the
+``resource.scheduling`` config key (see
+`here <https://flux-framework.readthedocs.io/projects/flux-core/en/latest/man5/flux-config-resource.html#keys>`_).
 
 In order to facilitate Fluxion restart when using this new JGF
 (as it is called), Fluxion must be configured to use a ``match-format``

From 82a9cc6a9530edc8def91a5e112f21ff84623564 Mon Sep 17 00:00:00 2001
From: James Corbett <corbett8@llnl.gov>
Date: Sat, 2 Nov 2024 10:31:43 -0700
Subject: [PATCH 2/3] rabbit: move config to separate file

Problem: the rabbit config information deserves its own file, rather
than sharing a file with CORAL2 MPI config.

Move it.
---
 tutorials/lab/coral2.rst        | 49 --------------------------------
 tutorials/lab/index.rst         |  1 +
 tutorials/lab/rabbit_config.rst | 50 +++++++++++++++++++++++++++++++++
 3 files changed, 51 insertions(+), 49 deletions(-)
 create mode 100644 tutorials/lab/rabbit_config.rst

diff --git a/tutorials/lab/coral2.rst b/tutorials/lab/coral2.rst
index 0eebd04..bd7eb3f 100644
--- a/tutorials/lab/coral2.rst
+++ b/tutorials/lab/coral2.rst
@@ -132,52 +132,3 @@ in the PMI bootstrapping environment.  When this happens it may be useful to:
   0: libpmi2: barrier: success
   1: libpmi2: finalize: success
   0: libpmi2: finalize: success
-
------------------------------
-Configuring Flux with Rabbits
------------------------------
-
-In order for a Flux system instance to be able to allocate
-rabbit storage, the ``dws_jobtap.so`` plugin must be loaded.
-The plugin can be loaded in a  config file like so:
-
-.. code-block::
-
-    [job-manager]
-    plugins = [
-      { load = "dws-jobtap.so" }
-    ]
-
-Also, the ``flux-coral2-dws`` systemd service must be started
-on the same node as the rank 0 broker of the system instance
-(i.e. the management node). The ``flux`` user must have
-a kubeconfig file in its home directory granting it read
-and write access to, at a minimum, ``Storages``, ``Workflows``,
-``Servers``, and ``Computes`` resources (all of which are defined by
-dataworkflowservices). There are instructions for how to grant Flux
-the minimum permissions necessary by setting up role-based access control
-`here <https://nearnodeflash.github.io/latest/guides/rbac-for-users/readme/#rbac-for-workload-manager-wlm>`_.
-
-Lastly, the Fluxion scheduler must be configured to recognize rabbit
-resources. This can be done by generating a file describing the rabbit layout
-for the cluster and then running ``flux dws2jgf`` like so:
-
-.. code-block:: bash
-
-    flux rabbitmapping > /tmp/rabbitmapping.json
-    flux dws2jgf [--no-validate] --from-config /etc/flux/system/conf.d/resource.toml --only-sched /tmp/rabbitmapping.json
-
-The output (which may be large) must be saved to a file and pointed to with the
-``resource.scheduling`` config key (see
-`here <https://flux-framework.readthedocs.io/projects/flux-core/en/latest/man5/flux-config-resource.html#keys>`_).
-
-In order to facilitate Fluxion restart when using this new JGF
-(as it is called), Fluxion must be configured to use a ``match-format``
-of ``rv1`` instead of the otherwise recommended default of ``rv1_nosched``.
-
-For example, in a config file:
-
-.. code-block:: toml
-
-    [sched-fluxion-resource]
-    match-format = "rv1"
diff --git a/tutorials/lab/index.rst b/tutorials/lab/index.rst
index 884517c..4a8d251 100644
--- a/tutorials/lab/index.rst
+++ b/tutorials/lab/index.rst
@@ -14,4 +14,5 @@ provided there are for collaborating systems.
    coral
    coral2
    rabbit
+   rabbit_config
 
diff --git a/tutorials/lab/rabbit_config.rst b/tutorials/lab/rabbit_config.rst
new file mode 100644
index 0000000..b56d9a0
--- /dev/null
+++ b/tutorials/lab/rabbit_config.rst
@@ -0,0 +1,50 @@
+.. _rabbitconfig:
+
+=============================
+Configuring Flux with Rabbits
+=============================
+
+In order for a Flux system instance to be able to allocate
+rabbit storage, the ``dws_jobtap.so`` plugin must be loaded.
+The plugin can be loaded in a  config file like so:
+
+.. code-block::
+
+    [job-manager]
+    plugins = [
+      { load = "dws-jobtap.so" }
+    ]
+
+Also, the ``flux-coral2-dws`` systemd service must be started
+on the same node as the rank 0 broker of the system instance
+(i.e. the management node). The ``flux`` user must have
+a kubeconfig file in its home directory granting it read
+and write access to, at a minimum, ``Storages``, ``Workflows``,
+``Servers``, and ``Computes`` resources (all of which are defined by
+dataworkflowservices). There are instructions for how to grant Flux
+the minimum permissions necessary by setting up role-based access control
+`here <https://nearnodeflash.github.io/latest/guides/rbac-for-users/readme/#rbac-for-workload-manager-wlm>`__.
+
+Lastly, the Fluxion scheduler must be configured to recognize rabbit
+resources. This can be done by generating a file describing the rabbit layout
+for the cluster and then running ``flux dws2jgf`` like so:
+
+.. code-block:: bash
+
+    flux rabbitmapping > /tmp/rabbitmapping.json
+    flux dws2jgf [--no-validate] --from-config /etc/flux/system/conf.d/resource.toml --only-sched /tmp/rabbitmapping.json
+
+The output (which may be large) must be saved to a file and pointed to with the
+``resource.scheduling`` config key (see
+`here <https://flux-framework.readthedocs.io/projects/flux-core/en/latest/man5/flux-config-resource.html#keys>`__).
+
+In order to facilitate Fluxion restart when using this new JGF
+(as it is called), Fluxion must be configured to use a ``match-format``
+of ``rv1`` instead of the otherwise recommended default of ``rv1_nosched``.
+
+For example, in a config file:
+
+.. code-block:: toml
+
+    [sched-fluxion-resource]
+    match-format = "rv1"

From 0605fb365e58d8c4d664fa6e928b49ceaf44d615 Mon Sep 17 00:00:00 2001
From: James Corbett <corbett8@llnl.gov>
Date: Sat, 2 Nov 2024 10:51:56 -0700
Subject: [PATCH 3/3] rabbit: add documentation on rabbit config table

Problem: there is no documentation on rabbit configuration with
a config file.

Add it.
---
 auto_examples/auto_examples_jupyter.zip | Bin 4225 -> 4225 bytes
 auto_examples/auto_examples_python.zip  | Bin 1596 -> 1596 bytes
 tutorials/lab/rabbit_config.rst         |  66 +++++++++++++++++++++++-
 3 files changed, 65 insertions(+), 1 deletion(-)

diff --git a/auto_examples/auto_examples_jupyter.zip b/auto_examples/auto_examples_jupyter.zip
index 13b1be2ae51f829c333edc7d382e2f8db3ef981b..a4e9a8acf7a42149ce1f1fcea16fd901f3c65a95 100644
GIT binary patch
delta 38
pcmZovY*ge6@MdNaVE}>Y)A++T^2rOZfEbev1tb`kO!g8m2LPV#2-W}q

delta 38
ocmZovY*ge6@MdNaVE_S{{OYiceDVS;AjV`v0SU$}lf4AY0hR9vcmMzZ

diff --git a/auto_examples/auto_examples_python.zip b/auto_examples/auto_examples_python.zip
index 6ef22f8a13e585eb679201c7d4160554fee920c0..fa74302a4108d0c6b963e499d165a1939d40a19a 100644
GIT binary patch
delta 43
ucmdnPvxkQ-z?+#xgaHJmPvZ~Y$aj~O1;m*Ao>hW*Nh8BVX_?7tY+?ZPZwrI~

delta 43
ucmdnPvxkQ-z?+#xgaHI(@~gu(^4(=+0Wl`OXO&>y(#SATT4u5un-~D;nhPcX

diff --git a/tutorials/lab/rabbit_config.rst b/tutorials/lab/rabbit_config.rst
index b56d9a0..9cb7afd 100644
--- a/tutorials/lab/rabbit_config.rst
+++ b/tutorials/lab/rabbit_config.rst
@@ -6,7 +6,7 @@ Configuring Flux with Rabbits
 
 In order for a Flux system instance to be able to allocate
 rabbit storage, the ``dws_jobtap.so`` plugin must be loaded.
-The plugin can be loaded in a  config file like so:
+The plugin can be loaded in a config file like so:
 
 .. code-block::
 
@@ -48,3 +48,67 @@ For example, in a config file:
 
     [sched-fluxion-resource]
     match-format = "rv1"
+
+Rabbit Config Options
+---------------------
+
+The ``rabbit`` config table captures site-general policies and options for
+Flux's interactions with the rabbits.
+
+
+**kubeconfig** (string)
+  (optional) Path to kubeconfig file for Flux to use, ideally with restricted permissions.
+  This can be left undefined if the file is placed at the path `~flux/.kube/config`
+  (assuming the `flux` user is the instance owner).
+
+**tc_timeout** (integer)
+  (optonal) Time in seconds to tolerate a workflow stuck in TransientCondition state
+  before killing the associated job. Defaults to 10 seconds.
+
+**drain_compute_nodes** (boolean)
+  (optional) Whether to automatically drain compute nodes that lose PCIe connection
+  with their rabbit. Defaults to true.
+
+**save_datamovements** (integer)
+  (optional) Number of `nnfdatamovement` resources to save to jobs' KVS, may be useful for
+  debugging but too many may degrade performance. Defaults to 0.
+
+**restrict_persistent_creation** (boolean)
+  (optional) Restrict the creation of persistent file systems to the instance owner
+  (in most cases the `flux` user).
+
+**policy.maximums** (table)
+  (optional) The maximum filesystem capacity per node, in GiB, that users may
+  request. Leave undefined for no limit. See below for an example.
+
+**presets** (table)
+  (optional) Defines preset #DW strings. May potentially save users time and energy,
+  allowing them to run, for instance, `flux alloc -N1 -S dw=NAME` rather than
+  `flux alloc -N1 -S "dw=#DW jobdw ..."` See below for an example.
+
+
+Example
+~~~~~~~
+
+.. code-block:: TOML
+
+    [rabbit]
+
+    kubeconfig = "/var/flux/.kube/config"
+    tc_timeout = 600
+    drain_compute_nodes = true
+    save_datamovements = 5
+    restrict_persistent_creation = true
+
+    # maximum filesystem capacity per node, in GiB
+    [rabbit.policy.maximums]
+    xfs = 1024
+    gfs2 = 2048
+    raw = 4096
+    lustre = 1024
+
+    # defines preset #DW strings
+    [rabbit.presets]
+
+    small_xfs = "#DW jobdw type=xfs capacity=100GiB name=smallxfs"
+    large_lustre = "#DW jobdw type=lustre capacity=50TiB name=largelustre"