update the examples for SDK version 1.2.0

Cerebras · Jul 11, 2024 · b7edf77 · b7edf77
1 parent 66a1eb6
commit b7edf77
Show file tree

Hide file tree

Showing 185 changed files with 9,872 additions and 2,034 deletions.
diff --git a/README.rst b/README.rst
@@ -1,7 +1,7 @@
 CSL Examples
 ============
 
-Register for access to the Cerebras SDK `here <https://www.cerebras.net/homepage-landing/developers/sdk-request/>`_.
+Register to access the Cerebras SDK `here <https://www.cerebras.net/homepage-landing/developers/sdk-request/>`_.
 Documentation for the SDK can be found `here <https://sdk.cerebras.net>`_.
 
 This repository contains examples of CSL code. Each example has the following
@@ -35,7 +35,7 @@ This is the place to start.
 There are 10 tutorials which teach basic CSL language features and
 ``SdkRuntime`` host runtime features by building up an increasingly
 complex code to compute a GEMV.
-There are an additional 12 tutorial examples which illustrate
+There are an additional 15 tutorial examples which illustrate
 specific language features,
 and 3 tutorial examples which build an increasingly complex
 pipelined computation.
@@ -95,9 +95,9 @@ Branches
 
 For each release of the SDK, there is a corresponding release tag in this
 repository which contains a version of the CSL examples which are compatible
-with that SDK release. For example, the tag ``rel-sdk-1.1.0`` in this
+with that SDK release. For example, the tag ``rel-sdk-1.2.0`` in this
 repository contains a version of the CSL examples which will work (compile and
-simulate) with the SDK 1.1.0 release. The ``master`` branch is identical to the
+simulate) with the SDK 1.2.0 release. The ``master`` branch is identical to the
 newest release.
 
 Full backward compatibility of the SDK is not guaranteed.

diff --git a/RELEASE-NOTES.rst b/RELEASE-NOTES.rst
@@ -4,6 +4,25 @@ Release Notes
 The following are the release notes for the CSL Examples repository,
 ``csl-examples``.
 
+Version 1.2.0
+-------------
+
+- The examples are improved and updated to comply with the SDK version 1.2.0.
+
+- All tutorial example programs have been updated to support WSE-3.
+
+- Two new example programs for switches, demonstrating use of the
+  ``<control>`` library, have been added.
+
+- A new example program demonstrating the ``<simprint>`` library has been
+  added.
+
+- ``wide-multiplication``, ``residual``, ``mandelbrot``,
+  ``gemv-collectives_2d``, ``gemv-checkerboard-pattern``,
+  ``gemm-collectives_2d``, ``stencil-3d-7pts``, ``bicgstab``,
+  ``conjugateGradient``, ``preconditionedConjugateGradient``, and
+  ``powerMethod`` programs have been updated to support WSE-3.
+
 Version 1.1.0
 -------------
 

diff --git a/benchmarks/25-pt-stencil/README.rst b/benchmarks/25-pt-stencil/README.rst
@@ -0,0 +1,66 @@
+25-Point Stencil
+================
+
+The stencil code is a time-marching app, requiring the following three inputs:
+
+- scalar ``iterations``: number of time steps
+- tensor ``vp``: velocity field
+- tensor ``source``: source term
+
+and producing the following three outputs:
+
+- maximum and minimum value of vector field of last time step, two f32 per PE
+- timestamps of the time-marching per PE, three uint32 per PE
+- vector field ``z`` of last time step, ``zdim`` f32 per PE
+
+The stencil code uses 21 colors and task IDs for communication patterns,
+and ``SdkRuntime`` reserves 6 colors,
+so only 4 colors are left for ``streaming`` H2D/D2H transfers
+and some entrypoints for control flow.
+We use one color (color 0) to launch kernel functions
+and one entrypoint (color 2) to trigger the time marching.
+The ``copy`` mode of memcpy is used for two inputs and two outputs.
+
+After the simulator (or WSE) has been launched,
+we send input tensors ``vp`` and ``source`` to the device via ``copy`` mode.
+
+Second, we launch time marching with the argument ``iterations``.
+
+In this example, we have two kernel launches.
+One performs time marching after ``vp`` and ``source`` are received,
+and the other prepares the output data ``zValues``.
+The former has the function symbol ``f_activate_comp``
+and the latter has the function symbol ``f_prepare_zout``.
+Here ``SdkRuntime.launch()`` triggers a host-callable function, in which
+the first argument is the function symbol ``f_activate_comp``,
+and the second argument is ``iterations``,
+which is received as an argument by ``f_activate_comp``.
+
+The end of time marching (``f_checkpoint()`` in ``task.csl``)
+will record the maximum and minimum value
+of the vector field and timing info into an array ``d2h_buf_f32``.
+The host calls ``memcpy_d2h()`` to receive the data in ``d2h_buf_f32``.
+
+To receive the vector field of the last time step,
+the function ``f_prepare_zout()`` is called by ``SdkRuntime.launch()``
+to prepare this data into a temporary array ``zout``,
+because the result is in either ``zValues[0, :]`` or ``zValues[1, :]``.
+
+The last operation, ``memcpy_d2h()``, sends the array ``zout`` back to the host.
+
+When ``f_activate_comp`` is launched, it triggers the entrypoint ``f_comp()``
+to start the time-marching and to record the starting time.
+
+At the end of time marching, the function ``epilog()`` checks
+``iterationCount``.
+If it reaches the given ``iterations``,  ``epilog()`` triggers the entrypoint
+``CHECKPOINT`` to prepare the data for the first ``memcpy_d2h()``.
+
+The function ``f_checkpoint()`` calls ``unblock_cmd_stream()`` to process the
+next operation which is the first ``memcpy_d2h()``.
+Without ``unblock_cmd_stream()``, the program stalls because the
+``memcpy_d2h()`` is never scheduled.
+
+The function ``f_prepare_zout()`` prepares the vector field into ``zout``.
+It also calls ``unblock_cmd_stream()`` to process the next operation, which is
+the second ``memcpy_d2h()``.
diff --git a/benchmarks/25-pt-stencil/cmd_parser.py b/benchmarks/25-pt-stencil/cmd_parser.py
@@ -0,0 +1,112 @@
+# Copyright 2024 Cerebras Systems.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# This is not a real test, but a module that gets imported in other tests.
+
+"""parse command line for sparse level routines
+
+   -m <int>     number of rows of the matrix A
+   -n <int>     number of columns of the matrix A
+   --local_out_sz <int>  dimension of submatrix in tile approach,
+                         or number of rows in non-tile approach
+   --eps        tolerance
+   --latestlink   working directory
+   --debug      show A, x, and b
+   --sdkgui     prepare data fro sdk gui, including source code
+   --driver     path to CSL compiler
+   --autocsl    use get_cslang_dir to find out the path of CSL
+
+"""
+
+
+import argparse
+
+
+SIZE = 10
+ZDIM = 10
+ITERATIONS = 10
+DX = 20
+
+
+def parse_args():
+  parser = argparse.ArgumentParser()
+
+  parser.add_argument('--name', help='the test name')
+  parser.add_argument(
+            '--zDim', type=int, help='size of the Z dimension', default=ZDIM
+            )
+  parser.add_argument(
+            '--size', type=int, help='size of the domain in x and y dims', default=SIZE
+            )
+
+  parser.add_argument(
+            '--skip-compile', action="store_true",
+            help='Skip compilation of the code from python'
+            )
+
+  parser.add_argument(
+            '--skip-run', action="store_true",
+            help='Skip run of the code from python'
+            )
+
+  parser.add_argument(
+            '--iterations',
+            type=int,
+            help='number of timesteps to simulate',
+            default=ITERATIONS
+            )
+
+  parser.add_argument(
+            '--dx',
+            type=int,
+            help='dx value (impacting the boundary)', default=DX
+            )
+
+  parser.add_argument(
+            '--fabric_width',
+            type=int,
+            help='Width of the fabric we are compiling for',
+            )
+
+  parser.add_argument(
+            '--fabric_height',
+            type=int,
+            help='Height of the fabric we are compiling for',
+            )
+
+  parser.add_argument('--cmaddr', help='IP:port for CS system')
+
+  parser.add_argument(
+            "--debug",
+            help="show A, x, and b", action="store_true"
+            )
+
+  parser.add_argument(
+            "--width-west-buf",
+            default=0, type=int,
+            help="width of west buffer")
+  parser.add_argument(
+            "--width-east-buf",
+            default=0, type=int,
+            help="width of east buffer")
+  parser.add_argument(
+            "--n_channels",
+            default=1, type=int,
+            help="Number of memcpy \"channels\" (LVDS/streamers for both input and output)  to use \
+            when memcpy support is compiled with this program. If this argument is not present, \
+            or is 0, then the previous single-LVDS version is compiled.")
+
+  args = parser.parse_args()
+
+  return args
diff --git a/benchmarks/25-pt-stencil/commands.sh b/benchmarks/25-pt-stencil/commands.sh
@@ -0,0 +1,10 @@
+#!/usr/bin/env bash
+
+set -e
+
+cslc ./layout.csl --arch=wse2 --fabric-dims=17,12 --fabric-offsets=4,1 \
+-o=out_code --params=width:10,height:10,zDim:10,sourceLength:10,dx:20 \
+--params=srcX:0,srcY:0,srcZ:0 --verbose --memcpy --channels=1 \
+--width-west-buf=0 --width-east-buf=0
+cs_python run.py --name out \
+--iterations=10 --dx=20 --skip-compile
diff --git a/benchmarks/25-pt-stencil/consts.csl b/benchmarks/25-pt-stencil/consts.csl
@@ -0,0 +1,107 @@
+// Copyright 2024 Cerebras Systems.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+param pattern: u16;
+param paddedZDim: u16;
+
+const math = @import_module("<math>");
+// We need to allocate space for not just the (padded) size of the problem (in
+// the Z dimension), but also space for ghost cells.
+const zBufferSize = paddedZDim + 2 * (pattern - 1);
+
+fn initBuffer() [2, zBufferSize]f32 {
+  return @zeros([2, zBufferSize]f32);
+}
+
+// Minimig - main.c:15-23, target_3d.c:23, and target_3d.c:30
+fn computeMinimigConsts(dx: u16) [9]f32 {
+  @comptime_assert(pattern == 5);
+  const dx2:f32 = @as(f32, dx * dx);
+  const c0:f32 = -205.0 / 72.0 / dx2;
+  const c1:f32 = 8.0 / 5.0 / dx2;
+  const c2:f32 = -1.0 / 5.0 / dx2;
+  const c3:f32 = 8.0 / 315.0 / dx2;
+  const c4:f32 = -1.0 / 560.0 / dx2;
+
+  return [9]f32 {
+    c4,
+    c3,
+    c2,
+    c1,
+    c0 * 3.0,
+    c1,
+    c2,
+    c3,
+    c4,
+  };
+}
+
+// `computeMinimigConsts()` computes constants in both the positive as well as
+// negative direction of the X, Y, and Z dimensions.  However, for any given
+// axis, our implementation splits communication and computation into two, one
+// for the positive direction and another for the negative direction.  This
+// function extracts the first half of the constants, and optionally includes
+// the center element.
+fn fetchFirstHalfConsts(consts: [2 * pattern - 1]f32, self: bool) [pattern]f32 {
+  var idx: u16 = 0;
+  var result = @zeros([pattern]f32);
+
+  if (!self) {
+    idx += 1;
+  }
+
+  while (idx < pattern) : (idx += 1) {
+    result[idx] = consts[pattern - idx - 1];
+  }
+
+  return result;
+}
+
+fn fetchSecondHalfConsts(consts: [2 * pattern - 1]f32, self: bool) [pattern]f32 {
+  var idx: u16 = 0;
+  var result = @zeros([pattern]f32);
+
+  if (!self) {
+    idx += 1;
+  }
+
+  while (idx < pattern) : (idx += 1) {
+    result[idx] = consts[pattern + idx - 1];
+  }
+
+  return result;
+}
+
+// The sequence in which each PE receives wavetlets from its neighbors depends
+// on the relative placement of the PE within each group of `pattern` PEs.  This
+// function reorders the constants to match the sequence of source PE IDs so
+// that we multiply the incoming data with the right constants.
+fn permuteConsts(pattId: u16, originalConsts: [pattern]f32) [pattern]f32 {
+  const start = pattId;
+  var result = @zeros([pattern]f32);
+
+  var idx: u16 = 0;
+  while (idx < pattern) : (idx += 1) {
+    var value: f32 = 0.0;
+    if (start < idx) {
+      value = originalConsts[(start + pattern) - idx];
+    } else {
+      value = originalConsts[start - idx];
+    }
+
+    result[idx] = value;
+  }
+
+  return result;
+}