diff --git a/README.rst b/README.rst
index 14f6723..9bf8467 100644
--- a/README.rst
+++ b/README.rst
@@ -57,20 +57,20 @@ The sample applications available are:
   (GEMM) using the ``collectives`` library.
 * ``residual``: Computes the norm of the residual of a matrix-vector
   multiplication. Builds on the ``gemv-checkerboard-pattern`` example.
-* ``stencil-v2``: A 3D 25-point stencil finite difference code for solving a
+* ``25-pt-stencil``: A 3D 25-point stencil finite difference code for solving a
   wave equation with a source perturbation.
-* ``bandwidthTest``: Benchmarks the bandwidth of data transfers between host
+* ``bandwidth-test``: Benchmarks the bandwidth of data transfers between host
   and device using the ``memcpy`` framework and the ``SdkRuntime`` host API.
 * ``spmv-hypersparse``: Computes a sparse matrix-vector product using a
   hypersparse matrix.
-* ``stencil-3d-7pts``: Computes a sparse matrix-vector product using a matrix
-  generated by a 7-point stencil.
-* ``powerMethod``: Implements the Power method to compute the eigenvector
+* ``7pt-stencil-spmv``: Computes a sparse matrix-vector product using a matrix
+  generated by a 3D 7-point stencil.
+* ``power-method``: Implements the Power method to compute the eigenvector
   of the largest eigenvalue of a matrix generated by a 7-point stencil.
-* ``conjugateGradient``: Implements the Conjugate Gradient (CG) method to 
+* ``conjugate-gradient``: Implements the Conjugate Gradient (CG) method to 
   approximate the solution to a system of linear equations ``A*x = b``,
   where ``A`` is a matrix generated by a 7-point stencil.
-* ``preconditionedConjugateGradient``: Implements the Preconditioned Conjugate
+* ``preconditioned-conjugate-gradient``: Implements the Preconditioned Conjugate
   Gradient method (PCG) to approximate the solution to a system of linear
   equations ``A*x = b``, where ``A`` is a matrix generated by a 7-point
   stencil.
@@ -89,15 +89,20 @@ The sample applications available are:
 * ``FFT``: Implements 1D and 2D Discrete Fourier Transforms (DFT).
 * ``single-tile-matvec``: Implements highly optimized ``N x N`` matrix-vector
   products, in which each PE performs the same matrix-vector computation.
+* ``row-col-broadcast``: Benchmarks the bandwidth of data transfers between
+  host and device, where data is broadcast across a row or column of PEs,
+  using ``memcpy_h2d_colbcast`` and ``memcpy_h2d_rowbcast``.
+* ``game-of-life``: Implements Conway's Game of Life, where each PE is treated
+  as a single cell.
 
 Branches
 --------
 
 For each release of the SDK, there is a corresponding release tag in this
 repository which contains a version of the CSL examples which are compatible
-with that SDK release. For example, the tag ``rel-sdk-1.2.0`` in this
+with that SDK release. For example, the tag ``rel-sdk-1.3.0`` in this
 repository contains a version of the CSL examples which will work (compile and
-simulate) with the SDK 1.2.0 release. The ``master`` branch is identical to the
+simulate) with the SDK 1.3.0 release. The ``master`` branch is identical to the
 newest release.
 
 Full backward compatibility of the SDK is not guaranteed.
diff --git a/RELEASE-NOTES.rst b/RELEASE-NOTES.rst
index fe83768..c51d1fc 100644
--- a/RELEASE-NOTES.rst
+++ b/RELEASE-NOTES.rst
@@ -4,6 +4,19 @@ Release Notes
 The following are the release notes for the CSL Examples repository,
 ``csl-examples``.
 
+Version 1.3.0
+-------------
+
+- The examples are improved and updated to comply with the SDK version 1.3.0.
+
+- A new example program ``row-col-broadcast`` has been introduced which
+  benchmarks the bandwidth of data transfers between host and device,
+  where data is broadcast across a row or column of PEs,
+  using the new ``memcpy_h2d_colbcast`` and ``memcpy_h2d_rowbcast`` APIs.
+
+- A new example program ``game-of-life`` has been introduced which implements
+  Conway's Game of Life, where each PE is treated as a single cell.
+
 Version 1.2.0
 -------------
 
diff --git a/benchmarks/25-pt-stencil/commands.sh b/benchmarks/25-pt-stencil/commands_wse2.sh
similarity index 100%
rename from benchmarks/25-pt-stencil/commands.sh
rename to benchmarks/25-pt-stencil/commands_wse2.sh
diff --git a/benchmarks/stencil-3d-7pts/README.rst b/benchmarks/7pt-stencil-spmv/README.rst
similarity index 97%
rename from benchmarks/stencil-3d-7pts/README.rst
rename to benchmarks/7pt-stencil-spmv/README.rst
index 8d338e0..e8bf48b 100644
--- a/benchmarks/stencil-3d-7pts/README.rst
+++ b/benchmarks/7pt-stencil-spmv/README.rst
@@ -1,5 +1,5 @@
-stencil-3d-7pts
-===============
+3D 7-Point Stencil SpMV
+=======================
 
 This example evaluates the performance of 7-point stencil. The kernel records
 the ``start`` and ``end`` of ``spmv`` by tsc counter. In addition the tsc
diff --git a/benchmarks/stencil-3d-7pts/cmd_parser.py b/benchmarks/7pt-stencil-spmv/cmd_parser.py
similarity index 93%
rename from benchmarks/stencil-3d-7pts/cmd_parser.py
rename to benchmarks/7pt-stencil-spmv/cmd_parser.py
index 7fab1ad..7a72006 100644
--- a/benchmarks/stencil-3d-7pts/cmd_parser.py
+++ b/benchmarks/7pt-stencil-spmv/cmd_parser.py
@@ -47,6 +47,8 @@ def parse_args():
       "-n",
       default=1, type=int,
       help="number of columns")
+  parser.add_argument("--simulator", action="store_true",
+                      help="Runs on simulator")
   parser.add_argument(
       "-k",
       default=1, type=int,
@@ -74,10 +76,9 @@ def parse_args():
   parser.add_argument(
       "--run-only",
       help="Run only", action="store_true")
-  # arch = wse1 or wse2
   parser.add_argument(
       "--arch",
-      help="wse1 or wse2. Default is wse1 when not supplied.")
+      help="wse2 or wse3. Default is wse2 when not supplied.")
   parser.add_argument(
       "--width-west-buf",
       default=0, type=int,
@@ -108,4 +109,7 @@ def parse_args():
     print(f"create {logs_dir} to store log files")
     os.mkdir(logs_dir)
 
+  if args.cmaddr is None:
+      args.simulator = False
+
   return args, logs_dir
diff --git a/benchmarks/stencil-3d-7pts/commands.sh b/benchmarks/7pt-stencil-spmv/commands_wse2.sh
similarity index 85%
rename from benchmarks/stencil-3d-7pts/commands.sh
rename to benchmarks/7pt-stencil-spmv/commands_wse2.sh
index 25a43c2..17d2c53 100755
--- a/benchmarks/stencil-3d-7pts/commands.sh
+++ b/benchmarks/7pt-stencil-spmv/commands_wse2.sh
@@ -2,7 +2,7 @@
 
 set -e
 
-cslc ./layout.csl --arch wse2 --fabric-dims=12,7 --fabric-offsets=4,1 \
+cslc ./src/layout.csl --arch wse2 --fabric-dims=12,7 --fabric-offsets=4,1 \
 --params=width:5,height:5,MAX_ZDIM:5 --params=BLOCK_SIZE:2 --params=C0_ID:0 \
 --params=C1_ID:1 --params=C2_ID:2 --params=C3_ID:3 --params=C4_ID:4 --params=C5_ID:5 \
 --params=C6_ID:6 --params=C7_ID:7 --params=C8_ID:8 -o=out \
diff --git a/benchmarks/7pt-stencil-spmv/commands_wse3.sh b/benchmarks/7pt-stencil-spmv/commands_wse3.sh
new file mode 100755
index 0000000..8517be9
--- /dev/null
+++ b/benchmarks/7pt-stencil-spmv/commands_wse3.sh
@@ -0,0 +1,11 @@
+#!/usr/bin/env bash
+
+set -e
+
+cslc ./src/layout.csl --arch wse3 --fabric-dims=12,7 --fabric-offsets=4,1 \
+--params=width:5,height:5,MAX_ZDIM:5 --params=BLOCK_SIZE:2 --params=C0_ID:0 \
+--params=C1_ID:1 --params=C2_ID:2 --params=C3_ID:3 --params=C4_ID:4 --params=C5_ID:5 \
+--params=C6_ID:6 --params=C7_ID:7 --params=C8_ID:8 -o=out \
+--memcpy --channels=1 --width-west-buf=0 --width-east-buf=0
+cs_python ./run.py -m=5 -n=5 -k=5 --latestlink out --channels=1 \
+--width-west-buf=0 --width-east-buf=0 --zDim=5 --run-only
diff --git a/benchmarks/7pt-stencil-spmv/run.appliance.py b/benchmarks/7pt-stencil-spmv/run.appliance.py
new file mode 100644
index 0000000..1de0716
--- /dev/null
+++ b/benchmarks/7pt-stencil-spmv/run.appliance.py
@@ -0,0 +1,438 @@
+# Copyright 2024 Cerebras Systems.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+""" test 7-point stencil
+
+    The Laplacian operator L on 3-dimensional domain can be represented by 7-point
+  stencil based on the standard 2nd order Finite Difference Method. The operator form
+  with Dirichlet boundary conditions can be written by
+         L[u](i,j,k) = u(i+1, j,  k  ) + u(i-1, j,   k  ) +
+                       u(i,   j+1,k  ) + u(i,   j-1, k  ) +
+                       u(i,   j,  k+1) + u(i,   j,   k-1) +
+                      -6*u(i, j, k)
+  In general the coefficients of those 7 points can vary. To minimize the memory
+  consumption, this example assumes the coefficients are independent of index k and
+  whole vector u(i,j,:) is placed in one PE (px=j, py=i).
+  The above formula can be re-written by
+     c_west   * x[i-1][j  ][k  ] + c_east  * x[i+1][j  ][k  ] +
+     c_south  * x[i  ][j-1][k  ] + c_north * x[i  ][j+1][k  ] +
+     c_bot    * x[i  ][j  ][k-1] + c_top   * x[i  ][j  ][k+1] +
+     c_center * x[i][j][k]
+  Each PE only holds 7 coefficients organized by c_west, c_east, c_south, c_north,
+  c_bot, c_top and c_center.
+
+  This example provides two modules, one is allreduce and the other is stencil_3d_7pts.
+  "allreduce" module can synchronize all PEs to form a reference clock.
+  "stencil_3d_7pts" module can compute y = A*x where A is the matrix from 7-point stencil
+
+  The framework is
+  ---
+       sync()      // synchronize all PEs to sample the reference clock
+       tic()       // record start time
+       spmv(zdim)  // compute y = A*x
+       toc()       // record end time
+  ---
+
+  The tic() samples "time_start" and toc() samples "time_end". The sync() samples
+  "time_ref" which is used to shift "time_start" and "time_end".
+  The elapsed time is measured by
+       cycles_send = max(time_end) - min(time_start)
+
+  The overall runtime is computed via the following formula
+       time_send = (cycles_send / 0.85) *1.e-3 us
+  where a PE runs with clock speed 850MHz
+
+  Each PE needs to gather six f32 from six neighbors, the cost of the communication is
+        6*h*w*zDim*4 bytes
+  where w-by-h is the core rectangle and zDim is the length of local vector.
+
+  Here is the list of parameters:
+    -m=<int> is the height of the core
+    -n=<int> is the width of the core
+    -k=<int> is size of x and y allocated in the core
+    --zDim=<int> is the number of f32 per PE, computed by y = A*x
+                 zDim must be not greater than k
+    --channels=<int> specifies the number of I/O channels, no bigger than 16
+"""
+
+
+import struct
+import os
+from typing import Optional
+from pathlib import Path
+import shutil
+import subprocess
+import random
+import json
+
+import numpy as np
+
+
+from cmd_parser import parse_args
+
+
+from util import (
+    hwl_2_oned_colmajor,
+    oned_to_hwl_colmajor,
+    laplacian,
+)
+from cerebras.sdk.client import (
+        SdkCompiler,
+        SdkRuntime,
+)
+
+from cerebras.appliance.pb.sdk.sdk_common_pb2 import (
+        MemcpyDataType,
+        MemcpyOrder,
+)
+
+hash_filename = "hash.json"
+
+
+def float_to_hex(f):
+  return hex(struct.unpack('<I', struct.pack('<f', f))[0])
+
+def make_u48(words):
+  return words[0] + (words[1] << 16) + (words[2] << 32)
+
+
+def csl_compile_core(
+    csl_path: str, # path to CSL files
+    width: int,  # width of the core
+    height: int, # height of the core
+    pe_length: int,
+    blockSize: int,
+    file_config: str,
+    elf_dir: str,
+    fabric_width: int,
+    fabric_height: int,
+    core_fabric_offset_x: int, # fabric-offsets of the core
+    core_fabric_offset_y: int,
+    arch: Optional[str],
+    C0: int,
+    C1: int,
+    C2: int,
+    C3: int,
+    C4: int,
+    C5: int,
+    C6: int,
+    C7: int,
+    C8: int,
+    channels: int,
+    width_west_buf: int,
+    width_east_buf: int
+):
+  compiler = SdkCompiler()
+  args = []
+  args.append(f"--fabric-dims={fabric_width},{fabric_height}")
+  args.append(f"--fabric-offsets={core_fabric_offset_x},{core_fabric_offset_y}")
+  args.append(f"--params=width:{width},height:{height},MAX_ZDIM:{pe_length}")
+  args.append(f"--params=BLOCK_SIZE:{blockSize}")
+  args.append(f"--params=C0_ID:{C0}")
+  args.append(f"--params=C1_ID:{C1}")
+  args.append(f"--params=C2_ID:{C2}")
+  args.append(f"--params=C3_ID:{C3}")
+  args.append(f"--params=C4_ID:{C4}")
+  args.append(f"--params=C5_ID:{C5}")
+  args.append(f"--params=C6_ID:{C6}")
+  args.append(f"--params=C7_ID:{C7}")
+  args.append(f"--params=C8_ID:{C8}")
+
+  args.append(f"-o={elf_dir}")
+  if arch is not None:
+    args.append(f"--arch={arch}")
+  args.append("--memcpy")
+  args.append(f"--channels={channels}")
+  args.append(f"--width-west-buf={width_west_buf}")
+  args.append(f"--width-east-buf={width_east_buf}")
+
+  args_str = " ".join(args)
+  hashstr = compiler.compile(csl_path, file_config, args_str)
+  print("compile artifact:", hashstr)
+  return hashstr
+
+
+
+
+
+# How to compile:
+#   python run.py -m=5 -n=5 -k=5 --latestlink latest --channels=1 \
+#     --width-west-buf=0 --width-east-buf=0 --compile-only
+#
+# How to run:
+#   python run.py -m=5 -n=5 -k=5 --latestlink latest --channels=1 \
+#     --width-west-buf=0 --width-east-buf=0 --run-only
+#
+def main():
+  """Main method to run the example code."""
+
+  random.seed(127)
+
+  args, dirname = parse_args()
+
+  width_west_buf = args.width_west_buf
+  width_east_buf = args.width_east_buf
+  channels = args.channels
+  assert channels <= 16, "only support up to 16 I/O channels"
+  assert channels >= 1, "number of I/O channels must be at least 1"
+
+  print(f"width_west_buf = {width_west_buf}")
+  print(f"width_east_buf = {width_east_buf}")
+  print(f"channels = {channels}")
+
+  height = args.m
+  width = args.n
+  pe_length = args.k
+  zDim = args.zDim
+  blockSize = args.blockSize
+
+  print(f"width = {width}, height = {height}, pe_length={pe_length}, zDim={zDim}, blockSize={blockSize}")
+  assert pe_length >= 2, "the maximum size of z must be greater than 1"
+  assert zDim <= pe_length, "[0, zDim) cannot exceed the storage"
+
+  np.random.seed(2)
+  # A is h-by-w-by-l
+  x = np.arange(height*width*pe_length).reshape(height, width, pe_length).astype(np.float32) + 100
+
+  x_1d = hwl_2_oned_colmajor(height, width, pe_length, x, np.float32)
+
+  # stencil coefficients has the following order
+  # {c_west, c_east, c_south, c_north, c_bottom, c_top, c_center}
+  stencil_coeff = np.zeros((height, width, 7), dtype = np.float32)
+  for i in range(height):
+    for j in range(width):
+      stencil_coeff[(i, j, 0)] = -1 # west
+      stencil_coeff[(i, j, 1)] = -2 # east
+      stencil_coeff[(i, j, 2)] = -3 # south
+      stencil_coeff[(i, j, 3)] = -4 # north
+      stencil_coeff[(i, j, 4)] = -5 # bottom
+      stencil_coeff[(i, j, 5)] = -6 # top
+      stencil_coeff[(i, j, 6)] = 6  # center
+
+  stencil_coeff_1d = hwl_2_oned_colmajor(height, width, 7, stencil_coeff, np.float32)
+
+  y_ref = np.zeros((height, width, pe_length), dtype=np.float32)
+
+  laplacian(stencil_coeff, zDim, x, y_ref)
+
+  # fabric-offsets = 1,1
+  fabric_offset_x = 1
+  fabric_offset_y = 1
+  # starting point of the core rectangle = (core_fabric_offset_x, core_fabric_offset_y)
+  # memcpy framework requires 3 columns at the west of the core rectangle
+  # memcpy framework requires 2 columns at the east of the core rectangle
+  core_fabric_offset_x = fabric_offset_x + 3 + width_west_buf
+  core_fabric_offset_y = fabric_offset_y
+  # (min_fabric_width, min_fabric_height) is the minimal dimension to run the app
+  min_fabric_width = (core_fabric_offset_x + width + 2 + 1 + width_east_buf)
+  min_fabric_height = (core_fabric_offset_y + height + 1)
+
+  fabric_width = 0
+  fabric_height = 0
+  if args.fabric_dims:
+    w_str, h_str = args.fabric_dims.split(",")
+    fabric_width = int(w_str)
+    fabric_height = int(h_str)
+
+  if fabric_width == 0 or fabric_height == 0:
+    fabric_width = min_fabric_width
+    fabric_height = min_fabric_height
+
+  assert fabric_width >= min_fabric_width
+  assert fabric_height >= min_fabric_height
+
+  # prepare the simulation
+  print('store ELFs and log files in the folder ', dirname)
+
+  # layout of a rectangle
+  code_csl = "layout.csl"
+
+  C0 = 0
+  C1 = 1
+  C2 = 2
+  C3 = 3
+  C4 = 4
+  C5 = 5
+  C6 = 6
+  C7 = 7
+  C8 = 8
+
+  csl_path = "./src"
+
+  if args.compile_only:
+    print("WARNING: compile the code, don't run SdkRuntime because the server is down after the compilation");
+    hashstr = csl_compile_core(
+      csl_path,
+      width,
+      height,
+      pe_length,
+      blockSize,
+      code_csl,
+      dirname,
+      fabric_width,
+      fabric_height,
+      core_fabric_offset_x,
+      core_fabric_offset_y,
+      args.arch,
+      C0,
+      C1,
+      C2,
+      C3,
+      C4,
+      C5,
+      C6,
+      C7,
+      C8,
+      channels,
+      width_west_buf,
+      width_east_buf
+    )
+    print(f"dump artifact name to file {hash_filename}")
+    with open(hash_filename, "w") as write_file:
+      json.dump(hashstr, write_file)
+    print("COMPILE ONLY: EXIT")
+    return
+
+  print(f"load artifact name from file {hash_filename}")
+  with open(hash_filename, "r") as f:
+    hashstr = json.load(f)
+
+  memcpy_dtype = MemcpyDataType.MEMCPY_32BIT
+  with SdkRuntime(hashstr, simulator=args.simulator) as runner:
+
+    symbol_x = runner.get_id("x")
+    symbol_y = runner.get_id("y")
+    symbol_time_memcpy = runner.get_id("time_memcpy")
+    symbol_stencil_coeff = runner.get_id("stencil_coeff")
+    symbol_time_buf_u16 = runner.get_id("time_buf_u16")
+    symbol_time_ref = runner.get_id("time_ref")
+
+    # load() and run() are called by client.Sdkruntime.__enter__
+    #runner.load()
+    #runner.run()
+
+    print(f"copy vector x of type f32")
+    # the size of x per PE is pe_length
+    runner.memcpy_h2d(symbol_x, x_1d, 0, 0, width, height, pe_length,\
+          streaming=False, data_type=memcpy_dtype, order=MemcpyOrder.COL_MAJOR, nonblock=True)
+
+    print(f"copy coefficients of type f32")
+    # each PE holds 7 coefficients
+    runner.memcpy_h2d(symbol_stencil_coeff, stencil_coeff_1d, 0, 0, width, height, 7,\
+          streaming=False, data_type=memcpy_dtype, order=MemcpyOrder.COL_MAJOR, nonblock=True)
+
+    print("step 1: sync all PEs")
+    runner.launch("f_sync", np.int16(1), nonblock=False)
+
+    print("step 2: tic() records time_start")
+    runner.launch("f_tic", nonblock=True)
+
+    print(f"step 3: compute y = A*x with zDim = {zDim}")
+    # positive zDim can be smaller than pe_length
+    runner.launch("f_spmv", np.int16(zDim), nonblock=False)
+
+    print("step 4: toc() records time_end")
+    runner.launch("f_toc", nonblock=False)
+
+    print("step 5: prepare (time_start, time_end)")
+    runner.launch("f_memcpy_timestamps", nonblock=False)
+
+    print("step 6: D2H (time_start, time_end)")
+    # time_start/time_end is of type u16[3]
+    time_memcpy_hwl_1d = np.zeros(height*width*6, np.uint32)
+    runner.memcpy_d2h(time_memcpy_hwl_1d, symbol_time_buf_u16, 0, 0, width, height, 6,\
+      streaming=False, data_type=MemcpyDataType.MEMCPY_16BIT, order=MemcpyOrder.COL_MAJOR, nonblock=False)
+    time_memcpy_hwl = oned_to_hwl_colmajor(height, width, 6, time_memcpy_hwl_1d, np.uint16)
+
+    print("step 7: D2H y of type f32")
+    y_1d = np.zeros(height*width*pe_length, np.float32)
+    runner.memcpy_d2h(y_1d, symbol_y, 0, 0, width, height, pe_length,\
+      streaming=False, data_type=memcpy_dtype, order=MemcpyOrder.COL_MAJOR, nonblock=False)
+    y_wse = np.reshape(y_1d, (height, width, pe_length), order='F')
+
+    print("step 8: prepare reference clock")
+    runner.launch("f_reference_timestamps", nonblock=False)
+
+    print("step 9: D2H reference clock")
+    time_ref_1d = np.zeros(height*width*3, np.uint32)
+    runner.memcpy_d2h(time_ref_1d, symbol_time_ref, 0, 0, width, height, 3,\
+      streaming=False, data_type=MemcpyDataType.MEMCPY_16BIT, order=MemcpyOrder.COL_MAJOR, nonblock=False)
+    time_ref_hwl = oned_to_hwl_colmajor(height, width, 3, time_ref_1d, np.uint16)
+
+    # stop() is called by client.Sdkruntime.__exit__
+    #runner.stop()
+
+  # time_start = start time of spmv
+  time_start = np.zeros((height, width)).astype(int)
+  # time_end = end time of spmv
+  time_end = np.zeros((height, width)).astype(int)
+  word = np.zeros(3).astype(np.uint16)
+  for w in range(width):
+    for h in range(height):
+      word[0] = time_memcpy_hwl[(h, w, 0)]
+      word[1] = time_memcpy_hwl[(h, w, 1)]
+      word[2] = time_memcpy_hwl[(h, w, 2)]
+      time_start[(h,w)] = make_u48(word)
+      word[0] = time_memcpy_hwl[(h, w, 3)]
+      word[1] = time_memcpy_hwl[(h, w, 4)]
+      word[2] = time_memcpy_hwl[(h, w, 5)]
+      time_end[(h,w)] = make_u48(word)
+
+  # time_ref = reference clock
+  time_ref = np.zeros((height, width)).astype(int)
+  word = np.zeros(3).astype(np.uint16)
+  for w in range(width):
+    for h in range(height):
+      word[0] = time_ref_hwl[(h, w, 0)]
+      word[1] = time_ref_hwl[(h, w, 1)]
+      word[2] = time_ref_hwl[(h, w, 2)]
+      time_ref[(h, w)] = make_u48(word)
+
+  # adjust the reference clock by the propagation delay
+  # the right-bottom PE signals other PEs, the propagation delay is
+  #     (h-1) - py + (w-1) - px
+  for py in range(height):
+    for px in range(width):
+      time_ref[(py, px)] = time_ref[(py, px)] - ((width+height-2)-(px + py))
+
+  # shift time_start and time_end by time_ref
+  time_start = time_start - time_ref
+  time_end = time_end - time_ref
+
+  # cycles_send = time_end[(h,w)] - time_start[(h,w)]
+  # 850MHz --> 1 cycle = (1/0.85) ns = (1/0.85)*1.e-3 us
+  # time_send = (cycles_send / 0.85) *1.e-3 us
+  #
+  # each PE needs to gather six f32 from six neighbors, the cost of the communication is
+  #      6*h*w*zDim*4 bytes
+  #
+  # bandwidth = (((wvlts-1) * 4)/time_send) MBS
+  wvlts = 6*height*width*zDim
+  min_time_start = time_start.min()
+  max_time_end = time_end.max()
+  cycles_send = max_time_end - min_time_start
+  time_send = (cycles_send / 0.85) *1.e-3
+  bandwidth = ((wvlts * 4)/time_send)
+  print(f"cycles_send = {cycles_send} cycles")
+  print(f"time_send = {time_send} us")
+  print(f"bandwidth = {bandwidth} MB/S ")
+
+  z = y_ref.ravel() - y_wse.ravel()
+  nrm_z = np.linalg.norm(z, np.inf)
+  print(f"|y_ref - y_wes| = {nrm_z}")
+  np.testing.assert_allclose(y_ref.ravel(), y_wse.ravel(), 1.e-5)
+  print("\nSUCCESS!")
+
+if __name__ == "__main__":
+  main()
diff --git a/benchmarks/stencil-3d-7pts/run.py b/benchmarks/7pt-stencil-spmv/run.py
similarity index 92%
rename from benchmarks/stencil-3d-7pts/run.py
rename to benchmarks/7pt-stencil-spmv/run.py
index e7c0a11..0b15730 100644
--- a/benchmarks/stencil-3d-7pts/run.py
+++ b/benchmarks/7pt-stencil-spmv/run.py
@@ -251,7 +251,7 @@ def main():
   sim_log = os.path.join(dirname, "sim.log")
 
   # layout of a rectangle
-  code_csl = "layout.csl"
+  code_csl = "src/layout.csl"
 
   C0 = 0
   C1 = 1
@@ -295,67 +295,67 @@ def main():
     return
 
   memcpy_dtype = MemcpyDataType.MEMCPY_32BIT
-  simulator = SdkRuntime(dirname, cmaddr=args.cmaddr)
+  runner = SdkRuntime(dirname, cmaddr=args.cmaddr)
 
-  symbol_x = simulator.get_id("x")
-  symbol_y = simulator.get_id("y")
-  symbol_stencil_coeff = simulator.get_id("stencil_coeff")
-  symbol_time_buf_u16 = simulator.get_id("time_buf_u16")
-  symbol_time_ref = simulator.get_id("time_ref")
+  symbol_x = runner.get_id("x")
+  symbol_y = runner.get_id("y")
+  symbol_stencil_coeff = runner.get_id("stencil_coeff")
+  symbol_time_buf_u16 = runner.get_id("time_buf_u16")
+  symbol_time_ref = runner.get_id("time_ref")
 
-  simulator.load()
-  simulator.run()
+  runner.load()
+  runner.run()
 
   print(f"copy vector x of type f32")
   # the size of x per PE is pe_length
-  simulator.memcpy_h2d(symbol_x, x_1d, 0, 0, width, height, pe_length,\
+  runner.memcpy_h2d(symbol_x, x_1d, 0, 0, width, height, pe_length,\
           streaming=False, data_type=memcpy_dtype, order=MemcpyOrder.COL_MAJOR, nonblock=True)
 
   print(f"copy coefficients of type f32")
   # each PE holds 7 coefficients
-  simulator.memcpy_h2d(symbol_stencil_coeff, stencil_coeff_1d, 0, 0, width, height, 7,\
+  runner.memcpy_h2d(symbol_stencil_coeff, stencil_coeff_1d, 0, 0, width, height, 7,\
           streaming=False, data_type=memcpy_dtype, order=MemcpyOrder.COL_MAJOR, nonblock=True)
 
   print("step 1: sync all PEs")
-  simulator.launch("f_sync", np.int16(1), nonblock=False)
+  runner.launch("f_sync", np.int16(1), nonblock=False)
 
   print("step 2: tic() records time_start")
-  simulator.launch("f_tic", nonblock=True)
+  runner.launch("f_tic", nonblock=True)
 
   print(f"step 3: compute y = A*x with zDim = {zDim}")
   # positive zDim can be smaller than pe_length
-  simulator.launch("f_spmv", np.int16(zDim), nonblock=False)
+  runner.launch("f_spmv", np.int16(zDim), nonblock=False)
 
   print("step 4: toc() records time_end")
-  simulator.launch("f_toc", nonblock=False)
+  runner.launch("f_toc", nonblock=False)
 
   print("step 5: prepare (time_start, time_end)")
-  simulator.launch("f_memcpy_timestamps", nonblock=False)
+  runner.launch("f_memcpy_timestamps", nonblock=False)
 
   print("step 6: D2H (time_start, time_end)")
   time_memcpy_hwl_1d = np.zeros(height*width*6, np.uint32)
-  simulator.memcpy_d2h(time_memcpy_hwl_1d, symbol_time_buf_u16, 0, 0, width, height, 6,\
+  runner.memcpy_d2h(time_memcpy_hwl_1d, symbol_time_buf_u16, 0, 0, width, height, 6,\
     streaming=False, data_type=MemcpyDataType.MEMCPY_16BIT, order=MemcpyOrder.COL_MAJOR, nonblock=False)
   time_memcpy_hwl = oned_to_hwl_colmajor(height, width, 6, time_memcpy_hwl_1d, np.uint16)
 
   print("step 7: D2H y of type f32")
   y_1d = np.zeros(height*width*pe_length, np.float32)
-  simulator.memcpy_d2h(y_1d, symbol_y, 0, 0, width, height, pe_length,\
+  runner.memcpy_d2h(y_1d, symbol_y, 0, 0, width, height, pe_length,\
     streaming=False, data_type=memcpy_dtype, order=MemcpyOrder.COL_MAJOR, nonblock=False)
   y_wse = np.reshape(y_1d, (height, width, pe_length), order='F')
 
   print("step 8: prepare reference clock")
-  simulator.launch("f_reference_timestamps", nonblock=False)
+  runner.launch("f_reference_timestamps", nonblock=False)
 
   print("step 9: D2H reference clock")
   time_ref_1d = np.zeros(height*width*3, np.uint32)
-  simulator.memcpy_d2h(time_ref_1d, symbol_time_ref, 0, 0, width, height, 3,\
+  runner.memcpy_d2h(time_ref_1d, symbol_time_ref, 0, 0, width, height, 3,\
     streaming=False, data_type=MemcpyDataType.MEMCPY_16BIT, order=MemcpyOrder.COL_MAJOR, nonblock=False)
   time_ref_hwl = oned_to_hwl_colmajor(height, width, 3, time_ref_1d, np.uint16)
 
-  simulator.stop()
+  runner.stop()
 
-  if args.cmaddr is None:
+  if args.simulator:
     # move simulation log and core dump to the given folder
     dst_log = Path(f"{dirname}/sim.log")
     src_log = Path("sim.log")
diff --git a/benchmarks/stencil-3d-7pts/kernel.csl b/benchmarks/7pt-stencil-spmv/src/kernel.csl
similarity index 95%
rename from benchmarks/stencil-3d-7pts/kernel.csl
rename to benchmarks/7pt-stencil-spmv/src/kernel.csl
index 1adcebc..ff57614 100644
--- a/benchmarks/stencil-3d-7pts/kernel.csl
+++ b/benchmarks/7pt-stencil-spmv/src/kernel.csl
@@ -32,7 +32,7 @@ const timestamp = @import_module("<time>");
 const sys_mod = @import_module( "<memcpy/memcpy>", memcpyParams);
 
 // allreduce uses input queue/output queue 2
-const reduce_mod = @import_module( "../csl-libs/allreduce/pe.csl", @concat_structs(reduceParams, .{
+const reduce_mod = @import_module( "../../csl-libs/allreduce/pe.csl", @concat_structs(reduceParams, .{
      .f_callback = sys_mod.unblock_cmd_stream,
      .queues = [1]u16{2},
      .dest_dsr_ids = [1]u16{1},
@@ -41,7 +41,7 @@ const reduce_mod = @import_module( "../csl-libs/allreduce/pe.csl", @concat_struc
      }));
 
 // output queue cannot overlap input queues
-const stencil_mod = @import_module( "../csl-libs/stencil_3d_7pts/pe.csl", @concat_structs(stencilParams, .{
+const stencil_mod = @import_module( "../../csl-libs/stencil_3d_7pts/pe.csl", @concat_structs(stencilParams, .{
      .f_callback = sys_mod.unblock_cmd_stream,
      .input_queues = [4]u16{4, 5, 6, 7},
      .output_queues = if (@is_arch("wse3")) [4]u16{4, 5, 6, 7} else [1]u16{3},
diff --git a/benchmarks/stencil-3d-7pts/layout.csl b/benchmarks/7pt-stencil-spmv/src/layout.csl
similarity index 96%
rename from benchmarks/stencil-3d-7pts/layout.csl
rename to benchmarks/7pt-stencil-spmv/src/layout.csl
index e5768dc..960d9ed 100644
--- a/benchmarks/stencil-3d-7pts/layout.csl
+++ b/benchmarks/7pt-stencil-spmv/src/layout.csl
@@ -55,14 +55,14 @@ const EN_REDUCE_2: local_task_id = @get_local_task_id(18);
 const EN_REDUCE_3: local_task_id = @get_local_task_id(19);
 const EN_REDUCE_4: local_task_id = @get_local_task_id(20);
 
-const stencil = @import_module( "../csl-libs/stencil_3d_7pts/layout.csl", .{
+const stencil = @import_module( "../../csl-libs/stencil_3d_7pts/layout.csl", .{
     .colors = [8]color{C0, C1, C2, C3, C4, C5, C6, C7},
     .entrypoints = [3]local_task_id{EN_STENCIL_1, EN_STENCIL_2, EN_STENCIL_3},
     .width = width,
     .height = height
     });
 
-const reduce = @import_module( "../csl-libs/allreduce/layout.csl", .{
+const reduce = @import_module( "../../csl-libs/allreduce/layout.csl", .{
     .colors = [1]color{C8},
     .entrypoints = [4]local_task_id{EN_REDUCE_1, EN_REDUCE_2, EN_REDUCE_3, EN_REDUCE_4},
     .width = width,
diff --git a/benchmarks/powerMethod/util.py b/benchmarks/7pt-stencil-spmv/util.py
similarity index 100%
rename from benchmarks/powerMethod/util.py
rename to benchmarks/7pt-stencil-spmv/util.py
diff --git a/benchmarks/FFT/commands.sh b/benchmarks/FFT/commands_wse2.sh
similarity index 100%
rename from benchmarks/FFT/commands.sh
rename to benchmarks/FFT/commands_wse2.sh
diff --git a/benchmarks/FFT/commands_wse3.sh b/benchmarks/FFT/commands_wse3.sh
new file mode 100755
index 0000000..5b7cc3b
--- /dev/null
+++ b/benchmarks/FFT/commands_wse3.sh
@@ -0,0 +1,12 @@
+#!/usr/bin/env bash
+
+set -e
+
+cslc --arch=wse3 ./layout.csl --fabric-dims=8,3 --fabric-offsets=4,1 \
+--params=DIM:1,Nz:4,FP:2 --memcpy --channels=1 -o out-1D
+cs_python run.py --name out-1D
+cs_python run.py --inverse --name out-1D
+cslc --arch=wse3 ./layout.csl --fabric-dims=11,3 --fabric-offsets=4,1 \
+--params=DIM:2,Nz:4,FP:1 --memcpy --channels=1 -o out-2D
+cs_python run.py --name out-2D
+cs_python run.py --inverse --name out-2D
diff --git a/benchmarks/bandwidthTest/README.rst b/benchmarks/bandwidth-test/README.rst
similarity index 100%
rename from benchmarks/bandwidthTest/README.rst
rename to benchmarks/bandwidth-test/README.rst
diff --git a/benchmarks/bandwidthTest/bw_cmd_parser.py b/benchmarks/bandwidth-test/bw_cmd_parser.py
similarity index 93%
rename from benchmarks/bandwidthTest/bw_cmd_parser.py
rename to benchmarks/bandwidth-test/bw_cmd_parser.py
index c3d9e65..983cbd5 100644
--- a/benchmarks/bandwidthTest/bw_cmd_parser.py
+++ b/benchmarks/bandwidth-test/bw_cmd_parser.py
@@ -47,6 +47,8 @@ def parse_args():
       "-n",
       default=1, type=int,
       help="number of columns")
+  parser.add_argument("--simulator", action="store_true",
+                      help="Runs on simulator")
   parser.add_argument(
       "-k",
       default=1, type=int,
@@ -70,10 +72,9 @@ def parse_args():
   parser.add_argument(
       "--run-only",
       help="Run only", action="store_true")
-  # arch = wse1 or wse2
   parser.add_argument(
       "--arch",
-      help="wse1 or wse2. Default is wse1 when not supplied.")
+      help="wse2 or wse3. Default is wse2 when not supplied.")
   parser.add_argument(
       "--width-west-buf",
       default=0, type=int,
@@ -107,4 +108,7 @@ def parse_args():
     print(f"create {logs_dir} to store log files")
     os.mkdir(logs_dir)
 
+  if args.cmaddr is None:
+      args.simulator = False
+
   return args, logs_dir
diff --git a/benchmarks/bandwidthTest/commands.sh b/benchmarks/bandwidth-test/commands_wse2.sh
similarity index 80%
rename from benchmarks/bandwidthTest/commands.sh
rename to benchmarks/bandwidth-test/commands_wse2.sh
index 429f990..9c988cb 100755
--- a/benchmarks/bandwidthTest/commands.sh
+++ b/benchmarks/bandwidth-test/commands_wse2.sh
@@ -2,7 +2,7 @@
 
 set -e
 
-cslc ./bw_sync_layout.csl --arch wse2 --fabric-dims=12,7 --fabric-offsets=4,1 \
+cslc ./src/bw_sync_layout.csl --arch wse2 --fabric-dims=12,7 --fabric-offsets=4,1 \
 --params=width:5,height:5,pe_length:5 --params=C0_ID:0 \
 --params=C1_ID:1 --params=C2_ID:2 --params=C3_ID:3 --params=C4_ID:4 -o=out \
 --memcpy --channels=1 --width-west-buf=0 --width-east-buf=0
diff --git a/benchmarks/bandwidth-test/commands_wse3.sh b/benchmarks/bandwidth-test/commands_wse3.sh
new file mode 100755
index 0000000..4a5d8fb
--- /dev/null
+++ b/benchmarks/bandwidth-test/commands_wse3.sh
@@ -0,0 +1,10 @@
+#!/usr/bin/env bash
+
+set -e
+
+cslc ./src/bw_sync_layout.csl --arch wse3 --fabric-dims=12,7 --fabric-offsets=4,1 \
+--params=width:5,height:5,pe_length:5 --params=C0_ID:0 \
+--params=C1_ID:1 --params=C2_ID:2 --params=C3_ID:3 --params=C4_ID:4 -o=out \
+--memcpy --channels=1 --width-west-buf=0 --width-east-buf=0
+cs_python ./run.py -m=5 -n=5 -k=5 --latestlink out --channels=1 \
+--width-west-buf=0 --width-east-buf=0 --run-only --loop_count=1
diff --git a/benchmarks/bandwidth-test/run.appliance.py b/benchmarks/bandwidth-test/run.appliance.py
new file mode 100644
index 0000000..6cf081e
--- /dev/null
+++ b/benchmarks/bandwidth-test/run.appliance.py
@@ -0,0 +1,439 @@
+# Copyright 2024 Cerebras Systems.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+""" test bandwidth between host and device
+
+    The host connects the device via 100Gbps ethernets. The data is distributed
+  from/to couple of I/O channels. The maximum bandwidth of a single channel is
+  around 7Gbps (Giga bit per second). In addition, the overhead of TCP is about
+  200 us, a non-negligible cost when the transaction is small.
+
+  The bandwidth is affected by the following factors:
+  (1) number of I/O channels
+      The number of I/O channels is controlled by the flag --channels=<int>
+      The more channels, the higher bandwidth
+  (2) buffers to hold input/output data to hide the long latency of I/O
+      Although The I/O channelsand the core are independent, if the core has a
+      heavy computation such that it cannot respond to the I/O request, there is
+      a backpressure from the core upstream to the I/O channels. The backpressure
+      stalls the data transfer and the host can no longer push the data.
+      I/O channel will resume only when the core responds the request,however
+      there is a long latency before the core can receive the data.
+      To overlap the computaton and communication (or to avoid this long latency)
+      , we can insert buffers to hold the data from the I/O channels while the
+      core is busy for something else.
+      The user can use flag --width-west-buf=<int> to set a buffer for the input
+      and the flag --width-east-buf to set a buffer for the output.
+      Each PE in the buffer has 46KB fifo to store the data, if a H2D/D2H has
+      "pe_length" u32 per PE and "width" PEs per row, it needs
+      (pe_length*width)*4/46K columns
+  (3) blocking (sync) or nonblocking (async)
+      The long latency of I/O can be amortized if multiple requests are combined
+      together into one TCP transfer (200 us overhead per TCP transaction). The
+      runtime can aggregate multiple nonblocking H2D/D2H commands implicitly.
+      The user can set paramerer 'nonblock=True' to enable async operations.
+
+  The framework of bandwidthTest is
+  ---
+       sync   // synchronize all PEs to sample the reference clock
+       tic()  // record start time
+       for j = 0:loop_count
+          H2D or D2H (sync or async)
+       end
+       toc()  // record end time
+  ---
+
+  To record the elapsed time on host may not show the true timing because the
+  runtime may not start the transaction when the user calls a H2D/D2H command.
+  For example, the runtime can aggregate multiple nonblocking commands together.
+  Instead, this bandwidhTest samples the timing on the device.
+
+  The strategy is to record "start" time and "end" time of H2D/D2H on each PE and
+  to compute the elapsed time by the different of max/min of these two numbers.
+  However the tsc timer is not synchronized and could differ a lot if we take max
+  or min operation on the timer. To obtain the reliable timing info, we need to
+  synchronize all PEs and use one PE to trigger the timer such that all PEs can
+  start at "the same" time. The "sync" operation can sample the reference clock
+  which is the initial time t0 for all PEs.
+  Even we shift the "start clock" by the "reference clock", each PE does not have
+  the same number because of the propagation delay of the signal. The delay of
+  "start clock" is about the dimension of the WSE.
+
+  Here is the list of parameters:
+    The flag --loop_count=<int> decides how many H2Ds/D2Hs are called.
+    The flag --d2h measures the bandwidth of D2H, otherwise bandwidth of H2D is
+        measured.
+    The flag --channels specifies the number of I/O channels, no bigger than 16.
+
+  The tic() samples "time_start" and toc() samples "time_end". The sync() samples
+  "time_ref" which is used to shift "time_start" and "time_end".
+  The elapsed time is measured by
+       cycles_send = max(time_end) - min(time_start)
+
+  The overall runtime is computed via the following formula
+       time_send = (cycles_send / 0.85) *1.e-3 us
+  where a PE runs with clock speed 850MHz
+
+  The bandwidth is calculated by
+       bandwidth = ((wvlts * 4)/time_send)*loop_count
+"""
+
+
+import struct
+import os
+from typing import Optional
+from pathlib import Path
+import shutil
+import subprocess
+import random
+import json
+
+import numpy as np
+
+from bw_cmd_parser import parse_args
+
+from cerebras.sdk.client import (
+        SdkCompiler,
+        SdkRuntime,
+)
+
+from cerebras.appliance.pb.sdk.sdk_common_pb2 import (
+        MemcpyDataType,
+        MemcpyOrder,
+)
+
+hash_filename = "hash.json"
+
+def float_to_hex(f):
+  return hex(struct.unpack('<I', struct.pack('<f', f))[0])
+
+def make_u48(words):
+  return words[0] + (words[1] << 16) + (words[2] << 32)
+
+def cast_uint32(x):
+  if isinstance(x, (np.float16, np.int16, np.uint16)):
+    z = x.view(np.uint16)
+    val = np.uint32(z)
+  elif isinstance(x, (np.float32, np.int32, np.uint32)):
+    val = x.view(np.uint32)
+  elif isinstance(x, int):
+    val = np.uint32(x)
+  elif isinstance(x, float):
+    z = np.float32(x)
+    val = z.view(np.uint32)
+  else:
+    raise RuntimeError(f"type of x {type(x)} is not supported")
+
+  return val
+
+def csl_compile_core(
+    csl_path: str, # path to CSL files
+    out_path: str, # path where to store the artifact
+    width: int,  # width of the core
+    height: int, # height of the core
+    pe_length: int,
+    file_config: str,
+    elf_dir: str,
+    fabric_width: int,
+    fabric_height: int,
+    core_fabric_offset_x: int, # fabric-offsets of the core
+    core_fabric_offset_y: int,
+    arch: Optional[str],
+    C0: int,
+    C1: int,
+    C2: int,
+    C3: int,
+    C4: int,
+    channels: int,
+    width_west_buf: int,
+    width_east_buf: int
+):
+    compiler = SdkCompiler()
+    args = []
+    args.append(f"--fabric-dims={fabric_width},{fabric_height}")
+    args.append(f"--fabric-offsets={core_fabric_offset_x},{core_fabric_offset_y}")
+    args.append(f"--params=width:{width},height:{height},pe_length:{pe_length}")
+    args.append(f"--params=C0_ID:{C0}")
+    args.append(f"--params=C1_ID:{C1}")
+    args.append(f"--params=C2_ID:{C2}")
+    args.append(f"--params=C3_ID:{C3}")
+    args.append(f"--params=C4_ID:{C4}")
+
+    args.append(f"-o={elf_dir}")
+
+    if arch is not None:
+      args.append(f"--arch={arch}")
+    args.append("--memcpy")
+    args.append(f"--channels={channels}")
+    args.append(f"--width-west-buf={width_west_buf}")
+    args.append(f"--width-east-buf={width_east_buf}")
+
+    args_str = " ".join(args)
+    hashstr = compiler.compile(csl_path, file_config, args_str, out_path)
+    print("compile artifact:", hashstr)
+    return hashstr 
+
+
+def hwl_2_oned_colmajor(
+    height: int,
+    width: int,
+    pe_length: int,
+    A_hwl: np.ndarray
+):
+  """
+    Given a 3-D tensor A[height][width][pe_length], transform it to
+    1D array by column-major
+  """
+  A_1d = np.zeros(height*width*pe_length, np.float32)
+  idx = 0
+  for l in range(pe_length):
+    for w in range(width):
+      for h in range(height):
+        A_1d[idx] = A_hwl[(h, w, l)]
+        idx = idx + 1
+  return A_1d
+
+
+# How to compile:
+#  python run.py -m=5 -n=5 -k=5 --latestlink latest --channels=1 \
+#    --width-west-buf=0 --width-east-buf=0 \
+#    --compile-only
+#
+# How to run:
+#  python run.py -m=5 -n=5 -k=5 --latestlink latest --channels=1 \
+#   --width-west-buf=0 --width-east-buf=0 \
+#   --run-only --loop_count=1
+#
+def main():
+  """Main method to run the example code."""
+
+  random.seed(127)
+
+  args, dirname = parse_args()
+
+  width_west_buf = args.width_west_buf
+  width_east_buf = args.width_east_buf
+  channels = args.channels
+  assert channels <= 16, "only support up to 16 I/O channels"
+  assert channels >= 1, "number of I/O channels must be at least 1"
+
+  print(f"width_west_buf = {width_west_buf}")
+  print(f"width_east_buf = {width_east_buf}")
+  print(f"channels = {channels}")
+
+  height = args.m
+  width = args.n
+  pe_length = args.k
+  loop_count = args.loop_count
+
+  print(f"width = {width}, height = {height}, pe_length={pe_length}, loop_count = {loop_count}")
+
+  np.random.seed(2)
+  # A is h-by-w-by-l
+  A = np.arange(height*width*pe_length).reshape(height, width, pe_length).astype(np.float32)
+
+  A_1d = hwl_2_oned_colmajor(height, width, pe_length, A)
+
+  # fabric-offsets = 1,1
+  fabric_offset_x = 1
+  fabric_offset_y = 1
+  # starting point of the core rectangle = (core_fabric_offset_x, core_fabric_offset_y)
+  # memcpy framework requires 3 columns at the west of the core rectangle
+  # memcpy framework requires 2 columns at the east of the core rectangle
+  core_fabric_offset_x = fabric_offset_x + 3 + width_west_buf
+  core_fabric_offset_y = fabric_offset_y
+  # (min_fabric_width, min_fabric_height) is the minimal dimension to run the app
+  min_fabric_width = (core_fabric_offset_x + width + 2 + 1 + width_east_buf)
+  min_fabric_height = (core_fabric_offset_y + height + 1)
+
+  fabric_width = 0
+  fabric_height = 0
+  if args.fabric_dims:
+    w_str, h_str = args.fabric_dims.split(",")
+    fabric_width = int(w_str)
+    fabric_height = int(h_str)
+
+  if fabric_width == 0 or fabric_height == 0:
+    fabric_width = min_fabric_width
+    fabric_height = min_fabric_height
+
+  assert fabric_width >= min_fabric_width
+  assert fabric_height >= min_fabric_height
+
+  # prepare the simulation
+  print('store ELFs and log files in the folder ', dirname)
+
+  # layout of a rectangle
+  code_csl = "bw_sync_layout.csl"
+
+  C0 = 0
+  C1 = 1
+  C2 = 2
+  C3 = 3
+  C4 = 4
+
+  csl_path = "./src"
+  out_path = "."
+
+  if args.compile_only:
+    print("WARNING: compile the code, don't run SdkRuntime because it should be done in a separate appliance job");
+    hashstr = csl_compile_core(
+      csl_path,
+      out_path,
+      width,
+      height,
+      pe_length,
+      code_csl,
+      dirname,
+      fabric_width,
+      fabric_height,
+      core_fabric_offset_x,
+      core_fabric_offset_y,
+      args.arch,
+      C0,
+      C1,
+      C2,
+      C3,
+      C4,
+      channels,
+      width_west_buf,
+      width_east_buf
+    )
+    print(f"Dump artifact name to file {hash_filename}")
+    with open(hash_filename, "w") as write_file:
+      json.dump(hashstr, write_file)
+    print("COMPILE ONLY: EXIT")
+    return
+
+  print(f"Load artifact name from file {hash_filename}")
+  with open(hash_filename, "r") as f:
+    artifact_path = json.load(f)
+
+  # output tensor via D2H
+  E_1d = np.zeros(height*width*pe_length, np.float32)
+
+  memcpy_dtype = MemcpyDataType.MEMCPY_32BIT
+
+  with SdkRuntime(artifact_path, simulator=args.simulator) as runner:
+
+    symbol_A = runner.get_id("A")
+    symbol_time_memcpy = runner.get_id("time_memcpy")
+    symbol_time_ref = runner.get_id("time_ref")
+
+    # load() and run() are called by client.Sdkruntime.__enter__
+    #runner.load()
+    #runner.run()
+
+    print("step 1: sync() synchronizes all PEs and records reference clock")
+    runner.call("f_sync", [], nonblock=True)
+
+    print("step 2: tic() records time_start")
+    runner.call("f_tic", [], nonblock=True)
+
+    if args.d2h:
+      for j in range(loop_count):
+        print(f"step 3: measure D2H with loop_count = {loop_count}, {j}-th")
+        runner.memcpy_d2h(E_1d, symbol_A, 0, 0, width, height, pe_length,\
+          streaming=False, data_type=memcpy_dtype, order=MemcpyOrder.COL_MAJOR, nonblock=True)
+    else:
+      for j in range(loop_count):
+        print(f"step 3: measure H2D with loop_count = {loop_count}, {j}-th")
+        runner.memcpy_h2d(symbol_A, A_1d, 0, 0, width, height, pe_length,\
+          streaming=False, data_type=memcpy_dtype, order=MemcpyOrder.COL_MAJOR, nonblock=True)
+
+    print("step 4: toc() records time_end")
+    runner.call("f_toc", [], nonblock=False)
+
+    print("step 5: prepare (time_start, time_end)")
+    runner.call("f_memcpy_timestamps", [], nonblock=False)
+
+    print("step 6: D2H (time_start, time_end)")
+    # time_start/time_end is of type u16[3]
+    # {time_start, time_end} is packed into three f32
+    time_memcpy_1d_f32 = np.zeros(height*width*3, np.float32)
+    runner.memcpy_d2h(time_memcpy_1d_f32, symbol_time_memcpy, 0, 0, width, height, 3,\
+      streaming=False, data_type=memcpy_dtype, order=MemcpyOrder.ROW_MAJOR, nonblock=True)
+
+    print("step 7: prepare reference clock")
+    runner.call("f_reference_timestamps", [], nonblock=False)
+
+    print("step 8: D2H reference clock")
+    # time_ref is of type u16[3], packed into two f32
+    time_ref_1d_f32 = np.zeros(height*width*2, np.float32)
+    runner.memcpy_d2h(time_ref_1d_f32, symbol_time_ref, 0, 0, width, height, 2,\
+      streaming=False, data_type=memcpy_dtype, order=MemcpyOrder.ROW_MAJOR, nonblock=False)
+
+    # stop() is called by client.Sdkruntime.__exit__
+    #runner.stop()
+
+  time_memcpy_hwl = np.reshape(time_memcpy_1d_f32, (height, width, 3), order='C')
+  time_ref_hwl = np.reshape(time_ref_1d_f32, (height, width, 2), order='C')
+
+  # time_start = start time of H2D/D2H
+  time_start = np.zeros((height, width)).astype(int)
+  # time_end = end time of H2D/D2H
+  time_end = np.zeros((height, width)).astype(int)
+  word = np.zeros(3).astype(np.uint16)
+  for w in range(width):
+    for h in range(height):
+      hex_t0 = int(float_to_hex(time_memcpy_hwl[(h, w, 0)]), base=16)
+      hex_t1 = int(float_to_hex(time_memcpy_hwl[(h, w, 1)]), base=16)
+      hex_t2 = int(float_to_hex(time_memcpy_hwl[(h, w, 2)]), base=16)
+      word[0] = hex_t0 & 0x0000ffff
+      word[1] = (hex_t0 >> 16) & 0x0000ffff
+      word[2] = hex_t1 & 0x0000ffff
+      time_start[(h, w)] = make_u48(word)
+      word[0] = (hex_t1 >> 16) & 0x0000ffff
+      word[1] = hex_t2 & 0x0000ffff
+      word[2] = (hex_t2 >> 16) & 0x0000ffff
+      time_end[(h, w)] = make_u48(word)
+
+  # time_ref = reference clock
+  time_ref = np.zeros((height, width)).astype(int)
+  word = np.zeros(3).astype(np.uint16)
+  for w in range(width):
+    for h in range(height):
+      hex_t0 = int(float_to_hex(time_ref_hwl[(h, w, 0)]), base=16)
+      hex_t1 = int(float_to_hex(time_ref_hwl[(h, w, 1)]), base=16)
+      word[0] = hex_t0 & 0x0000ffff
+      word[1] = (hex_t0 >> 16) & 0x0000ffff
+      word[2] = hex_t1 & 0x0000ffff
+      time_ref[(h, w)] = make_u48(word)
+  # adjust the reference clock by the propagation delay
+  for py in range(height):
+    for px in range(width):
+      time_ref[(py, px)] = time_ref[(py, px)] - (px + py)
+
+  # shift time_start and time_end by time_ref
+  time_start = time_start - time_ref
+  time_end = time_end - time_ref
+
+  # cycles_send = time_end[(h,w)] - time_start[(h,w)]
+  # 850MHz --> 1 cycle = (1/0.85) ns = (1/0.85)*1.e-3 us
+  # time_send = (cycles_send / 0.85) *1.e-3 us
+  # bandwidth = (((wvlts-1) * 4)/time_send) MBS
+  wvlts = height*width*pe_length
+  min_time_start = time_start.min()
+  max_time_end = time_end.max()
+  cycles_send = max_time_end - min_time_start
+  time_send = (cycles_send / 0.85) *1.e-3
+  bandwidth = ((wvlts * 4)/time_send)*loop_count
+  print(f"wvlts = {wvlts}, loop_count = {loop_count}")
+  print(f"cycles_send = {cycles_send} cycles")
+  print(f"time_send = {time_send} us")
+  print(f"bandwidth = {bandwidth} MB/S ")
+
+
+if __name__ == "__main__":
+  main()
diff --git a/benchmarks/bandwidthTest/run.py b/benchmarks/bandwidth-test/run.py
similarity index 93%
rename from benchmarks/bandwidthTest/run.py
rename to benchmarks/bandwidth-test/run.py
index 1e0cea7..f0d0ebc 100644
--- a/benchmarks/bandwidthTest/run.py
+++ b/benchmarks/bandwidth-test/run.py
@@ -202,7 +202,7 @@ def hwl_2_oned_colmajor(
 
 
 # How to compile:
-#  <path/to/cslc> bw_sync_layout.csl --fabric-dims=12,7 --fabric-offsets=4,1 \
+#  <path/to/cslc> src/bw_sync_layout.csl --fabric-dims=12,7 --fabric-offsets=4,1 \
 #    --params=width:5,height:5,pe_length:5 \
 #    --params=C0_ID:0 --params=C1_ID:1 --params=C2_ID:2 \
 #    --params=C3_ID:3 --params=C4_ID:4 \
@@ -291,7 +291,7 @@ def main():
   sim_log = os.path.join(dirname, "sim.log")
 
   # layout of a rectangle
-  code_csl = "bw_sync_layout.csl"
+  code_csl = "src/bw_sync_layout.csl"
 
   C0 = 0
   C1 = 1
@@ -329,60 +329,60 @@ def main():
   E_1d = np.zeros(height*width*pe_length, np.float32)
 
   memcpy_dtype = MemcpyDataType.MEMCPY_32BIT
-  simulator = SdkRuntime(dirname, cmaddr=args.cmaddr)
+  runner = SdkRuntime(dirname, cmaddr=args.cmaddr)
 
-  symbol_A = simulator.get_id("A")
-  symbol_time_memcpy = simulator.get_id("time_memcpy")
-  symbol_time_ref = simulator.get_id("time_ref")
+  symbol_A = runner.get_id("A")
+  symbol_time_memcpy = runner.get_id("time_memcpy")
+  symbol_time_ref = runner.get_id("time_ref")
 
-  simulator.load()
-  simulator.run()
+  runner.load()
+  runner.run()
 
   print("step 1: sync() synchronizes all PEs and records reference clock")
-  simulator.call("f_sync", [], nonblock=True)
+  runner.call("f_sync", [], nonblock=True)
 
   print("step 2: tic() records time_start")
-  simulator.call("f_tic", [], nonblock=True)
+  runner.call("f_tic", [], nonblock=True)
 
   if args.d2h:
     for j in range(loop_count):
       print(f"step 3: measure D2H with loop_count = {loop_count}, {j}-th")
-      simulator.memcpy_d2h(E_1d, symbol_A, 0, 0, width, height, pe_length,\
+      runner.memcpy_d2h(E_1d, symbol_A, 0, 0, width, height, pe_length,\
           streaming=False, data_type=memcpy_dtype, order=MemcpyOrder.COL_MAJOR, nonblock=False)
   else:
     for j in range(loop_count):
       print(f"step 3: measure H2D with loop_count = {loop_count}, {j}-th")
-      simulator.memcpy_h2d(symbol_A, A_1d, 0, 0, width, height, pe_length,\
+      runner.memcpy_h2d(symbol_A, A_1d, 0, 0, width, height, pe_length,\
           streaming=False, data_type=memcpy_dtype, order=MemcpyOrder.COL_MAJOR, nonblock=True)
 
   print("step 4: toc() records time_end")
-  simulator.call("f_toc", [], nonblock=False)
+  runner.call("f_toc", [], nonblock=False)
 
   print("step 5: prepare (time_start, time_end)")
-  simulator.call("f_memcpy_timestamps", [], nonblock=False)
+  runner.call("f_memcpy_timestamps", [], nonblock=False)
 
   print("step 6: D2H (time_start, time_end)")
   # time_start/time_end is of type u16[3]
   # {time_start, time_end} is packed into three f32
   time_memcpy_1d_f32 = np.zeros(height*width*3, np.float32)
-  simulator.memcpy_d2h(time_memcpy_1d_f32, symbol_time_memcpy, 0, 0, width, height, 3,\
+  runner.memcpy_d2h(time_memcpy_1d_f32, symbol_time_memcpy, 0, 0, width, height, 3,\
     streaming=False, data_type=memcpy_dtype, order=MemcpyOrder.ROW_MAJOR, nonblock=False)
   time_memcpy_hwl = np.reshape(time_memcpy_1d_f32, (height, width, 3), order='C')
 
   print("step 7: prepare reference clock")
-  simulator.call("f_reference_timestamps", [], nonblock=False)
+  runner.call("f_reference_timestamps", [], nonblock=False)
 
   print("step 8: D2H reference clock")
   # time_ref is of type u16[3], packed into two f32
   time_ref_1d_f32 = np.zeros(height*width*2, np.float32)
-  simulator.memcpy_d2h(time_ref_1d_f32, symbol_time_ref, 0, 0, width, height, 2,\
+  runner.memcpy_d2h(time_ref_1d_f32, symbol_time_ref, 0, 0, width, height, 2,\
     streaming=False, data_type=memcpy_dtype, order=MemcpyOrder.ROW_MAJOR, nonblock=False)
   time_ref_hwl = np.reshape(time_ref_1d_f32, (height, width, 2), order='C')
 
-  #simulator.stop(core_path)
-  simulator.stop()
+  #runner.stop(core_path)
+  runner.stop()
 
-  if args.cmaddr is None:
+  if args.simulator:
     # move simulation log and core dump to the given folder
     dst_log = Path(f"{dirname}/sim.log")
     src_log = Path("sim.log")
diff --git a/benchmarks/bandwidthTest/bw_sync_kernel.csl b/benchmarks/bandwidth-test/src/bw_sync_kernel.csl
similarity index 100%
rename from benchmarks/bandwidthTest/bw_sync_kernel.csl
rename to benchmarks/bandwidth-test/src/bw_sync_kernel.csl
diff --git a/benchmarks/bandwidthTest/bw_sync_layout.csl b/benchmarks/bandwidth-test/src/bw_sync_layout.csl
similarity index 100%
rename from benchmarks/bandwidthTest/bw_sync_layout.csl
rename to benchmarks/bandwidth-test/src/bw_sync_layout.csl
diff --git a/benchmarks/bandwidthTest/sync/layout.csl b/benchmarks/bandwidth-test/src/sync/layout.csl
similarity index 100%
rename from benchmarks/bandwidthTest/sync/layout.csl
rename to benchmarks/bandwidth-test/src/sync/layout.csl
diff --git a/benchmarks/bandwidthTest/sync/pe.csl b/benchmarks/bandwidth-test/src/sync/pe.csl
similarity index 100%
rename from benchmarks/bandwidthTest/sync/pe.csl
rename to benchmarks/bandwidth-test/src/sync/pe.csl
diff --git a/benchmarks/bicgstab/cmd_parser.py b/benchmarks/bicgstab/cmd_parser.py
index 6023f0c..7e7bd51 100644
--- a/benchmarks/bicgstab/cmd_parser.py
+++ b/benchmarks/bicgstab/cmd_parser.py
@@ -78,10 +78,9 @@ def parse_args():
   parser.add_argument(
       "--run-only",
       help="Run only", action="store_true")
-  # arch = wse1 or wse2
   parser.add_argument(
       "--arch",
-      help="wse1 or wse2. Default is wse1 when not supplied.")
+      help="wse2 or wse3. Default is wse2 when not supplied.")
   parser.add_argument(
       "--width-west-buf",
       default=0, type=int,
diff --git a/benchmarks/bicgstab/commands.sh b/benchmarks/bicgstab/commands_wse2.sh
similarity index 100%
rename from benchmarks/bicgstab/commands.sh
rename to benchmarks/bicgstab/commands_wse2.sh
diff --git a/benchmarks/bicgstab/commands_wse3.sh b/benchmarks/bicgstab/commands_wse3.sh
new file mode 100755
index 0000000..a1aa2f8
--- /dev/null
+++ b/benchmarks/bicgstab/commands_wse3.sh
@@ -0,0 +1,11 @@
+#!/usr/bin/env bash
+
+set -e
+
+cslc ./layout.csl --arch wse3 --fabric-dims=12,7 --fabric-offsets=4,1 \
+--params=width:5,height:5,MAX_ZDIM:5 --params=BLOCK_SIZE:2 --params=C0_ID:0 \
+--params=C1_ID:1 --params=C2_ID:2 --params=C3_ID:3 --params=C4_ID:4 --params=C5_ID:5 \
+--params=C6_ID:6 --params=C7_ID:7 --params=C8_ID:8 -o=out \
+--memcpy --channels=1 --width-west-buf=0 --width-east-buf=0
+cs_python ./run.py -m=5 -n=5 -k=5 --latestlink out --channels=1 \
+--width-west-buf=0 --width-east-buf=0 --zDim=5 --run-only --max-ite=2
diff --git a/benchmarks/bicgstab/blas.csl b/benchmarks/bicgstab/src/blas.csl
similarity index 100%
rename from benchmarks/bicgstab/blas.csl
rename to benchmarks/bicgstab/src/blas.csl
diff --git a/benchmarks/bicgstab/kernel.csl b/benchmarks/bicgstab/src/kernel.csl
similarity index 97%
rename from benchmarks/bicgstab/kernel.csl
rename to benchmarks/bicgstab/src/kernel.csl
index 9b7622b..f9b156e 100644
--- a/benchmarks/bicgstab/kernel.csl
+++ b/benchmarks/bicgstab/src/kernel.csl
@@ -34,7 +34,7 @@ const blas_lib = @import_module("blas.csl");
 const sys_mod = @import_module( "<memcpy/memcpy>", memcpyParams);
 
 // allreduce uses input queue/output queue 1
-const reduce_mod = @import_module( "../csl-libs/allreduce/pe.csl", @concat_structs(reduceParams, .{
+const reduce_mod = @import_module( "../../csl-libs/allreduce/pe.csl", @concat_structs(reduceParams, .{
      .f_callback = sys_mod.unblock_cmd_stream,
      .queues = [1]u16{2},
      .dest_dsr_ids = [1]u16{1},
@@ -43,7 +43,7 @@ const reduce_mod = @import_module( "../csl-libs/allreduce/pe.csl", @concat_struc
      }));
 
 // output queue cannot overlap input queues
-const stencil_mod = @import_module( "../csl-libs/stencil_3d_7pts/pe.csl", @concat_structs(stencilParams, .{
+const stencil_mod = @import_module( "../../csl-libs/stencil_3d_7pts/pe.csl", @concat_structs(stencilParams, .{
      .f_callback = sys_mod.unblock_cmd_stream,
      .input_queues = [4]u16{4, 5, 6, 7},
      .output_queues = if (@is_arch("wse3")) [4]u16{4, 5, 6, 7} else [1]u16{3},
diff --git a/benchmarks/bicgstab/kernel_bicgstab.csl b/benchmarks/bicgstab/src/kernel_bicgstab.csl
similarity index 98%
rename from benchmarks/bicgstab/kernel_bicgstab.csl
rename to benchmarks/bicgstab/src/kernel_bicgstab.csl
index 1fea728..d98ae6e 100644
--- a/benchmarks/bicgstab/kernel_bicgstab.csl
+++ b/benchmarks/bicgstab/src/kernel_bicgstab.csl
@@ -36,7 +36,7 @@ const blas_lib = @import_module("blas.csl");
 const sys_mod = @import_module( "<memcpy/memcpy>", memcpyParams);
 
 // allreduce uses input queue/output queue 1
-const reduce_mod = @import_module( "../csl-libs/allreduce/pe.csl", @concat_structs(reduceParams, .{
+const reduce_mod = @import_module( "../../csl-libs/allreduce/pe.csl", @concat_structs(reduceParams, .{
      .f_callback = f_trigger_state_machine,
      .queues = [1]u16{2},
      .dest_dsr_ids = [1]u16{1},
@@ -45,7 +45,7 @@ const reduce_mod = @import_module( "../csl-libs/allreduce/pe.csl", @concat_struc
      }));
 
 // output queue cannot overlap input queues
-const stencil_mod = @import_module( "../csl-libs/stencil_3d_7pts/pe.csl", @concat_structs(stencilParams, .{
+const stencil_mod = @import_module( "../../csl-libs/stencil_3d_7pts/pe.csl", @concat_structs(stencilParams, .{
      .f_callback = f_trigger_state_machine,
      .input_queues = [4]u16{4, 5, 6, 7},
      .output_queues = if (@is_arch("wse3")) [4]u16{4, 5, 6, 7} else [1]u16{3},
diff --git a/benchmarks/bicgstab/layout.csl b/benchmarks/bicgstab/src/layout.csl
similarity index 97%
rename from benchmarks/bicgstab/layout.csl
rename to benchmarks/bicgstab/src/layout.csl
index 91dd053..a8387f2 100644
--- a/benchmarks/bicgstab/layout.csl
+++ b/benchmarks/bicgstab/src/layout.csl
@@ -66,14 +66,14 @@ const EN_REDUCE_2: local_task_id = @get_local_task_id(18);
 const EN_REDUCE_3: local_task_id = @get_local_task_id(19);
 const EN_REDUCE_4: local_task_id = @get_local_task_id(20);
 
-const stencil = @import_module( "../csl-libs/stencil_3d_7pts/layout.csl", .{
+const stencil = @import_module( "../../csl-libs/stencil_3d_7pts/layout.csl", .{
     .colors = [8]color{C0, C1, C2, C3, C4, C5, C6, C7},
     .entrypoints = [3]local_task_id{EN_STENCIL_1, EN_STENCIL_2, EN_STENCIL_3},
     .width = width,
     .height = height
     });
 
-const reduce = @import_module( "../csl-libs/allreduce/layout.csl", .{
+const reduce = @import_module( "../../csl-libs/allreduce/layout.csl", .{
     .colors = [1]color{C8},
     .entrypoints = [4]local_task_id{EN_REDUCE_1, EN_REDUCE_2, EN_REDUCE_3, EN_REDUCE_4},
     .width = width,
diff --git a/benchmarks/bicgstab/layout_bicgstab.csl b/benchmarks/bicgstab/src/layout_bicgstab.csl
similarity index 97%
rename from benchmarks/bicgstab/layout_bicgstab.csl
rename to benchmarks/bicgstab/src/layout_bicgstab.csl
index 2e29073..1d02c99 100644
--- a/benchmarks/bicgstab/layout_bicgstab.csl
+++ b/benchmarks/bicgstab/src/layout_bicgstab.csl
@@ -69,14 +69,14 @@ const EN_REDUCE_2: local_task_id = @get_local_task_id(18);
 const EN_REDUCE_3: local_task_id = @get_local_task_id(19);
 const EN_REDUCE_4: local_task_id = @get_local_task_id(20);
 
-const stencil = @import_module( "../csl-libs/stencil_3d_7pts/layout.csl", .{
+const stencil = @import_module( "../../csl-libs/stencil_3d_7pts/layout.csl", .{
     .colors = [8]color{C0, C1, C2, C3, C4, C5, C6, C7},
     .entrypoints = [3]local_task_id{EN_STENCIL_1, EN_STENCIL_2, EN_STENCIL_3},
     .width = width,
     .height = height
     });
 
-const reduce = @import_module( "../csl-libs/allreduce/layout.csl", .{
+const reduce = @import_module( "../../csl-libs/allreduce/layout.csl", .{
     .colors = [1]color{C8},
     .entrypoints = [4]local_task_id{EN_REDUCE_1, EN_REDUCE_2, EN_REDUCE_3, EN_REDUCE_4},
     .width = width,
diff --git a/benchmarks/cholesky/commands.sh b/benchmarks/cholesky/commands_wse2.sh
similarity index 100%
rename from benchmarks/cholesky/commands.sh
rename to benchmarks/cholesky/commands_wse2.sh
diff --git a/benchmarks/cholesky/commands_wse3.sh b/benchmarks/cholesky/commands_wse3.sh
new file mode 100755
index 0000000..0ae99ed
--- /dev/null
+++ b/benchmarks/cholesky/commands_wse3.sh
@@ -0,0 +1,8 @@
+#!/usr/bin/env bash
+
+set -e
+
+cslc --arch=wse3 ./layout.csl --fabric-dims=17,12 --fabric-offsets=4,1 \
+--params=P:10,Nt:4 -o out \
+--memcpy --channels=1 --width-west-buf=0 --width-east-buf=0
+cs_python run.py --name out
diff --git a/benchmarks/conjugateGradient/README.rst b/benchmarks/conjugate-gradient/README.rst
similarity index 100%
rename from benchmarks/conjugateGradient/README.rst
rename to benchmarks/conjugate-gradient/README.rst
diff --git a/benchmarks/conjugateGradient/cg.py b/benchmarks/conjugate-gradient/cg.py
similarity index 100%
rename from benchmarks/conjugateGradient/cg.py
rename to benchmarks/conjugate-gradient/cg.py
diff --git a/benchmarks/conjugateGradient/cmd_parser.py b/benchmarks/conjugate-gradient/cmd_parser.py
similarity index 97%
rename from benchmarks/conjugateGradient/cmd_parser.py
rename to benchmarks/conjugate-gradient/cmd_parser.py
index 6023f0c..7e7bd51 100644
--- a/benchmarks/conjugateGradient/cmd_parser.py
+++ b/benchmarks/conjugate-gradient/cmd_parser.py
@@ -78,10 +78,9 @@ def parse_args():
   parser.add_argument(
       "--run-only",
       help="Run only", action="store_true")
-  # arch = wse1 or wse2
   parser.add_argument(
       "--arch",
-      help="wse1 or wse2. Default is wse1 when not supplied.")
+      help="wse2 or wse3. Default is wse2 when not supplied.")
   parser.add_argument(
       "--width-west-buf",
       default=0, type=int,
diff --git a/benchmarks/conjugateGradient/commands.sh b/benchmarks/conjugate-gradient/commands_wse2.sh
similarity index 100%
rename from benchmarks/conjugateGradient/commands.sh
rename to benchmarks/conjugate-gradient/commands_wse2.sh
diff --git a/benchmarks/conjugate-gradient/commands_wse3.sh b/benchmarks/conjugate-gradient/commands_wse3.sh
new file mode 100755
index 0000000..a1aa2f8
--- /dev/null
+++ b/benchmarks/conjugate-gradient/commands_wse3.sh
@@ -0,0 +1,11 @@
+#!/usr/bin/env bash
+
+set -e
+
+cslc ./layout.csl --arch wse3 --fabric-dims=12,7 --fabric-offsets=4,1 \
+--params=width:5,height:5,MAX_ZDIM:5 --params=BLOCK_SIZE:2 --params=C0_ID:0 \
+--params=C1_ID:1 --params=C2_ID:2 --params=C3_ID:3 --params=C4_ID:4 --params=C5_ID:5 \
+--params=C6_ID:6 --params=C7_ID:7 --params=C8_ID:8 -o=out \
+--memcpy --channels=1 --width-west-buf=0 --width-east-buf=0
+cs_python ./run.py -m=5 -n=5 -k=5 --latestlink out --channels=1 \
+--width-west-buf=0 --width-east-buf=0 --zDim=5 --run-only --max-ite=2
diff --git a/benchmarks/conjugateGradient/run.py b/benchmarks/conjugate-gradient/run.py
similarity index 100%
rename from benchmarks/conjugateGradient/run.py
rename to benchmarks/conjugate-gradient/run.py
diff --git a/benchmarks/conjugateGradient/run_cg.py b/benchmarks/conjugate-gradient/run_cg.py
similarity index 100%
rename from benchmarks/conjugateGradient/run_cg.py
rename to benchmarks/conjugate-gradient/run_cg.py
diff --git a/benchmarks/conjugateGradient/blas.csl b/benchmarks/conjugate-gradient/src/blas.csl
similarity index 100%
rename from benchmarks/conjugateGradient/blas.csl
rename to benchmarks/conjugate-gradient/src/blas.csl
diff --git a/benchmarks/conjugateGradient/kernel.csl b/benchmarks/conjugate-gradient/src/kernel.csl
similarity index 97%
rename from benchmarks/conjugateGradient/kernel.csl
rename to benchmarks/conjugate-gradient/src/kernel.csl
index c893e1b..451dbc0 100644
--- a/benchmarks/conjugateGradient/kernel.csl
+++ b/benchmarks/conjugate-gradient/src/kernel.csl
@@ -34,7 +34,7 @@ const blas_lib = @import_module("blas.csl");
 const sys_mod = @import_module( "<memcpy/memcpy>", memcpyParams);
 
 // allreduce uses input queue/output queue 1
-const reduce_mod = @import_module( "../csl-libs/allreduce/pe.csl", @concat_structs(reduceParams, .{
+const reduce_mod = @import_module( "../../csl-libs/allreduce/pe.csl", @concat_structs(reduceParams, .{
      .f_callback = sys_mod.unblock_cmd_stream,
      .queues = [1]u16{2},
      .dest_dsr_ids = [1]u16{1},
@@ -43,7 +43,7 @@ const reduce_mod = @import_module( "../csl-libs/allreduce/pe.csl", @concat_struc
      }));
 
 // output queue cannot overlap input queues
-const stencil_mod = @import_module( "../csl-libs/stencil_3d_7pts/pe.csl", @concat_structs(stencilParams, .{
+const stencil_mod = @import_module( "../../csl-libs/stencil_3d_7pts/pe.csl", @concat_structs(stencilParams, .{
      .f_callback = sys_mod.unblock_cmd_stream,
      .input_queues = [4]u16{4, 5, 6, 7},
      .output_queues = if (@is_arch("wse3")) [4]u16{4, 5, 6, 7} else [1]u16{3},
diff --git a/benchmarks/conjugateGradient/kernel_cg.csl b/benchmarks/conjugate-gradient/src/kernel_cg.csl
similarity index 98%
rename from benchmarks/conjugateGradient/kernel_cg.csl
rename to benchmarks/conjugate-gradient/src/kernel_cg.csl
index 79e78f8..d92a0e3 100644
--- a/benchmarks/conjugateGradient/kernel_cg.csl
+++ b/benchmarks/conjugate-gradient/src/kernel_cg.csl
@@ -36,7 +36,7 @@ const blas_lib = @import_module("blas.csl");
 const sys_mod = @import_module( "<memcpy/memcpy>", memcpyParams);
 
 // allreduce uses input queue/output queue 1
-const reduce_mod = @import_module( "../csl-libs/allreduce/pe.csl", @concat_structs(reduceParams, .{
+const reduce_mod = @import_module( "../../csl-libs/allreduce/pe.csl", @concat_structs(reduceParams, .{
      .f_callback = f_trigger_state_machine,
      .queues = [1]u16{2},
      .dest_dsr_ids = [1]u16{1},
@@ -45,7 +45,7 @@ const reduce_mod = @import_module( "../csl-libs/allreduce/pe.csl", @concat_struc
      }));
 
 // output queue cannot overlap input queues
-const stencil_mod = @import_module( "../csl-libs/stencil_3d_7pts/pe.csl", @concat_structs(stencilParams, .{
+const stencil_mod = @import_module( "../../csl-libs/stencil_3d_7pts/pe.csl", @concat_structs(stencilParams, .{
      .f_callback = f_trigger_state_machine,
      .input_queues = [4]u16{4, 5, 6, 7},
      .output_queues = if (@is_arch("wse3")) [4]u16{4, 5, 6, 7} else [1]u16{3},
diff --git a/benchmarks/conjugateGradient/layout.csl b/benchmarks/conjugate-gradient/src/layout.csl
similarity index 97%
rename from benchmarks/conjugateGradient/layout.csl
rename to benchmarks/conjugate-gradient/src/layout.csl
index 1469b40..5d44e99 100644
--- a/benchmarks/conjugateGradient/layout.csl
+++ b/benchmarks/conjugate-gradient/src/layout.csl
@@ -66,14 +66,14 @@ const EN_REDUCE_2: local_task_id = @get_local_task_id(18);
 const EN_REDUCE_3: local_task_id = @get_local_task_id(19);
 const EN_REDUCE_4: local_task_id = @get_local_task_id(20);
 
-const stencil = @import_module( "../csl-libs/stencil_3d_7pts/layout.csl", .{
+const stencil = @import_module( "../../csl-libs/stencil_3d_7pts/layout.csl", .{
     .colors = [8]color{C0, C1, C2, C3, C4, C5, C6, C7},
     .entrypoints = [3]local_task_id{EN_STENCIL_1, EN_STENCIL_2, EN_STENCIL_3},
     .width = width,
     .height = height
     });
 
-const reduce = @import_module( "../csl-libs/allreduce/layout.csl", .{
+const reduce = @import_module( "../../csl-libs/allreduce/layout.csl", .{
     .colors = [1]color{C8},
     .entrypoints = [4]local_task_id{EN_REDUCE_1, EN_REDUCE_2, EN_REDUCE_3, EN_REDUCE_4},
     .width = width,
diff --git a/benchmarks/conjugateGradient/layout_cg.csl b/benchmarks/conjugate-gradient/src/layout_cg.csl
similarity index 97%
rename from benchmarks/conjugateGradient/layout_cg.csl
rename to benchmarks/conjugate-gradient/src/layout_cg.csl
index 6d15b35..fc56467 100644
--- a/benchmarks/conjugateGradient/layout_cg.csl
+++ b/benchmarks/conjugate-gradient/src/layout_cg.csl
@@ -69,14 +69,14 @@ const EN_REDUCE_2: local_task_id = @get_local_task_id(18);
 const EN_REDUCE_3: local_task_id = @get_local_task_id(19);
 const EN_REDUCE_4: local_task_id = @get_local_task_id(20);
 
-const stencil = @import_module( "../csl-libs/stencil_3d_7pts/layout.csl", .{
+const stencil = @import_module( "../../csl-libs/stencil_3d_7pts/layout.csl", .{
     .colors = [8]color{C0, C1, C2, C3, C4, C5, C6, C7},
     .entrypoints = [3]local_task_id{EN_STENCIL_1, EN_STENCIL_2, EN_STENCIL_3},
     .width = width,
     .height = height
     });
 
-const reduce = @import_module( "../csl-libs/allreduce/layout.csl", .{
+const reduce = @import_module( "../../csl-libs/allreduce/layout.csl", .{
     .colors = [1]color{C8},
     .entrypoints = [4]local_task_id{EN_REDUCE_1, EN_REDUCE_2, EN_REDUCE_3, EN_REDUCE_4},
     .width = width,
diff --git a/benchmarks/conjugateGradient/util.py b/benchmarks/conjugate-gradient/util.py
similarity index 100%
rename from benchmarks/conjugateGradient/util.py
rename to benchmarks/conjugate-gradient/util.py
diff --git a/benchmarks/csl-libs/stencil_3d_7pts/pe.csl b/benchmarks/csl-libs/stencil_3d_7pts/pe.csl
index eb74994..7d5c135 100644
--- a/benchmarks/csl-libs/stencil_3d_7pts/pe.csl
+++ b/benchmarks/csl-libs/stencil_3d_7pts/pe.csl
@@ -58,9 +58,9 @@ const api_wse3 = @is_arch("wse3");
 // The user must specify --import-path=<path to csl-libs>
 fn get_stencil_module() comptime_string {
   if (api_wse3) {
-    return "../csl-libs/stencil_3d_7pts/wse3/pe.csl";
+    return "../../csl-libs/stencil_3d_7pts/wse3/pe.csl";
   }else{
-    return "../csl-libs/stencil_3d_7pts/wse2/pe.csl";
+    return "../../csl-libs/stencil_3d_7pts/wse2/pe.csl";
   }
 }
 
diff --git a/benchmarks/game-of-life/README.rst b/benchmarks/game-of-life/README.rst
new file mode 100644
index 0000000..8c84138
--- /dev/null
+++ b/benchmarks/game-of-life/README.rst
@@ -0,0 +1,36 @@
+Conway's Game of Life
+=====================
+
+This program implements
+`Conway's Game of Life <https://en.wikipedia.org/wiki/Conway%27s_Game_of_Life>`_
+on the WSE.
+
+Conway's Game of Life is a cellular automaton which evolves on a 2D grid of
+square cells. Each cell is in one of two possible states, LIVE or DEAD.
+Every cell interacts with its neighbors, which are the cells horziontally,
+vertically, or diagonally adjacent. At each step in time, the following
+transitions occur:
+
+- Any LIVE cell with fewer than two LIVE neighbours becomes a DEAD cell.
+- Any LIVE cell with two or three LIVE neighbours stays a LIVE cell.
+- Any LIVE cell with more than three LIVE neighbours becomes a DEAD cell.
+- Any DEAD cell with exactly three LIVE neighbours becomes a LIVE cell.
+
+This program implements the Game of Life be assigning one cell to each PE.
+Zero boundary conditions are used, and thus the neighbors of a border PE that
+fall outside of the program rectangle are treaded as always DEAD.
+
+In each generation, each PE sends its state to its four N, S, E, and W
+neighbors. Each PE receives the state of its four N, S, E, and W neighbors, and
+also forwards the received state from its N and S neighbors to its E and W
+neighbors. Thus, each PE receives from its E and W links both the state of its
+E and W adjacent neighbors, as well as its four diagonal neighbors.
+
+The program implements two initial conditions, ``random`` and ``glider``.
+``random`` randomly initializes the state of all cells. ``glider`` generates
+several glider objects across the grid. The initial condition can be set with
+the ``--initial-state`` flag.
+
+The ``--show-ascii-animation`` flag will generate an ASCII animation of the
+cellular automoton's evolution when the program is complete.
+``--save-animation`` will save a GIF of the automoton's evolution.
diff --git a/benchmarks/game-of-life/commands_wse2.sh b/benchmarks/game-of-life/commands_wse2.sh
new file mode 100755
index 0000000..09a131c
--- /dev/null
+++ b/benchmarks/game-of-life/commands_wse2.sh
@@ -0,0 +1,7 @@
+#!/usr/bin/env bash
+
+set -e
+
+cslc --arch=wse2 ./layout.csl --fabric-dims=19,14 --fabric-offsets=4,1 \
+--params=x_dim:12,y_dim:12 --memcpy --channels=1 -o out
+cs_python run.py --name out --initial-state glider --iters 20
diff --git a/benchmarks/game-of-life/commands_wse3.sh b/benchmarks/game-of-life/commands_wse3.sh
new file mode 100755
index 0000000..17e4629
--- /dev/null
+++ b/benchmarks/game-of-life/commands_wse3.sh
@@ -0,0 +1,7 @@
+#!/usr/bin/env bash
+
+set -e
+
+cslc --arch=wse3 ./layout.csl --fabric-dims=19,14 --fabric-offsets=4,1 \
+--params=x_dim:12,y_dim:12 --memcpy --channels=1 -o out
+cs_python run.py --name out --initial-state glider --iters 20
diff --git a/benchmarks/game-of-life/layout.csl b/benchmarks/game-of-life/layout.csl
new file mode 100644
index 0000000..cad0671
--- /dev/null
+++ b/benchmarks/game-of-life/layout.csl
@@ -0,0 +1,129 @@
+// Copyright 2024 Cerebras Systems.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+// kernel dimensions
+param x_dim: i16;
+param y_dim: i16;
+
+// Colors
+const east_color_0:  color = @get_color(0);
+const east_color_1:  color = @get_color(1);
+const west_color_0:  color = @get_color(2);
+const west_color_1:  color = @get_color(3);
+const south_color_0: color = @get_color(4);
+const south_color_1: color = @get_color(5);
+const north_color_0: color = @get_color(6);
+const north_color_1: color = @get_color(7);
+
+// This example uses x_dim x y_dim PEs
+const memcpy = @import_module("<memcpy/get_params>", .{
+  .width = x_dim,
+  .height = y_dim
+});
+
+layout {
+  // PE coordinates are (column, row)
+  @set_rectangle(x_dim, y_dim);
+
+  const x_even_params = .{
+    .send_east_color = east_color_0, .send_west_color = west_color_1,
+    .recv_east_color = west_color_0, .recv_west_color = east_color_1,
+  };
+
+  const x_odd_params = .{
+    .send_east_color = east_color_1, .send_west_color = west_color_0,
+    .recv_east_color = west_color_1, .recv_west_color = east_color_0,
+  };
+
+  const y_even_params = .{
+    .send_south_color = south_color_0, .send_north_color = north_color_1,
+    .recv_south_color = north_color_0, .recv_north_color = south_color_1,
+  };
+
+  const y_odd_params = .{
+    .send_south_color = south_color_1, .send_north_color = north_color_0,
+    .recv_south_color = north_color_1, .recv_north_color = south_color_0,
+  };
+
+  for (@range(i16, x_dim)) |pe_x| {
+    const west_edge = (pe_x == 0);
+    const east_edge = (pe_x == x_dim-1);
+
+    const x_color_params = if (pe_x % 2 == 0) x_even_params else x_odd_params;
+
+    const x_params = @concat_structs(
+                       .{ .is_west_edge = west_edge, .is_east_edge = east_edge,
+                          .memcpy_params = memcpy.get_params(pe_x) },
+                       x_color_params
+                     );
+
+    for (@range(i16, y_dim)) |pe_y| {
+      const north_edge = (pe_y == 0);
+      const south_edge = (pe_y == y_dim-1);
+
+      const y_color_params = if (pe_y % 2 == 0) y_even_params else y_odd_params;
+
+      const y_params = @concat_structs(
+                         .{ .is_north_edge = north_edge, .is_south_edge = south_edge },
+                         y_color_params
+                       );
+
+      @set_tile_code(pe_x, pe_y, "pe_program.csl", @concat_structs(x_params, y_params));
+    }
+  }
+
+  // Create route values
+  const RX_R_TX_E = .{ .rx = .{ RAMP  }, .tx = .{ EAST  }};
+  const RX_W_TX_R = .{ .rx = .{ WEST  }, .tx = .{ RAMP  }};
+  const RX_R_TX_W = .{ .rx = .{ RAMP  }, .tx = .{ WEST  }};
+  const RX_E_TX_R = .{ .rx = .{ EAST  }, .tx = .{ RAMP  }};
+
+  const RX_R_TX_S = .{ .rx = .{ RAMP  }, .tx = .{ SOUTH }};
+  const RX_N_TX_R = .{ .rx = .{ NORTH }, .tx = .{ RAMP  }};
+  const RX_R_TX_N = .{ .rx = .{ RAMP  }, .tx = .{ NORTH }};
+  const RX_S_TX_R = .{ .rx = .{ SOUTH }, .tx = .{ RAMP  }};
+
+
+  for (@range(i16, x_dim)) |pe_x| {
+    for (@range(i16, y_dim)) |pe_y| {
+      if (pe_x % 2 == 0) {
+        @set_color_config(pe_x, pe_y, east_color_0, .{ .routes = RX_R_TX_E });
+        @set_color_config(pe_x, pe_y, east_color_1, .{ .routes = RX_W_TX_R });
+        @set_color_config(pe_x, pe_y, west_color_0, .{ .routes = RX_E_TX_R });
+        @set_color_config(pe_x, pe_y, west_color_1, .{ .routes = RX_R_TX_W });
+      } else {
+        @set_color_config(pe_x, pe_y, east_color_0, .{ .routes = RX_W_TX_R });
+        @set_color_config(pe_x, pe_y, east_color_1, .{ .routes = RX_R_TX_E });
+        @set_color_config(pe_x, pe_y, west_color_0, .{ .routes = RX_R_TX_W });
+        @set_color_config(pe_x, pe_y, west_color_1, .{ .routes = RX_E_TX_R });
+      }
+
+      if (pe_y % 2 == 0) {
+        @set_color_config(pe_x, pe_y, south_color_0, .{ .routes = RX_R_TX_S });
+        @set_color_config(pe_x, pe_y, south_color_1, .{ .routes = RX_N_TX_R });
+        @set_color_config(pe_x, pe_y, north_color_0, .{ .routes = RX_S_TX_R });
+        @set_color_config(pe_x, pe_y, north_color_1, .{ .routes = RX_R_TX_N });
+      } else {
+        @set_color_config(pe_x, pe_y, south_color_0, .{ .routes = RX_N_TX_R });
+        @set_color_config(pe_x, pe_y, south_color_1, .{ .routes = RX_R_TX_S });
+        @set_color_config(pe_x, pe_y, north_color_0, .{ .routes = RX_R_TX_N });
+        @set_color_config(pe_x, pe_y, north_color_1, .{ .routes = RX_S_TX_R });
+      }
+    }
+  }
+
+  // export symbol names
+  @export_name("states", [*]u32, true);
+  @export_name("generate", fn(u16)void);
+}
diff --git a/benchmarks/game-of-life/pe_program.csl b/benchmarks/game-of-life/pe_program.csl
new file mode 100644
index 0000000..c153eff
--- /dev/null
+++ b/benchmarks/game-of-life/pe_program.csl
@@ -0,0 +1,318 @@
+// Copyright 2024 Cerebras Systems.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+param memcpy_params: comptime_struct;
+
+param is_east_edge:  bool;
+param is_west_edge:  bool;
+param is_south_edge: bool;
+param is_north_edge: bool;
+
+// Colors
+param send_east_color:  color;
+param send_west_color:  color;
+param send_south_color: color;
+param send_north_color: color;
+
+param recv_east_color:  color;
+param recv_west_color:  color;
+param recv_south_color: color;
+param recv_north_color: color;
+
+// Queue IDs
+const send_east_oq:  output_queue = @get_output_queue(2);
+const send_west_oq:  output_queue = @get_output_queue(3);
+const send_south_oq: output_queue = @get_output_queue(4);
+const send_north_oq: output_queue = @get_output_queue(5);
+
+const recv_east_iq:  input_queue  = @get_input_queue(2);
+const recv_west_iq:  input_queue  = @get_input_queue(3);
+const recv_south_iq: input_queue  = @get_input_queue(4);
+const recv_north_iq: input_queue  = @get_input_queue(5);
+
+// Task IDs
+const send_task_id:           local_task_id = @get_local_task_id(8);
+const sync_send_task_id:      local_task_id = @get_local_task_id(9);
+const sync_fwd_task_id:       local_task_id = @get_local_task_id(10);
+const start_next_gen_task_id: local_task_id = @get_local_task_id(11);
+const fwd_east_west_task_id:  local_task_id = @get_local_task_id(12);
+const exit_task_id:           local_task_id = @get_local_task_id(13);
+
+// On WSE-2, data task IDs are created from colors; on WSE-3, from input queues
+const recv_east_task_id: data_task_id =
+  if      (@is_arch("wse2")) @get_data_task_id(recv_east_color)
+  else if (@is_arch("wse3")) @get_data_task_id(recv_east_iq);
+const recv_west_task_id: data_task_id =
+  if      (@is_arch("wse2")) @get_data_task_id(recv_west_color)
+  else if (@is_arch("wse3")) @get_data_task_id(recv_west_iq);
+const recv_south_task_id: data_task_id =
+  if      (@is_arch("wse2")) @get_data_task_id(recv_south_color)
+  else if (@is_arch("wse3")) @get_data_task_id(recv_south_iq);
+const recv_north_task_id: data_task_id =
+  if      (@is_arch("wse2")) @get_data_task_id(recv_north_color)
+  else if (@is_arch("wse3")) @get_data_task_id(recv_north_iq);
+
+
+// memcpy module provides infrastructure for copying data
+// and launching functions from the host
+const sys_mod = @import_module("<memcpy/memcpy>", memcpy_params);
+const layout_mod = @import_module("<layout>");
+
+const MAX_GENERATIONS = 1000; // Max num total generations that can be stored
+
+// Number of neighboring PEs for this cell
+const num_neighbors: u16 = (if (is_west_edge)  0 else 1) + (if (is_east_edge)  0 else 1) // W, E
+                         + (if (is_north_edge) 0 else 1) + (if (is_south_edge) 0 else 1) // N, S
+                         + (if (is_north_edge or is_west_edge) 0 else 1)  // NW
+                         + (if (is_north_edge or is_east_edge) 0 else 1)  // NE
+                         + (if (is_south_edge or is_west_edge) 0 else 1)  // SW
+                         + (if (is_south_edge or is_east_edge) 0 else 1); // SE
+
+const num_west_nbrs: u16 = if (is_west_edge) 0
+                           else (1 + (if (is_north_edge) 0 else 1) + (if (is_south_edge) 0 else 1));
+const num_east_nbrs: u16 = if (is_east_edge) 0
+                           else (1 + (if (is_north_edge) 0 else 1) + (if (is_south_edge) 0 else 1));
+
+const num_ns_nbrs: u16 = (if (is_north_edge) 0 else 1) + (if (is_south_edge) 0 else 1);
+
+var iters: u16 = 0; // Number of generations for current run
+var current_iter: u16 = 0; // Track num generations completed so far
+
+// Store states of all cells for each generation
+var states: [MAX_GENERATIONS]u32;
+var states_ptr: [*]u32 = &states;
+var state_dsd = @get_dsd(mem1d_dsd, .{ .tensor_access = |i|{1} -> states[i] });
+
+// For current generation, track received states from neighbors
+var num_recv: u16 = 0;
+var current_sum: u32 = 0;
+
+var num_west_recv: u16 = 0;
+var num_east_recv: u16 = 0;
+var num_ns_recv: u16 = 0;
+
+// Store values received from N and S to forward E and W
+var fwd_vals: [2]u32;
+
+// DSDs for sending values to N, S, E, W neighbors
+const send_west_dsd = @get_dsd(fabout_dsd, .{
+  .fabric_color = send_west_color, .extent = 1, .output_queue = send_west_oq });
+const send_east_dsd = @get_dsd(fabout_dsd, .{
+  .fabric_color = send_east_color, .extent = 1, .output_queue = send_east_oq });
+const send_north_dsd = @get_dsd(fabout_dsd, .{
+  .fabric_color = send_north_color, .extent = 1, .output_queue = send_north_oq });
+const send_south_dsd = @get_dsd(fabout_dsd, .{
+  .fabric_color = send_south_color, .extent = 1, .output_queue = send_south_oq });
+
+// Send current state to all four neighbors
+task send() void {
+  if (!is_north_edge) @fmovs(send_north_dsd, state_dsd, .{ .async = true });
+  if (!is_south_edge) @fmovs(send_south_dsd, state_dsd, .{ .async = true });
+
+  // When sending to E and W finishes, allow sync_fwd task to proceed
+  // sync_fwd allows us to begin forwarding states received from N/ S to E/ W
+  if (!is_west_edge) @fmovs(send_west_dsd, state_dsd,
+                            .{ .async = true, .unblock = sync_fwd_task_id });
+  if (!is_east_edge) @fmovs(send_east_dsd, state_dsd,
+                            .{ .async = true, .activate = sync_fwd_task_id });
+
+  if (is_west_edge) @unblock(sync_fwd_task_id);
+  if (is_east_edge) @activate(sync_fwd_task_id);
+
+  // Do no send again until we forward N/ S recvs to E/ W neighbors
+  @block(send_task_id);
+}
+
+// Guarantee that we do not begin forwarding N/ S recvs to E/ W neighbors
+// until E/ W sends from our cell complete
+task sync_fwd() void {
+  @block(sync_fwd_task_id);
+  @unblock(fwd_east_west_task_id);
+}
+
+// Forward states received from N/ S neighbors to E/ W neighbors
+task fwd_east_west() void {
+  // fwd_vals[0] is N neighbor forwarded to E and W
+  // fwd_vals[1] is S neighbor forwarded to E and W
+  // if we are N edge, there is no N neighbor to forward, so we access only fwd_vals[1]
+  const offset = if (is_north_edge) 1 else 0;
+  const fwd_dsd = @get_dsd(mem1d_dsd,
+                           .{ .tensor_access = |i|{num_ns_nbrs} -> fwd_vals[i + offset] });
+
+  // When forwarding to E and W finishes, allow sync_send task to proceed
+  // sync_send allows us to begin sending next generation
+  if (!is_west_edge) @fmovs(send_west_dsd, fwd_dsd,
+                            .{ .async = true, .unblock = sync_send_task_id });
+  if (!is_east_edge) @fmovs(send_east_dsd, fwd_dsd,
+                            .{ .async = true, .activate = sync_send_task_id });
+
+  if (is_west_edge) @unblock(sync_send_task_id);
+  if (is_east_edge) @activate(sync_send_task_id);
+
+  // Do not forward again until we complete next generation E/ W sends
+  // from our cell to neighbors
+  @block(fwd_east_west_task_id);
+}
+
+// Guarantee that we do not begin sending next generation until we have forwarded
+// all neighbors from current generation
+task sync_send() void {
+  @block(sync_send_task_id);
+  @unblock(send_task_id);
+}
+
+// In each generation, PE will receive from W up to three times:
+// W neighbor, NW neighbor, and SW neighbor
+task recv_west(val: u32) void {
+  num_west_recv += 1;
+  num_recv += 1;
+  current_sum += val;
+
+  // If we have received from all W neighbors, block to prevent
+  // any activations until we begin next generation
+  if (num_west_recv == num_west_nbrs) @block(recv_west_task_id);
+  // If we have received from all neighbors, begin next generation
+  if (num_recv == num_neighbors) @activate(start_next_gen_task_id);
+}
+
+// In each generation, PE will receive from E up to three times
+// E neighbor, NE neighbor, and SE neighbor
+task recv_east(val: u32) void {
+  num_east_recv += 1;
+  num_recv += 1;
+  current_sum += val;
+
+  // If we have received from all E neighbors, block to prevent
+  // any activations until we begin next generation
+  if (num_east_recv == num_east_nbrs) @block(recv_east_task_id);
+  // If we have received from all neighbors, begin next generation
+  if (num_recv == num_neighbors) @activate(start_next_gen_task_id);
+}
+
+// In each generation, PE will receive from N if there is N neighbor
+task recv_north(val: u32) void {
+  num_ns_recv += 1;
+  num_recv += 1;
+  current_sum += val;
+
+  // Per generation, we only receive from N once. Block to prevent any
+  // activations until we begin next generation.
+  @block(recv_north_task_id);
+
+  // Store value received from N to forward to E and W neighbors
+  fwd_vals[0] = val;
+
+  // If we have received from N and S, fwd to E and W neighbors
+  if (num_ns_recv == num_ns_nbrs) @activate(fwd_east_west_task_id);
+  // If we have received from all neighbors, begin next generation
+  if (num_recv == num_neighbors) @activate(start_next_gen_task_id);
+}
+
+// In each generation, PE will receive from S if there is S neighbor
+task recv_south(val: u32) void {
+  num_ns_recv += 1;
+  num_recv += 1;
+  current_sum += val;
+
+  // Per generation, we only receive from S once. Block to prevent any
+  // activations until we begin next generation.
+  @block(recv_south_task_id);
+
+  // Store value received from S to forward to E and W neighbors
+  fwd_vals[1] = val;
+
+  // If we have received from N and S, fwd to E and W neighbors
+  if (num_ns_recv == num_ns_nbrs) @activate(fwd_east_west_task_id);
+  // If we have received from all neighbors, begin next generation
+  if (num_recv == num_neighbors) @activate(start_next_gen_task_id);
+}
+
+// Update current state and begin sending next generation to neighbors
+task start_next_gen() void {
+
+  current_iter += 1;
+  state_dsd = @increment_dsd_offset(state_dsd, 1, u32);
+
+  // Previous generation of cell is alive
+  if (states[current_iter-1] == 1) {
+    states[current_iter] = if (current_sum == 2 or current_sum == 3) 1 else 0;
+  // Previous generation of cell is dead
+  } else {
+    states[current_iter] = if (current_sum == 3) 1 else 0;
+  }
+
+  if (current_iter == iters - 1) {
+    @activate(exit_task_id);
+  } else {
+    current_sum = 0;
+    num_recv = 0;
+    num_west_recv = 0;
+    num_east_recv = 0;
+    num_ns_recv = 0;
+    @unblock(recv_west_task_id);
+    @unblock(recv_east_task_id);
+    @unblock(recv_north_task_id);
+    @unblock(recv_south_task_id);
+    @activate(send_task_id);
+  }
+}
+
+task exit() void {
+  sys_mod.unblock_cmd_stream();
+}
+
+fn generate(num_gen: u16) void {
+  // Set number of generations for current run
+  iters = num_gen;
+  @assert(iters <= MAX_GENERATIONS);
+
+  // Begin sending to neighbors
+  @activate(send_task_id);
+}
+
+comptime {
+  @bind_local_task(send, send_task_id);
+  @bind_local_task(sync_send, sync_send_task_id);
+  @bind_local_task(sync_fwd, sync_fwd_task_id);
+  @bind_local_task(start_next_gen, start_next_gen_task_id);
+  @bind_local_task(fwd_east_west, fwd_east_west_task_id);
+  @bind_local_task(exit, exit_task_id);
+
+  @bind_data_task(recv_west,  recv_west_task_id);
+  @bind_data_task(recv_east,  recv_east_task_id);
+  @bind_data_task(recv_north, recv_north_task_id);
+  @bind_data_task(recv_south, recv_south_task_id);
+
+  @block(sync_send_task_id);
+  @block(sync_fwd_task_id);
+
+  // Will only become unbocked after first executoin of sync_fwd
+  @block(fwd_east_west_task_id);
+
+  // On WSE-3, we must explicitly initialize input and output queues
+  if (@is_arch("wse3")) {
+    @initialize_queue(send_west_oq,  .{ .color = send_west_color });
+    @initialize_queue(send_east_oq,  .{ .color = send_east_color });
+    @initialize_queue(send_north_oq, .{ .color = send_north_color });
+    @initialize_queue(send_south_oq, .{ .color = send_south_color });
+
+    @initialize_queue(recv_west_iq,  .{ .color = recv_west_color });
+    @initialize_queue(recv_east_iq,  .{ .color = recv_east_color });
+    @initialize_queue(recv_north_iq, .{ .color = recv_north_color });
+    @initialize_queue(recv_south_iq, .{ .color = recv_south_color });
+  }
+
+  @export_symbol(states_ptr, "states");
+  @export_symbol(generate);
+}
diff --git a/benchmarks/game-of-life/run.py b/benchmarks/game-of-life/run.py
new file mode 100644
index 0000000..b94618f
--- /dev/null
+++ b/benchmarks/game-of-life/run.py
@@ -0,0 +1,214 @@
+#!/usr/bin/env cs_python
+
+# Copyright 2024 Cerebras Systems.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+import argparse
+import json
+import subprocess
+import time
+import matplotlib
+import matplotlib.pyplot as plt
+from matplotlib.animation import FuncAnimation, PillowWriter
+import numpy as np
+
+from cerebras.sdk.runtime.sdkruntimepybind import SdkRuntime, MemcpyDataType, MemcpyOrder # pylint: disable=no-name-in-module
+
+matplotlib.use('Agg')
+
+
+def game_of_life_ref(initial_state, num_generations):
+  """Compute reference to check WSE result for game of life generation"""
+
+  x_dim = initial_state.shape[0]
+  y_dim = initial_state.shape[1]
+
+  ref_states = np.zeros((x_dim, y_dim, num_generations))
+  ref_states[:,:,0] = initial_state
+
+  for gen in range(1, num_generations):
+    for i in range(x_dim):
+      for j in range(y_dim):
+        total = (0 if (i == 0)                           else ref_states[i-1,j,  gen-1]) \
+              + (0 if (i == x_dim-1)                     else ref_states[i+1,j,  gen-1]) \
+              + (0 if (j == 0)                           else ref_states[i,  j-1,gen-1]) \
+              + (0 if (j == y_dim-1)                     else ref_states[i,  j+1,gen-1]) \
+              + (0 if ((i == 0)     or (j == 0))         else ref_states[i-1,j-1,gen-1]) \
+              + (0 if ((i == 0)     or (j == y_dim-1))   else ref_states[i-1,j+1,gen-1]) \
+              + (0 if ((i == x_dim-1) or (j == 0))       else ref_states[i+1,j-1,gen-1]) \
+              + (0 if ((i == x_dim-1) or (j == y_dim-1)) else ref_states[i+1,j+1,gen-1])
+
+        if (ref_states[i, j, gen-1] == 1):
+          ref_states[i, j, gen] = 1 if (total in (2, 3)) else 0
+        else:
+          ref_states[i, j, gen] = 1 if (total == 3) else 0
+
+  return ref_states
+
+
+def show_ascii_animation(states):
+  """Generate a command-line ASCII animation"""
+
+  num_generations = states.shape[2]
+  try:
+    for i in range(num_generations):
+      subprocess.run(['clear'], shell=True, check=True)
+      print(f'Generation {i}:\n')
+      for row in states[:, :, i]:
+        print(' '.join(['#' if cell else '.' for cell in row]))
+      print('\nPress Ctrl+C to exit.')
+      time.sleep(0.1)  # Wait for 0.1 seconds before displaying the next frame
+  except KeyboardInterrupt:
+    print('\nAnimation stopped.')
+
+
+def save_animation(states, fname):
+  """Save an animation as a GIF"""
+
+  fig, ax = plt.subplots()
+  ax.set_xticks([])
+  ax.set_yticks([])
+  ax.axis('off')
+
+  frame_image = ax.imshow(states[:, :, 0], cmap='Greys', vmin=0, vmax=1)
+
+  def update_plot(frame_index):
+    frame_image.set_data(states[:, :, frame_index])
+    return [frame_image]
+
+  anim = FuncAnimation(
+    fig,
+    update_plot,
+    frames=states.shape[2],
+    interval=100,  # 0.1 seconds per frame
+    blit=True
+  )
+
+  output_file = fname + '.gif'
+  anim.save(output_file, writer=PillowWriter(fps=10))
+
+
+def create_initial_state(state_type, x_dim, y_dim):
+  """Generate intitial state for Game of Life"""
+
+  initial_state = np.zeros((x_dim, y_dim), dtype=np.uint32)
+
+  if state_type == 'glider':
+    assert x_dim >= 4 and y_dim >=4, \
+           'For glider initial state, x_dim and y_dim must be at least 4'
+
+    glider = np.array([[0, 0, 1],
+                       [1, 0, 1],
+                       [0, 1, 1]])
+
+    for i in range(x_dim//4):
+      for j in range(y_dim//4):
+        if i%2 == 0 and j%2 == 0:
+          initial_state[4*i:4*i+3, 4*j:4*j+3] = glider
+        elif i%2 == 0 and j%2 == 1:
+          initial_state[4*i:4*i+3, 4*j:4*j+3] = glider[:,::-1]
+        elif i%2 == 1 and j%2 == 0:
+          initial_state[4*i:4*i+3, 4*j:4*j+3] = glider[::-1,:]
+        elif i%2 == 1 and j%2 == 1:
+          initial_state[4*i:4*i+3, 4*j:4*j+3] = glider[::-1,:]
+
+  else: # state_type == 'random'
+    np.random.seed(seed=7)
+    initial_state = np.random.binomial(1, 0.5, (x_dim, y_dim)).astype(np.uint32)
+
+  return initial_state
+
+
+def main():
+  """Main method to run the example code."""
+
+  # Read arguments
+  parser = argparse.ArgumentParser()
+  parser.add_argument('--name', help='the test compile output dir', required=True)
+  parser.add_argument('--cmaddr', help='IP:port for CS system')
+  parser.add_argument('--iters', type=int, default=10, help='Number of generations (default: 10)')
+  parser.add_argument('--initial-state', choices=['glider', 'random'], default='glider',
+    help='Specify the initial state of the system (default: glider)'
+  )
+  parser.add_argument('--save-animation', action='store_true',
+    help="Save animated GIF of states"
+  )
+  parser.add_argument('--show-ascii-animation', action='store_true',
+    help="Show ascii animation of states"
+  )
+  args = parser.parse_args()
+
+  # Get matrix dimensions from compile metadata
+  with open(f'{args.name}/out.json', encoding='utf-8') as json_file:
+    compile_data = json.load(json_file)
+
+  # PE grid dimensions
+  x_dim = int(compile_data['params']['x_dim'])
+  y_dim = int(compile_data['params']['y_dim'])
+
+  # Number of generations
+  iters = args.iters
+
+  initial_state = create_initial_state(args.initial_state, x_dim, y_dim)
+
+  # Construct a runner using SdkRuntime
+  runner = SdkRuntime(args.name, cmaddr=args.cmaddr)
+
+  states_symbol = runner.get_id('states')
+
+  # Load and run the program
+  runner.load()
+  runner.run()
+
+  print('Copy initial state to device...')
+  # Copy initial state into all PEs
+  runner.memcpy_h2d(states_symbol, initial_state.flatten(), 0, 0, x_dim, y_dim, 1,
+    streaming=False, order=MemcpyOrder.ROW_MAJOR, data_type=MemcpyDataType.MEMCPY_32BIT,
+    nonblock=False)
+
+  print(f'Run for {iters} generations...')
+  # Launch the generate function on device
+  runner.launch('generate', np.uint16(iters), nonblock=False)
+
+  # Copy states back
+  states_result = np.zeros([x_dim * y_dim * iters], dtype=np.uint32)
+  runner.memcpy_d2h(states_result, states_symbol, 0, 0, x_dim, y_dim, iters, streaming=False,
+    order=MemcpyOrder.ROW_MAJOR, data_type=MemcpyDataType.MEMCPY_32BIT, nonblock=False)
+
+  # Stop the program
+  runner.stop()
+
+  print('Create output...')
+
+  # Reshape states results to x_dim x y_dim frames
+  all_states = states_result.reshape((x_dim, y_dim, iters))
+
+  # Loop through the frames and display them
+  if args.show_ascii_animation:
+    show_ascii_animation(all_states)
+
+  # Generate animated GIF of generations
+  if args.save_animation:
+    save_animation(all_states, 'game_of_life')
+
+  print('Create reference solution...')
+  ref_states = game_of_life_ref(initial_state, iters)
+
+  # Test that wafer output is equal to the reference
+  np.testing.assert_equal(ref_states, all_states)
+  print('SUCCESS!')
+
+if __name__ == '__main__':
+  main()
diff --git a/benchmarks/gemm-collectives_2d/commands.sh b/benchmarks/gemm-collectives_2d/commands_wse2.sh
similarity index 100%
rename from benchmarks/gemm-collectives_2d/commands.sh
rename to benchmarks/gemm-collectives_2d/commands_wse2.sh
diff --git a/benchmarks/gemm-collectives_2d/commands_wse3.sh b/benchmarks/gemm-collectives_2d/commands_wse3.sh
new file mode 100755
index 0000000..e0e89eb
--- /dev/null
+++ b/benchmarks/gemm-collectives_2d/commands_wse3.sh
@@ -0,0 +1,8 @@
+#!/usr/bin/env bash
+
+set -e
+
+cslc --arch=wse3 ./layout.csl --fabric-dims=11,6 --fabric-offsets=4,1 \
+--params=P:4,Mt:14,Kt:14,Nt:14 \
+--memcpy --channels=1 -o out
+cs_python run.py --name out
diff --git a/benchmarks/gemv-checkerboard-pattern/commands.sh b/benchmarks/gemv-checkerboard-pattern/commands_wse2.sh
similarity index 100%
rename from benchmarks/gemv-checkerboard-pattern/commands.sh
rename to benchmarks/gemv-checkerboard-pattern/commands_wse2.sh
diff --git a/benchmarks/gemv-checkerboard-pattern/commands_wse3.sh b/benchmarks/gemv-checkerboard-pattern/commands_wse3.sh
new file mode 100755
index 0000000..2966fb9
--- /dev/null
+++ b/benchmarks/gemv-checkerboard-pattern/commands_wse3.sh
@@ -0,0 +1,11 @@
+#!/usr/bin/env bash
+
+set -e
+
+cslc --arch=wse3 ./layout.csl --fabric-dims=11,6 --fabric-offsets=4,1 \
+--colors=x_in:1,ax_out:3,b_in:4 -o out \
+--params=kernel_rows:4,kernel_cols:4,matrix_rows:32,matrix_cols:16 \
+--params=MEMCPYH2D_DATA_1_ID:10 --params=MEMCPYH2D_DATA_2_ID:11 \
+--params=MEMCPYD2H_DATA_1_ID:12 \
+--memcpy --channels=1 --width-west-buf=0 --width-east-buf=0
+cs_python run.py --name out
diff --git a/benchmarks/gemv-collectives_2d/commands.sh b/benchmarks/gemv-collectives_2d/commands_wse2.sh
similarity index 100%
rename from benchmarks/gemv-collectives_2d/commands.sh
rename to benchmarks/gemv-collectives_2d/commands_wse2.sh
diff --git a/benchmarks/gemv-collectives_2d/commands_wse3.sh b/benchmarks/gemv-collectives_2d/commands_wse3.sh
new file mode 100755
index 0000000..d6c747d
--- /dev/null
+++ b/benchmarks/gemv-collectives_2d/commands_wse3.sh
@@ -0,0 +1,8 @@
+#!/usr/bin/env bash
+
+set -e
+
+cslc --arch=wse3 ./layout.csl --fabric-dims=11,6 --fabric-offsets=4,1 \
+--params=kernel_rows:4,kernel_cols:4,matrix_rows:32,matrix_cols:16 \
+--memcpy --channels=1 -o out
+cs_python run.py --name out
diff --git a/benchmarks/histogram-torus/commands.sh b/benchmarks/histogram-torus/commands_wse2.sh
similarity index 100%
rename from benchmarks/histogram-torus/commands.sh
rename to benchmarks/histogram-torus/commands_wse2.sh
diff --git a/benchmarks/mandelbrot/commands.sh b/benchmarks/mandelbrot/commands_wse2.sh
similarity index 100%
rename from benchmarks/mandelbrot/commands.sh
rename to benchmarks/mandelbrot/commands_wse2.sh
diff --git a/benchmarks/mandelbrot/commands_wse3.sh b/benchmarks/mandelbrot/commands_wse3.sh
new file mode 100755
index 0000000..249e303
--- /dev/null
+++ b/benchmarks/mandelbrot/commands_wse3.sh
@@ -0,0 +1,8 @@
+#!/usr/bin/env bash
+
+set -e
+
+cslc --arch=wse3 ./code.csl --fabric-dims=11,6 --fabric-offsets=4,1 -o out \
+--params=MEMCPYD2H_DATA_1_ID:1 \
+--memcpy --channels=1 --width-west-buf=0 --width-east-buf=0
+cs_python run.py --name out
diff --git a/benchmarks/powerMethod/README.rst b/benchmarks/power-method/README.rst
similarity index 100%
rename from benchmarks/powerMethod/README.rst
rename to benchmarks/power-method/README.rst
diff --git a/benchmarks/preconditionedConjugateGradient/cmd_parser.py b/benchmarks/power-method/cmd_parser.py
similarity index 97%
rename from benchmarks/preconditionedConjugateGradient/cmd_parser.py
rename to benchmarks/power-method/cmd_parser.py
index 6023f0c..7e7bd51 100644
--- a/benchmarks/preconditionedConjugateGradient/cmd_parser.py
+++ b/benchmarks/power-method/cmd_parser.py
@@ -78,10 +78,9 @@ def parse_args():
   parser.add_argument(
       "--run-only",
       help="Run only", action="store_true")
-  # arch = wse1 or wse2
   parser.add_argument(
       "--arch",
-      help="wse1 or wse2. Default is wse1 when not supplied.")
+      help="wse2 or wse3. Default is wse2 when not supplied.")
   parser.add_argument(
       "--width-west-buf",
       default=0, type=int,
diff --git a/benchmarks/powerMethod/commands.sh b/benchmarks/power-method/commands_wse2.sh
similarity index 100%
rename from benchmarks/powerMethod/commands.sh
rename to benchmarks/power-method/commands_wse2.sh
diff --git a/benchmarks/power-method/commands_wse3.sh b/benchmarks/power-method/commands_wse3.sh
new file mode 100755
index 0000000..5aba2b6
--- /dev/null
+++ b/benchmarks/power-method/commands_wse3.sh
@@ -0,0 +1,11 @@
+#!/usr/bin/env bash
+
+set -e
+
+cslc ./layout.csl --arch wse3 --fabric-dims=12,7 --fabric-offsets=4,1 \
+--params=width:5,height:5,MAX_ZDIM:5 --params=BLOCK_SIZE:2 --params=C0_ID:0 \
+--params=C1_ID:1 --params=C2_ID:2 --params=C3_ID:3 --params=C4_ID:4 --params=C5_ID:5 \
+--params=C6_ID:6 --params=C7_ID:7 --params=C8_ID:8 -o=out \
+--memcpy --channels=1 --width-west-buf=0 --width-east-buf=0
+cs_python ./run.py -m=5 -n=5 -k=5 --latestlink out --channels=1 \
+--width-west-buf=0 --width-east-buf=0 --zDim=5 --run-only --max-ite=1
diff --git a/benchmarks/powerMethod/power_method.py b/benchmarks/power-method/power_method.py
similarity index 100%
rename from benchmarks/powerMethod/power_method.py
rename to benchmarks/power-method/power_method.py
diff --git a/benchmarks/powerMethod/run.py b/benchmarks/power-method/run.py
similarity index 100%
rename from benchmarks/powerMethod/run.py
rename to benchmarks/power-method/run.py
diff --git a/benchmarks/powerMethod/run_power.py b/benchmarks/power-method/run_power.py
similarity index 100%
rename from benchmarks/powerMethod/run_power.py
rename to benchmarks/power-method/run_power.py
diff --git a/benchmarks/powerMethod/blas.csl b/benchmarks/power-method/src/blas.csl
similarity index 100%
rename from benchmarks/powerMethod/blas.csl
rename to benchmarks/power-method/src/blas.csl
diff --git a/benchmarks/powerMethod/kernel.csl b/benchmarks/power-method/src/kernel.csl
similarity index 96%
rename from benchmarks/powerMethod/kernel.csl
rename to benchmarks/power-method/src/kernel.csl
index ad6550f..97c00cf 100644
--- a/benchmarks/powerMethod/kernel.csl
+++ b/benchmarks/power-method/src/kernel.csl
@@ -34,7 +34,7 @@ const blas_lib = @import_module("blas.csl");
 const sys_mod = @import_module( "<memcpy/memcpy>", memcpyParams);
 
 // allreduce uses input queue/output queue 1
-const reduce_mod = @import_module( "../csl-libs/allreduce/pe.csl", @concat_structs(reduceParams, .{
+const reduce_mod = @import_module( "../../csl-libs/allreduce/pe.csl", @concat_structs(reduceParams, .{
      .f_callback = sys_mod.unblock_cmd_stream,
      .queues = [1]u16{2},
      .dest_dsr_ids = [1]u16{1},
@@ -43,7 +43,7 @@ const reduce_mod = @import_module( "../csl-libs/allreduce/pe.csl", @concat_struc
      }));
 
 // output queue cannot overlap input queues
-const stencil_mod = @import_module( "../csl-libs/stencil_3d_7pts/pe.csl", @concat_structs(stencilParams, .{
+const stencil_mod = @import_module( "../../csl-libs/stencil_3d_7pts/pe.csl", @concat_structs(stencilParams, .{
      .f_callback = sys_mod.unblock_cmd_stream,
      .input_queues = [4]u16{4, 5, 6, 7},
      .output_queues = if (@is_arch("wse3")) [4]u16{4, 5, 6, 7} else [1]u16{3},
diff --git a/benchmarks/powerMethod/kernel_power.csl b/benchmarks/power-method/src/kernel_power.csl
similarity index 97%
rename from benchmarks/powerMethod/kernel_power.csl
rename to benchmarks/power-method/src/kernel_power.csl
index eac9469..8fb3b6f 100644
--- a/benchmarks/powerMethod/kernel_power.csl
+++ b/benchmarks/power-method/src/kernel_power.csl
@@ -36,7 +36,7 @@ const blas_lib = @import_module("blas.csl");
 const sys_mod = @import_module( "<memcpy/memcpy>", memcpyParams);
 
 // allreduce uses input queue/output queue 1
-const reduce_mod = @import_module( "../csl-libs/allreduce/pe.csl", @concat_structs(reduceParams, .{
+const reduce_mod = @import_module( "../../csl-libs/allreduce/pe.csl", @concat_structs(reduceParams, .{
      .f_callback = f_trigger_state_machine,
      .queues = [1]u16{2},
      .dest_dsr_ids = [1]u16{1},
@@ -45,7 +45,7 @@ const reduce_mod = @import_module( "../csl-libs/allreduce/pe.csl", @concat_struc
      }));
 
 // output queue cannot overlap input queues
-const stencil_mod = @import_module( "../csl-libs/stencil_3d_7pts/pe.csl", @concat_structs(stencilParams, .{
+const stencil_mod = @import_module( "../../csl-libs/stencil_3d_7pts/pe.csl", @concat_structs(stencilParams, .{
      .f_callback = f_trigger_state_machine,
      .input_queues = [4]u16{4, 5, 6, 7},
      .output_queues = if (@is_arch("wse3")) [4]u16{4, 5, 6, 7} else [1]u16{3},
diff --git a/benchmarks/powerMethod/layout.csl b/benchmarks/power-method/src/layout.csl
similarity index 97%
rename from benchmarks/powerMethod/layout.csl
rename to benchmarks/power-method/src/layout.csl
index 25fa661..3402bd6 100644
--- a/benchmarks/powerMethod/layout.csl
+++ b/benchmarks/power-method/src/layout.csl
@@ -66,14 +66,14 @@ const EN_REDUCE_2: local_task_id = @get_local_task_id(18);
 const EN_REDUCE_3: local_task_id = @get_local_task_id(19);
 const EN_REDUCE_4: local_task_id = @get_local_task_id(20);
 
-const stencil = @import_module( "../csl-libs/stencil_3d_7pts/layout.csl", .{
+const stencil = @import_module( "../../csl-libs/stencil_3d_7pts/layout.csl", .{
     .colors = [8]color{C0, C1, C2, C3, C4, C5, C6, C7},
     .entrypoints = [3]local_task_id{EN_STENCIL_1, EN_STENCIL_2, EN_STENCIL_3},
     .width = width,
     .height = height
     });
 
-const reduce = @import_module( "../csl-libs/allreduce/layout.csl", .{
+const reduce = @import_module( "../../csl-libs/allreduce/layout.csl", .{
     .colors = [1]color{C8},
     .entrypoints = [4]local_task_id{EN_REDUCE_1, EN_REDUCE_2, EN_REDUCE_3, EN_REDUCE_4},
     .width = width,
diff --git a/benchmarks/powerMethod/layout_power.csl b/benchmarks/power-method/src/layout_power.csl
similarity index 97%
rename from benchmarks/powerMethod/layout_power.csl
rename to benchmarks/power-method/src/layout_power.csl
index 5c41853..001a4a2 100644
--- a/benchmarks/powerMethod/layout_power.csl
+++ b/benchmarks/power-method/src/layout_power.csl
@@ -69,14 +69,14 @@ const EN_REDUCE_2: local_task_id = @get_local_task_id(18);
 const EN_REDUCE_3: local_task_id = @get_local_task_id(19);
 const EN_REDUCE_4: local_task_id = @get_local_task_id(20);
 
-const stencil = @import_module( "../csl-libs/stencil_3d_7pts/layout.csl", .{
+const stencil = @import_module( "../../csl-libs/stencil_3d_7pts/layout.csl", .{
     .colors = [8]color{C0, C1, C2, C3, C4, C5, C6, C7},
     .entrypoints = [3]local_task_id{EN_STENCIL_1, EN_STENCIL_2, EN_STENCIL_3},
     .width = width,
     .height = height
     });
 
-const reduce = @import_module( "../csl-libs/allreduce/layout.csl", .{
+const reduce = @import_module( "../../csl-libs/allreduce/layout.csl", .{
     .colors = [1]color{C8},
     .entrypoints = [4]local_task_id{EN_REDUCE_1, EN_REDUCE_2, EN_REDUCE_3, EN_REDUCE_4},
     .width = width,
diff --git a/benchmarks/stencil-3d-7pts/util.py b/benchmarks/power-method/util.py
similarity index 100%
rename from benchmarks/stencil-3d-7pts/util.py
rename to benchmarks/power-method/util.py
diff --git a/benchmarks/preconditionedConjugateGradient/README.rst b/benchmarks/preconditioned-conjugate-gradient/README.rst
similarity index 100%
rename from benchmarks/preconditionedConjugateGradient/README.rst
rename to benchmarks/preconditioned-conjugate-gradient/README.rst
diff --git a/benchmarks/powerMethod/cmd_parser.py b/benchmarks/preconditioned-conjugate-gradient/cmd_parser.py
similarity index 97%
rename from benchmarks/powerMethod/cmd_parser.py
rename to benchmarks/preconditioned-conjugate-gradient/cmd_parser.py
index 6023f0c..7e7bd51 100644
--- a/benchmarks/powerMethod/cmd_parser.py
+++ b/benchmarks/preconditioned-conjugate-gradient/cmd_parser.py
@@ -78,10 +78,9 @@ def parse_args():
   parser.add_argument(
       "--run-only",
       help="Run only", action="store_true")
-  # arch = wse1 or wse2
   parser.add_argument(
       "--arch",
-      help="wse1 or wse2. Default is wse1 when not supplied.")
+      help="wse2 or wse3. Default is wse2 when not supplied.")
   parser.add_argument(
       "--width-west-buf",
       default=0, type=int,
diff --git a/benchmarks/preconditionedConjugateGradient/commands.sh b/benchmarks/preconditioned-conjugate-gradient/commands_wse2.sh
similarity index 100%
rename from benchmarks/preconditionedConjugateGradient/commands.sh
rename to benchmarks/preconditioned-conjugate-gradient/commands_wse2.sh
diff --git a/benchmarks/preconditioned-conjugate-gradient/commands_wse3.sh b/benchmarks/preconditioned-conjugate-gradient/commands_wse3.sh
new file mode 100755
index 0000000..a1aa2f8
--- /dev/null
+++ b/benchmarks/preconditioned-conjugate-gradient/commands_wse3.sh
@@ -0,0 +1,11 @@
+#!/usr/bin/env bash
+
+set -e
+
+cslc ./layout.csl --arch wse3 --fabric-dims=12,7 --fabric-offsets=4,1 \
+--params=width:5,height:5,MAX_ZDIM:5 --params=BLOCK_SIZE:2 --params=C0_ID:0 \
+--params=C1_ID:1 --params=C2_ID:2 --params=C3_ID:3 --params=C4_ID:4 --params=C5_ID:5 \
+--params=C6_ID:6 --params=C7_ID:7 --params=C8_ID:8 -o=out \
+--memcpy --channels=1 --width-west-buf=0 --width-east-buf=0
+cs_python ./run.py -m=5 -n=5 -k=5 --latestlink out --channels=1 \
+--width-west-buf=0 --width-east-buf=0 --zDim=5 --run-only --max-ite=2
diff --git a/benchmarks/preconditionedConjugateGradient/pcg.py b/benchmarks/preconditioned-conjugate-gradient/pcg.py
similarity index 100%
rename from benchmarks/preconditionedConjugateGradient/pcg.py
rename to benchmarks/preconditioned-conjugate-gradient/pcg.py
diff --git a/benchmarks/preconditionedConjugateGradient/run.py b/benchmarks/preconditioned-conjugate-gradient/run.py
similarity index 100%
rename from benchmarks/preconditionedConjugateGradient/run.py
rename to benchmarks/preconditioned-conjugate-gradient/run.py
diff --git a/benchmarks/preconditionedConjugateGradient/run_pcg.py b/benchmarks/preconditioned-conjugate-gradient/run_pcg.py
similarity index 100%
rename from benchmarks/preconditionedConjugateGradient/run_pcg.py
rename to benchmarks/preconditioned-conjugate-gradient/run_pcg.py
diff --git a/benchmarks/preconditionedConjugateGradient/blas.csl b/benchmarks/preconditioned-conjugate-gradient/src/blas.csl
similarity index 100%
rename from benchmarks/preconditionedConjugateGradient/blas.csl
rename to benchmarks/preconditioned-conjugate-gradient/src/blas.csl
diff --git a/benchmarks/preconditionedConjugateGradient/kernel.csl b/benchmarks/preconditioned-conjugate-gradient/src/kernel.csl
similarity index 97%
rename from benchmarks/preconditionedConjugateGradient/kernel.csl
rename to benchmarks/preconditioned-conjugate-gradient/src/kernel.csl
index 1742576..2032420 100644
--- a/benchmarks/preconditionedConjugateGradient/kernel.csl
+++ b/benchmarks/preconditioned-conjugate-gradient/src/kernel.csl
@@ -34,7 +34,7 @@ const blas_lib = @import_module("blas.csl");
 const sys_mod = @import_module( "<memcpy/memcpy>", memcpyParams);
 
 // allreduce uses input queue/output queue 1
-const reduce_mod = @import_module( "../csl-libs/allreduce/pe.csl", @concat_structs(reduceParams, .{
+const reduce_mod = @import_module( "../../csl-libs/allreduce/pe.csl", @concat_structs(reduceParams, .{
      .f_callback = sys_mod.unblock_cmd_stream,
      .queues = [1]u16{2},
      .dest_dsr_ids = [1]u16{1},
@@ -43,7 +43,7 @@ const reduce_mod = @import_module( "../csl-libs/allreduce/pe.csl", @concat_struc
      }));
 
 // output queue cannot overlap input queues
-const stencil_mod = @import_module( "../csl-libs/stencil_3d_7pts/pe.csl", @concat_structs(stencilParams, .{
+const stencil_mod = @import_module( "../../csl-libs/stencil_3d_7pts/pe.csl", @concat_structs(stencilParams, .{
      .f_callback = sys_mod.unblock_cmd_stream,
      .input_queues = [4]u16{4, 5, 6, 7},
      .output_queues = if (@is_arch("wse3")) [4]u16{4, 5, 6, 7} else [1]u16{3},
diff --git a/benchmarks/preconditionedConjugateGradient/kernel_pcg.csl b/benchmarks/preconditioned-conjugate-gradient/src/kernel_pcg.csl
similarity index 98%
rename from benchmarks/preconditionedConjugateGradient/kernel_pcg.csl
rename to benchmarks/preconditioned-conjugate-gradient/src/kernel_pcg.csl
index 523b187..212d211 100644
--- a/benchmarks/preconditionedConjugateGradient/kernel_pcg.csl
+++ b/benchmarks/preconditioned-conjugate-gradient/src/kernel_pcg.csl
@@ -36,7 +36,7 @@ const blas_lib = @import_module("blas.csl");
 const sys_mod = @import_module( "<memcpy/memcpy>", memcpyParams);
 
 // allreduce uses input queue/output queue 1
-const reduce_mod = @import_module( "../csl-libs/allreduce/pe.csl", @concat_structs(reduceParams, .{
+const reduce_mod = @import_module( "../../csl-libs/allreduce/pe.csl", @concat_structs(reduceParams, .{
      .f_callback = f_trigger_state_machine,
      .queues = [1]u16{2},
      .dest_dsr_ids = [1]u16{1},
@@ -45,7 +45,7 @@ const reduce_mod = @import_module( "../csl-libs/allreduce/pe.csl", @concat_struc
      }));
 
 // output queue cannot overlap input queues
-const stencil_mod = @import_module( "../csl-libs/stencil_3d_7pts/pe.csl", @concat_structs(stencilParams, .{
+const stencil_mod = @import_module( "../../csl-libs/stencil_3d_7pts/pe.csl", @concat_structs(stencilParams, .{
      .f_callback = f_trigger_state_machine,
      .input_queues = [4]u16{4, 5, 6, 7},
      .output_queues = if (@is_arch("wse3")) [4]u16{4, 5, 6, 7} else [1]u16{3},
diff --git a/benchmarks/preconditionedConjugateGradient/layout.csl b/benchmarks/preconditioned-conjugate-gradient/src/layout.csl
similarity index 97%
rename from benchmarks/preconditionedConjugateGradient/layout.csl
rename to benchmarks/preconditioned-conjugate-gradient/src/layout.csl
index c21617b..de37353 100644
--- a/benchmarks/preconditionedConjugateGradient/layout.csl
+++ b/benchmarks/preconditioned-conjugate-gradient/src/layout.csl
@@ -66,14 +66,14 @@ const EN_REDUCE_2: local_task_id = @get_local_task_id(18);
 const EN_REDUCE_3: local_task_id = @get_local_task_id(19);
 const EN_REDUCE_4: local_task_id = @get_local_task_id(20);
 
-const stencil = @import_module( "../csl-libs/stencil_3d_7pts/layout.csl", .{
+const stencil = @import_module( "../../csl-libs/stencil_3d_7pts/layout.csl", .{
     .colors = [8]color{C0, C1, C2, C3, C4, C5, C6, C7},
     .entrypoints = [3]local_task_id{EN_STENCIL_1, EN_STENCIL_2, EN_STENCIL_3},
     .width = width,
     .height = height
     });
 
-const reduce = @import_module( "../csl-libs/allreduce/layout.csl", .{
+const reduce = @import_module( "../../csl-libs/allreduce/layout.csl", .{
     .colors = [1]color{C8},
     .entrypoints = [4]local_task_id{EN_REDUCE_1, EN_REDUCE_2, EN_REDUCE_3, EN_REDUCE_4},
     .width = width,
diff --git a/benchmarks/preconditionedConjugateGradient/layout_pcg.csl b/benchmarks/preconditioned-conjugate-gradient/src/layout_pcg.csl
similarity index 97%
rename from benchmarks/preconditionedConjugateGradient/layout_pcg.csl
rename to benchmarks/preconditioned-conjugate-gradient/src/layout_pcg.csl
index 6794765..97baa74 100644
--- a/benchmarks/preconditionedConjugateGradient/layout_pcg.csl
+++ b/benchmarks/preconditioned-conjugate-gradient/src/layout_pcg.csl
@@ -69,14 +69,14 @@ const EN_REDUCE_2: local_task_id = @get_local_task_id(18);
 const EN_REDUCE_3: local_task_id = @get_local_task_id(19);
 const EN_REDUCE_4: local_task_id = @get_local_task_id(20);
 
-const stencil = @import_module( "../csl-libs/stencil_3d_7pts/layout.csl", .{
+const stencil = @import_module( "../../csl-libs/stencil_3d_7pts/layout.csl", .{
     .colors = [8]color{C0, C1, C2, C3, C4, C5, C6, C7},
     .entrypoints = [3]local_task_id{EN_STENCIL_1, EN_STENCIL_2, EN_STENCIL_3},
     .width = width,
     .height = height
     });
 
-const reduce = @import_module( "../csl-libs/allreduce/layout.csl", .{
+const reduce = @import_module( "../../csl-libs/allreduce/layout.csl", .{
     .colors = [1]color{C8},
     .entrypoints = [4]local_task_id{EN_REDUCE_1, EN_REDUCE_2, EN_REDUCE_3, EN_REDUCE_4},
     .width = width,
diff --git a/benchmarks/preconditionedConjugateGradient/util.py b/benchmarks/preconditioned-conjugate-gradient/util.py
similarity index 100%
rename from benchmarks/preconditionedConjugateGradient/util.py
rename to benchmarks/preconditioned-conjugate-gradient/util.py
diff --git a/benchmarks/residual/commands.sh b/benchmarks/residual/commands_wse2.sh
similarity index 100%
rename from benchmarks/residual/commands.sh
rename to benchmarks/residual/commands_wse2.sh
diff --git a/benchmarks/residual/commands_wse3.sh b/benchmarks/residual/commands_wse3.sh
new file mode 100755
index 0000000..915348e
--- /dev/null
+++ b/benchmarks/residual/commands_wse3.sh
@@ -0,0 +1,9 @@
+#!/usr/bin/env bash
+
+set -e
+
+cslc ./layout.csl --arch=wse3 --fabric-dims=9,4 --fabric-offsets=4,1 \
+--params=width:2,height:2 \
+--params=LOCAL_OUT_SZ:3,LOCAL_IN_SZ:2 -o=out --memcpy --channels=1 \
+--width-west-buf=0 --width-east-buf=0
+cs_python run.py --name out
diff --git a/benchmarks/row-col-broadcast/README.rst b/benchmarks/row-col-broadcast/README.rst
new file mode 100644
index 0000000..256e263
--- /dev/null
+++ b/benchmarks/row-col-broadcast/README.rst
@@ -0,0 +1,29 @@
+Host-to-Device Broadcast Test
+=============================
+
+This example shows how to use row or column broadcast. For example if the user
+wants to broadcast a column of data [1.0, 2.0, 3.0, 4.0] to a region of interest
+starting from (1,1) with width 3 and height 4, one element per PE, the H2D API
+requires the user to prepare the following 3-by-4 tensor,
+
+.. code-block::
+
+   | 1.0  1.0  1.0 |
+   | 2.0  2.0  2.0 |
+   | 3.0  3.0  3.0 |
+   | 4.0  4.0  4.0 |
+
+and use ``memcpy_h2d()`` API to stream 12 elements into the device. This operation
+wastes host bandwidth by 3x.
+Now the user can use the new API, ``memcpy_h2d_rowbcast()``, to stream 4 elements
+only.
+
+The same for column broadcasting, the user only needs to provide data of one
+row and uses ``memcpy_h2d_colbcast()`` API.
+ 
+The new broadcasting scheme only supports H2D, not D2H.
+
+The kernel of ``row-col-broadcast`` is the same as ``bandwidth-test``. The ``run.py``
+calculates the bandwidth as well.
+The formula of the bandwidth calculation is the same as ``bandwidth-test``, so the
+user can see how much time this new API can save.
diff --git a/benchmarks/row-col-broadcast/cmd_parser.py b/benchmarks/row-col-broadcast/cmd_parser.py
new file mode 100644
index 0000000..a9a55cc
--- /dev/null
+++ b/benchmarks/row-col-broadcast/cmd_parser.py
@@ -0,0 +1,92 @@
+# Copyright 2024 Cerebras Systems.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# This is not a real test, but a module that gets imported in other tests.
+
+"""command parser for broadcast
+
+   -m <int>      number of rows of the core rectangle
+   -n <int>      number of columns of the core rectangle
+   -k <int>      number of elements of local tensor
+   --latestlink  working directory
+   --cmaddr      IP address of a WSE
+   --roi_px      starting column index of region of interest
+   --roi_py      starting row index of region of interest
+   --roi_w       width of region of interest
+   --roi_h       height of region of interest
+"""
+
+
+import argparse
+import os
+
+
+def parse_args():
+    """command parser"""
+    parser = argparse.ArgumentParser()
+    parser.add_argument("-m", default=1, type=int, help="number of rows")
+    parser.add_argument("-n", default=1, type=int, help="number of columns")
+    parser.add_argument("-k", default=1, type=int, help="size of local tensor")
+    parser.add_argument(
+        "--latestlink", help="folder to contain the log files (default: latest)"
+    )
+    parser.add_argument(
+        "--cmaddr", help="CM address and port, i.e. <IP>:<port>"
+    )
+    parser.add_argument(
+        "--arch", help="wse2 or wse3. Default is wse2 when not supplied."
+    )
+    parser.add_argument(
+        "--channels", default=1, type=int, help="number of channels"
+    )
+    parser.add_argument(
+        "--roi_px", default=1, type=int, help="starting column index of ROI"
+    )
+    parser.add_argument(
+        "--roi_py", default=1, type=int, help="starting row index of ROI"
+    )
+    parser.add_argument("--roi_w", default=3, type=int, help="width of ROI")
+    parser.add_argument("--roi_h", default=3, type=int, help="height of ROI")
+    parser.add_argument(
+        "--use_col_major",
+        action="store_true",
+        help="use column major to send the row or column broadcast",
+    )
+    parser.add_argument(
+        "--is_row_bcast",
+        action="store_true",
+        help="row broadcast or column broadcast",
+    )
+    parser.add_argument("--fabric-dims", help="Fabric dimension, i.e. <W>,<H>")
+    parser.add_argument(
+        "--loop_count",
+        default=1,
+        type=int,
+        help="number of back-to-back H2D/D2H",
+    )
+
+    args = parser.parse_args()
+
+    logs_dir = "latest"
+    if args.latestlink:
+        logs_dir = args.latestlink
+
+    dir_exist = os.path.isdir(logs_dir)
+    if dir_exist:
+        print(f"{logs_dir} already exists")
+    else:
+        print(f"create {logs_dir} to store log files")
+        os.mkdir(logs_dir)
+
+    return args, logs_dir
diff --git a/benchmarks/row-col-broadcast/commands_wse2.sh b/benchmarks/row-col-broadcast/commands_wse2.sh
new file mode 100755
index 0000000..b8c73e6
--- /dev/null
+++ b/benchmarks/row-col-broadcast/commands_wse2.sh
@@ -0,0 +1,9 @@
+#!/usr/bin/env bash
+
+set -e
+
+cslc ./src/layout.csl --arch wse2 --fabric-dims=12,7 --fabric-offsets=4,1 \
+--params=width:5,height:5,pe_length:5 --params=C0_ID:0 \
+--params=C1_ID:1 --params=C2_ID:2 --params=C3_ID:3 --params=C4_ID:4 -o=out \
+--memcpy --channels=2 --width-west-buf=0 --width-east-buf=0
+cs_python ./run.py -m=5 -n=5 -k=5 --latestlink out --is_row_bcast --loop_count=1
diff --git a/benchmarks/row-col-broadcast/commands_wse3.sh b/benchmarks/row-col-broadcast/commands_wse3.sh
new file mode 100755
index 0000000..4ed65f7
--- /dev/null
+++ b/benchmarks/row-col-broadcast/commands_wse3.sh
@@ -0,0 +1,9 @@
+#!/usr/bin/env bash
+
+set -e
+
+cslc ./src/layout.csl --arch wse3 --fabric-dims=12,7 --fabric-offsets=4,1 \
+--params=width:5,height:5,pe_length:5 --params=C0_ID:0 \
+--params=C1_ID:1 --params=C2_ID:2 --params=C3_ID:3 --params=C4_ID:4 -o=out \
+--memcpy --channels=2 --width-west-buf=0 --width-east-buf=0
+cs_python ./run.py -m=5 -n=5 -k=5 --latestlink out --is_row_bcast --loop_count=1
diff --git a/benchmarks/row-col-broadcast/compile.py b/benchmarks/row-col-broadcast/compile.py
new file mode 100644
index 0000000..c0ae284
--- /dev/null
+++ b/benchmarks/row-col-broadcast/compile.py
@@ -0,0 +1,147 @@
+#!/usr/bin/env python3
+
+# Copyright 2024 Cerebras Systems.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+""" compile the kernel
+"""
+
+import subprocess
+from glob import glob
+from typing import List, Optional
+
+from cmd_parser import parse_args
+
+
+def csl_compile_core(
+    width: int,  # width of the core
+    height: int,  # height of the core
+    pe_length: int,
+    file_config: str,
+    comp_dir: str,
+    fabric_width: int,
+    fabric_height: int,
+    core_fabric_offset_x: int,  # fabric-offsets of the core
+    core_fabric_offset_y: int,
+    arch: Optional[str],
+    C0: int,
+    C1: int,
+    C2: int,
+    C3: int,
+    C4: int,
+    channels: int,
+) -> List[str]:
+    """use cslc or sdk_debug_shell to compile the kernel"""
+
+    cslc = "cslc"
+
+    args = []
+    args.append(cslc)  # command
+
+    args.append(file_config)
+    if arch is not None:
+        args.append(f"--arch={arch}")
+    args.append(f"--fabric-dims={fabric_width},{fabric_height}")
+    args.append(
+        f"--fabric-offsets={core_fabric_offset_x},{core_fabric_offset_y}"
+    )
+    args.append(f"--params=width:{width},height:{height},pe_length:{pe_length}")
+    args.append(f"--params=C0_ID:{C0}")
+    args.append(f"--params=C1_ID:{C1}")
+    args.append(f"--params=C2_ID:{C2}")
+    args.append(f"--params=C3_ID:{C3}")
+    args.append(f"--params=C4_ID:{C4}")
+    args.append(f"-o={comp_dir}")
+    args.append("--memcpy")
+    args.append(f"--channels={channels}")
+    args.append(f"--width-west-buf=0")
+    args.append(f"--width-east-buf=0")
+
+    print(f"subprocess.check_call(args = {args}")
+    subprocess.check_call(args)
+
+    elfs = glob(f"{comp_dir}/bin/out_[0-9]*.elf")
+
+    return elfs
+
+
+def main():
+    """Main method to run the example code."""
+
+    args, dirname = parse_args()
+
+    height = args.m
+    width = args.n
+    pe_length = args.k
+    channels = args.channels
+
+    # prepare the simulation
+    print('store ELFs and log files in the folder ', dirname)
+
+    code_csl = "src/layout.csl"
+
+    # "+5" is "demux adaptor" + "demux" + "cmd fan" + "mux" + "mux adaptor"
+    # "+2" means halo of size 1
+    min_fabric_width = width + 5 + 2
+    min_fabric_height = height + 2
+
+    core_fabric_offset_x = 4
+    core_fabric_offset_y = 1
+
+    fabric_width = 0
+    fabric_height = 0
+    if args.fabric_dims:
+        w_str, h_str = args.fabric_dims.split(",")
+        fabric_width = int(w_str)
+        fabric_height = int(h_str)
+
+    if fabric_width == 0 or fabric_height == 0:
+        fabric_width = min_fabric_width
+        fabric_height = min_fabric_height
+
+    assert fabric_width >= min_fabric_width
+    assert fabric_height >= min_fabric_height
+
+    C0 = 0
+    C1 = 1
+    C2 = 2
+    C3 = 3
+    C4 = 4
+
+    elf_list = csl_compile_core(
+        width,
+        height,
+        pe_length,
+        code_csl,
+        dirname,
+        fabric_width,
+        fabric_height,
+        core_fabric_offset_x,
+        core_fabric_offset_y,
+        args.arch,
+        C0,
+        C1,
+        C2,
+        C3,
+        C4,
+        channels,
+    )
+
+    if elf_list is None or len(elf_list) == 0:
+        raise RuntimeError("Must have a non-empty list of ELFs to run")
+
+
+if __name__ == "__main__":
+    main()
diff --git a/benchmarks/row-col-broadcast/run.py b/benchmarks/row-col-broadcast/run.py
new file mode 100644
index 0000000..d12ea8a
--- /dev/null
+++ b/benchmarks/row-col-broadcast/run.py
@@ -0,0 +1,341 @@
+#!/usr/bin/env cs_python
+
+# Copyright 2024 Cerebras Systems.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# pylint: disable=too-many-function-args
+
+""" Test row or column broadcast
+    The kernel is the same as bandwidthTest.
+    The bandwidth calculation follows bandwidthTest.
+
+    Here is the list of parameters:
+    -m=<int> specifies the height of the core.
+    -n=<int> specifies the width of the core.
+    -k=<int> specifies the maximum number of elements per PE in the core.
+    --roi_px=<int> specifies the starting column index of region of interest
+    --roi_py=<int> specifies the starting row index of region of interest
+    --roi_w=<int> specifies the width of region of interest
+    --roi_h=<int> specifies the height of region of interest
+    --channels specifies the number of I/O channels, no bigger than 16.
+"""
+
+import random
+import struct
+
+import numpy as np
+from cmd_parser import parse_args
+
+from cerebras.sdk.runtime.sdkruntimepybind import (  # pylint: disable=no-name-in-module
+    MemcpyDataType,
+    MemcpyOrder,
+    SdkRuntime,
+)
+
+
+def float_to_hex(f):
+    return hex(struct.unpack('<I', struct.pack('<f', f))[0])
+
+
+def make_u48(words):
+    return words[0] + (words[1] << 16) + (words[2] << 32)
+
+
+def main():
+    """Main method to run the example code."""
+
+    random.seed(127)
+
+    args, dirname = parse_args()
+
+    height = args.m
+    width = args.n
+    pe_length = args.k
+    use_col_major = args.use_col_major
+    is_row_bcast = args.is_row_bcast
+    loop_count = args.loop_count
+
+    print(f"core: width = {width}, height = {height}, pe_length={pe_length}")
+
+    np.random.seed(2)
+    if is_row_bcast:
+        print("row broadcast mode: only prepare data for 1 column")
+        # A is h-by-1-by-l
+        A = (
+            np.arange(height * 1 * pe_length)
+            .reshape(height, 1, pe_length)
+            .astype(np.uint32)
+        )
+    else:
+        print("column broadcast mode: only prepare data for 1 row")
+        # A is 1-by-w-by-l
+        A = (
+            np.arange(1 * width * pe_length)
+            .reshape(1, width, pe_length)
+            .astype(np.uint32)
+        )
+    print(f"shape(A) = {A.shape}")
+    print(f"A = {A}")
+
+    px = args.roi_px
+    py = args.roi_py
+    pw = args.roi_w
+    ph = args.roi_h
+
+    print(f"ROI: px = {px}, py = {py}, pw = {pw}, ph = {ph}")
+
+    assert 0 <= px, "px must be non-negative"
+    assert 0 <= py, "px must be non-negative"
+    assert width >= pw, "pw must not be greater than width"
+    assert height >= ph, "ph must not be greater than height"
+
+    # extract ROI from A
+    if is_row_bcast:
+        B = A[py : (py + ph), 0:, 0:]
+    else:
+        B = A[0:, px : (px + pw), 0:]
+    print(f"shape(B) = {B.shape}")
+    print(f"B = {B}")
+
+    bx, by, bz = B.shape
+    if is_row_bcast:
+        assert bx == ph
+        assert by == 1
+        assert bz == pe_length
+    else:
+        assert bx == 1
+        assert by == pw
+        assert bz == pe_length
+
+    print(f"use_col_major = {use_col_major}")
+    if use_col_major:
+        B_1d = B.T.ravel()
+    else:
+        B_1d = B.ravel()
+
+    print('store ELFs and log files in the folder ', dirname)
+
+    memcpy_dtype = MemcpyDataType.MEMCPY_32BIT
+
+    runner = SdkRuntime(
+        dirname,
+        suppress_simfab_trace=True,
+        # msg_level="DEBUG",
+        cmaddr=args.cmaddr,
+    )
+
+    symbol_A = runner.get_id("A")
+    symbol_time_memcpy = runner.get_id("time_memcpy")
+    symbol_time_ref = runner.get_id("time_ref")
+
+    runner.load()
+    runner.run()
+
+    print("step 1: sync() synchronizes all PEs and records reference clock")
+    runner.call("f_sync", [], nonblock=True)
+
+    print("step 2: tic() records time_start")
+    runner.call("f_tic", [], nonblock=True)
+
+    print(f"len(B_1d) = {len(B_1d)}")
+    print(f"B_1d = {B_1d}")
+    for j in range(loop_count):
+        if is_row_bcast:
+            print("step 1: memcpy_h2d_rowbcast(B)")
+            runner.memcpy_h2d_rowbcast(
+                symbol_A,
+                B_1d,
+                px,
+                py,
+                pw,
+                ph,
+                pe_length,
+                streaming=False,
+                data_type=memcpy_dtype,
+                order=(
+                    MemcpyOrder.COL_MAJOR
+                    if use_col_major
+                    else MemcpyOrder.ROW_MAJOR
+                ),
+                nonblock=True,
+            )
+        else:
+            print("step 1: memcpy_h2d_colbcast(B)")
+            runner.memcpy_h2d_colbcast(
+                symbol_A,
+                B_1d,
+                px,
+                py,
+                pw,
+                ph,
+                pe_length,
+                streaming=False,
+                data_type=memcpy_dtype,
+                order=(
+                    MemcpyOrder.COL_MAJOR
+                    if use_col_major
+                    else MemcpyOrder.ROW_MAJOR
+                ),
+                nonblock=True,
+            )
+
+    print("step 4: toc() records time_end")
+    runner.call("f_toc", [], nonblock=False)
+
+    print("step 5: prepare (time_start, time_end)")
+    runner.call("f_memcpy_timestamps", [], nonblock=False)
+
+    print("step 6: D2H (time_start, time_end)")
+    # time_start/time_end is of type u16[3]
+    # {time_start, time_end} is packed into three f32
+    time_memcpy_1d_f32 = np.zeros(height * width * 3, np.float32)
+    runner.memcpy_d2h(
+        time_memcpy_1d_f32,
+        symbol_time_memcpy,
+        0,
+        0,
+        width,
+        height,
+        3,
+        streaming=False,
+        data_type=memcpy_dtype,
+        order=MemcpyOrder.ROW_MAJOR,
+        nonblock=False,
+    )
+    time_memcpy_hwl = np.reshape(
+        time_memcpy_1d_f32, (height, width, 3), order='C'
+    )
+
+    print("step 7: prepare reference clock")
+    runner.call("f_reference_timestamps", [], nonblock=False)
+
+    print("step 8: D2H reference clock")
+    # time_ref is of type u16[3], packed into two f32
+    time_ref_1d_f32 = np.zeros(height * width * 2, np.float32)
+    runner.memcpy_d2h(
+        time_ref_1d_f32,
+        symbol_time_ref,
+        0,
+        0,
+        width,
+        height,
+        2,
+        streaming=False,
+        data_type=memcpy_dtype,
+        order=MemcpyOrder.ROW_MAJOR,
+        nonblock=False,
+    )
+    time_ref_hwl = np.reshape(time_ref_1d_f32, (height, width, 2), order='C')
+
+    print("step 9: D2H(A)")
+    E_1d = np.zeros(height * width * pe_length, A.dtype)
+    runner.memcpy_d2h(
+        E_1d,
+        symbol_A,
+        0,
+        0,
+        width,
+        height,
+        pe_length,
+        streaming=False,
+        data_type=memcpy_dtype,
+        order=MemcpyOrder.COL_MAJOR,
+        nonblock=False,
+    )
+
+    runner.stop()
+
+    print("DONE")
+
+    # E is h-by-w-by-l
+    E_hwl = np.reshape(E_1d, (height, width, pe_length), order='F')
+    print(f"E_hwl (from device) = {E_hwl}")
+
+    # B_ext is the expected result
+    B_ext = (
+        np.zeros(height * width * pe_length)
+        .reshape(height, width, pe_length)
+        .astype(A.dtype)
+    )
+    if is_row_bcast:
+        # copy B to each column of ROI
+        for w in range(pw):
+            B_ext[py : (py + ph), (px + w) : (px + w + 1), 0:] = B
+    else:
+        # copy B to each row of ROI
+        for h in range(ph):
+            B_ext[(py + h) : (py + h + 1), px : (px + pw), 0:] = B
+    print(f"B_ext = {B_ext}")
+
+    print("check E_hwl == B_ext")
+    assert np.allclose(E_hwl.ravel(), B_ext.ravel(), 0)
+
+    # time_start = start time of H2D/D2H
+    time_start = np.zeros((height, width)).astype(int)
+    # time_end = end time of H2D/D2H
+    time_end = np.zeros((height, width)).astype(int)
+    word = np.zeros(3).astype(np.uint16)
+    for w in range(width):
+        for h in range(height):
+            hex_t0 = int(float_to_hex(time_memcpy_hwl[(h, w, 0)]), base=16)
+            hex_t1 = int(float_to_hex(time_memcpy_hwl[(h, w, 1)]), base=16)
+            hex_t2 = int(float_to_hex(time_memcpy_hwl[(h, w, 2)]), base=16)
+            word[0] = hex_t0 & 0x0000FFFF
+            word[1] = (hex_t0 >> 16) & 0x0000FFFF
+            word[2] = hex_t1 & 0x0000FFFF
+            time_start[(h, w)] = make_u48(word)
+            word[0] = (hex_t1 >> 16) & 0x0000FFFF
+            word[1] = hex_t2 & 0x0000FFFF
+            word[2] = (hex_t2 >> 16) & 0x0000FFFF
+            time_end[(h, w)] = make_u48(word)
+
+    # time_ref = reference clock
+    time_ref = np.zeros((height, width)).astype(int)
+    word = np.zeros(3).astype(np.uint16)
+    for w in range(width):
+        for h in range(height):
+            hex_t0 = int(float_to_hex(time_ref_hwl[(h, w, 0)]), base=16)
+            hex_t1 = int(float_to_hex(time_ref_hwl[(h, w, 1)]), base=16)
+            word[0] = hex_t0 & 0x0000FFFF
+            word[1] = (hex_t0 >> 16) & 0x0000FFFF
+            word[2] = hex_t1 & 0x0000FFFF
+            time_ref[(h, w)] = make_u48(word)
+    # adjust the reference clock by the propagation delay
+    for py in range(height):
+        for px in range(width):
+            time_ref[(py, px)] = time_ref[(py, px)] - (px + py)
+
+    # shift time_start and time_end by time_ref
+    time_start = time_start - time_ref
+    time_end = time_end - time_ref
+
+    # cycles_send = time_end[(h,w)] - time_start[(h,w)]
+    # 850MHz --> 1 cycle = (1/0.85) ns = (1/0.85)*1.e-3 us
+    # time_send = (cycles_send / 0.85) *1.e-3 us
+    # bandwidth = (((wvlts-1) * 4)/time_send) MBS
+    wvlts = pw * ph * pe_length
+    min_time_start = time_start.min()
+    max_time_end = time_end.max()
+    cycles_send = max_time_end - min_time_start
+    time_send = (cycles_send / 0.85) * 1.0e-3
+    bandwidth = ((wvlts * 4) / time_send) * loop_count
+    print(f"ROI: pw = {pw}, ph= {ph}, pe_length={pe_length}")
+    print(f"wvlts = {wvlts}, loop_count = {loop_count}")
+    print(f"cycles_send = {cycles_send} cycles")
+    print(f"time_send = {time_send} us")
+    print(f"bandwidth = {bandwidth} MB/S ")
+
+
+if __name__ == "__main__":
+    main()
diff --git a/benchmarks/row-col-broadcast/src/kernel.csl b/benchmarks/row-col-broadcast/src/kernel.csl
new file mode 100644
index 0000000..d4f7597
--- /dev/null
+++ b/benchmarks/row-col-broadcast/src/kernel.csl
@@ -0,0 +1,142 @@
+// Copyright 2024 Cerebras Systems.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+
+// contraints: input/output queue ID = 0 is reserved for memcpy module
+// only use microthread 2,3,4,5,6,7
+
+param memcpyParams: comptime_struct;
+
+param syncParams: comptime_struct;
+
+param pe_length: i16;
+
+
+const timestamp = @import_module("<time>");
+// starting time of H2D/D2H
+var tscStartBuffer = @zeros([timestamp.tsc_size_words]u16);
+// ending time of H2D/D2H
+var tscEndBuffer = @zeros([timestamp.tsc_size_words]u16);
+
+
+const sys_mod = @import_module( "<memcpy/memcpy>", memcpyParams);
+
+const sync_mod = @import_module( "sync/pe.csl", @concat_structs(syncParams, .{
+     .f_callback = sys_mod.unblock_cmd_stream,
+     .input_queues=[3]u16{2, 3, 4},
+     .output_queues=[3]u16{2, 3, 4},
+     }));
+
+
+////////////////////////////////////////////////////////////////////////////////
+// Main memory (48KB)
+////////////////////////////////////////////////////////////////////////////////
+
+const size : i16 = 1024*4;
+
+var A = @zeros([size]f32);
+// time_buf_f32[0:2] = {tscStartBuffer, tscEndBuffer}
+var time_buf_f32 = @zeros([3]f32);
+// reference clock inside sync module
+var time_ref_f32 = @zeros([2]f32);
+
+var ptr_A : [*]f32 = &A;
+var ptr_time_memcpy: [*]f32 = &time_buf_f32;
+var ptr_time_ref: [*]f32 = &time_ref_f32;
+
+////////////////////////////////////////////////////////////////////////////////
+// Tasks
+////////////////////////////////////////////////////////////////////////////////
+
+
+fn f_tic() void {
+    timestamp.get_timestamp(&tscStartBuffer);
+
+    // the user must unblock cmd color for every PE
+    sys_mod.unblock_cmd_stream();
+}
+
+fn f_toc() void {
+    timestamp.get_timestamp(&tscEndBuffer);
+
+    // the user must unblock cmd color for every PE
+    sys_mod.unblock_cmd_stream();
+}
+
+fn f_memcpy_timestamps() void {
+    // time_buf_f32[0] = {tscStartBuffer[1], tscStartBuffer[0]}
+    // time_buf_f32[1] = {tscEndBuffer[0], tscStartBuffer[2]}
+    // time_buf_f32[2] = {tscEndBuffer[2], tscEndBuffer[1]}
+    var lo_ : u16 = 0;
+    var hi_ : u16 = 0;
+    var word : u32 = 0;
+
+    lo_ = tscStartBuffer[0];
+    hi_ = tscStartBuffer[1];
+    time_buf_f32[0] = @bitcast(f32, (@as(u32,hi_) << @as(u16,16)) | @as(u32, lo_) );
+
+    lo_ = tscStartBuffer[2];
+    hi_ = tscEndBuffer[0];
+    time_buf_f32[1] = @bitcast(f32, (@as(u32,hi_) << @as(u16,16)) | @as(u32, lo_) );
+
+    lo_ = tscEndBuffer[1];
+    hi_ = tscEndBuffer[2];
+    time_buf_f32[2] = @bitcast(f32, (@as(u32,hi_) << @as(u16,16)) | @as(u32, lo_) );
+
+    // the user must unblock cmd color for every PE
+    sys_mod.unblock_cmd_stream();
+}
+
+fn f_sync() void {
+    // sync all PEs and record the reference clock
+    sync_mod.f_sync();
+}
+
+fn f_reference_timestamps() void {
+    // time_ref_f32[0] = {tscRefBuffer[1], tscRefBuffer[0]}
+    // time_ref_f32[1] = {0, tscRefBuffer[2]}
+    var lo_ : u16 = 0;
+    var hi_ : u16 = 0;
+
+    lo_ = sync_mod.tscRefBuffer[0];
+    hi_ = sync_mod.tscRefBuffer[1];
+    time_ref_f32[0] = @bitcast(f32, (@as(u32,hi_) << @as(u16,16)) | @as(u32, lo_) );
+
+    lo_ = sync_mod.tscRefBuffer[2];
+    hi_ = 0;
+    time_ref_f32[1] = @bitcast(f32, (@as(u32,hi_) << @as(u16,16)) | @as(u32, lo_) );
+
+    // the user must unblock cmd color for every PE
+    sys_mod.unblock_cmd_stream();
+}
+
+
+comptime {
+
+    @comptime_assert( pe_length <= size );
+}
+
+comptime {
+    @export_symbol(ptr_A, "A");
+    @export_symbol(ptr_time_memcpy, "time_memcpy");
+    @export_symbol(ptr_time_ref, "time_ref");
+}
+
+comptime{
+    @export_symbol(f_tic);
+    @export_symbol(f_toc);
+    @export_symbol(f_memcpy_timestamps);
+    @export_symbol(f_sync);
+    @export_symbol(f_reference_timestamps);
+}
diff --git a/benchmarks/row-col-broadcast/src/layout.csl b/benchmarks/row-col-broadcast/src/layout.csl
new file mode 100644
index 0000000..55f0f2e
--- /dev/null
+++ b/benchmarks/row-col-broadcast/src/layout.csl
@@ -0,0 +1,95 @@
+// Copyright 2024 Cerebras Systems.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+
+// c0,c1,c2,c3,c4 are internal colors of sync module
+param C0_ID: i16;
+param C1_ID: i16;
+param C2_ID: i16;
+param C3_ID: i16;
+param C4_ID: i16;
+
+param pe_length: i16; // number of wavelets per PE
+param width : i16 ;   // width of the core
+param height: i16 ;   // height of the core
+
+
+const C0 : color = @get_color(C0_ID);
+const C1 : color = @get_color(C1_ID);
+const C2 : color = @get_color(C2_ID);
+const C3 : color = @get_color(C3_ID);
+const C4 : color = @get_color(C4_ID);
+
+// entrypoints of sync module
+const STARTUP: local_task_id = @get_local_task_id(15);
+const SYNC_Y: local_task_id = @get_local_task_id(16);
+const SYNC_BCAST: local_task_id = @get_local_task_id(17);
+const EXIT: local_task_id = @get_local_task_id(18);
+
+
+const memcpy = @import_module( "<memcpy/get_params>", .{
+    .width = width,
+    .height = height,
+    });
+
+const sync = @import_module( "sync/layout.csl", .{
+    .colors = [5]color{C0, C1, C2, C3, C4},
+    .entrypoints = [4]local_task_id{STARTUP, SYNC_Y, SYNC_BCAST, EXIT},
+    .width = width,
+    .height = height
+    });
+
+layout{
+
+    // H2D or D2H colors must be less than 15 (smallest color of entrypoints)
+    @comptime_assert( C0_ID < C1_ID);
+    @comptime_assert( C1_ID < C2_ID);
+    @comptime_assert( C2_ID < C3_ID);
+    @comptime_assert( C3_ID < C4_ID);
+
+    // step 1: configure the rectangle which does not include halo
+    @set_rectangle( width, height );
+
+    // step 2: compile csl code for a set of PEx.y and generate out_x_y.elf
+    //   format: @set_tile_code(x, y, code.csl, param_binding);
+
+    var py: i16 = 0;
+    while(py < height) : (py +=1) {
+        var px: i16 = 0;
+        while( px < width) : (px +=1) {
+
+            const memcpyParams = memcpy.get_params(px);
+            const syncParams = sync.get_params(px, py);
+
+            var params: comptime_struct = .{
+                .memcpyParams = memcpyParams,
+                .pe_length = pe_length,
+
+                .syncParams = syncParams,
+            };
+
+            @set_tile_code(px, py, "kernel.csl", params);
+        }
+    }
+
+    @export_name("A", [*]f32, true);
+    @export_name("time_memcpy", [*]f32, true);
+    @export_name("time_ref", [*]f32, true);
+
+    @export_name("f_tic", fn()void);
+    @export_name("f_toc", fn()void);
+    @export_name("f_memcpy_timestamps", fn()void);
+    @export_name("f_sync", fn()void);
+    @export_name("f_reference_timestamps", fn()void);
+} // end of layout
diff --git a/benchmarks/row-col-broadcast/src/sync/layout.csl b/benchmarks/row-col-broadcast/src/sync/layout.csl
new file mode 100644
index 0000000..ba975f5
--- /dev/null
+++ b/benchmarks/row-col-broadcast/src/sync/layout.csl
@@ -0,0 +1,79 @@
+// Copyright 2024 Cerebras Systems.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+
+param colors:[5]color;
+param entrypoints:[4]local_task_id;
+param width : i16 ;   // width of the core
+param height: i16 ;   // height of the core
+
+const C0 : color = colors[0];
+const C1 : color = colors[1];
+const C2 : color = colors[2];
+const C3 : color = colors[3];
+const C4 : color = colors[4];
+
+const STARTUP: local_task_id = entrypoints[0];
+const SYNC_Y: local_task_id = entrypoints[1];
+const SYNC_BCAST: local_task_id = entrypoints[2];
+const EXIT: local_task_id = entrypoints[3];
+
+fn get_params(px:i16, py:i16) comptime_struct {
+
+    var first_py: bool = (0 == py);
+    var last_py: bool = ((height-1) == py);
+    var is_py_even: bool = (0 == (py % 2));
+
+    var first_px: bool = (0 == px);
+    var last_px: bool = ((width-1) == px);
+    var is_px_even: bool = (0 == (px % 2));
+
+    var c_recv_px: color = C0;
+    var c_send_px: color = C1;
+    if (is_px_even){
+        c_recv_px = C0;
+        c_send_px = C1;
+    }else{
+        c_recv_px = C1;
+        c_send_px = C0;
+    }
+
+    var c_recv_py: color = C2;
+    var c_send_py: color = C3;
+    if (is_py_even){
+        c_recv_py = C2;
+        c_send_py = C3;
+    }else{
+        c_recv_py = C3;
+        c_send_py = C2;
+    }
+
+    return .{
+        .c_recv_px = c_recv_px,
+        .c_send_px = c_send_px,
+        .c_recv_py = c_recv_py,
+        .c_send_py = c_send_py,
+        .c_bcast = C4,
+
+        .STARTUP = STARTUP,
+        .SYNC_Y = SYNC_Y,
+        .SYNC_BCAST = SYNC_BCAST,
+        .EXIT = EXIT,
+
+        .first_px = first_px,
+        .last_px = last_px,
+        .first_py = first_py,
+        .last_py = last_py,
+    };
+}
diff --git a/benchmarks/row-col-broadcast/src/sync/pe.csl b/benchmarks/row-col-broadcast/src/sync/pe.csl
new file mode 100644
index 0000000..3a0b391
--- /dev/null
+++ b/benchmarks/row-col-broadcast/src/sync/pe.csl
@@ -0,0 +1,289 @@
+// Copyright 2024 Cerebras Systems.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+
+param c_recv_px: color;
+param c_send_px: color;
+param c_recv_py: color;
+param c_send_py: color;
+param c_bcast: color;
+
+param STARTUP: local_task_id;
+param SYNC_Y: local_task_id;
+param SYNC_BCAST: local_task_id;
+param EXIT: local_task_id;
+
+param first_px: bool;
+param last_px: bool;
+param first_py: bool;
+param last_py: bool;
+
+// f_callback = sys_mod.unblock_cmd_stream, to continue next command
+param f_callback : fn ()void;
+
+// input_queues={2,3,4}
+// output_queues={2,3,4}
+param input_queues:[3]u16;
+param output_queues:[3]u16;
+
+const c_recv_px_iq = @get_input_queue(input_queues[0]);
+const c_send_px_oq = @get_output_queue(output_queues[0]);
+
+const c_recv_py_iq = @get_input_queue(input_queues[1]);
+const c_send_py_oq = @get_output_queue(output_queues[1]);
+
+const c_bcast_iq = @get_input_queue(input_queues[2]);
+const c_bcast_oq = @get_output_queue(input_queues[2]);
+
+const timestamp = @import_module("<time>");
+
+// tsc_size_words = 3
+var tscRefBuffer = @zeros([timestamp.tsc_size_words]u16);
+
+////////////////////////////////////////////////////////////////////////////////
+// Main memory (48KB)
+////////////////////////////////////////////////////////////////////////////////
+
+var buf = @zeros([1]f32);
+
+////////////////////////////////////////////////////////////////////////////////
+// Tasks
+// syntax
+//     task_begin(name, entrypoint, color)
+////////////////////////////////////////////////////////////////////////////////
+
+const mem_buf_dsd = @get_dsd(mem1d_dsd, .{ .tensor_access = |i|{1} -> buf[i] });
+
+var fab_recv_data_px_wdsd =  @get_dsd(fabin_dsd, .{
+   .extent = 1,
+   .fabric_color = c_recv_px,
+   .input_queue = c_recv_px_iq
+});
+
+var fab_trans_data_px_wdsd = @get_dsd(fabout_dsd, .{
+    .extent = 1,
+    .fabric_color = c_send_px,
+    .output_queue = c_send_px_oq
+});
+
+var fab_recv_data_py_wdsd =  @get_dsd(fabin_dsd, .{
+   .extent = 1,
+   .fabric_color = c_recv_py,
+   .input_queue = c_recv_py_iq
+});
+
+var fab_trans_data_py_wdsd = @get_dsd(fabout_dsd, .{
+    .extent = 1,
+    .fabric_color = c_send_py,
+    .output_queue = c_send_py_oq
+});
+
+var fab_recv_data_bcast_wdsd =  @get_dsd(fabin_dsd, .{
+   .extent = 1,
+   .fabric_color = c_bcast,
+   .input_queue = c_bcast_iq
+});
+
+var fab_trans_data_bcast_wdsd = @get_dsd(fabout_dsd, .{
+    .extent = 1,
+    .fabric_color = c_bcast,
+    .output_queue = c_bcast_oq
+});
+
+
+
+// Each row performs a sync from the last PE to first PE
+fn f_sync() void {
+    // sync a row
+    if (last_px){
+        // px = width-1: send sync signal
+        @mov32(fab_trans_data_px_wdsd, mem_buf_dsd, .{.async=true, .activate = f_sync_y });
+    }else{
+        if (first_px){
+            // px = 0: receive signal
+            @mov32(mem_buf_dsd, fab_recv_data_px_wdsd, .{.async=true, .activate = f_sync_y });
+        }else{
+            // 0 < px < width-1: receive signal and forward it
+            @mov32(fab_trans_data_px_wdsd, fab_recv_data_px_wdsd, .{.async=true, .activate = f_sync_y });
+        }
+    }
+}
+
+
+// prerequisite: row synchronization is done
+//   the first PE is the last one to receive the signal
+// The first column performs a sync from last PE to first PE
+// other PEs wait for bcast signal
+task f_sync_y() void {
+    if (first_px){
+        // 1st column performs a sync
+        if (last_py){
+            // py = height-1: send sync signal
+            @mov32(fab_trans_data_py_wdsd, mem_buf_dsd, .{.async=true, .activate = f_sync_bcast });
+        }else{
+            if (first_py){
+                // py = 0: receive signal
+                @mov32(mem_buf_dsd, fab_recv_data_py_wdsd, .{.async=true, .activate = f_sync_bcast });
+            }else{
+                // 0 < py < height-1: receive signal and forward it
+                @mov32(fab_trans_data_py_wdsd, fab_recv_data_py_wdsd, .{.async=true, .activate = f_sync_bcast });
+            }
+        }
+    }else{
+        // other PEs wait for bcast signal
+        @activate(SYNC_BCAST); // trigger f_sync_bcast
+    }
+}
+
+// prerequisite: sync is done, P0.0 is the last one to receive the sync
+// P0.0 broadcasts the signal, others wait for the bcast signal from P0.0
+task f_sync_bcast() void {
+
+    if ( first_px and first_py ){
+        // P0.0 sends the signal
+        @mov32(fab_trans_data_bcast_wdsd, mem_buf_dsd, .{.async=true, .activate = f_exit });
+    }else{
+        // others wait for bcast from P0.0
+        @mov32(mem_buf_dsd, fab_recv_data_bcast_wdsd, .{.async=true, .activate = f_exit });
+    }
+}
+
+// record reference clock T
+// T is regarded as clock 0 because all PEs sync with P0.0
+task f_exit() void {
+
+    timestamp.get_timestamp(&tscRefBuffer);
+
+    //sys_mod.unblock_cmd_stream();
+    f_callback();
+}
+
+
+task f_startup() void {
+    timestamp.enable_tsc();
+}
+
+comptime {
+    @activate(STARTUP);
+
+    @bind_local_task(f_startup, STARTUP);
+    @bind_local_task(f_sync_y, SYNC_Y);
+    @bind_local_task(f_sync_bcast, SYNC_BCAST);
+    @bind_local_task(f_exit, EXIT);
+
+    // On WSE-3, we must explicitly initialize input and output queues
+    if (@is_arch("wse3")) {
+        @initialize_queue(c_recv_px_iq, .{ .color = c_recv_px });
+        @initialize_queue(c_send_px_oq, .{ .color = c_send_px });
+
+        @initialize_queue(c_recv_py_iq, .{ .color = c_recv_py });
+        @initialize_queue(c_send_py_oq, .{ .color = c_send_py });
+
+        @initialize_queue(c_bcast_iq, .{ .color = c_bcast });
+        @initialize_queue(c_bcast_oq, .{ .color = c_bcast });
+    }
+}
+
+
+// sync a row with C0 and C1
+//
+//     C0     C1     C0     C1
+// P0 <-- P1 <-- P2 <-- P3 <-- P4
+//
+//     C0     C1     C0     C1     C0
+// P0 <-- P1 <-- P2 <-- P3 <-- P4 <-- P5
+//
+// P0: recv C0
+// P_even: recv C0, send C1
+// P_odd: recv C1, send C0
+// P_last: send C0 if odd; send C1 if even
+comptime {
+    if (first_px){
+        // px = 0: receive from east
+        @set_local_color_config(c_recv_px, .{ .routes = .{ .rx = .{EAST}, .tx = .{RAMP} } } );
+    }else{
+        if (last_px){
+           // px = width-1: send to west
+           @set_local_color_config(c_send_px, .{ .routes = .{ .rx = .{RAMP}, .tx = .{WEST} } } );
+        }else{
+           // 0 < px < width-1: receive from east, send to west
+           @set_local_color_config(c_recv_px, .{ .routes = .{ .rx = .{EAST}, .tx = .{RAMP} } } );
+           @set_local_color_config(c_send_px, .{ .routes = .{ .rx = .{RAMP}, .tx = .{WEST} } } );
+        }
+    }
+}
+
+// sync a col with C2 and C3
+//     C2     C3     C2     C3
+// P0 <-- P1 <-- P2 <-- P3 <-- P4
+//
+//     C2     C3     C2     C3     C2
+// P0 <-- P1 <-- P2 <-- P3 <-- P4 <-- P5
+//
+// P0: recv C2
+// P_even: recv C2, send C3
+// P_odd: recv C3, send C2
+// P_last: send C2 if odd; send C3 if even
+comptime {
+    if (first_py){
+        // py = 0 (even): receive from south
+        @set_local_color_config(c_recv_py, .{ .routes = .{ .rx = .{SOUTH}, .tx = .{RAMP} } } );
+    }else{
+        if (last_py){
+           // py = height-1: send to north
+           @set_local_color_config(c_send_py, .{ .routes = .{ .rx = .{RAMP}, .tx = .{NORTH} } } );
+        }else{
+           // 0 < py < height-1: receive from south, send to north
+           @set_local_color_config(c_recv_py, .{ .routes = .{ .rx = .{SOUTH}, .tx = .{RAMP} } } );
+           @set_local_color_config(c_send_py, .{ .routes = .{ .rx = .{RAMP}, .tx = .{NORTH} } } );
+        }
+    }
+}
+
+
+// w > 1 and h > 1
+//  x --> x --> x
+//  |
+//  V
+//  x --> x --> x
+//  |
+//  V
+//  x --> x --> x
+//
+// WARNING: corner case for w=1 or h=1
+comptime {
+    if (first_px){
+        // px = 0
+        if (first_py){
+            // P0,0: send to east and south
+            @set_local_color_config(c_bcast, .{ .routes = .{ .rx = .{RAMP}, .tx = .{EAST, SOUTH} } } );
+        }else{
+            if (last_py){
+                // P0,h-1
+                @set_local_color_config(c_bcast, .{ .routes = .{ .rx = .{NORTH}, .tx = .{EAST, RAMP} } } );
+            }else{
+                // P0,py: 0 < py < height-1
+                @set_local_color_config(c_bcast, .{ .routes = .{ .rx = .{NORTH}, .tx = .{EAST, RAMP, SOUTH} } } );
+            }
+        }
+    }else{
+        if (last_px){
+            // px = width-1
+           @set_local_color_config(c_bcast, .{ .routes = .{ .rx = .{WEST}, .tx = .{RAMP} } } );
+        }else{
+            // 0 < px < width-1
+           @set_local_color_config(c_bcast, .{ .routes = .{ .rx = .{WEST}, .tx = .{EAST, RAMP} } } );
+        }
+    }
+}
diff --git a/benchmarks/single-tile-matvec/commands.sh b/benchmarks/single-tile-matvec/commands_wse2.sh
similarity index 73%
rename from benchmarks/single-tile-matvec/commands.sh
rename to benchmarks/single-tile-matvec/commands_wse2.sh
index 0276133..2f65d62 100755
--- a/benchmarks/single-tile-matvec/commands.sh
+++ b/benchmarks/single-tile-matvec/commands_wse2.sh
@@ -2,7 +2,7 @@
 
 set -e
 
-cslc ./layout_matvec.csl --arch wse2 --fabric-dims=9,4 \
+cslc ./src/layout_matvec.csl --arch wse2 --fabric-dims=9,4 \
 --fabric-offsets=4,1 \
 --params=width:2,height:2,tile_size:25,iters:1 \
 -o out --memcpy --channels=1
diff --git a/benchmarks/single-tile-matvec/commands_wse3.sh b/benchmarks/single-tile-matvec/commands_wse3.sh
new file mode 100755
index 0000000..a7ce8b5
--- /dev/null
+++ b/benchmarks/single-tile-matvec/commands_wse3.sh
@@ -0,0 +1,9 @@
+#!/usr/bin/env bash
+
+set -e
+
+cslc ./src/layout_matvec.csl --arch wse3 --fabric-dims=9,4 \
+--fabric-offsets=4,1 \
+--params=width:2,height:2,tile_size:25,iters:1 \
+-o out --memcpy --channels=1
+cs_python ./run.py --name out --verify
diff --git a/benchmarks/single-tile-matvec/compile.appliance.py b/benchmarks/single-tile-matvec/compile.appliance.py
new file mode 100644
index 0000000..49d6c45
--- /dev/null
+++ b/benchmarks/single-tile-matvec/compile.appliance.py
@@ -0,0 +1,37 @@
+# Copyright 2024 Cerebras Systems.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import json
+from google.protobuf.json_format import MessageToJson
+
+#from tests.appliance.common.cluster_details_utils import build_cluster_details
+#import cluster_details_utils
+
+from cerebras.sdk.client import (
+        SdkCompiler,
+)
+
+hash_filename = "hash.json"
+
+compiler = SdkCompiler()
+
+hashstr = compiler.compile("./src", "layout_matvec.csl", "--arch wse3 --fabric-dims=9,4 --fabric-offsets=4,1 --params=width:2,height:2,tile_size:25,iters:1 -o latest --memcpy --channels=1")
+
+print("compile artifact:", hashstr)
+
+print(f"dump artifact name to file {hash_filename}")
+with open(hash_filename, "w") as write_file:
+    json.dump(hashstr, write_file)
+
diff --git a/benchmarks/single-tile-matvec/run.appliance.py b/benchmarks/single-tile-matvec/run.appliance.py
new file mode 100644
index 0000000..7759ff6
--- /dev/null
+++ b/benchmarks/single-tile-matvec/run.appliance.py
@@ -0,0 +1,333 @@
+# Copyright 2024 Cerebras Systems.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import struct
+import time
+import argparse
+import math
+import csv
+import json
+import numpy as np
+
+from cerebras.sdk.client import (
+        SdkCompiler,
+        SdkRuntime,
+)
+
+from cerebras.appliance.pb.sdk.sdk_common_pb2 import (
+        MemcpyDataType,
+        MemcpyOrder,
+)
+
+hash_filename = "hash.json"
+
+
+def parse_args():
+  """ parse the command line """
+
+  parser = argparse.ArgumentParser(description="single tile matvec run parameters")
+  parser.add_argument("--name", required=False, default="out",
+                      help="prefix of ELF files")
+  parser.add_argument("--simulator", action="store_true",
+                      help="Runs on simulator")
+  parser.add_argument("--verify", action="store_true",
+                      help="Verify Y computation")
+  # The following parameters are 
+  parser.add_argument(
+      "--nb",
+      default=1, type=int,
+      help="tile_size")
+  parser.add_argument(
+      "--width",
+      default=1, type=int,
+      help="width of the core rectangle")
+  parser.add_argument(
+      "--height",
+      default=1, type=int,
+      help="height of the core rectangle")
+  parser.add_argument(
+      "--iters",
+      default=1, type=int,
+      help="number of iterations")
+
+  args = parser.parse_args()
+  return args
+
+
+def float_to_hex(f):
+  return hex(struct.unpack('<I', struct.pack('<f', f))[0])
+
+def make_u48(words):
+  return words[0] + (words[1] << 16) + (words[2] << 32)
+
+def sub_ts(words):
+  return make_u48(words[3:]) - make_u48(words[0:3])
+
+
+# How to compile
+#  python compile.py
+#
+# How to run
+#  python run.py --name latest --verify --width=2 --height=2 --nb=25 --iters=1
+#
+def main():
+  """Main method to run the example code."""
+
+  args = parse_args()
+
+  name = args.name
+  verify = args.verify
+
+  # FileNotFoundError: [Errno 2] No such file or directory: '8da5208c2594b1ed64fe066bcb8b03b475a6888b21a2c4b3091c5637802f0c85/latest/out.json'
+  # We need to RPCed whole tarball of {hashtr} back to the client
+  if False:
+    # Parse the compile metadata
+    with open(f"{hashstr}/out.json", encoding="utf-8") as json_file:
+      compile_data = json.load(json_file)
+
+    nb = int(compile_data["params"]["tile_size"])
+    width = int(compile_data["params"]["width"])
+    height = int(compile_data["params"]["height"])
+    iters = int(compile_data["params"]["iters"])
+  else:
+    nb = args.nb
+    width = args.width
+    height = args.height
+    iters = args.iters
+
+  print(f"nb = {nb}")
+  print(f"width = {width}")
+  print(f"height = {height}")
+  print(f"iters = {iters}")
+
+  # Calculate alignment and padding to avoid bank conflicts
+  align = 16
+  multiple = int(align/4)
+  padded_nb = math.ceil(nb/multiple)*multiple
+
+
+  #############
+  # Run
+  #############
+
+  print(f"load artifact name from file {hash_filename}")
+  with open(hash_filename, "r") as f:
+    hashstr = json.load(f)
+
+  start = time.time()
+
+  # Instantiate runner
+  with SdkRuntime(hashstr, simulator=args.simulator) as runner:
+
+    # Device symbols for memcpy
+    A_symbol = runner.get_id("A")
+    x_symbol = runner.get_id("x")
+    y_symbol = runner.get_id("y")
+    symbol_maxmin_time = runner.get_id("maxmin_time")
+
+    # load() and run() are called by client.Sdkruntime.__enter__
+    #runner.load()
+    #runner.run()
+
+    # Construct A data and copy random A matrix PE (0,0) for verification
+    A_mat = np.random.rand(nb, nb)
+    A_data = np.zeros(width*height*(nb*padded_nb+1), dtype=np.float32)
+
+    for w in range(width):
+      for h in range(height):
+        for i in range(nb):
+          for j in range(nb):
+            A_data[(h*width + w)*(nb*padded_nb+1) + j*padded_nb + i + 1] = A_mat[i, j]
+
+    print()
+    print("Beginning run.")
+    print("Copy A matrices to device...")
+    runner.memcpy_h2d(A_symbol, A_data, 0, 0, width, height, nb*padded_nb+1,
+      streaming=False, data_type=MemcpyDataType.MEMCPY_32BIT, order=MemcpyOrder.ROW_MAJOR, nonblock=False)
+
+    # Construct x data and copy random x vector to PE (0,0) for verification
+    x_vec = np.random.rand(nb)
+    x_data = np.zeros(width*height*nb, dtype=np.float32)
+    for w in range(width):
+      for h in range(height):
+        x_data[(h*width + w)*nb:(h*width + w)*nb+nb] = x_vec
+
+
+    print("Copy x vectors to device...")
+    runner.memcpy_h2d(x_symbol, x_data, 0, 0, width, height, nb,
+      streaming=False, data_type=MemcpyDataType.MEMCPY_32BIT, order=MemcpyOrder.ROW_MAJOR, nonblock=False)
+
+    # Launch the compute kernel
+    print("Launch kernel...")
+    runner.call("compute", [], nonblock=False)
+
+    # Copy back timestamps from device
+    data = np.zeros((width*height*3, 1), dtype=np.uint32)
+    runner.memcpy_d2h(data, symbol_maxmin_time, 0, 0, width, height, 3,
+      streaming=False, data_type=MemcpyDataType.MEMCPY_32BIT, order=MemcpyOrder.ROW_MAJOR, nonblock=False)
+    maxmin_time_hwl = data.view(np.float32).reshape((height, width, 3))
+    print("Copied back timestamps.")
+
+    # Copy back data array from device
+    if verify:
+      data = np.zeros((width*height*padded_nb, 1), dtype=np.uint32)
+      runner.memcpy_d2h(data, y_symbol, 0, 0, width, height, padded_nb,
+        streaming=False, data_type=MemcpyDataType.MEMCPY_32BIT, order=MemcpyOrder.ROW_MAJOR, nonblock=False)
+      y_device_array = data.view(np.float32).reshape((height, width, padded_nb))
+      print("Copied back Y array.")
+
+    print("Done.")
+
+    # stop() is called by client.Sdkruntime.__exit__
+    #runner.stop()
+
+  # End walltime timer
+  end = time.time()
+  walltime = end-start
+
+
+  ###########
+  # Verify
+  ###########
+
+  if verify:
+    print("Test y result is as expected on each PE...")
+    expected = A_mat @ x_vec
+    for w in range(width):
+      for h in range(height):
+        np.testing.assert_allclose(y_device_array[h, w, :nb], expected, atol=0.0001, rtol=0)
+    print("SUCCESS!")
+
+
+  #################################
+  # Calculate mem accesses and FLOP
+  #################################
+
+  # STANDARD read/writes
+  # Read full x, read each column of V stack, write full y = nb + nb*nb + nb
+  # = nb*nb + 2*nb
+  #
+  # 4 bytes per elem. Mem = 4 * (nb*nb + 2*nb)
+  #                       = 4*nb*nb + 8*nb
+
+  # ACTUAL read/writes
+  # Read full x; read each col of V stack; read, write full y nb times
+  # = nb + nb*nb + 2*nb*nb
+  # = 3*nb*nb + nb
+  #
+  # 4 bytes per elem. Mem = 4 * (3*nb*nb + nb)
+  #                       = 12*nb*nb + 4*nb
+
+  # Floating point operations
+  # Compute A_ij * x_j for each i, j = nb * nb
+  # For each row of A, reduction uses nb - 1 adds = nb * (nb-1)
+  # = nb * nb + nb * (nb-1)
+  # = 2*nb*nb - nb
+
+  total_relative_accesses = width * height * (4*nb*nb + 8*nb)
+  total_absolute_accesses = width * height * (12*nb*nb + 4*nb)
+  total_flop = width * height * (2*nb*nb - nb)
+
+
+  #######################
+  # Calculate cycle count
+  #######################
+
+  tsc_tensor_d2h = np.zeros(6).astype(np.uint16)
+  min_cycles = math.inf
+  max_cycles = 0
+
+  for w in range(width):
+    for h in range(height):
+      hex_t0 = int(float_to_hex(maxmin_time_hwl[(h, w, 0)]), base=16)
+      hex_t1 = int(float_to_hex(maxmin_time_hwl[(h, w, 1)]), base=16)
+      hex_t2 = int(float_to_hex(maxmin_time_hwl[(h, w, 2)]), base=16)
+      tsc_tensor_d2h[0] = hex_t0 & 0x0000ffff
+      tsc_tensor_d2h[1] = (hex_t0 >> 16) & 0x0000ffff
+      tsc_tensor_d2h[2] = hex_t1 & 0x0000ffff
+      tsc_tensor_d2h[3] = (hex_t1 >> 16) & 0x0000ffff
+      tsc_tensor_d2h[4] = hex_t2 & 0x0000ffff
+      tsc_tensor_d2h[5] = (hex_t2 >> 16) & 0x0000ffff
+
+      cycles = sub_ts(tsc_tensor_d2h)
+      if cycles < min_cycles:
+        min_cycles = cycles
+        min_w = w
+        min_h = h
+      if cycles > max_cycles:
+        max_cycles = cycles
+        max_w = w
+        max_h = h
+
+
+  #####################
+  # Calculate bandwidth
+  #####################
+
+  # Calculate in bytes/sec and FLOP/sec for program rectangle
+  secs = max_cycles / 850000000.
+  relative_bw = total_relative_accesses / secs * iters
+  absolute_bw = total_absolute_accesses / secs * iters
+  flops_sec = total_flop / secs
+
+  # Convert to Petabytes/sec and PetaFLOPS
+  relative_bw /= 1.E15
+  absolute_bw /= 1.E15
+  flops_sec /= 1.E15
+
+  # Scale to program rectangle
+  scale_factor = (994.*750.) / (width*height)
+  scale_relative_bw = relative_bw * scale_factor
+  scale_absolute_bw = absolute_bw * scale_factor
+  scale_flops_sec = flops_sec * scale_factor
+
+
+  #################
+  # Generate output
+  #################
+
+  print()
+  print(f"Real walltime: {walltime}s")
+  print()
+  print("Cycle Counts:")
+  print("Min cycles (", min_w, ", ", min_h, "): ", min_cycles)
+  print("Max cycles (", max_w, ", ", max_h, "): ", max_cycles)
+  print()
+  print("Accesses and FLOP Information:")
+  print("Relative accesses (bytes): ", total_relative_accesses)
+  print("Absolute accesses (bytes): ", total_absolute_accesses)
+  print("FP operations:             ", total_flop)
+  print()
+  print("Bandwidth and FLOPS Information:")
+  print("Relative BW (PB/s): ", relative_bw)
+  print("Absolute BW (PB/s): ", absolute_bw)
+  print("PetaFLOPS:          ", flops_sec)
+  print()
+  print("Scaled (", width, ",", height, ") to (750,994)...")
+  print("Scaled relative BW (PB/s): ", scale_relative_bw)
+  print("Scaled absolute BW (PB/s): ", scale_absolute_bw)
+  print("Scaled PetaFLOPS:          ", scale_flops_sec)
+
+  # Write a CSV
+  csv_name = name + ".csv"
+  with open(csv_name, mode='a') as csv_file:
+    csv_writer = csv.writer(csv_file)
+    csv_writer.writerow(["appliance", width, height, iters, nb, padded_nb, min_cycles, max_cycles,
+      total_relative_accesses, total_absolute_accesses, relative_bw, absolute_bw,
+      scale_relative_bw, scale_absolute_bw, total_flop, flops_sec, scale_flops_sec, walltime])
+
+
+
+if __name__ == "__main__":
+  main()
diff --git a/benchmarks/single-tile-matvec/run.py b/benchmarks/single-tile-matvec/run.py
index a41d367..46c392b 100644
--- a/benchmarks/single-tile-matvec/run.py
+++ b/benchmarks/single-tile-matvec/run.py
@@ -70,6 +70,10 @@ def main():
   height = int(compile_data["params"]["height"])
   iters = int(compile_data["params"]["iters"])
 
+  print(f"nb = {nb}")
+  print(f"width = {width}")
+  print(f"height = {height}")
+  print(f"iters = {iters}")
 
   # Calculate alignment and padding to avoid bank conflicts
   align = 16
diff --git a/benchmarks/single-tile-matvec/layout_matvec.csl b/benchmarks/single-tile-matvec/src/layout_matvec.csl
similarity index 100%
rename from benchmarks/single-tile-matvec/layout_matvec.csl
rename to benchmarks/single-tile-matvec/src/layout_matvec.csl
diff --git a/benchmarks/single-tile-matvec/pe_matvec.csl b/benchmarks/single-tile-matvec/src/pe_matvec.csl
similarity index 100%
rename from benchmarks/single-tile-matvec/pe_matvec.csl
rename to benchmarks/single-tile-matvec/src/pe_matvec.csl
diff --git a/benchmarks/spmv-hypersparse/README.rst b/benchmarks/spmv-hypersparse/README.rst
index 0ca0c1e..fe3b17c 100644
--- a/benchmarks/spmv-hypersparse/README.rst
+++ b/benchmarks/spmv-hypersparse/README.rst
@@ -1,4 +1,4 @@
-spmv-hypersparse
+Hypersparse SpMV
 ================
 
 This example evaluates the performance of sparse matrix-vector multiplication.
diff --git a/benchmarks/spmv-hypersparse/cmd_parser.py b/benchmarks/spmv-hypersparse/cmd_parser.py
index 98d2f41..958409f 100644
--- a/benchmarks/spmv-hypersparse/cmd_parser.py
+++ b/benchmarks/spmv-hypersparse/cmd_parser.py
@@ -25,6 +25,8 @@ def parse_args():
         help='the sparse matrix in MTX format',
         required=True
     )
+    parser.add_argument("--simulator", action="store_true",
+        help="Runs on simulator")
     parser.add_argument(
         '--num_pe_cols',
         type=int,
@@ -78,7 +80,7 @@ def parse_args():
     )
     parser.add_argument(
         "--arch",
-        help="wse1 or wse2. Default is wse1 when not supplied."
+        help="wse2 or wse3. Default is wse2 when not supplied."
     )
     parser.add_argument(
         '--is_invec_one',
@@ -100,4 +102,7 @@ def parse_args():
 
     args = parser.parse_args()
 
+    if args.cmaddr is None:
+        args.simulator = False
+
     return args
diff --git a/benchmarks/spmv-hypersparse/commands.sh b/benchmarks/spmv-hypersparse/commands_wse2.sh
similarity index 87%
rename from benchmarks/spmv-hypersparse/commands.sh
rename to benchmarks/spmv-hypersparse/commands_wse2.sh
index 345a44f..990616f 100755
--- a/benchmarks/spmv-hypersparse/commands.sh
+++ b/benchmarks/spmv-hypersparse/commands_wse2.sh
@@ -2,7 +2,7 @@
 
 set -e
 
-cslc ./layout.csl --arch wse2 --fabric-dims=11,6 --fabric-offsets=4,1 \
+cslc ./src/layout.csl --arch wse2 --fabric-dims=11,6 --fabric-offsets=4,1 \
 --params=ncols:16 --params=nrows:16 --params=pcols:4 --params=prows:4 --params=max_local_nnz:8 \
 --params=max_local_nnz_cols:4 --params=max_local_nnz_rows:4 --params=local_vec_sz:1 \
 --params=local_out_vec_sz:1 --params=y_pad_start_row_idx:4 -o=out \
diff --git a/benchmarks/spmv-hypersparse/run.appliance.py b/benchmarks/spmv-hypersparse/run.appliance.py
new file mode 100644
index 0000000..025d15b
--- /dev/null
+++ b/benchmarks/spmv-hypersparse/run.appliance.py
@@ -0,0 +1,814 @@
+# Copyright 2024 Cerebras Systems.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+""" test sparse matrix-vector multiplication
+
+  This example aims at a hypersparse matrix with almost uniform distribution.
+  The algorithm partitions the sparse matrix into 2D grids. The algorithm may
+  fail if there exists one parition which has too many nonzeros to fit the
+  memory capacity (48KB) of the PE.
+
+  To obtain the best performance, the user may need to reorder the matrix such
+  that the variatoin of the nonzeros of each parition is small.
+
+  To run this example, the user has to provide a file of Matrix Market File
+  format with 1-based index. For example, the user can reorder the matrix A by
+  the permutation matrices P and Q, and writes P*A*Q^T to a file. One option is
+  "util/analyze.cpp" which provides a load balancing algorithm.
+
+  This example reads a MTX file, generates the vector x, partitions the matrix,
+  and computes y = A*x.
+
+  The framework is
+  ---
+       sync()  // synchronize all PEs to sample the reference clock
+       tic()   // record start time
+       spmv()  // compute y = A*x
+       toc()   // record end time
+  ---
+
+  The tic() samples "time_start" and toc() samples "time_end". The sync() samples
+  "time_ref" which is used to shift "time_start" and "time_end".
+  The elapsed time is measured by
+       cycles_send = max(time_end) - min(time_start)
+
+  The overall runtime is computed via the following formula
+       time_send = (cycles_send / 0.85) *1.e-3 us
+  where a PE runs with clock speed 850MHz
+
+  The spmv kernel performs y = A * x
+  where A is m-by-n with nnz nonzeros
+
+  The standard measurement counts the number of memory access of
+       y[i] = sum{ Aij * xj : Aij is nonzero }
+  - read Aij: nnz
+  - read xj: nnz
+  - write y[i]: m
+  Total number of memory access: (2*nnz + m) f32
+
+  Here is the list of parameters:
+    --infile_mtx=<path to mtx file> contains the sparse matrix A
+    --num_pe_rows=<int> specifies the height of the core rectangle
+    --num_pe_cols=<int> specifies the width of the core rectangle
+    --channels=<int> specifies the number of I/O channels, no bigger than 16
+
+  How to compile and run
+     To build a 5-by-4 core rectangle, we need to pass --num_pe_cols=5 --num_pe_rows=4
+     Use the following command to compile
+        python run.py --arch=wse2 --num_pe_cols=5 --num_pe_rows=4 --channels=1
+           --driver=<path to cslc> --compile-only --infile_mtx=<path to mtx file>
+     Use the following command to run
+        python run.py --arch=wse2 --num_pe_cols=5 --num_pe_rows=4 --channels=1
+           --is_weight_one --run-only --infile_mtx=<path to mtx file>
+"""
+
+import os, sys
+import subprocess
+import time
+import math
+import numpy as np
+import scipy.sparse as sparse
+import shutil
+import json
+
+from pathlib import Path
+from datetime import datetime
+from typing import Optional
+
+from cerebras.sdk.client.debug_util import debug_util # pylint: disable=no-name-in-module
+
+from cerebras.sdk.client import (
+        SdkCompiler,
+        SdkRuntime,
+)
+
+from cerebras.appliance.pb.sdk.sdk_common_pb2 import (
+        MemcpyDataType,
+        MemcpyOrder,
+)
+
+from cmd_parser import parse_args
+
+from memory_usage import memory_per_pe
+
+from scipy.io import mmread
+
+from preprocess import preprocess
+
+
+hash_filename = "hash.json"
+
+
+def make_u48(words):
+  return words[0] + (words[1] << 16) + (words[2] << 32)
+
+
+def hwl_to_oned_colmajor(
+    height: int,
+    width: int,
+    pe_length: int,
+    A_hwl: np.ndarray,
+    dtype
+):
+    """
+    Given a 3-D tensor A[height][width][pe_length], transform it to
+    1D array by column-major
+    """
+    if A_hwl.dtype == np.float32:
+        A_1d = np.zeros(height*width*pe_length, dtype)
+        idx = 0
+        for l in range(pe_length):
+            for w in range(width):
+                for h in range(height):
+                    A_1d[idx] = A_hwl[(h, w, l)]
+                    idx = idx + 1
+    elif A_hwl.dtype == np.uint16:
+        assert dtype == np.uint32, "only support dtype = u32 if A is f16"
+        A_1d = np.zeros(height*width*pe_length, dtype)
+        idx = 0
+        for l in range(pe_length):
+            for w in range(width):
+                for h in range(height):
+                    x = A_hwl[(h, w, l)]
+                    # x can be (np.float16, np.int16, np.uint16)
+                    # convert x to u16
+                    z = x.view(np.uint16)
+                    # zero extension of u16
+                    A_1d[idx] = np.uint32(z)
+                    idx = idx + 1
+    else:
+        raise RuntimeError(f"{type(A_hwl)} is not supported")
+
+    return A_1d
+
+
+def oned_to_hwl_colmajor(
+    height: int,
+    width: int,
+    pe_length: int,
+    A_1d: np.ndarray,
+    dtype
+):
+    """
+    Given a 1-D tensor A_1d[height*width*pe_length], transform it to
+    3-D tensor A[height][width][pe_length] by column-major
+    """
+    if dtype == np.float32:
+        # only support f32 to f32
+        assert A_1d.dtype == np.float32, "only support f32 to f32"
+        A_hwl = np.reshape(A_1d, (height, width, pe_length), order='F')
+
+    elif dtype == np.uint16:
+        # only support u32 to u16 by dropping upper 16-bit
+        assert A_1d.dtype == np.uint32, "only support u32 to u16"
+        A_hwl = np.zeros((height, width, pe_length), dtype)
+        idx = 0
+        for l in range(pe_length):
+            for w in range(width):
+                for h in range(height):
+                    x = A_1d[idx]
+                    x = x & 0x0000FFFF # drop upper 16-bit
+                    A_hwl[(h, w, l)] = np.uint16(x)
+                    idx = idx + 1
+    else:
+        raise RuntimeError(f"{dtype} is not supported")
+
+    return A_hwl
+
+
+def read_input_vector(IS_INVEC_1, vec_len):
+    if IS_INVEC_1:
+        return np.ones(vec_len).astype(np.float32)
+    else:
+        np.random.seed(0)
+        return np.random.rand(vec_len).astype(np.float32)
+
+
+# x is distributed into the core rectangle by the following steps
+# step 1: distribute x into columns
+#    vec_len_per_pe_col = ceil(vec_len / np_cols)
+# step 2: distribute the column into PEs
+#    vec_len_per_pe = ceil(vec_len_per_pe_col / np_rows)
+#
+# For example, if core rectangle is 2-by-2 and local_vec_sz is 13
+#    Each column has vec_len_per_pe_col = ceil(13/2) = 7
+#    The size of result is 7*2 = 14 which is bigger than local_vec_sz due to padding
+#    Each PE has vec_len_per_pe = ceil(7/2) = 4
+#
+# If x is {1,2,3,4,5,6,7,8,9,10,11,12,13}, the core has
+#          PE.x=0      PE.x=1
+#    +-------------+-------------+
+#    | {1,2,3,4}   | {8,9,10,11} | PE.y=0
+#    +-------------+-------------+
+#    | {5,6,7,x}   | {12,13,x,x} | PE.y=1
+#    +-------------+-------------+
+# column 0 has 7 elements, {1,2,3,4,5,6,7}
+# column 1 has 6 elements, {8,9,10,11,12,13}
+#
+# The symbol x is DON'T CARE
+#
+def dist_x_to_hwl(ncols, x, local_vec_sz, np_cols, np_rows):
+    # core rectangle is np_cols-by-np_rows
+    #            np_cols
+    #         +----------+
+    # np_rows |  core    |
+    #         +----------+
+    # input vector is distributed into columns, then distributed into rows
+
+    vec_len = ncols
+    vec_len_per_pe_col = math.ceil(vec_len / np_cols)
+    vec_len_per_pe = math.ceil(vec_len_per_pe_col / np_rows)
+    assert(vec_len_per_pe == local_vec_sz)
+
+    pad_len_per_pe_col = (vec_len_per_pe * np_rows) - vec_len_per_pe_col
+
+    pad_len = (vec_len_per_pe_col * np_cols) - vec_len
+    # invec = [x, ones(pad_len)]
+    invec = np.copy(x)
+    ## BIG NOTE: Since this is input vector, padding needs to be 1s
+    if pad_len > 0:
+        invec = np.append(invec, np.ones(pad_len))
+
+    x_hwl = np.zeros( (np_rows, np_cols, vec_len_per_pe), x.dtype)
+    ## now this is equally divided into np_cols
+    for col in range(np_cols):
+        ## get the slice for this col and append padding
+        invec_col = invec[col * vec_len_per_pe_col : (col + 1) * vec_len_per_pe_col]
+        if pad_len_per_pe_col > 0:
+            invec_col = np.append(invec_col, np.ones(pad_len_per_pe_col)).astype(x.dtype)
+        ## now this is equally divided into np_rows
+        for row in range(np_rows):
+            ## get the slice for this row
+            data = invec_col[row * vec_len_per_pe : (row + 1) * vec_len_per_pe]
+            x_hwl[(row, col)] = data
+
+    return x_hwl
+
+# The dimension of out_vec is h-by-w-by-l
+# h = np_rows is the height of the core
+# w = np_cols is the width of the core
+# l = local_out_vec_sz is the size of local vector
+#
+# The out_vec_sz is the length of y = A*x
+#
+# y is distributed into the core rectangle by the following steps
+# step 1: distribute y into rows
+#    vec_len_per_pe_row = math.ceil(out_vec_sz / np_rows)
+# step 2: distribute the row into PEs
+#    vec_len_per_pe = math.ceil(vec_len_per_pe_row / np_cols)
+#
+# If out_vec_sz is smaller than (vec_len_per_pe_row*np_rows), padding is added
+#
+# The function unpad_3d_to_1d returns a result of size (vec_len_per_pe_row*np_rows)
+#
+# For example, if core rectangle is 2-by-2 and out_vec_sz is 13
+#    Each row has vec_len_per_pe_row = ceil(13/2) = 7
+#    The size of result is 7*2 = 14 which is bigger than out_vec_sz due to padding
+#    Each PE has vec_len_per_pe = ceil(7/2) = 4
+#
+# If y is {1,2,3,4,5,6,7,8,9,10,11,12,13}, the core has
+#          PE.x=0      PE.x=1
+#    +-------------+-------------+
+#    | {1,2,3,4}   | {5,6,7,x}   | PE.y=0
+#    +-------------+-------------+
+#    | {8,9,10,11} | {12,13,x,x} | PE.y=1
+#    +-------------+-------------+
+# row 0 has 7 elements, {1,2,3,4,5,6,7
+# row 1 has 6 elements, {8,9,10,11,12,13}
+#
+# The symbol x is DON'T CARE
+#
+def unpad_3d_to_1d(out_vec_sz, out_vec):
+    assert 3 == out_vec.ndim, "y must be a 3-d tensor of the form h-by-w-by-l"
+    (height, width, local_out_vec_sz) = out_vec.shape
+    # core rectangle is np_cols-by-np_rows
+    #            np_cols
+    #         +----------+
+    # np_rows |  core    |
+    #         +----------+
+    np_rows = height
+    np_cols = width
+
+    vec_len_per_pe_row = math.ceil(out_vec_sz / np_rows)
+    vec_len_per_pe = math.ceil(vec_len_per_pe_row / np_cols)
+    # check if local_out_vec_sz = math.ceil(math.ceil(out_vec_sz / np_rows) / np_cols)
+    assert(vec_len_per_pe == local_out_vec_sz)
+
+    # result includes the padding
+    #    y = result[0:out_vec_sz]
+    # clear result to avoid bogus value outside the range [0, out_vec_sz)
+    result = np.zeros(vec_len_per_pe_row * np_rows, dtype = np.float32)
+    # tmp_buf contains the padding one row PEs
+    # tmp_buf gathers data of a whole row PE
+    tmp_buf = np.empty(vec_len_per_pe * np_cols, dtype = np.float32)
+    for row in range(np_rows):
+        low_idx = row * vec_len_per_pe_row
+        high_idx = low_idx + vec_len_per_pe_row
+        # gather data into tmp_buf
+        for col in range(np_cols):
+            start = col * vec_len_per_pe
+            end = start + vec_len_per_pe
+            tmp_buf[start:end] = out_vec[(row, col)]
+        result[low_idx:high_idx] = tmp_buf[0:vec_len_per_pe_row]
+    return result
+
+
+def verify_result(ref, res):
+    print(f'Comparing result with reference...')
+    abs_diff = np.sum(abs(ref - res))
+    abs_rel = abs_diff / len(ref)
+    print(f'reference[{len(ref)}]: \n{ref}')
+    print(f'result   [{len(res)}]: \n{res}')
+    print(f'[[ Absolute diff: {abs_diff} ]]')
+    print(f'[[ Average diff : {abs_rel} ]]')
+    atol = 1e-8
+    rtol = 1e-5
+    is_correct = np.allclose(ref, res, rtol, atol)
+    result = 'PASS' if is_correct else 'FAIL'
+    print(f'[[ Result within tolerance {atol}: {result} ]]')
+    print(f'[[ Result within tolerance {atol}: {result} ]]')
+    if not is_correct:
+        import pandas as pd
+        unequal = ~np.isclose(ref, res)
+        unequal_idx = list(np.where(unequal))
+        mismatches = list(zip(ref[tuple(unequal_idx)], res[tuple(unequal_idx)]))
+        df = pd.DataFrame(mismatches, columns=['reference', 'result'], index=unequal_idx)
+        print(f'{df}')
+
+
+# y = A*x
+# where A is nrows-by-ncols, represented by a CSR triplet
+def generate_reference(nrows, ncols, csrRowPtr, csrColInd, csrVal, x):
+    assert ncols == len(x), "the dimension of x does not match the dimension of A"
+    mat = sparse.csr_matrix((csrVal, csrColInd, csrRowPtr), shape=(nrows, ncols))
+    y = mat.dot(np.array(x).transpose())
+    return y
+
+
+def timing_analysis(height, width, nrows, ncols, nnz, time_memcpy_hwl, time_ref_hwl):
+    time_start = np.zeros((height, width)).astype(int)
+    time_end = np.zeros((height, width)).astype(int)
+    word = np.zeros(3).astype(np.uint16)
+    for w in range(width):
+        for h in range(height):
+            word[0] = time_memcpy_hwl[(h, w, 0)]
+            word[1] = time_memcpy_hwl[(h, w, 1)]
+            word[2] = time_memcpy_hwl[(h, w, 2)]
+            time_start[(h,w)] = make_u48(word)
+            word[0] = time_memcpy_hwl[(h, w, 3)]
+            word[1] = time_memcpy_hwl[(h, w, 4)]
+            word[2] = time_memcpy_hwl[(h, w, 5)]
+            time_end[(h,w)] = make_u48(word)
+
+    # time_ref = reference clock
+    time_ref = np.zeros((height, width)).astype(int)
+    word = np.zeros(3).astype(np.uint16)
+    for w in range(width):
+        for h in range(height):
+            word[0] = time_ref_hwl[(h, w, 0)]
+            word[1] = time_ref_hwl[(h, w, 1)]
+            word[2] = time_ref_hwl[(h, w, 2)]
+            time_ref[(h, w)] = make_u48(word)
+
+    # adjust the reference clock by the propagation delay
+    # the right-bottom PE signals other PEs, the propagation delay is
+    #     (h-1) - py + (w-1) - px
+    for py in range(height):
+        for px in range(width):
+            time_ref[(py, px)] = time_ref[(py, px)] - ((width+height-2)-(px + py))
+
+    # shift time_start and time_end by time_ref
+    time_start = time_start - time_ref
+    time_end = time_end - time_ref
+
+    # cycles_send = time_end[(h,w)] - time_start[(h,w)]
+    # 850MHz --> 1 cycle = (1/0.85) ns = (1/0.85)*1.e-3 us
+    # time_send = (cycles_send / 0.85) *1.e-3 us
+    #
+    # The spmv kernel performs y = A * x
+    #   y[i] = sum{ Aij * xj : Aij is nonzero }
+    # where A is m-by-n with nnz nonzeros
+    #
+    # We use the following standard measurement
+    # - read Aij: nnz
+    # - read xj: nnz
+    # - write y[i]: m
+    # Total number of wavelets: (2*nnz + m)
+    #
+    wvlts = 2 * nnz + height
+    min_time_start = time_start.min()
+    max_time_end = time_end.max()
+    cycles_send = max_time_end - min_time_start
+    time_send = (cycles_send / 0.85) *1.e-3
+    bandwidth = ((wvlts * 4)/time_send)
+    print(f"cycles_send = {cycles_send} cycles")
+    print(f"time_send = {time_send} us")
+    print(f"bandwidth = {bandwidth} MB/S ")
+
+
+def csl_compile_core(
+    csl_path: str, # path to CSL files
+    file_config : str,
+    elf_dir : str,
+    fabric_width : int,
+    fabric_height : int,
+    core_fabric_offset_x : int, # fabric-offsets of the core
+    core_fabric_offset_y : int,
+    arch : Optional[str],
+    ncols: int,
+    nrows: int,
+    np_cols: int,
+    np_rows: int,
+    max_local_nnz: int,
+    max_local_nnz_cols: int,
+    max_local_nnz_rows: int,
+    local_vec_sz: int,
+    local_out_vec_sz: int,
+    out_pad_start_idx: int,
+    channels:int,
+    width_west_buf: int,
+    width_east_buf:int
+):
+    compiler = SdkCompiler()
+    args = []
+    args.append(f"--fabric-dims={fabric_width},{fabric_height}") # options
+    args.append(f"--fabric-offsets={core_fabric_offset_x},{core_fabric_offset_y}") # options
+    args.append(f"--params=ncols:{ncols}") # options
+    args.append(f"--params=nrows:{nrows}") # options
+    args.append(f"--params=pcols:{np_cols}") # options
+    args.append(f"--params=prows:{np_rows}") # options
+    args.append(f"--params=max_local_nnz:{max_local_nnz}") # options
+    args.append(f"--params=max_local_nnz_cols:{max_local_nnz_cols}") # options
+    args.append(f"--params=max_local_nnz_rows:{max_local_nnz_rows}") # options
+    args.append(f"--params=local_vec_sz:{local_vec_sz}") # options
+    args.append(f"--params=local_out_vec_sz:{local_out_vec_sz}") # options
+    args.append(f"--params=y_pad_start_row_idx:{out_pad_start_idx}") # options
+
+    args.append(f"-o={elf_dir}")
+    if arch is not None:
+        args.append(f"--arch={arch}")
+    args.append(f"--memcpy")
+    args.append(f"--channels={channels}")
+    args.append(f"--width-west-buf={width_west_buf}")
+    args.append(f"--width-east-buf={width_east_buf}")
+
+    args_str = " ".join(args)
+    hashstr = compiler.compile(csl_path, file_config, args_str)
+    print("compile artifact (csl_hash/oname):", hashstr)
+    return hashstr
+
+
+# How to compile:
+#  python run.py --arch=wse2 --num_pe_cols=4 --num_pe_rows=4 --channels=1 \
+#    --width-west-buf=0 --width-east-buf=0 --is_weight_one --compile-only \
+#    --infile_mtx=data/rmat4.4x4.lb.mtx
+#
+# How to run:
+#  python run.py --arch=wse2 --num_pe_cols=4 --num_pe_rows=4 --channels=1 \
+#    --width-west-buf=0 --width-east-buf=0 --is_weight_one --run-only \
+#    --infile_mtx=data/rmat4.4x4.lb.mtx
+#
+def main():
+    """Main method to run the example code."""
+
+    args = parse_args()
+
+    width_west_buf = args.width_west_buf
+    width_east_buf = args.width_east_buf
+    channels = args.channels
+    assert channels <= 16, "only support up to 16 I/O channels"
+    assert channels >= 1, "number of I/O channels must be at least 1"
+
+    print(f"width_west_buf = {width_west_buf}")
+    print(f"width_east_buf = {width_east_buf}")
+    print(f"channels = {channels}")
+
+    dirname = args.latestlink
+
+    # core rectangle is np_cols-by-np_rows
+    np_cols = args.num_pe_cols
+    np_rows = args.num_pe_rows
+    IS_INVEC_1 = args.is_invec_one
+
+    width = np_cols
+    height = np_rows
+    print(f"width = {width}, height = {height}")
+
+    start = time.time()
+    infile_mtx = args.infile_mtx
+    print(f"infile_mtx = {infile_mtx}")
+
+    A_coo = mmread(infile_mtx)
+    # the CSR format is 0-based
+    A_csr = A_coo.tocsr(copy=True)
+    # sort column indices
+    A_csr = A_csr.sorted_indices().astype(np.float32)
+    assert 1 == A_csr.has_sorted_indices, "Error: A is not sorted"
+
+    [nrows, ncols] = A_csr.shape
+    nnz = A_csr.nnz
+
+    print(f"Load matrix A, {nrows}-by-{ncols} with {nnz} nonzeros")
+
+    if not args.is_weight_one:
+        print("WARNING: reset the matrix with random values")
+        np.random.seed(123)
+        (A_csr.data)[0:nnz] = np.random.rand(nnz).astype(np.float32)
+
+    csrRowPtr = A_csr.indptr
+    csrColInd = A_csr.indices
+    csrVal    = A_csr.data
+
+    A_csc = A_csr.tocsc(copy=True)
+    # sort row indices
+    A_csc = A_csc.sorted_indices().astype(np.float32)
+    assert 1 == A_csc.has_sorted_indices, "Error: A is not sorted"
+
+    cscColPtr = A_csc.indptr
+    cscRowInd = A_csc.indices
+    cscVal    = A_csc.data
+
+    matrix_info = preprocess(
+        # A is nrows-by-ncols with nnz nonzeros
+        nrows,
+        ncols,
+        nnz,
+        # core rectangle is fabx-by-faby
+        np_cols,
+        np_rows,
+        # (csrRowPtr, csrColInd, csrVal) is the CSR representation
+        csrRowPtr,
+        csrColInd,
+        csrVal,
+        # (cscColPtr, cscRowInd, cscVal) is the CSC representation
+        cscColPtr,
+        cscRowInd,
+        cscVal)
+
+    end = time.time()
+    print(f"prepare the structure for spmv kernel: {end-start}s", flush=True)
+
+    max_local_nnz = matrix_info['max_local_nnz']
+    max_local_nnz_cols = matrix_info['max_local_nnz_cols']
+    max_local_nnz_rows = matrix_info['max_local_nnz_rows']
+    mat_vals_buf = matrix_info['mat_vals_buf']
+    mat_rows_buf = matrix_info['mat_rows_buf']
+    mat_col_idx_buf = matrix_info['mat_col_idx_buf']
+    mat_col_loc_buf = matrix_info['mat_col_loc_buf']
+    mat_col_len_buf = matrix_info['mat_col_len_buf']
+    y_rows_init_buf = matrix_info['y_rows_init_buf']
+    local_nnz = matrix_info['local_nnz']
+    local_nnz_cols = matrix_info['local_nnz_cols']
+    local_nnz_rows = matrix_info['local_nnz_rows']
+
+    x_ref = read_input_vector(IS_INVEC_1, ncols)
+
+    # core rectangle is np_cols-by-np_rows
+    #            np_cols
+    #         +----------+
+    # np_rows |  core    |
+    #         +----------+
+    # input vector is distributed into columns, then distributed into rows
+    # output vector is distributed into rows, then distributed into columns
+    local_vec_sz = math.ceil(math.ceil(ncols / np_cols) / np_rows)
+    local_out_vec_sz = math.ceil(math.ceil(nrows / np_rows) / np_cols)
+
+    x_tx_buf = dist_x_to_hwl(ncols, x_ref, local_vec_sz, np_cols, np_rows)
+
+    print(f'Generating reference y = A*x ...')
+    y_ref = generate_reference(nrows, ncols, csrRowPtr, csrColInd, csrVal, x_ref)
+
+    mem_use_per_pe = memory_per_pe(max_local_nnz, max_local_nnz_cols, max_local_nnz_rows, local_vec_sz, local_out_vec_sz)
+    print(f'Total memory use per PE = {mem_use_per_pe} bytes = {mem_use_per_pe / 1024} KB', flush=True)
+    assert mem_use_per_pe < 46*1024, "exceed maximum memory capacity, increase the core rectangle"
+
+    # fabric-offsets = 1,1
+    fabric_offset_x = 1
+    fabric_offset_y = 1
+    # starting point of the core rectangle = (core_fabric_offset_x, core_fabric_offset_y)
+    # memcpy framework requires 3 columns at the west of the core rectangle
+    # memcpy framework requires 2 columns at the east of the core rectangle
+    core_fabric_offset_x = fabric_offset_x + 3 + width_west_buf
+    core_fabric_offset_y = fabric_offset_y
+    # (min_fabric_width, min_fabric_height) is the minimal dimension to run the app
+    min_fabric_width = (core_fabric_offset_x + width + 2 + 1 + width_east_buf)
+    min_fabric_height = (core_fabric_offset_y + height + 1)
+
+    fabric_width = 0
+    fabric_height = 0
+    if args.fabric_dims:
+        w_str, h_str = args.fabric_dims.split(",")
+        fabric_width = int(w_str)
+        fabric_height = int(h_str)
+
+    if fabric_width == 0 or fabric_height == 0:
+        fabric_width = min_fabric_width
+        fabric_height = min_fabric_height
+
+    assert fabric_width >= min_fabric_width
+    assert fabric_height >= min_fabric_height
+
+    print(f"fabric_width = {fabric_width}, fabric_height = {fabric_height}")
+    print(f"core_fabric_offset_x = {core_fabric_offset_x}, core_fabric_offset_y = {core_fabric_offset_y}")
+
+    # prepare the simulation
+    print('store ELFs and log files in the folder ', dirname)
+
+    # layout of a rectangle
+    code_csl = "layout.csl"
+
+    ## calculate the output vector padding info
+    out_vec_len_per_pe_row = math.ceil(nrows / np_rows)
+    out_pad_start_idx = out_vec_len_per_pe_row
+
+
+    csl_path = "./src"
+
+    if args.compile_only:
+        print("WARNING: compile the code, don't run SdkRuntime because the server is down after the compilation");
+        start = time.time()
+        hashstr = csl_compile_core(
+            csl_path,
+            code_csl,
+            dirname,
+            fabric_width,
+            fabric_height,
+            core_fabric_offset_x, # fabric-offsets of the core
+            core_fabric_offset_y,
+            args.arch,
+            LAUNCH,
+            ncols, # m, number of rows of the matrix
+            nrows, # n, number of columns of the matrix
+            np_cols, # width
+            np_rows, # height
+            max_local_nnz,
+            max_local_nnz_cols,
+            max_local_nnz_rows,
+            local_vec_sz,
+            local_out_vec_sz,
+            out_pad_start_idx,
+            channels,
+            width_west_buf,
+            width_east_buf
+        )
+        end = time.time()
+        print(f"Compilation done in {end-start}s", flush=True)
+        print(f"dump artifact name to file {hash_filename}")
+        with open(hash_filename, "w") as write_file:
+            json.dump(hashstr, write_file)
+        print("COMPILE ONLY: EXIT")
+        return
+
+    print(f"load artifact name from file {hash_filename}")
+    with open(hash_filename, "r") as f:
+        hashstr = json.load(f)
+
+    start = time.time()
+    with SdkRuntime(hashstr, simulator=args.simulator) as runner:
+
+        sym_mat_vals_buf = runner.get_id("mat_vals_buf")
+        sym_x_tx_buf = runner.get_id("x_tx_buf");
+        sym_y_local_buf = runner.get_id("y_local_buf");
+
+        sym_mat_rows_buf = runner.get_id("mat_rows_buf")
+        sym_mat_col_idx_buf = runner.get_id("mat_col_idx_buf")
+        sym_mat_col_loc_buf = runner.get_id("mat_col_loc_buf")
+        sym_mat_col_len_buf = runner.get_id("mat_col_len_buf")
+        sym_y_rows_init_buf = runner.get_id("y_rows_init_buf")
+        sym_local_nnz = runner.get_id("local_nnz")
+        sym_local_nnz_cols = runner.get_id("local_nnz_cols")
+        sym_local_nnz_rows = runner.get_id("local_nnz_rows")
+        sym_time_buf_u16 = runner.get_id("time_buf_u16")
+        sym_time_ref_u16 = runner.get_id("time_ref_u16")
+
+        # load() and run() are called by client.Sdkruntime.__enter__
+        #runner.load()
+        #runner.run()
+
+        print("step 1: enable tsc counter to sample the clock")
+        runner.launch("f_enable_tsc", nonblock=True)
+
+        print("step 2: copy the structure of A and vector x to the device")
+        # 1. mat_vals_buf[max_local_nnz], type = f32
+        mat_vals_buf_1d = hwl_to_oned_colmajor(height, width, max_local_nnz, mat_vals_buf, np.float32)
+        runner.memcpy_h2d(sym_mat_vals_buf, mat_vals_buf_1d, 0, 0, width, height, max_local_nnz,\
+            streaming=False, data_type=MemcpyDataType.MEMCPY_32BIT, order=MemcpyOrder.COL_MAJOR, nonblock=True)
+
+        # 2: x_tx_buf[local_vec_sz], type = f32
+        x_tx_buf_1d = hwl_to_oned_colmajor(height, width, local_vec_sz, x_tx_buf, np.float32)
+        runner.memcpy_h2d(sym_x_tx_buf, x_tx_buf_1d, 0, 0, width, height, local_vec_sz,\
+            streaming=False, data_type=MemcpyDataType.MEMCPY_32BIT, order=MemcpyOrder.COL_MAJOR, nonblock=True)
+
+        # 3: mat_rows_buf[max_local_nnz], type = u16
+        mat_rows_buf_1d = hwl_to_oned_colmajor(height, width, max_local_nnz, mat_rows_buf, np.uint32)
+        runner.memcpy_h2d(sym_mat_rows_buf, mat_rows_buf_1d, 0, 0, width, height, max_local_nnz,\
+            streaming=False, data_type=MemcpyDataType.MEMCPY_16BIT, order=MemcpyOrder.COL_MAJOR, nonblock=True)
+
+        # 4: mat_col_idx_buf[max_local_nnz_cols], type = u16
+        mat_col_idx_buf_1d = hwl_to_oned_colmajor(height, width, max_local_nnz_cols, mat_col_idx_buf, np.uint32)
+        runner.memcpy_h2d(sym_mat_col_idx_buf, mat_col_idx_buf_1d, 0, 0, width, height, max_local_nnz_cols,\
+            streaming=False, data_type=MemcpyDataType.MEMCPY_16BIT, order=MemcpyOrder.COL_MAJOR, nonblock=True)
+
+        # 5: mat_col_loc_buf[max_local_nnz_cols], type = u16
+        mat_col_loc_buf_1d = hwl_to_oned_colmajor(height, width, max_local_nnz_cols, mat_col_loc_buf, np.uint32)
+        runner.memcpy_h2d(sym_mat_col_loc_buf, mat_col_loc_buf_1d, 0, 0, width, height, max_local_nnz_cols,\
+            streaming=False, data_type=MemcpyDataType.MEMCPY_16BIT, order=MemcpyOrder.COL_MAJOR, nonblock=True)
+
+        # 6: mat_col_len_buf[max_local_nnz_cols], type = u16
+        mat_col_len_buf_1d = hwl_to_oned_colmajor(height, width, max_local_nnz_cols, mat_col_len_buf, np.uint32)
+        runner.memcpy_h2d(sym_mat_col_len_buf, mat_col_len_buf_1d, 0, 0, width, height, max_local_nnz_cols,\
+            streaming=False, data_type=MemcpyDataType.MEMCPY_16BIT, order=MemcpyOrder.COL_MAJOR, nonblock=True)
+
+        # 7: y_rows_init_buf[max_local_nnz_rows], type = u16
+        y_rows_init_buf_1d = hwl_to_oned_colmajor(height, width, max_local_nnz_rows, y_rows_init_buf, np.uint32)
+        runner.memcpy_h2d(sym_y_rows_init_buf, y_rows_init_buf_1d, 0, 0, width, height, max_local_nnz_rows,\
+            streaming=False, data_type=MemcpyDataType.MEMCPY_16BIT, order=MemcpyOrder.COL_MAJOR, nonblock=True)
+
+        # 8: local_nnz, type = u16
+        local_nnz_1d = hwl_to_oned_colmajor(height, width, 1, local_nnz, np.uint32)
+        runner.memcpy_h2d(sym_local_nnz, local_nnz_1d, 0, 0, width, height, 1,\
+            streaming=False, data_type=MemcpyDataType.MEMCPY_16BIT, order=MemcpyOrder.COL_MAJOR, nonblock=True)
+
+        # 9: local_nnz_cols, type = u16
+        local_nnz_cols_1d = hwl_to_oned_colmajor(height, width, 1, local_nnz_cols, np.uint32)
+        runner.memcpy_h2d(sym_local_nnz_cols, local_nnz_cols_1d, 0, 0, width, height, 1,\
+            streaming=False, data_type=MemcpyDataType.MEMCPY_16BIT, order=MemcpyOrder.COL_MAJOR, nonblock=True)
+
+        # 10: local_nnz_rows, type = u16
+        local_nnz_rows_1d = hwl_to_oned_colmajor(height, width, 1, local_nnz_rows, np.uint32)
+        runner.memcpy_h2d(sym_local_nnz_rows, local_nnz_rows_1d, 0, 0, width, height, 1,\
+            streaming=False, data_type=MemcpyDataType.MEMCPY_16BIT, order=MemcpyOrder.COL_MAJOR, nonblock=True)
+
+        print("step 3: sync all PEs to sample the reference clock")
+        runner.launch("f_sync", np.int16(1), nonblock=False)
+
+        print("step 4: tic() records time_start")
+        runner.launch("f_tic", nonblock=True)
+
+        print("step 5: spmv")
+        runner.launch("f_spmv", nonblock=False)
+
+        print("step 5: toc() records time_end")
+        runner.launch("f_toc", nonblock=False)
+
+        print("step 6: prepare (time_start, time_end)")
+        runner.launch("f_memcpy_timestamps", nonblock=False)
+
+        print("step 7: fetch the timing time_buf_u16[6] = (time_start, time_end), type = u16")
+        time_memcpy_hwl_1d = np.zeros(height*width*6, np.uint32)
+        runner.memcpy_d2h(time_memcpy_hwl_1d, sym_time_buf_u16, 0, 0, width, height, 6,\
+            streaming=False, data_type=MemcpyDataType.MEMCPY_16BIT, order=MemcpyOrder.COL_MAJOR, nonblock=False)
+        time_memcpy_hwl = oned_to_hwl_colmajor(height, width, 6, time_memcpy_hwl_1d, np.uint16)
+
+        print("step 8: fetch the output vector y of type f32")
+        y_1d = np.zeros(height*width*local_out_vec_sz, np.float32)
+        runner.memcpy_d2h(y_1d, sym_y_local_buf, 0, 0, width, height, local_out_vec_sz,\
+            streaming=False, data_type=MemcpyDataType.MEMCPY_32BIT, order=MemcpyOrder.COL_MAJOR, nonblock=False)
+
+        print("step 9: prepare reference clock")
+        runner.launch("f_reference_timestamps", nonblock=False)
+
+        print("step 10: D2H reference clock")
+        time_ref_1d = np.zeros(height*width*3, np.uint32)
+        runner.memcpy_d2h(time_ref_1d, sym_time_ref_u16, 0, 0, width, height, 3,\
+            streaming=False, data_type=MemcpyDataType.MEMCPY_16BIT, order=MemcpyOrder.COL_MAJOR, nonblock=False)
+        time_ref_hwl = oned_to_hwl_colmajor(height, width, 3, time_ref_1d, np.uint16)
+
+        # stop() is called by client.Sdkruntime.__exit__
+        #runner.stop()
+
+    end = time.time()
+    print(f"*** Run done in {end-start}s")
+
+    timing_analysis( height, width, nrows, ncols, nnz, time_memcpy_hwl, time_ref_hwl)
+
+    # The output y_wse distributed into nrows-by-ncols PEs
+    y_wse = np.reshape(y_1d, (height, width, local_out_vec_sz), order='F')
+    # y_wse is packed into 1d vector with zero padding
+    y_wse = unpad_3d_to_1d(nrows, y_wse)
+    # remove padding of y_wse because y_ref has no padding
+    verify_result(y_ref, y_wse[0:nrows])
+
+    # dump the device memory via debug tool
+    if args.simulator:
+        print(f"time_ref_hwl = \n{time_ref_hwl}")
+        debug_mod = debug_util(hashstr, simulator)
+        for py in range(height):
+            for px in range(width):
+                t = debug_mod.get_symbol(core_fabric_offset_x+px, core_fabric_offset_y+py, 'time_ref_u16', np.uint16)
+                print(f"(py, px) = {py, px}, time_ref_u16_ij = {t}")
+
+
+if __name__ == "__main__":
+    main()
diff --git a/benchmarks/spmv-hypersparse/run.py b/benchmarks/spmv-hypersparse/run.py
index ee8e154..5afc90f 100644
--- a/benchmarks/spmv-hypersparse/run.py
+++ b/benchmarks/spmv-hypersparse/run.py
@@ -617,7 +617,7 @@ def main():
     print('store ELFs and log files in the folder ', dirname)
 
     # layout of a rectangle
-    code_csl = "layout.csl"
+    code_csl = "src/layout.csl"
 
     ## calculate the output vector padding info
     out_vec_len_per_pe_row = math.ceil(nrows / np_rows)
@@ -655,121 +655,121 @@ def main():
         print("COMPILE ONLY: EXIT")
         return
 
-    simulator = SdkRuntime(dirname, cmaddr=args.cmaddr)
+    runner = SdkRuntime(dirname, cmaddr=args.cmaddr)
 
-    sym_mat_vals_buf = simulator.get_id("mat_vals_buf")
-    sym_x_tx_buf = simulator.get_id("x_tx_buf");
-    sym_y_local_buf = simulator.get_id("y_local_buf");
+    sym_mat_vals_buf = runner.get_id("mat_vals_buf")
+    sym_x_tx_buf = runner.get_id("x_tx_buf");
+    sym_y_local_buf = runner.get_id("y_local_buf");
 
-    sym_mat_rows_buf = simulator.get_id("mat_rows_buf")
-    sym_mat_col_idx_buf = simulator.get_id("mat_col_idx_buf")
-    sym_mat_col_loc_buf = simulator.get_id("mat_col_loc_buf")
-    sym_mat_col_len_buf = simulator.get_id("mat_col_len_buf")
-    sym_y_rows_init_buf = simulator.get_id("y_rows_init_buf")
-    sym_local_nnz = simulator.get_id("local_nnz")
-    sym_local_nnz_cols = simulator.get_id("local_nnz_cols")
-    sym_local_nnz_rows = simulator.get_id("local_nnz_rows")
-    sym_time_buf_u16 = simulator.get_id("time_buf_u16")
-    sym_time_ref_u16 = simulator.get_id("time_ref_u16")
+    sym_mat_rows_buf = runner.get_id("mat_rows_buf")
+    sym_mat_col_idx_buf = runner.get_id("mat_col_idx_buf")
+    sym_mat_col_loc_buf = runner.get_id("mat_col_loc_buf")
+    sym_mat_col_len_buf = runner.get_id("mat_col_len_buf")
+    sym_y_rows_init_buf = runner.get_id("y_rows_init_buf")
+    sym_local_nnz = runner.get_id("local_nnz")
+    sym_local_nnz_cols = runner.get_id("local_nnz_cols")
+    sym_local_nnz_rows = runner.get_id("local_nnz_rows")
+    sym_time_buf_u16 = runner.get_id("time_buf_u16")
+    sym_time_ref_u16 = runner.get_id("time_ref_u16")
 
     start = time.time()
-    simulator.load()
+    runner.load()
     end = time.time()
     print(f"*** Load done in {end-start}s")
 
     start = time.time()
-    simulator.run()
+    runner.run()
 
     print("step 1: enable tsc counter to sample the clock")
-    simulator.launch("f_enable_tsc", nonblock=True)
+    runner.launch("f_enable_tsc", nonblock=True)
 
     print("step 2: copy the structure of A and vector x to the device")
     # 1. mat_vals_buf[max_local_nnz], type = f32
     mat_vals_buf_1d = hwl_to_oned_colmajor(height, width, max_local_nnz, mat_vals_buf, np.float32)
-    simulator.memcpy_h2d(sym_mat_vals_buf, mat_vals_buf_1d, 0, 0, width, height, max_local_nnz,\
+    runner.memcpy_h2d(sym_mat_vals_buf, mat_vals_buf_1d, 0, 0, width, height, max_local_nnz,\
         streaming=False, data_type=MemcpyDataType.MEMCPY_32BIT, order=MemcpyOrder.COL_MAJOR, nonblock=True)
 
     # 2: x_tx_buf[local_vec_sz], type = f32
     x_tx_buf_1d = hwl_to_oned_colmajor(height, width, local_vec_sz, x_tx_buf, np.float32)
-    simulator.memcpy_h2d(sym_x_tx_buf, x_tx_buf_1d, 0, 0, width, height, local_vec_sz,\
+    runner.memcpy_h2d(sym_x_tx_buf, x_tx_buf_1d, 0, 0, width, height, local_vec_sz,\
         streaming=False, data_type=MemcpyDataType.MEMCPY_32BIT, order=MemcpyOrder.COL_MAJOR, nonblock=True)
 
     # 3: mat_rows_buf[max_local_nnz], type = u16
     mat_rows_buf_1d = hwl_to_oned_colmajor(height, width, max_local_nnz, mat_rows_buf, np.uint32)
-    simulator.memcpy_h2d(sym_mat_rows_buf, mat_rows_buf_1d, 0, 0, width, height, max_local_nnz,\
+    runner.memcpy_h2d(sym_mat_rows_buf, mat_rows_buf_1d, 0, 0, width, height, max_local_nnz,\
         streaming=False, data_type=MemcpyDataType.MEMCPY_16BIT, order=MemcpyOrder.COL_MAJOR, nonblock=True)
 
     # 4: mat_col_idx_buf[max_local_nnz_cols], type = u16
     mat_col_idx_buf_1d = hwl_to_oned_colmajor(height, width, max_local_nnz_cols, mat_col_idx_buf, np.uint32)
-    simulator.memcpy_h2d(sym_mat_col_idx_buf, mat_col_idx_buf_1d, 0, 0, width, height, max_local_nnz_cols,\
+    runner.memcpy_h2d(sym_mat_col_idx_buf, mat_col_idx_buf_1d, 0, 0, width, height, max_local_nnz_cols,\
         streaming=False, data_type=MemcpyDataType.MEMCPY_16BIT, order=MemcpyOrder.COL_MAJOR, nonblock=True)
 
     # 5: mat_col_loc_buf[max_local_nnz_cols], type = u16
     mat_col_loc_buf_1d = hwl_to_oned_colmajor(height, width, max_local_nnz_cols, mat_col_loc_buf, np.uint32)
-    simulator.memcpy_h2d(sym_mat_col_loc_buf, mat_col_loc_buf_1d, 0, 0, width, height, max_local_nnz_cols,\
+    runner.memcpy_h2d(sym_mat_col_loc_buf, mat_col_loc_buf_1d, 0, 0, width, height, max_local_nnz_cols,\
         streaming=False, data_type=MemcpyDataType.MEMCPY_16BIT, order=MemcpyOrder.COL_MAJOR, nonblock=True)
 
     # 6: mat_col_len_buf[max_local_nnz_cols], type = u16
     mat_col_len_buf_1d = hwl_to_oned_colmajor(height, width, max_local_nnz_cols, mat_col_len_buf, np.uint32)
-    simulator.memcpy_h2d(sym_mat_col_len_buf, mat_col_len_buf_1d, 0, 0, width, height, max_local_nnz_cols,\
+    runner.memcpy_h2d(sym_mat_col_len_buf, mat_col_len_buf_1d, 0, 0, width, height, max_local_nnz_cols,\
         streaming=False, data_type=MemcpyDataType.MEMCPY_16BIT, order=MemcpyOrder.COL_MAJOR, nonblock=True)
 
     # 7: y_rows_init_buf[max_local_nnz_rows], type = u16
     y_rows_init_buf_1d = hwl_to_oned_colmajor(height, width, max_local_nnz_rows, y_rows_init_buf, np.uint32)
-    simulator.memcpy_h2d(sym_y_rows_init_buf, y_rows_init_buf_1d, 0, 0, width, height, max_local_nnz_rows,\
+    runner.memcpy_h2d(sym_y_rows_init_buf, y_rows_init_buf_1d, 0, 0, width, height, max_local_nnz_rows,\
         streaming=False, data_type=MemcpyDataType.MEMCPY_16BIT, order=MemcpyOrder.COL_MAJOR, nonblock=True)
 
     # 8: local_nnz, type = u16
     local_nnz_1d = hwl_to_oned_colmajor(height, width, 1, local_nnz, np.uint32)
-    simulator.memcpy_h2d(sym_local_nnz, local_nnz_1d, 0, 0, width, height, 1,\
+    runner.memcpy_h2d(sym_local_nnz, local_nnz_1d, 0, 0, width, height, 1,\
         streaming=False, data_type=MemcpyDataType.MEMCPY_16BIT, order=MemcpyOrder.COL_MAJOR, nonblock=True)
 
     # 9: local_nnz_cols, type = u16
     local_nnz_cols_1d = hwl_to_oned_colmajor(height, width, 1, local_nnz_cols, np.uint32)
-    simulator.memcpy_h2d(sym_local_nnz_cols, local_nnz_cols_1d, 0, 0, width, height, 1,\
+    runner.memcpy_h2d(sym_local_nnz_cols, local_nnz_cols_1d, 0, 0, width, height, 1,\
         streaming=False, data_type=MemcpyDataType.MEMCPY_16BIT, order=MemcpyOrder.COL_MAJOR, nonblock=True)
 
     # 10: local_nnz_rows, type = u16
     local_nnz_rows_1d = hwl_to_oned_colmajor(height, width, 1, local_nnz_rows, np.uint32)
-    simulator.memcpy_h2d(sym_local_nnz_rows, local_nnz_rows_1d, 0, 0, width, height, 1,\
+    runner.memcpy_h2d(sym_local_nnz_rows, local_nnz_rows_1d, 0, 0, width, height, 1,\
         streaming=False, data_type=MemcpyDataType.MEMCPY_16BIT, order=MemcpyOrder.COL_MAJOR, nonblock=True)
 
     print("step 3: sync all PEs to sample the reference clock")
-    simulator.launch("f_sync", np.int16(1), nonblock=False)
+    runner.launch("f_sync", np.int16(1), nonblock=False)
 
     print("step 4: tic() records time_start")
-    simulator.launch("f_tic", nonblock=True)
+    runner.launch("f_tic", nonblock=True)
 
     print("step 5: spmv")
-    simulator.launch("f_spmv", nonblock=False)
+    runner.launch("f_spmv", nonblock=False)
 
     print("step 5: toc() records time_end")
-    simulator.launch("f_toc", nonblock=False)
+    runner.launch("f_toc", nonblock=False)
 
     print("step 6: prepare (time_start, time_end)")
-    simulator.launch("f_memcpy_timestamps", nonblock=False)
+    runner.launch("f_memcpy_timestamps", nonblock=False)
 
     print("step 7: fetch the timing time_buf_u16[6] = (time_start, time_end), type = u16")
     time_memcpy_hwl_1d = np.zeros(height*width*6, np.uint32)
-    simulator.memcpy_d2h(time_memcpy_hwl_1d, sym_time_buf_u16, 0, 0, width, height, 6,\
+    runner.memcpy_d2h(time_memcpy_hwl_1d, sym_time_buf_u16, 0, 0, width, height, 6,\
         streaming=False, data_type=MemcpyDataType.MEMCPY_16BIT, order=MemcpyOrder.COL_MAJOR, nonblock=False)
     time_memcpy_hwl = oned_to_hwl_colmajor(height, width, 6, time_memcpy_hwl_1d, np.uint16)
 
     print("step 8: fetch the output vector y of type f32")
     y_1d = np.zeros(height*width*local_out_vec_sz, np.float32)
-    simulator.memcpy_d2h(y_1d, sym_y_local_buf, 0, 0, width, height, local_out_vec_sz,\
+    runner.memcpy_d2h(y_1d, sym_y_local_buf, 0, 0, width, height, local_out_vec_sz,\
         streaming=False, data_type=MemcpyDataType.MEMCPY_32BIT, order=MemcpyOrder.COL_MAJOR, nonblock=False)
 
     print("step 9: prepare reference clock")
-    simulator.launch("f_reference_timestamps", nonblock=False)
+    runner.launch("f_reference_timestamps", nonblock=False)
 
     print("step 10: D2H reference clock")
     time_ref_1d = np.zeros(height*width*3, np.uint32)
-    simulator.memcpy_d2h(time_ref_1d, sym_time_ref_u16, 0, 0, width, height, 3,\
+    runner.memcpy_d2h(time_ref_1d, sym_time_ref_u16, 0, 0, width, height, 3,\
         streaming=False, data_type=MemcpyDataType.MEMCPY_16BIT, order=MemcpyOrder.COL_MAJOR, nonblock=False)
     time_ref_hwl = oned_to_hwl_colmajor(height, width, 3, time_ref_1d, np.uint16)
 
-    simulator.stop()
+    runner.stop()
 
     end = time.time()
     print(f"*** Run done in {end-start}s")
@@ -783,7 +783,7 @@ def main():
     # remove padding of y_wse because y_ref has no padding
     verify_result(y_ref, y_wse[0:nrows])
 
-    if args.cmaddr is None:
+    if args.simulator:
         # move simulation log and core dump to the given folder
         dst_log = Path(f"{dirname}/sim.log")
         src_log = Path("sim.log")
@@ -798,7 +798,7 @@ def main():
             shutil.move(src_trace, dst_trace)
 
     # dump the device memory via debug tool
-    if 0:
+    if args.simulator:
         print(f"time_ref_hwl = \n{time_ref_hwl}")
         debug_mod = debug_util(dirname, cmaddr=args.cmaddr)
         for py in range(height):
diff --git a/benchmarks/spmv-hypersparse/allreduce2R1E/layout.csl b/benchmarks/spmv-hypersparse/src/allreduce2R1E/layout.csl
similarity index 100%
rename from benchmarks/spmv-hypersparse/allreduce2R1E/layout.csl
rename to benchmarks/spmv-hypersparse/src/allreduce2R1E/layout.csl
diff --git a/benchmarks/spmv-hypersparse/allreduce2R1E/pe.csl b/benchmarks/spmv-hypersparse/src/allreduce2R1E/pe.csl
similarity index 100%
rename from benchmarks/spmv-hypersparse/allreduce2R1E/pe.csl
rename to benchmarks/spmv-hypersparse/src/allreduce2R1E/pe.csl
diff --git a/benchmarks/spmv-hypersparse/hypersparse_spmv/layout.csl b/benchmarks/spmv-hypersparse/src/hypersparse_spmv/layout.csl
similarity index 100%
rename from benchmarks/spmv-hypersparse/hypersparse_spmv/layout.csl
rename to benchmarks/spmv-hypersparse/src/hypersparse_spmv/layout.csl
diff --git a/benchmarks/spmv-hypersparse/hypersparse_spmv/pe.csl b/benchmarks/spmv-hypersparse/src/hypersparse_spmv/pe.csl
similarity index 100%
rename from benchmarks/spmv-hypersparse/hypersparse_spmv/pe.csl
rename to benchmarks/spmv-hypersparse/src/hypersparse_spmv/pe.csl
diff --git a/benchmarks/spmv-hypersparse/kernel.csl b/benchmarks/spmv-hypersparse/src/kernel.csl
similarity index 100%
rename from benchmarks/spmv-hypersparse/kernel.csl
rename to benchmarks/spmv-hypersparse/src/kernel.csl
diff --git a/benchmarks/spmv-hypersparse/layout.csl b/benchmarks/spmv-hypersparse/src/layout.csl
similarity index 100%
rename from benchmarks/spmv-hypersparse/layout.csl
rename to benchmarks/spmv-hypersparse/src/layout.csl
diff --git a/benchmarks/stencil-v2/README.rst b/benchmarks/stencil-v2/README.rst
deleted file mode 100644
index f69c3d8..0000000
--- a/benchmarks/stencil-v2/README.rst
+++ /dev/null
@@ -1,6 +0,0 @@
-Stencil
-========
-
-This example is documented in the section entitled `A 3D 25-point Stencil
-<https://sdk.cerebras.net/csl/code-examples/stencil-v2.html/>`_ in the
-``CSL Code Examples`` section of the SDK Documentation website.
diff --git a/benchmarks/stencil-v2/cmd_parser.py b/benchmarks/stencil-v2/cmd_parser.py
deleted file mode 100644
index 0d183c1..0000000
--- a/benchmarks/stencil-v2/cmd_parser.py
+++ /dev/null
@@ -1,112 +0,0 @@
-# Copyright 2024 Cerebras Systems.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-# This is not a real test, but a module that gets imported in other tests.
-
-"""parse command line for sparse level routines
-
-   -m <int>     number of rows of the matrix A
-   -n <int>     number of columns of the matrix A
-   --local_out_sz <int>  dimension of submatrix in tile approach,
-                         or number of rows in non-tile approach
-   --eps        tolerance
-   --latestlink   working directory
-   --debug      show A, x, and b
-   --sdkgui     prepare data fro sdk gui, including source code
-   --driver     path to CSL compiler
-   --autocsl    use get_cslang_dir to find out the path of CSL
-
-"""
-
-
-import argparse
-
-
-SIZE = 10
-ZDIM = 10
-ITERATIONS = 10
-DX = 20
-
-
-def parse_args():
-  parser = argparse.ArgumentParser()
-
-  parser.add_argument('--name', help='the test name')
-  parser.add_argument(
-            '--zDim', type=int, help='size of the Z dimension', default=ZDIM
-            )
-  parser.add_argument(
-            '--size', type=int, help='size of the domain in x and y dims', default=SIZE
-            )
-
-  parser.add_argument(
-            '--skip-compile', action="store_true",
-            help='Skip compilation of the code from python'
-            )
-
-  parser.add_argument(
-            '--skip-run', action="store_true",
-            help='Skip run of the code from python'
-            )
-
-  parser.add_argument(
-            '--iterations',
-            type=int,
-            help='number of timesteps to simulate',
-            default=ITERATIONS
-            )
-
-  parser.add_argument(
-            '--dx',
-            type=int,
-            help='dx value (impacting the boundary)', default=DX
-            )
-
-  parser.add_argument(
-            '--fabric_width',
-            type=int,
-            help='Width of the fabric we are compiling for',
-            )
-
-  parser.add_argument(
-            '--fabric_height',
-            type=int,
-            help='Height of the fabric we are compiling for',
-            )
-
-  parser.add_argument('--cmaddr', help='IP:port for CS system')
-
-  parser.add_argument(
-            "--debug",
-            help="show A, x, and b", action="store_true"
-            )
-
-  parser.add_argument(
-            "--width-west-buf",
-            default=0, type=int,
-            help="width of west buffer")
-  parser.add_argument(
-            "--width-east-buf",
-            default=0, type=int,
-            help="width of east buffer")
-  parser.add_argument(
-            "--n_channels",
-            default=1, type=int,
-            help="Number of memcpy \"channels\" (LVDS/streamers for both input and output)  to use \
-            when memcpy support is compiled with this program. If this argument is not present, \
-            or is 0, then the previous single-LVDS version is compiled.")
-
-  args = parser.parse_args()
-
-  return args
diff --git a/benchmarks/stencil-v2/code_memcpy.csl b/benchmarks/stencil-v2/code_memcpy.csl
deleted file mode 100644
index 7ccbf9b..0000000
--- a/benchmarks/stencil-v2/code_memcpy.csl
+++ /dev/null
@@ -1,248 +0,0 @@
-// Copyright 2024 Cerebras Systems.
-//
-// Licensed under the Apache License, Version 2.0 (the "License");
-// you may not use this file except in compliance with the License.
-// You may obtain a copy of the License at
-//
-//     http://www.apache.org/licenses/LICENSE-2.0
-//
-// Unless required by applicable law or agreed to in writing, software
-// distributed under the License is distributed on an "AS IS" BASIS,
-// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-// See the License for the specific language governing permissions and
-// limitations under the License.
-
-////////////////////////////////////////////////////////////////////////////////
-// The code for this 3D 25-point stencil was inspired by the proprietary code //
-// of TotalEnergies EP Research & Technology US.                              //
-////////////////////////////////////////////////////////////////////////////////
-
-// The core kernel must start at P4.1 so the memcpy infrastructure has enough
-// resources to route the data between the host and the device.
-//
-
-// color map of FD + memcpy:
-//
-// color  var             color  var          color  var              color  var
-//   0                      9    westDataFin   18    northCtrlFin2     27   reserved (memcpy)
-//   1                     10    northDataFin  19    southCtrlFin2     28   reserved (memcpy)
-//   2   f_comp            11    southDataFin  20    eastFin           29   reserved (memcpy)
-//   3   send              12    eastCtrlFin   21    reserved (memcpy) 30   reserved (memcpy)
-//   4   eastChannel       13    westCtrlFin   22    reserved (memcpy) 31   reserved
-//   5   westChannel       14    northCtrlFin  23    reserved (memcpy) 32
-//   6   northChannel      15    southCtrlFin  24    westFin           33
-//   7   southChannel      16    eastCtrlFin2  25    northFin          34
-//   8   eastDataFin       17    westCtrlFin2  26    southFin          35
-//
-
-// Colors
-param eastChannel:  color = @get_color(4);
-param westChannel:  color = @get_color(5);
-param northChannel: color = @get_color(6);
-param southChannel: color = @get_color(7);
-
-// Task IDs
-param send: local_task_id = @get_local_task_id(3);
-param COMP: local_task_id = @get_local_task_id(2);
-
-param eastDataFin:  local_task_id = @get_local_task_id(8);
-param westDataFin:  local_task_id = @get_local_task_id(9);
-param northDataFin: local_task_id = @get_local_task_id(10);
-param southDataFin: local_task_id = @get_local_task_id(11);
-
-param eastCtrlFin:  local_task_id = @get_local_task_id(12);
-param westCtrlFin:  local_task_id = @get_local_task_id(13);
-param northCtrlFin: local_task_id = @get_local_task_id(14);
-param southCtrlFin: local_task_id = @get_local_task_id(15);
-
-// the following four are entrypoints (send control wavelets for switch)
-// we don't need to bind it to 0~23
-param eastCtrlFin2:  local_task_id = @get_local_task_id(16);
-param westCtrlFin2:  local_task_id = @get_local_task_id(17);
-param northCtrlFin2: local_task_id = @get_local_task_id(18);
-param southCtrlFin2: local_task_id = @get_local_task_id(19);
-
-param eastFin:  local_task_id = @get_local_task_id(20);
-// WARNING: ID 21: reserved (memcpy)
-// WARNING: ID 22: reserved (memcpy)
-//          ID 23: reserved (memcpy)
-param westFin:  local_task_id = @get_local_task_id(24);
-param northFin: local_task_id = @get_local_task_id(25);
-param southFin: local_task_id = @get_local_task_id(26);
-
-
-param width: u16;
-param height: u16;
-param zDim: u16;
-param sourceLength: u16;
-param dx: u16;
-
-// Number of neighbors (plus self) that each PE will communicate with in all
-// directions.  The (three-dimensional) stencil size is `6 * (pattern - 1) + 1`.
-const pattern: u16 = 5;
-
-//// The coordinates of the "source" PE, which adds a small value to the wavefield
-//// in each iteration.
-param srcX: u16;
-param srcY: u16;
-param srcZ: u16;
-
-// The following parameters are the same for every PE, so we extract them out of
-// the loop that sets PE-specific parameters below.
-const invariants = .{
-  .send = send,
-  .zDim = zDim,
-  .pattern = pattern,
-  .sourceLength = sourceLength,
-  .dx = dx,
-  .width = width,
-  .height = height,
-  .srcZ = srcZ,
-
-  .eastFin = eastFin,
-  .westFin = westFin,
-  .northFin = northFin,
-  .southFin = southFin,
-
-  .eastDataFin = eastDataFin,
-  .westDataFin = westDataFin,
-  .northDataFin = northDataFin,
-  .southDataFin = southDataFin,
-
-  .eastCtrlFin = eastCtrlFin,
-  .westCtrlFin = westCtrlFin,
-  .northCtrlFin = northCtrlFin,
-  .southCtrlFin = southCtrlFin,
-
-  .eastCtrlFin2 = eastCtrlFin2,
-  .westCtrlFin2 = westCtrlFin2,
-  .northCtrlFin2 = northCtrlFin2,
-  .southCtrlFin2 = southCtrlFin2,
-
-  .eastChannel = eastChannel,
-  .westChannel = westChannel,
-  .northChannel = northChannel,
-  .southChannel = southChannel,
-};
-
-const util = @import_module("util.csl");
-
-const memcpy = @import_module( "<memcpy/get_params>", .{
-    .width = width,
-    .height = height,
-    });
-
-layout {
-  @comptime_assert(pattern <= width);
-  @comptime_assert(pattern <= height);
-  @comptime_assert(pattern > 1 and pattern < 8);
-
-  // step 1: configure the rectangle which does not include halo
-  @set_rectangle(width, height);
-
-  // step 2: compile csl code for a set of PEx.y and generate out_x_y.elf
-  //   format: @set_tile_code(x, y, code.csl, param_binding);
-
-  var xId = 0;
-  while (xId < width) : (xId += 1) {
-
-    // We specify the communication pattern is just one
-    // (eastward) direction out of the four cardinal directions (east, west,
-    // north, and south).  We then mirror the communication pattern in all other
-    // directions using relative PE IDs.  For instance, westward communication
-    // is identical to eastward communication with decreasing X coordinates.
-    // Similarly, southward communication is the same as eastward communication,
-    // except using the Y coordinate instead of the X coordinate.
-
-    // Here we compute the relative coordinates for westward and eastward
-    // communication.
-    const westPeId = util.computeRelativePeId(xId, width, WEST);
-    const eastPeId = util.computeRelativePeId(xId, width, EAST);
-
-    const westParams = .{
-      .westFirst = westPeId == 0,
-      .westLast = westPeId == width - 1,
-      .westPatternId = westPeId % pattern,
-      .westNotNeedsPos3 = westPeId < pattern - 1,
-      .westPatternFirst = westPeId % pattern == 0,
-      .westPatternLast = westPeId % pattern == pattern - 1,
-      .westSenderCount = util.min(pattern, westPeId + 1),
-    };
-
-    const eastParams = .{
-      .eastFirst = eastPeId == 0,
-      .eastLast = eastPeId == width - 1,
-      .eastPatternId = eastPeId % pattern,
-      .eastNotNeedsPos3 = eastPeId < pattern - 1,
-      .eastPatternFirst = eastPeId % pattern == 0,
-      .eastPatternLast = eastPeId % pattern == pattern - 1,
-      .eastSenderCount = util.min(pattern, eastPeId + 1),
-    };
-
-    const hParams = @concat_structs(westParams, eastParams);
-
-    var yId = 0;
-    while (yId < height) : (yId += 1) {
-
-      // Here we compute the relative coordinates for northward and southward
-      // communication.
-      const northPeId = util.computeRelativePeId(yId, height, NORTH);
-      const southPeId = util.computeRelativePeId(yId, height, SOUTH);
-
-      const northParams = .{
-        .northFirst = northPeId == 0,
-        .northLast = northPeId == height - 1,
-        .northPatternId = northPeId % pattern,
-        .northNotNeedsPos3 = northPeId < pattern - 1,
-        .northPatternFirst = northPeId % pattern == 0,
-        .northPatternLast = northPeId % pattern == pattern - 1,
-        .northSenderCount = util.min(pattern, northPeId + 1),
-      };
-
-      const southParams = .{
-        .southFirst = southPeId == 0,
-        .southLast = southPeId == height - 1,
-        .southPatternId = southPeId % pattern,
-        .southNotNeedsPos3 = southPeId < pattern - 1,
-        .southPatternFirst = southPeId % pattern == 0,
-        .southPatternLast = southPeId % pattern == pattern - 1,
-        .southSenderCount = util.min(pattern, southPeId + 1),
-      };
-
-      const vParams = @concat_structs(northParams, southParams);
-      const dirParams = @concat_structs(hParams, vParams);
-      const baseParams = @concat_structs(invariants, dirParams);
-
-      const params = @concat_structs(.{
-        .isSourcePe = xId == srcX and yId == srcY,
-        .isTscOutPe = xId == width - 1 and yId == 0,
-      }, baseParams);
-
-
-      // additional colors for memcpy
-      const params_task = @concat_structs( .{
-            .COMP = COMP,
-            ._px=xId,
-      }, params);
-
-      const memcpyParams = memcpy.get_params(xId);
-
-      @set_tile_code(xId, yId, "task_memcpy.csl", @concat_structs( .{
-            .memcpyParams = memcpyParams,
-      }, params_task));
-
-    }
-  }
-
-  // step 3: global and internal routing
-  //  format: @set_color_config(x, y, color, route);
-
-  // export symbol name
-  @export_name("vp", [*]f32, true);
-  @export_name("source", [*]f32, true);
-  @export_name("maxmin_time", [*]f32, true);
-  @export_name("zout", [*]f32, true);
-
-  @export_name("f_activate_comp", fn(u32)void);
-  @export_name("f_prepare_zout", fn()void);
-}
diff --git a/benchmarks/stencil-v2/commands.sh b/benchmarks/stencil-v2/commands.sh
deleted file mode 100755
index 2e38d3d..0000000
--- a/benchmarks/stencil-v2/commands.sh
+++ /dev/null
@@ -1,10 +0,0 @@
-#!/usr/bin/env bash
-
-set -e
-
-cslc ./code_memcpy.csl --fabric-dims=17,12 --fabric-offsets=4,1 \
--o=out_code --params=width:10,height:10,zDim:10,sourceLength:10,dx:20 \
---params=srcX:0,srcY:0,srcZ:0 --verbose --memcpy --channels=1 \
---width-west-buf=0 --width-east-buf=0
-cs_python run.py --name out \
---iterations=10 --dx=20 --skip-compile
diff --git a/benchmarks/stencil-v2/consts.csl b/benchmarks/stencil-v2/consts.csl
deleted file mode 100644
index d51aa53..0000000
--- a/benchmarks/stencil-v2/consts.csl
+++ /dev/null
@@ -1,107 +0,0 @@
-// Copyright 2024 Cerebras Systems.
-//
-// Licensed under the Apache License, Version 2.0 (the "License");
-// you may not use this file except in compliance with the License.
-// You may obtain a copy of the License at
-//
-//     http://www.apache.org/licenses/LICENSE-2.0
-//
-// Unless required by applicable law or agreed to in writing, software
-// distributed under the License is distributed on an "AS IS" BASIS,
-// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-// See the License for the specific language governing permissions and
-// limitations under the License.
-
-param pattern: u16;
-param paddedZDim: u16;
-
-const math = @import_module("<math>");
-// We need to allocate space for not just the (padded) size of the problem (in
-// the Z dimension), but also space for ghost cells.
-const zBufferSize = paddedZDim + 2 * (pattern - 1);
-
-fn initBuffer() [2, zBufferSize]f32 {
-  return @zeros([2, zBufferSize]f32);
-}
-
-// Minimig - main.c:15-23, target_3d.c:23, and target_3d.c:30
-fn computeMinimigConsts(dx: u16) [9]f32 {
-  @comptime_assert(pattern == 5);
-  const dx2:f32 = @as(f32, dx * dx);
-  const c0:f32 = -205.0 / 72.0 / dx2;
-  const c1:f32 = 8.0 / 5.0 / dx2;
-  const c2:f32 = -1.0 / 5.0 / dx2;
-  const c3:f32 = 8.0 / 315.0 / dx2;
-  const c4:f32 = -1.0 / 560.0 / dx2;
-
-  return [9]f32 {
-    c4,
-    c3,
-    c2,
-    c1,
-    c0 * 3.0,
-    c1,
-    c2,
-    c3,
-    c4,
-  };
-}
-
-// `computeMinimigConsts()` computes constants in both the positive as well as
-// negative direction of the X, Y, and Z dimensions.  However, for any given
-// axis, our implementation splits communication and computation into two, one
-// for the positive direction and another for the negative direction.  This
-// function extracts the first half of the constants, and optionally includes
-// the center element.
-fn fetchFirstHalfConsts(consts: [2 * pattern - 1]f32, self: bool) [pattern]f32 {
-  var idx: u16 = 0;
-  var result = @zeros([pattern]f32);
-
-  if (!self) {
-    idx += 1;
-  }
-
-  while (idx < pattern) : (idx += 1) {
-    result[idx] = consts[pattern - idx - 1];
-  }
-
-  return result;
-}
-
-fn fetchSecondHalfConsts(consts: [2 * pattern - 1]f32, self: bool) [pattern]f32 {
-  var idx: u16 = 0;
-  var result = @zeros([pattern]f32);
-
-  if (!self) {
-    idx += 1;
-  }
-
-  while (idx < pattern) : (idx += 1) {
-    result[idx] = consts[pattern + idx - 1];
-  }
-
-  return result;
-}
-
-// The sequence in which each PE receives wavetlets from its neighbors depends
-// on the relative placement of the PE within each group of `pattern` PEs.  This
-// function reorders the constants to match the sequence of source PE IDs so
-// that we multiply the incoming data with the right constants.
-fn permuteConsts(pattId: u16, originalConsts: [pattern]f32) [pattern]f32 {
-  const start = pattId;
-  var result = @zeros([pattern]f32);
-
-  var idx: u16 = 0;
-  while (idx < pattern) : (idx += 1) {
-    var value: f32 = 0.0;
-    if (start < idx) {
-      value = originalConsts[(start + pattern) - idx];
-    } else {
-      value = originalConsts[start - idx];
-    }
-
-    result[idx] = value;
-  }
-
-  return result;
-}
diff --git a/benchmarks/stencil-v2/ic.py b/benchmarks/stencil-v2/ic.py
deleted file mode 100644
index 63a54fb..0000000
--- a/benchmarks/stencil-v2/ic.py
+++ /dev/null
@@ -1,53 +0,0 @@
-#!/usr/bin/env cs_python
-
-# Copyright 2024 Cerebras Systems.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-# This is not a real test, but a module that gets imported in other tests.
-
-import numpy as np
-
-
-def computeGaussianSource(iterations):
-  tau = np.float32(1.0)
-  scale = np.float32(8.0)
-  mscale = np.float32(-8.0)
-  _fmax = np.float32(25.0)
-  dt = np.float32(0.001610153)
-  sigma = np.float32(0.6) * _fmax
-
-  t = np.arange(0, iterations, 1, dtype=np.float32) * np.float32(dt)
-  power = np.power(sigma * t - tau, 2, dtype=np.float32)
-  expf = np.exp(np.multiply(power, np.float32(mscale)))
-  source = (
-      np.float32(-2.0)
-      * scale
-      * sigma
-      * np.multiply(
-          sigma - np.float32(2.0) * sigma * scale * power,
-          expf,
-          dtype=np.float32,
-      )
-  )
-
-  first_zero_idx = np.nonzero(source)[-1][-1] + 1
-  if first_zero_idx < source.shape[-1]:
-    source = source[:first_zero_idx]
-    sourceLength = first_zero_idx
-  else:
-    sourceLength = source.shape[-1]
-
-  print(f"sourceLength = {sourceLength}, first_zero_idx={first_zero_idx}")
-
-  return source, sourceLength
diff --git a/benchmarks/stencil-v2/nop.csl b/benchmarks/stencil-v2/nop.csl
deleted file mode 100644
index 825a1d5..0000000
--- a/benchmarks/stencil-v2/nop.csl
+++ /dev/null
@@ -1,14 +0,0 @@
-// Copyright 2024 Cerebras Systems.
-//
-// Licensed under the Apache License, Version 2.0 (the "License");
-// you may not use this file except in compliance with the License.
-// You may obtain a copy of the License at
-//
-//     http://www.apache.org/licenses/LICENSE-2.0
-//
-// Unless required by applicable law or agreed to in writing, software
-// distributed under the License is distributed on an "AS IS" BASIS,
-// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-// See the License for the specific language governing permissions and
-// limitations under the License.
-
diff --git a/benchmarks/stencil-v2/oned_exch.csl b/benchmarks/stencil-v2/oned_exch.csl
deleted file mode 100644
index ea5dea9..0000000
--- a/benchmarks/stencil-v2/oned_exch.csl
+++ /dev/null
@@ -1,166 +0,0 @@
-// Copyright 2024 Cerebras Systems.
-//
-// Licensed under the Apache License, Version 2.0 (the "License");
-// you may not use this file except in compliance with the License.
-// You may obtain a copy of the License at
-//
-//     http://www.apache.org/licenses/LICENSE-2.0
-//
-// Unless required by applicable law or agreed to in writing, software
-// distributed under the License is distributed on an "AS IS" BASIS,
-// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-// See the License for the specific language governing permissions and
-// limitations under the License.
-
-// Colors
-param channel: color;
-
-// Task IDs
-param dataFin_task_id: local_task_id;
-param ctrlFin_task_id: local_task_id;
-
-param pattern: u16;
-param queueId: u16;
-param dir: direction;
-param paddedZDim: u16;
-param senderCount: u16;
-
-param callback: task () void;
-param ctrlCallback: task () void;
-param constants: *const [pattern]f32;
-
-param pos: i16;
-param chunkSize: u16;
-
-const zOffset: i16 = pattern - 1;
-const zBufferSize = paddedZDim + 2 * (pattern - 1);
-
-param zValues: *[2, zBufferSize]f32;
-param buffer: *[4, pattern, chunkSize]f32;
-
-const util = @import_module("util.csl");
-const switches = @import_module("switches.csl");
-
-// Put the PE in receive mode to fetch a single chunk of elements.  Once the
-// (asynchronous) receive operation completes, trigger the callback function.
-fn recvMode() void {
-  const buffDsd = @get_dsd(mem4d_dsd, .{
-    .tensor_access = |i,j|{senderCount, chunkSize} -> buffer[pos, i, j]
-  });
-
-  const fabInDsd = @get_dsd(fabin_dsd, .{
-    .fabric_color = channel,
-    .input_queue = @get_input_queue(4 + queueId),
-    .extent = senderCount * chunkSize,
-  });
-
-  const constDsd = @get_dsd(mem4d_dsd, .{
-    .tensor_access = |i,j|{senderCount, chunkSize} -> constants[i]
-  });
-
-  // Minimig - target_3d.c:4,7,10,13 (or 5,8,11,14) and target_3d:30
-  // `vp` is folded into `constants` so this corresponds to one of:
-  // ```
-  // vp * (coefx[1]*(u[IDX3_l(i+1,j,k)]                   ) \
-  //      +coefx[2]*(u[IDX3_l(i+2,j,k)]                   ) \
-  //      +coefx[3]*(u[IDX3_l(i+3,j,k)]                   ) \
-  //      +coefx[4]*(u[IDX3_l(i+4,j,k)]                   ))
-  // ```
-  // or:
-  // ```
-  // vp * (coefx[1]*(                  +u[IDX3_l(i-1,j,k)]) \
-  //      +coefx[2]*(                  +u[IDX3_l(i-2,j,k)]) \
-  //      +coefx[3]*(                  +u[IDX3_l(i-3,j,k)]) \
-  //      +coefx[4]*(                  +u[IDX3_l(i-4,j,k)]))
-  // ```
-  // or:
-  // ```
-  // vp * (coefx[1]*(u[IDX3_l(i+1,j,k)]                   ) \
-  //      +coefx[2]*(u[IDX3_l(i+2,j,k)]                   ) \
-  //      +coefx[3]*(u[IDX3_l(i+3,j,k)]                   ) \
-  //      +coefx[4]*(u[IDX3_l(i+4,j,k)]                   ))
-  // ```
-  // or:
-  // ```
-  // vp * (coefy[1]*(                  +u[IDX3_l(i,j-1,k)]) \
-  //      +coefy[2]*(                  +u[IDX3_l(i,j-2,k)]) \
-  //      +coefy[3]*(                  +u[IDX3_l(i,j-3,k)]) \
-  //      +coefy[4]*(                  +u[IDX3_l(i,j-4,k)]))
-  // ```
-  const props = .{ .async = true, .activate = callback };
-  @fmuls(buffDsd, fabInDsd, constDsd, props);
-}
-
-// The following arrays define values for control wavelets, which update the
-// switch position at the recipient PEs.
-var ctrl0 = [1]u32 { switches.ctrl(switches.firstSwitchCommands(pattern)) };
-var ctrl1 = [1]u32 { switches.ctrl(switches.secondSwitchCommands()) };
-
-// This function is activated when we've finished (asynchronously) sending the
-// `chunkSize` data elements, so now it's time to send the first control
-// wavelet.
-task dataFinTask() void {
-  const fabOutCtrlDsd = @get_dsd(fabout_dsd, .{
-    .extent = 1,
-    .control = true,
-    .fabric_color = channel,
-    .output_queue = @get_output_queue(queueId),
-  });
-
-  const ctrlDsd = @get_dsd(mem1d_dsd, .{
-    .tensor_access = |i|{1} -> ctrl0[i]
-  });
-
-  @mov32(fabOutCtrlDsd, ctrlDsd, .{ .async = true, .activate = ctrlFin_task_id });
-}
-
-// This function is activated after we've finished (asynchronously) sending the
-// first control wavelet, so now we send the second control wavelet.
-task ctrlFinTask() void {
-  const fabOutCtrlDsd = @get_dsd(fabout_dsd, .{
-    .extent = 1,
-    .control = true,
-    .fabric_color = channel,
-    .output_queue = @get_output_queue(queueId),
-  });
-
-  const ctrlDsd = @get_dsd(mem1d_dsd, .{
-    .tensor_access = |i|{1} -> ctrl1[i]
-  });
-
-  const props = .{ .async = true, .activate = ctrlCallback };
-  @mov32(fabOutCtrlDsd, ctrlDsd, props);
-}
-
-comptime {
-  @bind_local_task(dataFinTask, dataFin_task_id);
-  @bind_local_task(ctrlFinTask, ctrlFin_task_id);
-}
-
-// Send data to the appropriate neighbor.  This function accepts a `offset`
-// value, which identifies the next chunk (of `numChunks`) of values to send.
-fn send(iterationCount: u32, offset: i16) void {
-  const fabOutDataDsd = @get_dsd(fabout_dsd, .{
-    .extent = chunkSize,
-    .fabric_color = channel,
-    .output_queue = @get_output_queue(queueId),
-  });
-
-  const props = .{ .async = true, .activate = dataFin_task_id };
-
-  if (iterationCount & 1 == 0) {
-    const __memDsd = @get_dsd(mem1d_dsd, .{
-      .tensor_access = |i|{chunkSize} -> zValues[1, zOffset + i]
-    });
-
-    const memDsd = @increment_dsd_offset(__memDsd, offset, f32);
-    @fmovs(fabOutDataDsd, memDsd, props);
-  } else {
-    const __memDsd = @get_dsd(mem1d_dsd, .{
-      .tensor_access = |i|{chunkSize} -> zValues[0, zOffset + i]
-    });
-
-    const memDsd = @increment_dsd_offset(__memDsd, offset, f32);
-    @fmovs(fabOutDataDsd, memDsd, props);
-  }
-}
diff --git a/benchmarks/stencil-v2/routes.csl b/benchmarks/stencil-v2/routes.csl
deleted file mode 100644
index ab65f0d..0000000
--- a/benchmarks/stencil-v2/routes.csl
+++ /dev/null
@@ -1,126 +0,0 @@
-// Copyright 2024 Cerebras Systems.
-//
-// Licensed under the Apache License, Version 2.0 (the "License");
-// you may not use this file except in compliance with the License.
-// You may obtain a copy of the License at
-//
-//     http://www.apache.org/licenses/LICENSE-2.0
-//
-// Unless required by applicable law or agreed to in writing, software
-// distributed under the License is distributed on an "AS IS" BASIS,
-// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-// See the License for the specific language governing permissions and
-// limitations under the License.
-
-const directions = @import_module("<directions>");
-const util = @import_module("util.csl");
-
-param pattern: u16;
-param peWidth: u16;
-param peHeight: u16;
-
-fn initialSwitchPosition(pattFirst: bool, pattLast: bool) u16 {
-  if (pattern == 2) {
-    if (pattFirst) {
-      return 0;
-    }
-    return 2;
-  }
-  if (pattFirst) {
-    return 1;
-  }
-  if (pattLast) {
-    return 3;
-  }
-  return 0;
-}
-
-fn computeTxDir(dir: direction, isLast: bool) comptime_struct {
-  if (!isLast) {
-    return .{ dir, RAMP };
-  }
-  return .{ RAMP };
-}
-
-// The route when using a single neibhbor is quite different from the route for
-// non-unit neighbors.  The next two functions compute routes for these two
-// cases.
-fn twoPatternRoute(dir: direction, pattFirst: bool, pattLast: bool,
-    isLast: bool) comptime_struct {
-  return .{
-    .routes= .{
-        .rx = .{ RAMP },
-        .tx = computeTxDir(dir, isLast),
-    },
-    .switches=.{
-        .pos1 = .{ .tx = RAMP },
-        .pos2 = .{ .rx = directions.flip(dir) },
-        .ring_mode = true,
-        .current_switch_pos = initialSwitchPosition(pattFirst, pattLast, 2),
-        .pop_mode = .{ .always_pop = true },
-    },
-  };
-}
-
-fn genericRoute(dir: direction, notNeedsPos3: bool, pattFirst: bool,
-    pattLast: bool, isLast: bool) comptime_struct {
-  // The first `pattern - 1` PEs always forward, unless they're at the edge.
-  const baseRoute: comptime_struct = .{
-    .routes= .{
-        .rx = .{ directions.flip(dir) },
-        .tx = computeTxDir(dir, isLast),
-    },
-    .switches=.{
-        .pos1 = .{ .rx = RAMP },
-        .pos2 = .{ .tx = RAMP },
-        .ring_mode = true,
-        .pop_mode = .{ .always_pop = true },
-    }
-  };
-
-  if (notNeedsPos3) {
-    return baseRoute;
-  }
-
-  const pos3Route: comptime_struct = .{
-    .routes= .{
-        .rx = .{ directions.flip(dir) },
-        .tx = computeTxDir(dir, isLast),
-    },
-    .switches=.{
-        .pos1 = .{ .rx = RAMP },
-        .pos2 = .{ .tx = RAMP },
-        .pos3 = .{ .rx = directions.flip(dir) },
-        .ring_mode = true,
-        .current_switch_pos = initialSwitchPosition(pattFirst, pattLast),
-        .pop_mode = .{ .always_pop = true },
-    }
-  };
-
-  return pos3Route;
-}
-
-// This is the top-level function for computing the routes and switches.
-fn computeRoute(dir: direction, isFirst: bool, isLast: bool, notNeedsPos3: bool,
-    pattFirst: bool, pattLast: bool) comptime_struct {
-  if (isFirst) {
-    // The first PE only sends, never receives.
-    return .{
-      .routes= .{
-          .rx = .{ RAMP },
-          .tx = .{ dir, RAMP },
-      },
-      .switches=.{
-          .pos1 = .{ .tx = RAMP },
-          .ring_mode = true,
-          .pop_mode = .{ .always_pop = true },
-      }
-    };
-  }
-
-  if (pattern == 2) {
-    return twoPatternRoute(dir, pattFirst, pattLast, isLast);
-  }
-
-  return genericRoute(dir, notNeedsPos3, pattFirst, pattLast, isLast);
-}
diff --git a/benchmarks/stencil-v2/run.py b/benchmarks/stencil-v2/run.py
deleted file mode 100644
index 60db589..0000000
--- a/benchmarks/stencil-v2/run.py
+++ /dev/null
@@ -1,382 +0,0 @@
-#!/usr/bin/env cs_python
-
-# Copyright 2024 Cerebras Systems.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-# pylint: disable=too-many-function-args
-
-import struct
-import json
-import os
-import shutil
-import subprocess
-import time
-from glob import glob
-from pathlib import Path
-from typing import List
-from ic import computeGaussianSource
-import numpy as np
-from cmd_parser import parse_args
-
-from cerebras.sdk.runtime.sdkruntimepybind import SdkRuntime, MemcpyDataType # pylint: disable=no-name-in-module
-from cerebras.sdk.runtime.sdkruntimepybind import MemcpyOrder # pylint: disable=no-name-in-module
-
-SIZE = 10
-ZDIM = 10
-PATTERN = 5
-ITERATIONS = 10
-DX = 20
-arch_default = "wse2"
-# "+5" for infrastructure of memcpy
-# "+2" for a halo of size 1
-FABRIC_WIDTH = SIZE + 2 + 5
-FABRIC_HEIGHT = SIZE + 2
-
-FILE_PATH = os.path.realpath(__file__)
-MEMCPY_DIR = os.path.dirname(FILE_PATH)
-DEPIPELINE_DIR = os.path.dirname(MEMCPY_DIR)
-TEST_DIR = os.path.dirname(DEPIPELINE_DIR)
-HPC_DIR = os.path.dirname(TEST_DIR)
-ROOT_DIR = os.path.dirname(HPC_DIR)
-CSL_DIR = os.path.join(ROOT_DIR, "cslang")
-DRIVER = os.path.join(CSL_DIR, "build") + "/bin/cslc"
-
-
-
-def float_to_hex(f):
-  return hex(struct.unpack('<I', struct.pack('<f', f))[0])
-
-def make_u48(words):
-  return words[0] + (words[1] << 16) + (words[2] << 32)
-
-def sub_ts(words):
-  return make_u48(words[3:]) - make_u48(words[0:3])
-
-
-def cast_uint32(x):
-  if isinstance(x, (np.float16, np.int16, np.uint16)):
-    z = x.view(np.uint16)
-    return np.uint32(z)
-  if isinstance(x, (np.float32, np.int32, np.uint32)):
-    return x.view(np.uint32)
-  if isinstance(x, int):
-    return np.uint32(x)
-  if isinstance(x, float):
-    z = np.float32(x)
-    return z.view(np.uint32)
-
-  raise RuntimeError(f"type of x {type(x)} is not supported")
-
-
-def csl_compile(
-    cslc: str,
-    arch: str,
-    width: int,
-    height: int,
-    core_fabric_offset_x: int, # fabric-offsets of the core
-    core_fabric_offset_y: int,
-    zDim: int,
-    sourceLength: int,
-    dx: int,
-    srcX: int,
-    srcY: int,
-    srcZ: int,
-    fabric_width: int,
-    fabric_height: int,
-    name: str,
-    n_channels: int,
-    width_west_buf: int,
-    width_east_buf: int
-)  -> List[str]:
-  """Generate ELFs for the layout."""
-
-  start = time.time()
-  # CSL Compilation Step
-  args = []
-  args.append(cslc)
-  args.append("code_memcpy.csl")
-  args.append(f"--fabric-dims={fabric_width},{fabric_height}")
-  args.append(f"--fabric-offsets={core_fabric_offset_x},{core_fabric_offset_y}")
-  args.append(f"--params=width:{width},height:{height},zDim:{zDim},sourceLength:{sourceLength}")
-  args.append(f"--params=dx:{dx}")
-  args.append(f"--params=srcX:{srcX},srcY:{srcY},srcZ:{srcZ}")
-  args.append("--verbose")
-  args.append(f"-o={name}_code")
-  if arch is not None:
-    args.append(f"--arch={arch}")
-  args.append("--memcpy")
-  args.append(f"--channels={n_channels}")
-  args.append(f"--width-west-buf={width_west_buf}")
-  args.append(f"--width-east-buf={width_east_buf}")
-  print(f"subprocess.check_call(args = {args}")
-  subprocess.check_call(args)
-
-  end = time.time()
-  print(f"Code compiled in {end-start}s")
-
-  elf_paths = glob(f"{name}_code/bin/out_[0-9]*.elf")
-
-  return elf_paths
-
-
-def main():
-  """Main method to run the example code."""
-
-  args = parse_args()
-
-  # Path to the CSLC driver
-  cslc = DRIVER
-  print(f"cslc = {cslc}")
-
-  name = args.name
-  dx = args.dx
-  iterations = args.iterations
-
-  n_channels = args.n_channels
-  width_west_buf = args.width_west_buf
-  width_east_buf = args.width_east_buf
-  print(f"n_channels = {n_channels}")
-  print(f"width_west_buf = {width_west_buf}, width_east_buf = {width_east_buf}")
-
-  source, sourceLength = computeGaussianSource(iterations)
-  print("Gaussian source computed")
-  print(f"sourceLength = {sourceLength}")
-  print(f"source = {source}")
-
-  if args.skip_compile:
-    # Parse the compile metadata
-    with open(f"{name}_code/out.json", encoding="utf-8") as json_file:
-      compile_data = json.load(json_file)
-
-    size = int(compile_data["params"]["width"])
-    zDim = int(compile_data["params"]["zDim"])
-  else:
-    size = args.size
-    zDim = args.zDim
-
-  width = size
-  height = size
-
-  fabric_offset_x = 1
-  fabric_offset_y = 1
-
-  # if WSE is the target, fabric_[width|height] must be the size of WSE
-  if args.fabric_width is not None:
-    fabric_width = args.fabric_width
-  else:
-    fabric_width = fabric_offset_x + 3 + width + 2 + 1 + width_west_buf + width_east_buf
-
-  if args.fabric_height is not None:
-    fabric_height = args.fabric_height
-  else:
-    fabric_height = fabric_offset_y + height + 1
-
-  print(f"width = {width}, height={height}")
-  print(f"fabric_offset_x = {fabric_offset_x}, fabric_offset_y={fabric_offset_y}")
-  print(f"fabric_width = {fabric_width}, fabric_height={fabric_height}")
-
-  assert fabric_width >= (fabric_offset_x + width + 5 + 1 + width_west_buf + width_east_buf)
-  assert fabric_height >= (fabric_offset_y + height + 1)
-
-  srcX = width // 2 - 5
-  srcY = height // 2 - 5
-  srcZ = zDim // 2 - 5
-  assert srcX >= 0
-  assert srcY >= 0
-  assert srcZ >= 0
-  print(f"srcX (x-coordinate of the source) = width/2 - 5  = {srcX}")
-  print(f"srcY (y-coordinate of the source) = height/2 - 5 = {srcY}")
-  print(f"srcZ (z-coordinate of the source) = zdim/2 - 5   = {srcZ}")
-
-  if not args.skip_compile:
-    print("Cleaned up existing elf files before compilation")
-    elf_paths = glob(f"{name}_code_*.elf")
-    for felf in elf_paths:
-      os.remove(felf)
-
-    core_fabric_offset_x = fabric_offset_x + 3 + width_west_buf
-    core_fabric_offset_y = fabric_offset_y
-
-    start = time.time()
-    csl_compile(
-        cslc, arch_default, width, height, core_fabric_offset_x, core_fabric_offset_y,
-        zDim, sourceLength, dx, srcX, srcY, srcZ,
-        fabric_width, fabric_height, name,
-        n_channels,
-        width_west_buf,
-        width_east_buf)
-    end = time.time()
-    print(f"compilation of kernel in {end-start}s")
-  else:
-    print("skip-compile: No compilation, read existing ELFs")
-
-  if args.skip_run:
-    print("skip-run: early return")
-    return
-
-#----------- run the test --------
-
-  # vp[h][w][l] = 10.3703699112
-  vp_all = 10.3703699112
-  vp = np.full(width*height*zDim, vp_all, dtype=np.float32)
-  vp = vp.reshape(height, width, zDim)
-
-  # source_all[h][w][l]
-  source_all = np.zeros(width*height*zDim).reshape(width*height*zDim, 1).astype(np.float32)
-  for tidx in range(sourceLength):
-    #source_all[(srcY, srcX, tidx, 1)] = source[tidx]
-    offset = srcY * width*zDim + srcX * zDim + tidx
-    source_all[offset] = source[tidx]
-  source_all = source_all.reshape(height, width, zDim)
-
-#
-# Step 2: the user creates CSRunner
-#
-  memcpy_dtype = MemcpyDataType.MEMCPY_32BIT
-  memcpy_order = MemcpyOrder.ROW_MAJOR
-  dirname = f"{name}_code"
-  runner = SdkRuntime(dirname, cmaddr=args.cmaddr)
-
-  sym_vp = runner.get_id("vp")
-  sym_source = runner.get_id("source")
-  sym_maxmin_time = runner.get_id("maxmin_time")
-  sym_zout = runner.get_id("zout")
-
-  runner.load()
-  runner.run()
-
-  start = time.time()
-#
-# Step 3: The user has to prepare the sequence of H2D/D2H/RPC
-#
-  # H2D vp[h][w][zDim]
-  # vp is h-by-w-by-zDim in row-major
-  runner.memcpy_h2d(sym_vp, vp.ravel(), 0, 0, width, height, zDim,
-                    streaming=False, data_type=memcpy_dtype,
-                    order=memcpy_order, nonblock=False)
-
-  # H2D source[h][w][zDim]
-  runner.memcpy_h2d(sym_source, source_all.ravel(), 0, 0, width, height, zDim,
-                    streaming=False, data_type=memcpy_dtype,
-                    order=memcpy_order, nonblock=False)
-
-  # time marching: call f_activate_comp() to set num iters and start computation
-  runner.launch("f_activate_comp", cast_uint32(iterations), nonblock=False)
-
-  # D2H [h][w][6]
-  maxmin_time_1d = np.zeros(height*width*6, np.float32)
-  runner.memcpy_d2h(maxmin_time_1d, sym_maxmin_time, 0, 0, width, height, 6,
-                    streaming=False, data_type=memcpy_dtype,
-                    order=memcpy_order, nonblock=False)
-  maxmin_time_hwl = maxmin_time_1d.reshape(height, width, 6)
-
-  # prepare zout: call f_prepare_zout()
-  runner.launch("f_prepare_zout", nonblock=False)
-
-  # D2H [h][w][zDim]
-  z_1d = np.zeros(height*width*zDim, np.float32)
-  runner.memcpy_d2h(z_1d, sym_zout, 0, 0, width, height, zDim,
-                    streaming=False, data_type=memcpy_dtype,
-                    order=memcpy_order, nonblock=False)
-  z_hwl = z_1d.reshape(height, width, zDim)
-
-  runner.stop()
-  end = time.time()
-
-  print(f"Run done in {end-start}s")
-
-  if args.cmaddr is None:
-    # move simulation log and core dump to the given folder
-    dst_log = Path(f"{dirname}/sim.log")
-    src_log = Path("sim.log")
-    if src_log.exists():
-      shutil.move(src_log, dst_log)
-
-    dst_trace = Path(f"{dirname}/simfab_traces")
-    src_trace = Path("simfab_traces")
-    if dst_trace.exists():
-      shutil.rmtree(dst_trace)
-    if src_trace.exists():
-      shutil.move(src_trace, dst_trace)
-
-#
-# step 4: verification
-#
-  # D2H(max/min)
-  # d2h_buf_f32[0] = maxValue
-  # d2h_buf_f32[1] = minValue
-  # D2H (timestamps)
-  # d2h_buf_f32[2] = {tscStartBuffer[1], tscStartBuffer[0]}
-  # d2h_buf_f32[3] = {tscEndBuffer[0], tscStartBuffer[2]}
-  # d2h_buf_f32[4] = {tscEndBuffer[2], tscEndBuffer[1]}
-  maxValues_d2h = np.zeros(width*height).reshape(height, width).astype(np.float32)
-  for h in range(height):
-    for w in range(width):
-      maxValues_d2h[(h, w)] = maxmin_time_hwl[(h, w, 0)]
-
-  minValues_d2h = np.zeros(width*height).reshape(height, width).astype(np.float32)
-  for h in range(height):
-    for w in range(width):
-      minValues_d2h[(h, w)] = maxmin_time_hwl[(h, w, 1)]
-
-  computedMax = maxValues_d2h.max()
-  computedMin = minValues_d2h.min()
-  print(f"[computed] min_d2h: {computedMin}, max_d2h: {computedMax}")
-
-  timestamp_d2h = np.zeros(width*height*6).reshape(width, height, 6).astype(np.uint16)
-  for w in range(width):
-    for h in range(height):
-      hex_t0 = int(float_to_hex(maxmin_time_hwl[(h, w, 2)]), base=16)
-      hex_t1 = int(float_to_hex(maxmin_time_hwl[(h, w, 3)]), base=16)
-      hex_t2 = int(float_to_hex(maxmin_time_hwl[(h, w, 4)]), base=16)
-      timestamp_d2h[(w, h, 0)] = hex_t0 & 0x0000ffff
-      timestamp_d2h[(w, h, 1)] = (hex_t0 >> 16) & 0x0000ffff
-      timestamp_d2h[(w, h, 2)] = hex_t1 & 0x0000ffff
-      timestamp_d2h[(w, h, 3)] = (hex_t1 >> 16) & 0x0000ffff
-      timestamp_d2h[(w, h, 4)] = hex_t2 & 0x0000ffff
-      timestamp_d2h[(w, h, 5)] = (hex_t2 >> 16) & 0x0000ffff
-  tsc_tensor_d2h = np.zeros(6).astype(np.uint16)
-  tsc_tensor_d2h[0] = timestamp_d2h[(width-1, 0, 0)]
-  tsc_tensor_d2h[1] = timestamp_d2h[(width-1, 0, 1)]
-  tsc_tensor_d2h[2] = timestamp_d2h[(width-1, 0, 2)]
-  tsc_tensor_d2h[3] = timestamp_d2h[(width-1, 0, 3)]
-  tsc_tensor_d2h[4] = timestamp_d2h[(width-1, 0, 4)]
-  tsc_tensor_d2h[5] = timestamp_d2h[(width-1, 0, 5)]
-
-  print(f"tsc_tensor_d2h = {tsc_tensor_d2h}")
-  cycles = sub_ts(tsc_tensor_d2h)
-  cycles_per_element = cycles / (iterations * zDim)
-  print(f"cycles per element = {cycles_per_element}")
-
-  zMax_d2h = z_hwl.max()
-  zMin_d2h = z_hwl.min()
-  print(f"[computed] zMin_d2h: {zMin_d2h}, zMax_d2h: {zMax_d2h}")
-
-  if zDim == 10 and size == 10 and iterations == 10:
-    print("[verification] w=h=zdim=10, iters = 10, check golden vector")
-    np.testing.assert_allclose(computedMin, -1.3100899, atol=0.01, rtol=0)
-    np.testing.assert_allclose(computedMax, 1200.9414062, atol=0.01, rtol=0)
-    print("\nSUCCESS!")
-  elif zDim == 10 and size == 10 and iterations == 2:
-    print("[verification] w=h=zdim=10, iters = 2, check golden vector")
-    np.testing.assert_allclose(computedMin, -0.0939295, atol=0.01, rtol=0)
-    np.testing.assert_allclose(computedMax, 57.403816, atol=0.01, rtol=0)
-    print("\nSUCCESS!")
-  else:
-    print("Results are not checked for those parameters")
-    assert False
-
-
-if __name__ == "__main__":
-  main()
diff --git a/benchmarks/stencil-v2/switches.csl b/benchmarks/stencil-v2/switches.csl
deleted file mode 100644
index 55e4405..0000000
--- a/benchmarks/stencil-v2/switches.csl
+++ /dev/null
@@ -1,80 +0,0 @@
-// Copyright 2024 Cerebras Systems.
-//
-// Licensed under the Apache License, Version 2.0 (the "License");
-// you may not use this file except in compliance with the License.
-// You may obtain a copy of the License at
-//
-//     http://www.apache.org/licenses/LICENSE-2.0
-//
-// Unless required by applicable law or agreed to in writing, software
-// distributed under the License is distributed on an "AS IS" BASIS,
-// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-// See the License for the specific language governing permissions and
-// limitations under the License.
-
-// These are switch commands that tell the receiving router to either do nothing
-// or advance the switch position.
-const sw_nop = 0;
-const sw_adv = 1;
-
-// This function computes the payload field of the control wavelet.  The `cmds`
-// argument contains the switch commands for seven downstream PEs.
-fn ctrl(cmds: [8]u16) u32 {
-  // Tell the router to not forward control wavelets to the PE.
-  const ceFilter = true;
-
-  // Color 31 is special.  It is the hardware's way of saying null color.
-  const colorToActivate = 31;
-
-  var result: u32 = 0;
-  result |= (colorToActivate & 0x1f) << 16;
-
-  result |= @as(u32, cmds[0] & 0b11) << 22;
-  result |= @as(u32, ceFilter) << 24;
-
-  result |= @as(u32, cmds[1] & 0b11) << 25;
-  result |= @as(u32, ceFilter) << 27;
-
-  result |= @as(u32, cmds[2] & 0b11) << 28;
-  result |= @as(u32, ceFilter) << 0;
-
-  result |= @as(u32, cmds[3] & 0b11) << 1;
-  result |= @as(u32, ceFilter) << 3;
-
-  result |= @as(u32, cmds[4] & 0b11) << 4;
-  result |= @as(u32, ceFilter) << 6;
-
-  result |= @as(u32, cmds[5] & 0b11) << 7;
-  result |= @as(u32, ceFilter) << 9;
-
-  result |= @as(u32, cmds[6] & 0b11) << 10;
-  result |= @as(u32, ceFilter) << 12;
-
-  result |= @as(u32, cmds[7] & 0b11) << 13;
-  result |= @as(u32, ceFilter) << 15;
-
-  return result;
-}
-
-// This computes the (first) control wavelet, which advances the switch position
-// at (1) the current PE's router, (2) the next PE's router, and (3) the router
-// of the last PE in the group of `pattern` PEs (if it's different from the next
-// PE).
-fn firstSwitchCommands(pattern: u16) [8]u16 {
-  @comptime_assert(pattern <= 8);
-
-  var cmds = @constants([8]u16, sw_nop);
-  cmds[0] = sw_adv;
-  cmds[1] = sw_adv;
-  cmds[pattern - 1] = sw_adv;
-
-  return cmds;
-}
-
-// Computes the (second) control wavelet, which advances only the current PE's
-// switch position.
-fn secondSwitchCommands() [8]u16 {
-  var cmds = @constants([8]u16, sw_nop);
-  cmds[0] = sw_adv;
-  return cmds;
-}
diff --git a/benchmarks/stencil-v2/task_memcpy.csl b/benchmarks/stencil-v2/task_memcpy.csl
deleted file mode 100644
index d63225c..0000000
--- a/benchmarks/stencil-v2/task_memcpy.csl
+++ /dev/null
@@ -1,804 +0,0 @@
-// Copyright 2024 Cerebras Systems.
-//
-// Licensed under the Apache License, Version 2.0 (the "License");
-// you may not use this file except in compliance with the License.
-// You may obtain a copy of the License at
-//
-//     http://www.apache.org/licenses/LICENSE-2.0
-//
-// Unless required by applicable law or agreed to in writing, software
-// distributed under the License is distributed on an "AS IS" BASIS,
-// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-// See the License for the specific language governing permissions and
-// limitations under the License.
-
-//
-// FD kernel with memcpy
-//
-// The sequence of execution is
-// - H2D(vp) : prepare vp
-// - H2D(source): prepare source
-// - launch(0): trigger time marching
-// - D2H(maxmin_time): record max/min of zValues and time stamps
-// - launch(1): prepare zout which is either zValues[0, zOffset] or zValues[1, zOffset]
-// - D2H(zout)
-//
-param memcpyParams: comptime_struct;
-
-// Colors
-param eastChannel:  color;
-param westChannel:  color;
-param northChannel: color;
-param southChannel: color;
-
-// Task IDs
-param COMP: local_task_id; // start time marching
-param send: local_task_id;
-
-param eastFin:  local_task_id;
-param westFin:  local_task_id;
-param northFin: local_task_id;
-param southFin: local_task_id;
-
-param eastDataFin:  local_task_id;
-param westDataFin:  local_task_id;
-param northDataFin: local_task_id;
-param southDataFin: local_task_id;
-
-param eastCtrlFin:  local_task_id;
-param westCtrlFin:  local_task_id;
-param northCtrlFin: local_task_id;
-param southCtrlFin: local_task_id;
-
-param eastCtrlFin2:  local_task_id;
-param westCtrlFin2:  local_task_id;
-param northCtrlFin2: local_task_id;
-param southCtrlFin2: local_task_id;
-
-param _px: i16;
-
-param isTscOutPe: bool;
-
-param zDim: i16;
-param pattern: u16;
-param isSourcePe: bool;
-param sourceLength: u32;
-param dx: u16;
-param width: u16;
-param height: u16;
-param srcZ: u16;
-
-// Code allows do receive along 4 cardinal directions only
-// Anisotropy will require "diagonal" broadcasts
-const directionCount: u16 = 4;
-
-const timestamp = @import_module("<time>");
-var tscEndBuffer = @zeros([timestamp.tsc_size_words]u16);
-var tscStartBuffer = @zeros([timestamp.tsc_size_words]u16);
-
-var iterations: u32 = 0;
-
-const sys_mod = @import_module( "<memcpy/memcpy>", memcpyParams);
-
-//
-// FD uses input_queue = 4,5,6,7
-// oned_exch.csl:    .input_queue = 4 + queueId,
-//                   .output_queue = queueId
-// task_memcpy.csl:  .queueId = 0,
-// task_memcpy.csl:  .queueId = 1,
-// task_memcpy.csl:  .queueId = 2,
-// task_memcpy.csl:  .queueId = 3,
-//
-// so memcpyH2D with input_queue = 0 does not collide others
-// The D2H uses output_queue = 0
-// There should not have any problem with output_queue = 0
-// because multiple colors can share the same output_queue
-//
-
-
-const zOffset: i16 = pattern - 1;
-const math = @import_module("<math>");
-
-var recvChunkCounter: i16 = 0;
-var sendChunkCounter: i16 = 0;
-
-const util = @import_module("util.csl");
-
-const numChunks = util.computeChunks(zDim);
-const chunkSize = util.computeChunkSize(zDim, @as(u16, numChunks));
-const paddedZDim = chunkSize * @as(u16, numChunks);
-
-const routes = @import_module("routes.csl", .{
-  .pattern = pattern,
-  .peWidth = width,
-  .peHeight = height,
-});
-
-const consts = @import_module("consts.csl", .{
-  .pattern = pattern,
-  .paddedZDim = paddedZDim,
-});
-
-const xConsts = consts.computeMinimigConsts(dx);
-const yConsts = consts.computeMinimigConsts(dx);
-const zConsts = consts.computeMinimigConsts(dx);
-
-// The `zValues` array determines the seed value of the program.  For now, we
-// use all zeros to match the reference code.
-var zValues = consts.initBuffer();
-var vp = @zeros([zDim]f32);
-
-//var source = @zeros([sourceLength]f32);
-var source = @zeros([zDim]f32);
-
-
-//--- MEMCPY
-const dummy = @zeros([1]f32);
-
-// d2h_buf_f32[0] = max(zValues)
-// d2h_buf_f32[1] = min(zValues)
-// d2h_buf_f32[2:4] = timestamps
-var d2h_buf_f32 = @zeros([5]f32);
-
-// temporary array to hold ether zValues[0, zOffset] or zValues[1, zOffset]
-var zout = @zeros([zDim]f32);
-
-
-// WARNING: export pointers, not arrays
-var ptr_vp : [*]f32 = &vp;
-var ptr_source : [*]f32 = &source;
-var ptr_d2h_buf_f32 : [*]f32 = &d2h_buf_f32;
-var ptr_zout : [*]f32 = &zout;
-
-
-var mem_z_buf_dsd = @get_dsd(mem1d_dsd, .{ .tensor_access = |i|{zDim} -> dummy[i] });
-var mem_zout_buf_dsd = @get_dsd(mem1d_dsd, .{ .tensor_access = |i|{zDim} -> zout[i] });
-
-//--- END MEMCPY
-
-
-//These are broadcasts stuff
-param westFirst: bool;
-param westLast: bool;
-param westPatternId: u16;
-param westNotNeedsPos3: bool;
-param westPatternFirst: bool;
-param westPatternLast: bool;
-param westSenderCount: u16;
-
-param eastFirst: bool;
-param eastLast: bool;
-param eastPatternId: u16;
-param eastNotNeedsPos3: bool;
-param eastPatternFirst: bool;
-param eastPatternLast: bool;
-param eastSenderCount: u16;
-
-param northFirst: bool;
-param northLast: bool;
-param northPatternId: u16;
-param northNotNeedsPos3: bool;
-param northPatternFirst: bool;
-param northPatternLast: bool;
-param northSenderCount: u16;
-
-param southFirst: bool;
-param southLast: bool;
-param southPatternId: u16;
-param southNotNeedsPos3: bool;
-param southPatternFirst: bool;
-param southPatternLast: bool;
-param southSenderCount: u16;
-
-// Since our code essentially uses the same communication code with parameters
-// for the direction of the communication, we compute the subset of constants
-// that will be used by each instance of the communication code.  The boolean
-// value to `fetch*Consts()` function specifies whether the constant for the
-// element at the center should be included or not.  Since we want to include
-// the center element only once, we pass `true` only for the _first_ invocation
-// of this function, while all other values are false.
-const eastConsts = consts.fetchFirstHalfConsts(xConsts, true);
-const permutedEastConsts = consts.permuteConsts(eastPatternId, eastConsts);
-
-const westConsts = consts.fetchSecondHalfConsts(xConsts, false);
-const permutedWestConsts = consts.permuteConsts(westPatternId, westConsts);
-
-const southConsts = consts.fetchFirstHalfConsts(yConsts, false);
-const permutedSouthConsts = consts.permuteConsts(southPatternId, southConsts);
-
-const northConsts = consts.fetchSecondHalfConsts(yConsts, false);
-const permutedNorthConsts = consts.permuteConsts(northPatternId, northConsts);
-
-var accumulator = @zeros([paddedZDim]f32);
-var buffer = @zeros([directionCount, pattern, chunkSize]f32);
-
-// We import a module that is parameterized on the direction of the
-// communication.  The following module handles eastward communication.
-const eastBus = @import_module("oned_exch.csl", .{
-  .zValues = &zValues,
-  .buffer = &buffer,
-  .pattern = pattern,
-  .chunkSize = chunkSize,
-  .paddedZDim = paddedZDim,
-
-  .pos = 0,
-  .dir = EAST,
-  .queueId = 0,
-  .dataFin_task_id = eastDataFin,
-  .ctrlFin_task_id = eastCtrlFin,
-  .channel = eastChannel,
-  .callback = eastFinTask,
-  .senderCount = eastSenderCount,
-  .ctrlCallback = eastCtrlFinTask,
-  .constants = &permutedEastConsts,
-});
-
-const westBus = @import_module("oned_exch.csl", .{
-  .zValues = &zValues,
-  .buffer = &buffer,
-  .pattern = pattern,
-  .chunkSize = chunkSize,
-  .paddedZDim = paddedZDim,
-
-  .pos = 1,
-  .dir = WEST,
-  .queueId = 1,
-  .dataFin_task_id = westDataFin,
-  .ctrlFin_task_id = westCtrlFin,
-  .channel = westChannel,
-  .callback = westFinTask,
-  .senderCount = westSenderCount,
-  .ctrlCallback = westCtrlFinTask,
-  .constants = &permutedWestConsts,
-});
-
-const southBus = @import_module("oned_exch.csl", .{
-  .zValues = &zValues,
-  .buffer = &buffer,
-  .pattern = pattern,
-  .chunkSize = chunkSize,
-  .paddedZDim = paddedZDim,
-
-  .pos = 2,
-  .dir = SOUTH,
-  .queueId = 2,
-  .dataFin_task_id = southDataFin,
-  .ctrlFin_task_id = southCtrlFin,
-  .channel = southChannel,
-  .callback = southFinTask,
-  .senderCount = southSenderCount,
-  .ctrlCallback = southCtrlFinTask,
-  .constants = &permutedSouthConsts,
-});
-
-const northBus = @import_module("oned_exch.csl", .{
-  .zValues = &zValues,
-  .buffer = &buffer,
-  .pattern = pattern,
-  .chunkSize = chunkSize,
-  .paddedZDim = paddedZDim,
-
-  .pos = 3,
-  .dir = NORTH,
-  .queueId = 3,
-  .dataFin_task_id = northDataFin,
-  .ctrlFin_task_id = northCtrlFin,
-  .channel = northChannel,
-  .callback = northFinTask,
-  .senderCount = northSenderCount,
-  .ctrlCallback = northCtrlFinTask,
-  .constants = &permutedNorthConsts,
-});
-
-var sendCount: u16 = 0;
-var recvCount: u16 = 0;
-var iterationCount: u32 = 0;
-
-var maxValue: f32 = 0.0;
-var minValue: f32 = 0.0;
-
-const accDsd = @get_dsd(mem1d_dsd, .{
-  .tensor_access = |i|{zDim} -> accumulator[i]
-});
-
-const vpDsd = @get_dsd(mem1d_dsd, .{
-  .tensor_access = |k|{zDim} -> vp[k]
-});
-
-const zValuesDsd0 = @get_dsd(mem1d_dsd, .{
-  .tensor_access = |i|{zDim} -> zValues[0, zOffset + i]
-});
-
-const zValuesDsd1 = @get_dsd(mem1d_dsd, .{
-  .tensor_access = |i|{zDim} -> zValues[1, zOffset + i]
-});
-
-// This function is called when the program completes communication in any one
-// of the east, west, north, and south directions.
-fn recvFin() void {
-  recvCount += 1;
-
-  // Don't proceed until we've finished communicating in _all_ four directions.
-  if (recvCount != directionCount) {
-    return;
-  }
-
-  recvCount = 0;
-
-  // Each direction's communication module writes to a separate chunk of the
-  // buffer, so the following function call performs a sum reduction across all
-  // of these chunks.  This enables us to reuse this buffer for the next round
-  // of `chunkSize` communication without forcing us to allocate one large
-  // buffer for all chunks and for all four directions, which may require more
-  // memory than is available at any given PE.
-  reduceBuffer(recvChunkCounter * @as(i16, chunkSize));
-
-  // The above code multiplies the source data with constants for neighbors in
-  // the X and Y dimension, but we still need to multiply with the right
-  // constants in the Z dimension.  Here, we keep track of the number of chunks
-  // we've received so that we know when to start computing over the Z dim.
-  recvChunkCounter += 1;
-
-  // Note the difference in branch predicates below.  We want to continue
-  // receiving until we've received `chunkSize` values `numChunks` number of
-  // times.  However, the condition for calling `epilog()`, which processes
-  // values in the Z dimension, checks whether we've finished _sending_.  This
-  // way, we ensure that the _both_ sending and receiving code is fully complete
-  // before we begin further processing.  This also ensures that only _one_ of
-  // the `recvFin()` or `sendFin()` functions calls the `epilog()` code.
-  if (recvChunkCounter != numChunks) {
-    // Set the PE to again receive `chunkSize` values from all four directions.
-    startReceiving();
-  } else if (sendChunkCounter == numChunks) {
-    // Remainder tasks after exchanging data in all four direction.
-    epilog();
-  }
-}
-
-// Just like the code to receive `chunkSize` elements need to be called for the
-// total number of chunks, the sending code is also called multiple times so
-// that each call sends `chunkSize` elements to its neighbors.
-fn sendFin() void {
-  sendCount += 1;
-
-  // Don't proceed until we've finished sending to all four neighbors.
-  if (sendCount != directionCount) {
-    return;
-  }
-
-  sendCount = 0;
-  sendChunkCounter += 1;
-
-  // Note the difference in branch predicates below.  We want to continue
-  // sending until we've sent `chunkSize` values `numChunks` number of times.
-  // However, the condition for calling `epilog()`, which processes values in
-  // the Z dimension, checks whether we've finished _receiving_.  This way, we
-  // ensure that the _both_ sending and receiving code is fully complete before
-  // we begin further processing.  This also ensures that only _one_ of the
-  // `recvFin()` or `sendFin()` functions calls the `epilog()` code.
-  if (sendChunkCounter != numChunks) {
-    startSending(sendChunkCounter * @as(i16, chunkSize));
-  } else if (recvChunkCounter == numChunks) {
-    // Remainder tasks after exchanging data in all four direction.
-    epilog();
-  }
-}
-
-
-fn epilog() void {
-  // Multiply shifted versions of zValues with various constants, before
-  // accumulating them into `accumulator`.
-  scaleWithZConsts();
-
-  // Multiply by the velocity field vp
-  //
-  // Minimig - target_3d.c:30
-  // vp[IDX3(i,j,k)]*lap
-  //
-  @fmuls(accDsd, accDsd, vpDsd);
-
-  // Add 2x the value of the previous iteration (referred to as `u`) then
-  // subtract the value from two iterations ago (referred to as `v`).
-  // Since we want to keep track of values for _two_ iterations and not
-  // just the previous iterations, we toggle between `zValues[0, :]`
-  // and `zValues[1, :]`.
-  //
-  // Minimig - target_3d.c:30
-  // If iterationCount is even, `zValues[0, :]` contains `v[IDX3_l(i,j,k)]`
-  // and ``zValues[1, :]` contains `2.f*u[IDX3_l(i,j,k)]+vp[IDX3(i,j,k)]*lap`
-  // (and vice-versa if iterationCount is odd).
-  // This operation orresponds to `-v[IDX3_l(i,j,k)]` in:
-  // ```
-  // v[IDX3_l(i,j,k)] = 2.f*u[IDX3_l(i,j,k)]-v[IDX3_l(i,j,k)]+vp[IDX3(i,j,k)]*lap;
-  // ```
-  if (iterationCount & 1 == 0) {
-    //add 2u
-    @fmacs(accDsd, accDsd, zValuesDsd1, 2.0);
-    @fsubs(zValuesDsd0, accDsd, zValuesDsd0);
-  } else {
-    //add 2u
-    @fmacs(accDsd, accDsd, zValuesDsd0, 2.0);
-    @fsubs(zValuesDsd1, accDsd, zValuesDsd1);
-  }
-
-  // At this point, we've finished a single iteration's computation.  We now add
-  // the gaussian value to the wavefield, assuming this is the appropriate PE.
-  //
-  // Minimig - main.c:203 and data_setup.c:21-31
-  // ```
-  // kernel_add_source(grid, v, source, istep, sx, sy, sz);
-  // ```
-  //
-  if (iterationCount < sourceLength) {
-    if (isSourcePe) {
-      const thisIterationIdx = iterationCount & 1;
-      const offset = @as(u16, zOffset) + srcZ;
-      zValues[thisIterationIdx, offset] += source[iterationCount];
-    }
-  }
-
-  iterationCount += 1;
-
-  // Are we done yet?  If not, start the next iteration by triggering the send
-  // operation.
-  if (iterationCount < iterations) {
-    @activate(send);
-  } else {
-    // Now that we've finished executing the program, we have to perform four
-    // things:
-    // ref: hpc_apps/src/cslang/fd/task.csl
-    // 1. Record the value of the timestamp counter, so that the host can
-    // compute the difference and determine the number of cycles per element.
-
-    // 2. Compute the minimum and maximum value of the wavefield for each PE's
-    // local data, so that the host can simply compute the min and max of these
-    // (reduced) values instead of computing the min and max over the entire
-    // wavefield.
-
-    // 3. Assuming this is the top-right PE, send the timestamp values
-    f_checkpoint();
-  }
-}
-
-
-// This function computes the maximum of the computed result.  It switches
-// between the two `zValues` buffers depending on the executed iteration count.
-//
-// Minimig - data_setup.cc:49
-fn computeMaxValue() f32 {
-  var maxValue:f32 = math.NEGATIVE_INF_f32;
-  const lastIterationIdx = 1 - (iterationCount & 1);
-
-  if (lastIterationIdx == 0) {
-    @fmaxs(&maxValue, maxValue, zValuesDsd0);
-  } else {
-    @fmaxs(&maxValue, maxValue, zValuesDsd1);
-  }
-
-  return maxValue;
-}
-
-// This function computes the _minimum_ of the computed result.  Since there is
-// no instruction for computing the minimum and because we want to use DSDs
-// (instead of a software loop), we first negate the result, compute the
-// maximum, and negate the computed maximum (before negating the source values
-// again so as to make this operation idempotent).
-//
-// Minimig - data_setup.cc:48
-fn computeMinValue() f32 {
-  var minValue:f32 = math.NEGATIVE_INF_f32;
-  const lastIterationIdx = 1 - (iterationCount & 1);
-
-  if (lastIterationIdx == 0) {
-    @fnegs(zValuesDsd0, zValuesDsd0);
-    @fmaxs(&minValue, minValue, zValuesDsd0);
-    @fnegs(zValuesDsd0, zValuesDsd0);
-  } else {
-    @fnegs(zValuesDsd1, zValuesDsd1);
-    @fmaxs(&minValue, minValue, zValuesDsd1);
-    @fnegs(zValuesDsd1, zValuesDsd1);
-  }
-
-  return -minValue;
-}
-
-// The following are tasks that are activated when (asynchronous) send and
-// reveive operations in various directions complete.  Each task funnels to
-// either the `recvFin()` or the `sendFin()` function.  While it may _seem_
-// better to activate just one task instead of four, we cannot do so since the
-// hardware does not queue activations (instead, the hardware uses a single bit
-// to track task activations).  Thus, depending on the sequence of task
-// activations and executions, activating a task multiple times does not
-// guarantee that the said will execute multiple times.
-task eastFinTask() void {
-  recvFin();
-}
-
-task westFinTask() void {
-  recvFin();
-}
-
-task southFinTask() void {
-  recvFin();
-}
-
-task northFinTask() void {
-  recvFin();
-}
-
-task eastCtrlFinTask() void {
-  sendFin();
-}
-
-task westCtrlFinTask() void {
-  sendFin();
-}
-
-task southCtrlFinTask() void {
-  sendFin();
-}
-
-task northCtrlFinTask() void {
-  sendFin();
-}
-
-fn scaleWithZConsts() void {
-  @comptime_assert(pattern == 5);
-
-  // Ideally, we would express the following statements in a loop.  Since the
-  // loop bound is comptime-known, the compiler would then unroll the loop for
-  // us.  However, the current version of the compiler lacks the ability to
-  // unroll loops if the bounds are comptime-known, so the following code is the
-  // manually-unrolled version of the loop over `2 * pattern - 1`.
-  //
-  // Minimig - target_3d.c:3,6,9,12,15 and target_3d.c:30
-  // `vp` and `2u` are folded into `zConsts` so this corresponds to:
-  // ```
-  //  2.f*u[IDX3_l(i,j,k)] + vp * (coef0*u[IDX3_l(i,j,k)] \
-  //    +coefz[1]*(u[IDX3_l(i,j,k+1)]+u[IDX3_l(i,j,k-1)]) \
-  //    +coefz[2]*(u[IDX3_l(i,j,k+2)]+u[IDX3_l(i,j,k-2)]) \
-  //    +coefz[3]*(u[IDX3_l(i,j,k+3)]+u[IDX3_l(i,j,k-3)]) \
-  //    +coefz[4]*(u[IDX3_l(i,j,k+4)]+u[IDX3_l(i,j,k-4)]))
-  if (iterationCount & 1 != 0) {
-    const srcZ = @get_dsd(mem1d_dsd, .{
-      .tensor_access = |i|{zDim} -> zValues[0, i]
-    });
-    @fmacs(accDsd, accDsd, srcZ, zConsts[0]);
-
-    const srcZ1 = @increment_dsd_offset(srcZ, 1, f32);
-    @fmacs(accDsd, accDsd, srcZ1, zConsts[1]);
-
-    const srcZ2 = @increment_dsd_offset(srcZ, 2, f32);
-    @fmacs(accDsd, accDsd, srcZ2, zConsts[2]);
-
-    const srcZ3 = @increment_dsd_offset(srcZ, 3, f32);
-    @fmacs(accDsd, accDsd, srcZ3, zConsts[3]);
-
-    const srcZ5 = @increment_dsd_offset(srcZ, 5, f32);
-    @fmacs(accDsd, accDsd, srcZ5, zConsts[5]);
-
-    const srcZ6 = @increment_dsd_offset(srcZ, 6, f32);
-    @fmacs(accDsd, accDsd, srcZ6, zConsts[6]);
-
-    const srcZ7 = @increment_dsd_offset(srcZ, 7, f32);
-    @fmacs(accDsd, accDsd, srcZ7, zConsts[7]);
-
-    const srcZ8 = @increment_dsd_offset(srcZ, 8, f32);
-    @fmacs(accDsd, accDsd, srcZ8, zConsts[8]);
-  } else {
-    const srcZ = @get_dsd(mem1d_dsd, .{
-      .tensor_access = |i|{zDim} -> zValues[1, i]
-    });
-    @fmacs(accDsd, accDsd, srcZ, zConsts[0]);
-
-    const srcZ1 = @increment_dsd_offset(srcZ, 1, f32);
-    @fmacs(accDsd, accDsd, srcZ1, zConsts[1]);
-
-    const srcZ2 = @increment_dsd_offset(srcZ, 2, f32);
-    @fmacs(accDsd, accDsd, srcZ2, zConsts[2]);
-
-    const srcZ3 = @increment_dsd_offset(srcZ, 3, f32);
-    @fmacs(accDsd, accDsd, srcZ3, zConsts[3]);
-
-    const srcZ5 = @increment_dsd_offset(srcZ, 5, f32);
-    @fmacs(accDsd, accDsd, srcZ5, zConsts[5]);
-
-    const srcZ6 = @increment_dsd_offset(srcZ, 6, f32);
-    @fmacs(accDsd, accDsd, srcZ6, zConsts[6]);
-
-    const srcZ7 = @increment_dsd_offset(srcZ, 7, f32);
-    @fmacs(accDsd, accDsd, srcZ7, zConsts[7]);
-
-    const srcZ8 = @increment_dsd_offset(srcZ, 8, f32);
-    @fmacs(accDsd, accDsd, srcZ8, zConsts[8]);
-  }
-}
-
-fn reduceBuffer(offset: i16) void {
-  const bufferDsd = @get_dsd(mem4d_dsd, .{
-    .tensor_access = |i,j,k|{directionCount, pattern, chunkSize} -> buffer[i, j, k]
-  });
-
-  const accumulatorDsd = @get_dsd(mem4d_dsd, .{
-    .tensor_access = |i,j,k|{directionCount, pattern, chunkSize} -> accumulator[k]
-  });
-
-  // Minimig - target_3d.c:4-14
-  // This corresponds to the sum between each component of the laplacian
-  // over x and y (buffer contains data received from each neighbor in all
-  // 4 cardinal directions)
-  const dstDsd = @increment_dsd_offset(accumulatorDsd, offset, f32);
-  @fadds(dstDsd, dstDsd, bufferDsd);
-}
-
-fn startReceiving() void {
-  // Put the PE in the receive mode for all four directions.
-  eastBus.recvMode();
-  westBus.recvMode();
-  southBus.recvMode();
-  northBus.recvMode();
-}
-
-fn startSending(offset: i16) void {
-  // Asynchronously send data to neighbors in all four directions.
-  eastBus.send(iterationCount, offset);
-  westBus.send(iterationCount, offset);
-  southBus.send(iterationCount, offset);
-  northBus.send(iterationCount, offset);
-}
-
-fn startExchange() void {
-  // Reset the chunk counters since we will be exchanging all chunks now.
-  sendChunkCounter = 0;
-  recvChunkCounter = 0;
-
-  // We first need to put the PEs in receive mode before sending local data.
-  // Starts Laplacian receive and multiplies on the fly for all 4 directions
-  startReceiving();
-  // Sends data from the previous iterations along all 4 directions
-  startSending(0);
-}
-
-task sendTask() void {
-  // zero out the accumulation buffer
-  @fmovs(accDsd, 0.0);
-  startExchange();
-}
-
-
-
-//----[MEMCPY]
-
-// iteration count, we start the timer and trigger the broadcast of the source
-// data to all the PE's neighbors.  A side effect of this design is that running
-// the code with a different iteration count simply requires sending a new
-// wavelet (with the new iteration count) from the host.
-//
-task f_comp() void {
-
-  // WARNING: iterations is received by fn f_activate_comp called by
-  // RPC mechanism
-
-  timestamp.enable_tsc();
-  timestamp.get_timestamp(&tscStartBuffer);
-  @activate(send);
-}
-//--- END MEMCPY
-
-comptime {
-  @bind_local_task(sendTask, send);
-
-  @bind_local_task(eastFinTask, eastFin);
-
-  const eastRoute = routes.computeRoute(EAST, eastFirst, eastLast,
-      eastNotNeedsPos3, eastPatternFirst, eastPatternLast);
-  @set_local_color_config(eastChannel, eastRoute);
-
-  @bind_local_task(westFinTask, westFin);
-
-  const westRoute = routes.computeRoute(WEST, westFirst, westLast,
-      westNotNeedsPos3, westPatternFirst, westPatternLast);
-  @set_local_color_config(westChannel, westRoute);
-
-  @bind_local_task(southFinTask, southFin);
-
-  const southRoute = routes.computeRoute(SOUTH, southFirst, southLast,
-      southNotNeedsPos3, southPatternFirst, southPatternLast);
-  @set_local_color_config(southChannel, southRoute);
-
-  @bind_local_task(northFinTask, northFin);
-
-  const northRoute = routes.computeRoute(NORTH, northFirst, northLast,
-      northNotNeedsPos3, northPatternFirst, northPatternLast);
-  @set_local_color_config(northChannel, northRoute);
-
-  @bind_local_task(eastCtrlFinTask, eastCtrlFin2);
-  @bind_local_task(westCtrlFinTask, westCtrlFin2);
-  @bind_local_task(northCtrlFinTask, northCtrlFin2);
-  @bind_local_task(southCtrlFinTask, southCtrlFin2);
-}
-
-
-//----[MEMCPY]
-
-// time marching is done, epilog calls f_checkpoint
-// 1. recrod time stamps
-// 2. compute max and min of zValues
-// 3. prepare max, min and time stamps
-fn f_checkpoint() void {
-
-  // 1. Record the value of the timestamp counter, so that the host can
-  // compute the difference and determine the number of cycles per element.
-  timestamp.get_timestamp(&tscEndBuffer);
-  timestamp.disable_tsc();
-
-  // 2. Compute the minimum and maximum value of the wavefield for each PE's
-  // local data, so that the host can simply compute the min and max of these
-  // (reduced) values instead of computing the min and max over the entire
-  // wavefield.
-  maxValue = computeMaxValue();
-  minValue = computeMinValue();
-
-  // 3. prepares d2h_buf_f32[0:4]
-  // D2H max/min
-  d2h_buf_f32[0] = maxValue;
-  d2h_buf_f32[1] = minValue;
-
-  // D2H (timestamps)
-  // d2h_buf_f32[2] = {tscStartBuffer[1], tscStartBuffer[0]}
-  // d2h_buf_f32[3] = {tscEndBuffer[0], tscStartBuffer[2]}
-  // d2h_buf_f32[4] = {tscEndBuffer[2], tscEndBuffer[1]}
-  var lo_ : u16 = 0;
-  var hi_ : u16 = 0;
-  var word : u32 = 0;
-
-  lo_ = tscStartBuffer[0];
-  hi_ = tscStartBuffer[1];
-  d2h_buf_f32[2] = @bitcast(f32, (@as(u32,hi_) << @as(u16,16)) | @as(u32, lo_) );
-
-  lo_ = tscStartBuffer[2];
-  hi_ = tscEndBuffer[0];
-  d2h_buf_f32[3] = @bitcast(f32, (@as(u32,hi_) << @as(u16,16)) | @as(u32, lo_) );
-
-  lo_ = tscEndBuffer[1];
-  hi_ = tscEndBuffer[2];
-  d2h_buf_f32[4] = @bitcast(f32, (@as(u32,hi_) << @as(u16,16)) | @as(u32, lo_) );
-
-  // WARNING: the user must unblock cmd color for every PE
-  sys_mod.unblock_cmd_stream();
-}
-
-// set number of iterations and activate f_comp task
-fn f_activate_comp(iter_cnt: u32) void {
-  iterations = iter_cnt;
-  @activate(COMP);
-}
-
-// copy zValues to zout such that D2H can output zout
-fn f_prepare_zout() void {
-  // toggle = 1 - (iterations % 2)
-  var toggle: i32 = 1 - (@as(i32,iterations) % 2);
-  if (0 == toggle){
-    mem_z_buf_dsd = @set_dsd_base_addr(mem_z_buf_dsd, @ptrcast([*]f32, &(zValues[0, zOffset])));
-  }else{
-    mem_z_buf_dsd = @set_dsd_base_addr(mem_z_buf_dsd, @ptrcast([*]f32, &(zValues[1, zOffset])));
-  }
-  @mov32(mem_zout_buf_dsd, mem_z_buf_dsd);
-
-  // WARNING: the user must unblock cmd color for every PE
-  sys_mod.unblock_cmd_stream();
-}
-
-comptime {
-    @comptime_assert( sourceLength <= @as(u32,zDim));
-
-    @bind_local_task(f_comp, COMP);
-
-    @export_symbol(ptr_vp, "vp");
-    @export_symbol(ptr_source, "source");
-    @export_symbol(ptr_d2h_buf_f32, "maxmin_time");
-    @export_symbol(ptr_zout, "zout");
-
-    @export_symbol(f_activate_comp);
-    @export_symbol(f_prepare_zout);
-}
diff --git a/benchmarks/stencil-v2/util.csl b/benchmarks/stencil-v2/util.csl
deleted file mode 100644
index 88ebdc8..0000000
--- a/benchmarks/stencil-v2/util.csl
+++ /dev/null
@@ -1,52 +0,0 @@
-// Copyright 2024 Cerebras Systems.
-//
-// Licensed under the Apache License, Version 2.0 (the "License");
-// you may not use this file except in compliance with the License.
-// You may obtain a copy of the License at
-//
-//     http://www.apache.org/licenses/LICENSE-2.0
-//
-// Unless required by applicable law or agreed to in writing, software
-// distributed under the License is distributed on an "AS IS" BASIS,
-// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-// See the License for the specific language governing permissions and
-// limitations under the License.
-
-fn min(a: u16, b: u16) u16 {
-  if (a < b) {
-    return a;
-  }
-  return b;
-}
-
-fn computeRelativePeId(peId: u16, peCount: u16, dir: direction) u16 {
-  if (dir == EAST or dir == SOUTH) {
-    return peId;
-  }
-  if (dir == WEST or dir == NORTH) {
-    return peCount - peId - 1;
-  }
-  @comptime_assert(false);
-}
-
-fn computeChunks(zDim: u16) i16 {
-  // We observe that given the memory consumption of the program on chip, the
-  // maximum number of Z-dimension values that we can allocate on chip is about
-  // 400.  The following line splits the requested Z dimension into multiple
-  // chunks of the requested size exceeds 400.
-  return 1 + zDim / 401;
-}
-
-fn computeChunkSize(zDim: u16, numChunks: u16) u16 {
-  // If the number of chunks cleanly divides the number of elements in the Z
-  // dimension, then use the result of the division as the size of the chunks.
-  if (zDim % numChunks == 0) {
-    return zDim / numChunks;
-  }
-
-  // Otherwise, bump up the chunk size by one.  Note that increasing the chunk
-  // size by one is better than increasing the number of chunks by one, since
-  // each new chunk introduces a non-trivial overhead due to the need to perform
-  // another round of communication with each neighbor.
-  return 1 + zDim / numChunks;
-}
diff --git a/benchmarks/wide-multiplication/commands.sh b/benchmarks/wide-multiplication/commands_wse2.sh
similarity index 100%
rename from benchmarks/wide-multiplication/commands.sh
rename to benchmarks/wide-multiplication/commands_wse2.sh
diff --git a/benchmarks/wide-multiplication/commands_wse3.sh b/benchmarks/wide-multiplication/commands_wse3.sh
new file mode 100755
index 0000000..b49818e
--- /dev/null
+++ b/benchmarks/wide-multiplication/commands_wse3.sh
@@ -0,0 +1,9 @@
+#!/usr/bin/env bash
+
+set -e
+
+cslc --arch=wse3 ./code.csl --fabric-dims=8,3 --fabric-offsets=4,1 -o out \
+--params=num_bits:256 --params=MEMCPYH2D_DATA_1_ID:0 \
+--params=MEMCPYD2H_DATA_1_ID:1 \
+--memcpy --channels=1 --width-west-buf=0 --width-east-buf=0
+cs_python run.py --name out
diff --git a/tutorials/gemv-01-complete-program/commands.sh b/tutorials/gemv-01-complete-program/commands_wse2.sh
similarity index 100%
rename from tutorials/gemv-01-complete-program/commands.sh
rename to tutorials/gemv-01-complete-program/commands_wse2.sh
diff --git a/tutorials/topic-15-wse3-microthreads/commands.sh b/tutorials/gemv-01-complete-program/commands_wse3.sh
similarity index 67%
rename from tutorials/topic-15-wse3-microthreads/commands.sh
rename to tutorials/gemv-01-complete-program/commands_wse3.sh
index 81bde01..f454b79 100755
--- a/tutorials/topic-15-wse3-microthreads/commands.sh
+++ b/tutorials/gemv-01-complete-program/commands_wse3.sh
@@ -2,6 +2,6 @@
 
 set -e
 
-cslc --arch=wse3 ./layout.csl --fabric-dims=11,3 \
+cslc --arch=wse3 ./layout.csl --fabric-dims=8,3 \
 --fabric-offsets=4,1 -o out --memcpy --channels 1
 cs_python run.py --name out
diff --git a/tutorials/gemv-02-memory-dsds/README.rst b/tutorials/gemv-02-memory-dsds/README.rst
index 6da42d3..bdf02ef 100644
--- a/tutorials/gemv-02-memory-dsds/README.rst
+++ b/tutorials/gemv-02-memory-dsds/README.rst
@@ -8,9 +8,6 @@ performing operations on entire tensors.
 This program creates three one-dimensional memory DSDs for accessing ``A``,
 ``b``, and ``y``, each of which specifies how to loop over the respective
 arrays.
-The ``tensor_access`` field specifies an induction variable, a loop bound,
-and an affine expression (i.e., a linear function plus a constant) to generate
-various addresses at runtime.
 
 ``b_dsd`` and ``y_dsd`` access the ``M`` contiguous elements of ``b`` and ``y``,
 respectively.
@@ -18,6 +15,15 @@ respectively.
 Because ``A`` is stored in row major format, this means that ``A_dsd``
 initially accesses the 0th column of ``A``.
 
+We demonstrate here two ways of defining DSDs. For ``y_dsd``, we specify the
+base memory address (``&y``) and the number of elements accessed (``M``).
+For ``A_dsd`` and ``b_dsd``, we demonstrate the use of a ``tensor_access``
+expression.
+The ``tensor_access`` field specifies an induction variable, a loop bound,
+and an affine expression (i.e., a linear function plus a constant) to generate
+various addresses at runtime.
+
+
 These DSDs are used by the DSD operations ``@fmacs`` and ``@fadds`` to
 compute ``Ax + b`` and store it in ``y``.
 The ``gemv`` function first loops over ``N``, with the ``@fmacs`` in iteration
diff --git a/tutorials/gemv-02-memory-dsds/commands.sh b/tutorials/gemv-02-memory-dsds/commands_wse2.sh
similarity index 100%
rename from tutorials/gemv-02-memory-dsds/commands.sh
rename to tutorials/gemv-02-memory-dsds/commands_wse2.sh
diff --git a/tutorials/gemv-02-memory-dsds/commands_wse3.sh b/tutorials/gemv-02-memory-dsds/commands_wse3.sh
new file mode 100755
index 0000000..f454b79
--- /dev/null
+++ b/tutorials/gemv-02-memory-dsds/commands_wse3.sh
@@ -0,0 +1,7 @@
+#!/usr/bin/env bash
+
+set -e
+
+cslc --arch=wse3 ./layout.csl --fabric-dims=8,3 \
+--fabric-offsets=4,1 -o out --memcpy --channels 1
+cs_python run.py --name out
diff --git a/tutorials/gemv-02-memory-dsds/pe_program.csl b/tutorials/gemv-02-memory-dsds/pe_program.csl
index 90a0184..da895c0 100644
--- a/tutorials/gemv-02-memory-dsds/pe_program.csl
+++ b/tutorials/gemv-02-memory-dsds/pe_program.csl
@@ -31,11 +31,21 @@ var b = @constants([M]f32, 2.0);
 var y = @zeros([M]f32);
 
 // DSDs for accessing A, b, y
+// b_dsd uses tensor access expression to specify access to M consecutive elements of b
 var b_dsd = @get_dsd(mem1d_dsd, .{ .tensor_access = |i|{M} -> b[i] });
-var y_dsd = @get_dsd(mem1d_dsd, .{ .tensor_access = |i|{M} -> y[i] });
+// The above expression is equivalent to:
+// var b_dsd = @get_dsd(mem1d_dsd, .{ .base_address = &b, .extent = M });
+
+// y_dsd uses base_address and extent fields to specify access to M consecutive elements of y
+var y_dsd = @get_dsd(mem1d_dsd, .{ .base_address = &y, .extent = M });
+// The above expression is equivalent to:
+// var y_dsd = @get_dsd(mem1d_dsd, .{ .tensor_access = |i|{M} -> y[i] });
 
 // A_dsd accesses column of A
+// A_dsd uses tensor access expression to specify access to every Nth element of A
 var A_dsd = @get_dsd(mem1d_dsd, .{ .tensor_access = |i|{M} -> A[i*N] });
+// The above expression is equivalent to:
+// var A_dsd = @get_dsd(mem1d_dsd, .{ .base_address = &A, .extent = M, .stride = N });
 
 // ptr to y will be advertised as symbol to host
 const y_ptr: [*]f32 = &y;
diff --git a/tutorials/gemv-03-memcpy/commands.sh b/tutorials/gemv-03-memcpy/commands_wse2.sh
similarity index 100%
rename from tutorials/gemv-03-memcpy/commands.sh
rename to tutorials/gemv-03-memcpy/commands_wse2.sh
diff --git a/tutorials/gemv-03-memcpy/commands_wse3.sh b/tutorials/gemv-03-memcpy/commands_wse3.sh
new file mode 100755
index 0000000..f454b79
--- /dev/null
+++ b/tutorials/gemv-03-memcpy/commands_wse3.sh
@@ -0,0 +1,7 @@
+#!/usr/bin/env bash
+
+set -e
+
+cslc --arch=wse3 ./layout.csl --fabric-dims=8,3 \
+--fabric-offsets=4,1 -o out --memcpy --channels 1
+cs_python run.py --name out
diff --git a/tutorials/gemv-03-memcpy/pe_program.csl b/tutorials/gemv-03-memcpy/pe_program.csl
index fe8888e..79989b9 100644
--- a/tutorials/gemv-03-memcpy/pe_program.csl
+++ b/tutorials/gemv-03-memcpy/pe_program.csl
@@ -31,8 +31,8 @@ var y = @zeros([M]f32); // Initialize y to zero
 // DSDs for accessing A, b, y
 // A_dsd accesses column of A
 var A_dsd = @get_dsd(mem1d_dsd, .{ .tensor_access = |i|{M} -> A[i*N] });
-var b_dsd = @get_dsd(mem1d_dsd, .{ .tensor_access = |i|{M} -> b[i] });
-var y_dsd = @get_dsd(mem1d_dsd, .{ .tensor_access = |i|{M} -> y[i] });
+var b_dsd = @get_dsd(mem1d_dsd, .{ .base_address = &b, .extent = M });
+var y_dsd = @get_dsd(mem1d_dsd, .{ .base_address = &y, .extent = M });
 
 // ptrs to A, x, b, y will be advertised as symbols to host
 var A_ptr: [*]f32 = &A;
diff --git a/tutorials/gemv-04-params/commands.sh b/tutorials/gemv-04-params/commands_wse2.sh
similarity index 100%
rename from tutorials/gemv-04-params/commands.sh
rename to tutorials/gemv-04-params/commands_wse2.sh
diff --git a/tutorials/gemv-04-params/commands_wse3.sh b/tutorials/gemv-04-params/commands_wse3.sh
new file mode 100755
index 0000000..f2f9b9b
--- /dev/null
+++ b/tutorials/gemv-04-params/commands_wse3.sh
@@ -0,0 +1,7 @@
+#!/usr/bin/env bash
+
+set -e
+
+cslc --arch=wse3 ./layout.csl --fabric-dims=8,3 \
+--fabric-offsets=4,1 --params=M:4,N:6 -o out --memcpy --channels 1
+cs_python run.py --name out
diff --git a/tutorials/gemv-04-params/pe_program.csl b/tutorials/gemv-04-params/pe_program.csl
index d94e71a..824d0ea 100644
--- a/tutorials/gemv-04-params/pe_program.csl
+++ b/tutorials/gemv-04-params/pe_program.csl
@@ -32,8 +32,8 @@ var y = @zeros([M]f32); // Initialize y to zero
 // DSDs for accessing A, b, y
 // A_dsd accesses column of A
 var A_dsd = @get_dsd(mem1d_dsd, .{ .tensor_access = |i|{M} -> A[i*N] });
-var b_dsd = @get_dsd(mem1d_dsd, .{ .tensor_access = |i|{M} -> b[i] });
-var y_dsd = @get_dsd(mem1d_dsd, .{ .tensor_access = |i|{M} -> y[i] });
+var b_dsd = @get_dsd(mem1d_dsd, .{ .base_address = &b, .extent = M });
+var y_dsd = @get_dsd(mem1d_dsd, .{ .base_address = &y, .extent = M });
 
 // ptrs to A, x, b, y will be advertised as symbols to host
 var A_ptr: [*]f32 = &A;
diff --git a/tutorials/gemv-05-multiple-pes/commands.sh b/tutorials/gemv-05-multiple-pes/commands_wse2.sh
similarity index 100%
rename from tutorials/gemv-05-multiple-pes/commands.sh
rename to tutorials/gemv-05-multiple-pes/commands_wse2.sh
diff --git a/tutorials/gemv-05-multiple-pes/commands_wse3.sh b/tutorials/gemv-05-multiple-pes/commands_wse3.sh
new file mode 100755
index 0000000..b41e55b
--- /dev/null
+++ b/tutorials/gemv-05-multiple-pes/commands_wse3.sh
@@ -0,0 +1,7 @@
+#!/usr/bin/env bash
+
+set -e
+
+cslc --arch=wse3 ./layout.csl --fabric-dims=11,3 \
+--fabric-offsets=4,1 --params=M:4,N:6,width:4 -o out --memcpy --channels 1
+cs_python run.py --name out
diff --git a/tutorials/gemv-05-multiple-pes/pe_program.csl b/tutorials/gemv-05-multiple-pes/pe_program.csl
index 5fee993..4f35fec 100644
--- a/tutorials/gemv-05-multiple-pes/pe_program.csl
+++ b/tutorials/gemv-05-multiple-pes/pe_program.csl
@@ -32,8 +32,8 @@ var y = @zeros([M]f32); // Initialize y to zero
 // DSDs for accessing A, b, y
 // A_dsd accesses column of A
 var A_dsd = @get_dsd(mem1d_dsd, .{ .tensor_access = |i|{M} -> A[i*N] });
-var b_dsd = @get_dsd(mem1d_dsd, .{ .tensor_access = |i|{M} -> b[i] });
-var y_dsd = @get_dsd(mem1d_dsd, .{ .tensor_access = |i|{M} -> y[i] });
+var b_dsd = @get_dsd(mem1d_dsd, .{ .base_address = &b, .extent = M });
+var y_dsd = @get_dsd(mem1d_dsd, .{ .base_address = &y, .extent = M });
 
 // ptrs to A, x, b, y will be advertised as symbols to host
 var A_ptr: [*]f32 = &A;
diff --git a/tutorials/gemv-06-routes-1/commands.sh b/tutorials/gemv-06-routes-1/commands_wse2.sh
similarity index 100%
rename from tutorials/gemv-06-routes-1/commands.sh
rename to tutorials/gemv-06-routes-1/commands_wse2.sh
diff --git a/tutorials/gemv-06-routes-1/commands_wse3.sh b/tutorials/gemv-06-routes-1/commands_wse3.sh
new file mode 100755
index 0000000..3b95972
--- /dev/null
+++ b/tutorials/gemv-06-routes-1/commands_wse3.sh
@@ -0,0 +1,7 @@
+#!/usr/bin/env bash
+
+set -e
+
+cslc --arch=wse3 ./layout.csl --fabric-dims=11,3 \
+--fabric-offsets=4,1 --params=M:4,N:6 -o out --memcpy --channels 1
+cs_python run.py --name out
diff --git a/tutorials/gemv-06-routes-1/pe_program.csl b/tutorials/gemv-06-routes-1/pe_program.csl
index 0929d44..d4729b6 100644
--- a/tutorials/gemv-06-routes-1/pe_program.csl
+++ b/tutorials/gemv-06-routes-1/pe_program.csl
@@ -44,8 +44,8 @@ var y: [M]f32;
 
 // DSDs for accessing A, b, y
 // A_dsd accesses column of A
-var A_dsd = @get_dsd(mem1d_dsd, .{ .tensor_access = |i|{M} -> A[i] });
-var y_dsd = @get_dsd(mem1d_dsd, .{ .tensor_access = |i|{M} -> y[i] });
+var A_dsd = @get_dsd(mem1d_dsd, .{ .base_address = &A, .extent = M });
+var y_dsd = @get_dsd(mem1d_dsd, .{ .base_address = &y, .extent = M });
 
 // ptrs to A, x, b, y will be advertised as symbols to host
 var A_ptr: [*]f32 = &A;
diff --git a/tutorials/gemv-07-routes-2/commands.sh b/tutorials/gemv-07-routes-2/commands_wse2.sh
similarity index 100%
rename from tutorials/gemv-07-routes-2/commands.sh
rename to tutorials/gemv-07-routes-2/commands_wse2.sh
diff --git a/tutorials/gemv-07-routes-2/commands_wse3.sh b/tutorials/gemv-07-routes-2/commands_wse3.sh
new file mode 100755
index 0000000..2619e7b
--- /dev/null
+++ b/tutorials/gemv-07-routes-2/commands_wse3.sh
@@ -0,0 +1,7 @@
+#!/usr/bin/env bash
+
+set -e
+
+cslc --arch=wse3 ./layout.csl --fabric-dims=9,4 \
+--fabric-offsets=4,1 --params=M:4,N:6 -o out --memcpy --channels 1
+cs_python run.py --name out
diff --git a/tutorials/gemv-07-routes-2/pe_program.csl b/tutorials/gemv-07-routes-2/pe_program.csl
index ce4eb86..966764d 100644
--- a/tutorials/gemv-07-routes-2/pe_program.csl
+++ b/tutorials/gemv-07-routes-2/pe_program.csl
@@ -56,9 +56,9 @@ var y: [M_per_PE]f32;
 
 // DSDs for accessing A, x, y
 // A_dsd accesses column of A
-var A_dsd = @get_dsd(mem1d_dsd, .{ .tensor_access = |i|{M_per_PE} -> A[i] });
-var x_dsd = @get_dsd(mem1d_dsd, .{ .tensor_access = |i|{N_per_PE} -> x[i] });
-var y_dsd = @get_dsd(mem1d_dsd, .{ .tensor_access = |i|{M_per_PE} -> y[i] });
+var A_dsd = @get_dsd(mem1d_dsd, .{ .base_address = &A, .extent = M_per_PE });
+var x_dsd = @get_dsd(mem1d_dsd, .{ .base_address = &x, .extent = N_per_PE });
+var y_dsd = @get_dsd(mem1d_dsd, .{ .base_address = &y, .extent = M_per_PE });
 
 // ptrs to A, x, b, y will be advertised as symbols to host
 var A_ptr: [*]f32 = &A;
diff --git a/tutorials/gemv-08-routes-3/commands.sh b/tutorials/gemv-08-routes-3/commands_wse2.sh
similarity index 100%
rename from tutorials/gemv-08-routes-3/commands.sh
rename to tutorials/gemv-08-routes-3/commands_wse2.sh
diff --git a/tutorials/gemv-08-routes-3/commands_wse3.sh b/tutorials/gemv-08-routes-3/commands_wse3.sh
new file mode 100755
index 0000000..71caa33
--- /dev/null
+++ b/tutorials/gemv-08-routes-3/commands_wse3.sh
@@ -0,0 +1,8 @@
+#!/usr/bin/env bash
+
+set -e
+
+cslc --arch=wse3 ./layout.csl --fabric-dims=11,5 \
+--fabric-offsets=4,1 --params=kernel_x_dim:4,kernel_y_dim:3,M:6,N:8 \
+-o out --memcpy --channels 1
+cs_python run.py --name out
diff --git a/tutorials/gemv-08-routes-3/pe_program.csl b/tutorials/gemv-08-routes-3/pe_program.csl
index 6a773e0..39b48e6 100644
--- a/tutorials/gemv-08-routes-3/pe_program.csl
+++ b/tutorials/gemv-08-routes-3/pe_program.csl
@@ -61,9 +61,9 @@ var y: [M_per_PE]f32;
 
 // DSDs for accessing A, x, y
 // A_dsd accesses column of A
-var A_dsd = @get_dsd(mem1d_dsd, .{ .tensor_access = |i|{M_per_PE} -> A[i] });
-var x_dsd = @get_dsd(mem1d_dsd, .{ .tensor_access = |i|{N_per_PE} -> x[i] });
-var y_dsd = @get_dsd(mem1d_dsd, .{ .tensor_access = |i|{M_per_PE} -> y[i] });
+var A_dsd = @get_dsd(mem1d_dsd, .{ .base_address = &A, .extent = M_per_PE });
+var x_dsd = @get_dsd(mem1d_dsd, .{ .base_address = &x, .extent = N_per_PE });
+var y_dsd = @get_dsd(mem1d_dsd, .{ .base_address = &y, .extent = M_per_PE });
 
 // ptrs to A, x, b, y will be advertised as symbols to host
 var A_ptr: [*]f32 = &A;
diff --git a/tutorials/gemv-09-streaming/commands.sh b/tutorials/gemv-09-streaming/commands_wse2.sh
similarity index 100%
rename from tutorials/gemv-09-streaming/commands.sh
rename to tutorials/gemv-09-streaming/commands_wse2.sh
diff --git a/tutorials/gemv-09-streaming/commands_wse3.sh b/tutorials/gemv-09-streaming/commands_wse3.sh
new file mode 100755
index 0000000..4322f0a
--- /dev/null
+++ b/tutorials/gemv-09-streaming/commands_wse3.sh
@@ -0,0 +1,11 @@
+#!/usr/bin/env bash
+
+set -e
+
+cslc --arch=wse3 ./layout.csl --fabric-dims=11,5 \
+--fabric-offsets=4,1 --params=kernel_x_dim:4,kernel_y_dim:3,M:6,N:8 \
+--params=MEMCPYH2D_DATA_1_ID:0 \
+--params=MEMCPYH2D_DATA_2_ID:1 \
+--params=MEMCPYD2H_DATA_1_ID:2 \
+-o out --memcpy --channels 1
+cs_python run.py --name out
diff --git a/tutorials/gemv-09-streaming/pe_program.csl b/tutorials/gemv-09-streaming/pe_program.csl
index 6a96c5a..7874852 100644
--- a/tutorials/gemv-09-streaming/pe_program.csl
+++ b/tutorials/gemv-09-streaming/pe_program.csl
@@ -74,8 +74,8 @@ var y: [M_per_PE]f32;
 
 // DSDs for accessing A, x, y
 // A_dsd accesses column of A
-var A_dsd = @get_dsd(mem1d_dsd, .{ .tensor_access = |i|{M_per_PE} -> A[i] });
-var y_dsd = @get_dsd(mem1d_dsd, .{ .tensor_access = |i|{M_per_PE} -> y[i] });
+var A_dsd = @get_dsd(mem1d_dsd, .{ .base_address = &A, .extent = M_per_PE });
+var y_dsd = @get_dsd(mem1d_dsd, .{ .base_address = &y, .extent = M_per_PE });
 
 // ptr to A will be advertised as symbol to host
 var A_ptr: [*]f32 = &A;
diff --git a/tutorials/pipeline-01-basic/commands.sh b/tutorials/pipeline-01-basic/commands_wse2.sh
similarity index 100%
rename from tutorials/pipeline-01-basic/commands.sh
rename to tutorials/pipeline-01-basic/commands_wse2.sh
diff --git a/tutorials/topic-09-map-builtin/commands.sh b/tutorials/pipeline-01-basic/commands_wse3.sh
similarity index 65%
rename from tutorials/topic-09-map-builtin/commands.sh
rename to tutorials/pipeline-01-basic/commands_wse3.sh
index 1128cc7..bbf06e7 100755
--- a/tutorials/topic-09-map-builtin/commands.sh
+++ b/tutorials/pipeline-01-basic/commands_wse3.sh
@@ -2,10 +2,8 @@
 
 set -e
 
-cslc ./layout.csl \
---fabric-dims=8,3 --fabric-offsets=4,1 \
---params=size:5 \
--o out \
+cslc --arch=wse3 ./layout.csl --fabric-dims=8,3 \
+--fabric-offsets=4,1 --params=size:12 -o out \
 --params=MEMCPYH2D_DATA_1_ID:0 \
 --params=MEMCPYD2H_DATA_1_ID:1 \
 --memcpy --channels=1 --width-west-buf=0 --width-east-buf=0
diff --git a/tutorials/pipeline-02-fifo/commands.sh b/tutorials/pipeline-02-fifo/commands_wse2.sh
similarity index 100%
rename from tutorials/pipeline-02-fifo/commands.sh
rename to tutorials/pipeline-02-fifo/commands_wse2.sh
diff --git a/tutorials/pipeline-02-fifo/commands_wse3.sh b/tutorials/pipeline-02-fifo/commands_wse3.sh
new file mode 100755
index 0000000..77c193e
--- /dev/null
+++ b/tutorials/pipeline-02-fifo/commands_wse3.sh
@@ -0,0 +1,10 @@
+#!/usr/bin/env bash
+
+set -e
+
+cslc --arch=wse3 ./layout.csl --fabric-dims=8,3 \
+--fabric-offsets=4,1 --params=size:32 -o out \
+--params=MEMCPYH2D_DATA_1_ID:0 \
+--params=MEMCPYD2H_DATA_1_ID:1 \
+--memcpy --channels=1 --width-west-buf=0 --width-east-buf=0
+cs_python run.py --name out
diff --git a/tutorials/pipeline-03-multiple/commands.sh b/tutorials/pipeline-03-multiple/commands_wse2.sh
similarity index 100%
rename from tutorials/pipeline-03-multiple/commands.sh
rename to tutorials/pipeline-03-multiple/commands_wse2.sh
diff --git a/tutorials/pipeline-03-multiple/commands_wse3.sh b/tutorials/pipeline-03-multiple/commands_wse3.sh
new file mode 100755
index 0000000..cb6c94f
--- /dev/null
+++ b/tutorials/pipeline-03-multiple/commands_wse3.sh
@@ -0,0 +1,10 @@
+#!/usr/bin/env bash
+
+set -e
+
+cslc --arch=wse3 ./layout.csl --fabric-dims=10,3 \
+--fabric-offsets=4,1 --params=size:32 -o out \
+--params=MEMCPYH2D_DATA_1_ID:0 \
+--params=MEMCPYD2H_DATA_1_ID:1 \
+--memcpy --channels=1 --width-west-buf=0 --width-east-buf=0
+cs_python run.py --name out
diff --git a/tutorials/pipeline-03-multiple/memcpyEdge/d2h.csl b/tutorials/pipeline-03-multiple/memcpyEdge/d2h.csl
deleted file mode 100644
index 1224c27..0000000
--- a/tutorials/pipeline-03-multiple/memcpyEdge/d2h.csl
+++ /dev/null
@@ -1,61 +0,0 @@
-// Copyright 2024 Cerebras Systems.
-//
-// Licensed under the Apache License, Version 2.0 (the "License");
-// you may not use this file except in compliance with the License.
-// You may obtain a copy of the License at
-//
-//     http://www.apache.org/licenses/LICENSE-2.0
-//
-// Unless required by applicable law or agreed to in writing, software
-// distributed under the License is distributed on an "AS IS" BASIS,
-// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-// See the License for the specific language governing permissions and
-// limitations under the License.
-
-
-// One streaming D2H:
-// 1st D2H: UT 5 and UT 6
-
-param MEMCPYD2H_1: color = @get_color(32);
-
-// Color along which we expect a wavelet
-param USER_OUT_1: color = @get_color(32);
-
-param rxdir: direction;
-
-const max_fifo_len = 256*40; // maximum length of the fifo
-
-var fifo1_buffer = @zeros([max_fifo_len]u32);
-const fifo1 = @allocate_fifo(fifo1_buffer);
-
-// length=inf
-var fab_recv_wdsd = @get_dsd(fabin_dsd, .{
-   .extent = 0x7fff,
-   .fabric_color = USER_OUT_1,
-   .input_queue = @get_input_queue(6)
-});
-
-// length=inf
-var fab_trans_wdsd = @get_dsd(fabout_dsd, .{
-    .extent = 0x7fff,
-    .fabric_color = MEMCPYD2H_1,
-    .output_queue = @get_output_queue(5)
-});
-
-// if USER_OUT_1 is not valid, f_startup() is empty
-fn f_startup() void {
-    if ( (@get_int(MEMCPYD2H_1) < 24) and (@get_int(USER_OUT_1) < 24) ){
-        // receive data from USER_OUT_1
-        @mov32(fifo1, fab_recv_wdsd, .{.async=true} );
-
-        // forward data to MEMCPYD2H_1
-        @mov32(fab_trans_wdsd, fifo1, .{.async=true} );
-    }
-}
-
-comptime {
-    if (@get_int(USER_OUT_1) < 24){
-        const d2h_route = .{ .rx = .{ rxdir }, .tx = .{ RAMP } };
-        @set_local_color_config(USER_OUT_1, .{ .routes = d2h_route });
-    }
-}
diff --git a/tutorials/pipeline-03-multiple/memcpyEdge/east.csl b/tutorials/pipeline-03-multiple/memcpyEdge/east.csl
deleted file mode 100644
index 7303d8c..0000000
--- a/tutorials/pipeline-03-multiple/memcpyEdge/east.csl
+++ /dev/null
@@ -1,35 +0,0 @@
-// Copyright 2024 Cerebras Systems.
-//
-// Licensed under the Apache License, Version 2.0 (the "License");
-// you may not use this file except in compliance with the License.
-// You may obtain a copy of the License at
-//
-//     http://www.apache.org/licenses/LICENSE-2.0
-//
-// Unless required by applicable law or agreed to in writing, software
-// distributed under the License is distributed on an "AS IS" BASIS,
-// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-// See the License for the specific language governing permissions and
-// limitations under the License.
-
-
-// send data to the "core"
-param USER_IN_1: color = @get_color(32);
-param USER_IN_2: color = @get_color(32);
-
-// receive data from the "core"
-param USER_OUT_1: color = @get_color(32);
-
-// entrypoint
-param STARTUP: local_task_id;
-
-param memcpy_params: comptime_struct;
-
-const edge_mod = @import_module( "memcpy_edge.csl", .{
-     .memcpy_params = memcpy_params,
-     .USER_IN_1 = USER_IN_1,
-     .USER_IN_2 = USER_IN_2,
-     .USER_OUT_1 = USER_OUT_1,
-     .STARTUP = STARTUP,
-     .dir = WEST
-      });
diff --git a/tutorials/pipeline-03-multiple/memcpyEdge/h2d.csl b/tutorials/pipeline-03-multiple/memcpyEdge/h2d.csl
deleted file mode 100644
index 3824fdb..0000000
--- a/tutorials/pipeline-03-multiple/memcpyEdge/h2d.csl
+++ /dev/null
@@ -1,94 +0,0 @@
-// Copyright 2024 Cerebras Systems.
-//
-// Licensed under the Apache License, Version 2.0 (the "License");
-// you may not use this file except in compliance with the License.
-// You may obtain a copy of the License at
-//
-//     http://www.apache.org/licenses/LICENSE-2.0
-//
-// Unless required by applicable law or agreed to in writing, software
-// distributed under the License is distributed on an "AS IS" BASIS,
-// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-// See the License for the specific language governing permissions and
-// limitations under the License.
-
-
-// Two streaming H2Ds:
-// 1st H2D: UT 1 and UT 2
-// 2nd H2D: UT 3 and UT 4
-
-param MEMCPYH2D_1: color = @get_color(32);
-param MEMCPYH2D_2: color = @get_color(32);
-
-// Color along which we send a wavelet to pe_program
-param USER_IN_1: color = @get_color(32);
-param USER_IN_2: color = @get_color(32);
-
-param txdir: direction;
-
-const max_fifo_len = 256*20; // maximum length of the fifo
-
-var fifo1_buffer = @zeros([max_fifo_len]u32);
-const fifo1 = @allocate_fifo(fifo1_buffer);
-
-var fifo2_buffer = @zeros([max_fifo_len]u32);
-const fifo2 = @allocate_fifo(fifo2_buffer);
-
-// length=inf
-var fab_recv_wdsd_1 = @get_dsd(fabin_dsd, .{
-   .extent = 0x7fff,
-   .fabric_color = MEMCPYH2D_1,
-   .input_queue = @get_input_queue(1)
-});
-
-// length=inf
-var fab_trans_wdsd_1 = @get_dsd(fabout_dsd, .{
-    .extent = 0x7fff,
-    .fabric_color = USER_IN_1,
-    .output_queue = @get_output_queue(2)
-});
-
-// length=inf
-var fab_recv_wdsd_2 = @get_dsd(fabin_dsd, .{
-   .extent = 0x7fff,
-   .fabric_color = MEMCPYH2D_2,
-   .input_queue = @get_input_queue(3)
-});
-
-// length=inf
-var fab_trans_wdsd_2 = @get_dsd(fabout_dsd, .{
-    .extent = 0x7fff,
-    .fabric_color = USER_IN_2,
-    .output_queue = @get_output_queue(4)
-});
-
-// if no user's color is defined, f_startup() is empty
-fn f_startup() void {
-    if ( (@get_int(MEMCPYH2D_1) < 24) and (@get_int(USER_IN_1) < 24) ){
-        // receive data from streaming H2D
-        @mov32(fifo1, fab_recv_wdsd_1, .{.async=true} );
-
-        // forward data to USER_IN_1
-        @mov32(fab_trans_wdsd_1, fifo1, .{.async=true} );
-    }
-
-    if ( (@get_int(MEMCPYH2D_2) < 24) and (@get_int(USER_IN_2) < 24) ){
-        // receive data from streaming H2D
-        @mov32(fifo2, fab_recv_wdsd_2, .{.async=true} );
-
-        // forward data to USER_IN_1
-        @mov32(fab_trans_wdsd_2, fifo2, .{.async=true} );
-    }
-}
-
-comptime {
-    if (@get_int(USER_IN_1) < 24){
-        const h2d_route = .{ .rx = .{ RAMP }, .tx = .{ txdir } };
-        @set_local_color_config(USER_IN_1, .{ .routes = h2d_route });
-    }
-
-    if (@get_int(USER_IN_2) < 24){
-        const h2d_route = .{ .rx = .{ RAMP }, .tx = .{ txdir } };
-        @set_local_color_config(USER_IN_2, .{ .routes = h2d_route });
-    }
-}
diff --git a/tutorials/pipeline-03-multiple/memcpyEdge/memcpy_edge.csl b/tutorials/pipeline-03-multiple/memcpyEdge/memcpy_edge.csl
deleted file mode 100644
index 5ebfd5f..0000000
--- a/tutorials/pipeline-03-multiple/memcpyEdge/memcpy_edge.csl
+++ /dev/null
@@ -1,102 +0,0 @@
-// Copyright 2024 Cerebras Systems.
-//
-// Licensed under the Apache License, Version 2.0 (the "License");
-// you may not use this file except in compliance with the License.
-// You may obtain a copy of the License at
-//
-//     http://www.apache.org/licenses/LICENSE-2.0
-//
-// Unless required by applicable law or agreed to in writing, software
-// distributed under the License is distributed on an "AS IS" BASIS,
-// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-// See the License for the specific language governing permissions and
-// limitations under the License.
-
-
-// This is a template of memcpy over the edges.
-// memcpy_edge.csl can be "north", "south", "west" or "east"
-// of the following layout.
-//        +---------+
-//        |  north  |
-// +------+---------+------+
-// | west |  core   | east |
-// +------+---------+------+
-//        |  south  |
-//        +---------+
-// north.csl, south.csl, west.csl and east.csl instantiate
-// memcpy_edge.csl with a proper direction.
-//
-// memcpy_edge.csl supports 2 streaming H2Ds and one
-// streaming D2H. Such constraint depends on the design.
-// The current implementation binds a FIFO for a H2D or D2H,
-// so we can only support 3 in total.
-// We choose 2 H2Ds and 1 D2H.
-// if we replace FIFO by WTT, we could support more.
-//
-// However the user can instantiate memcpy_edge.csl for each
-// edge. The maximum number of H2Ds is 2*4 = 8 and maximum
-// number of D2Hs is 1*4 = 4.
-//
-// If the user only has a H2D at north, for example, he only
-// needs to configure color USER_IN_1, i.e. only a single
-// streaming H2D is used.
-//
-// For example,
-//   @set_tile_code(pe_x, 0, "north.csl", .{
-//      .USER_IN_1 = mainColor,
-//      .STARTUP = STARTUP,
-//      .memcpy_params = memcpy_params,
-//      .MEMCPYH2D_DATA_1 = MEMCPYH2D_DATA_1,
-//      .MEMCPYD2H_DATA_1 = MEMCPYD2H_DATA_1
-//    });
-
-// send data to the "core"
-param USER_IN_1: color = @get_color(32);
-param USER_IN_2: color = @get_color(32);
-
-// receive data from the "core"
-param USER_OUT_1: color = @get_color(32);
-
-// entrypoint
-param STARTUP: local_task_id;
-
-// ----------
-// Every PE needs to import memcpy module otherwise the I/O cannot
-// propagate the data to the destination.
-
-param memcpy_params: comptime_struct;
-
-// The direction of "core", for example
-// north.csl has dir = SOUTH
-// south.csl has dir = NORTH
-// west.csl has dir = EAST
-// east.csl has dir = WEST
-param dir: direction;
-
-// memcpy module reserves input queue 0 and output queue 0
-const sys_mod = @import_module( "<memcpy/memcpy>", memcpy_params);
-// ----------
-
-const h2d_mod = @import_module("h2d.csl", .{
-     .USER_IN_1 = USER_IN_1,
-     .USER_IN_2 = USER_IN_2,
-     .MEMCPYH2D_1 = memcpy_params.MEMCPYH2D_1,
-     .MEMCPYH2D_2 = memcpy_params.MEMCPYH2D_2,
-     .txdir = dir
-      });
-
-const d2h_mod = @import_module("d2h.csl", .{
-     .USER_OUT_1 = USER_OUT_1,
-     .MEMCPYD2H_1 = memcpy_params.MEMCPYD2H_1,
-     .rxdir = dir
-      });
-
-task f_startup() void {
-    h2d_mod.f_startup();
-    d2h_mod.f_startup();
-}
-
-comptime {
-    @bind_local_task(f_startup, STARTUP);
-    @activate(STARTUP);
-}
diff --git a/tutorials/pipeline-03-multiple/memcpyEdge/north.csl b/tutorials/pipeline-03-multiple/memcpyEdge/north.csl
deleted file mode 100644
index 1452245..0000000
--- a/tutorials/pipeline-03-multiple/memcpyEdge/north.csl
+++ /dev/null
@@ -1,35 +0,0 @@
-// Copyright 2024 Cerebras Systems.
-//
-// Licensed under the Apache License, Version 2.0 (the "License");
-// you may not use this file except in compliance with the License.
-// You may obtain a copy of the License at
-//
-//     http://www.apache.org/licenses/LICENSE-2.0
-//
-// Unless required by applicable law or agreed to in writing, software
-// distributed under the License is distributed on an "AS IS" BASIS,
-// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-// See the License for the specific language governing permissions and
-// limitations under the License.
-
-
-// send data to the "core"
-param USER_IN_1: color = @get_color(32);
-param USER_IN_2: color = @get_color(32);
-
-// receive data from the "core"
-param USER_OUT_1: color = @get_color(32);
-
-// entrypoint
-param STARTUP: local_task_id;
-
-param memcpy_params: comptime_struct;
-
-const edge_mod = @import_module( "memcpy_edge.csl", .{
-     .memcpy_params = memcpy_params,
-     .USER_IN_1 = USER_IN_1,
-     .USER_IN_2 = USER_IN_2,
-     .USER_OUT_1 = USER_OUT_1,
-     .STARTUP = STARTUP,
-     .dir = SOUTH
-      });
diff --git a/tutorials/pipeline-03-multiple/memcpyEdge/south.csl b/tutorials/pipeline-03-multiple/memcpyEdge/south.csl
deleted file mode 100644
index 11b4c43..0000000
--- a/tutorials/pipeline-03-multiple/memcpyEdge/south.csl
+++ /dev/null
@@ -1,35 +0,0 @@
-// Copyright 2024 Cerebras Systems.
-//
-// Licensed under the Apache License, Version 2.0 (the "License");
-// you may not use this file except in compliance with the License.
-// You may obtain a copy of the License at
-//
-//     http://www.apache.org/licenses/LICENSE-2.0
-//
-// Unless required by applicable law or agreed to in writing, software
-// distributed under the License is distributed on an "AS IS" BASIS,
-// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-// See the License for the specific language governing permissions and
-// limitations under the License.
-
-
-// send data to the "core"
-param USER_IN_1: color = @get_color(32);
-param USER_IN_2: color = @get_color(32);
-
-// receive data from the "core"
-param USER_OUT_1: color = @get_color(32);
-
-// entrypoint
-param STARTUP: local_task_id;
-
-param memcpy_params: comptime_struct;
-
-const edge_mod = @import_module( "memcpy_edge.csl", .{
-     .memcpy_params = memcpy_params,
-     .USER_IN_1 = USER_IN_1,
-     .USER_IN_2 = USER_IN_2,
-     .USER_OUT_1 = USER_OUT_1,
-     .STARTUP = STARTUP,
-     .dir = NORTH
-      });
diff --git a/tutorials/pipeline-03-multiple/memcpyEdge/west.csl b/tutorials/pipeline-03-multiple/memcpyEdge/west.csl
deleted file mode 100644
index 5c7d21a..0000000
--- a/tutorials/pipeline-03-multiple/memcpyEdge/west.csl
+++ /dev/null
@@ -1,35 +0,0 @@
-// Copyright 2024 Cerebras Systems.
-//
-// Licensed under the Apache License, Version 2.0 (the "License");
-// you may not use this file except in compliance with the License.
-// You may obtain a copy of the License at
-//
-//     http://www.apache.org/licenses/LICENSE-2.0
-//
-// Unless required by applicable law or agreed to in writing, software
-// distributed under the License is distributed on an "AS IS" BASIS,
-// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-// See the License for the specific language governing permissions and
-// limitations under the License.
-
-
-// send data to the "core"
-param USER_IN_1: color = @get_color(32);
-param USER_IN_2: color = @get_color(32);
-
-// receive data from the "core"
-param USER_OUT_1: color = @get_color(32);
-
-// entrypoint
-param STARTUP: local_task_id;
-
-param memcpy_params: comptime_struct;
-
-const edge_mod = @import_module( "memcpy_edge.csl", .{
-     .memcpy_params = memcpy_params,
-     .USER_IN_1 = USER_IN_1,
-     .USER_IN_2 = USER_IN_2,
-     .USER_OUT_1 = USER_OUT_1,
-     .STARTUP = STARTUP,
-     .dir = EAST
-      });
diff --git a/tutorials/topic-01-arrays-and-pointers/commands.sh b/tutorials/topic-01-arrays-and-pointers/commands_wse2.sh
similarity index 100%
rename from tutorials/topic-01-arrays-and-pointers/commands.sh
rename to tutorials/topic-01-arrays-and-pointers/commands_wse2.sh
diff --git a/tutorials/topic-05-switches/commands.sh b/tutorials/topic-01-arrays-and-pointers/commands_wse3.sh
similarity index 53%
rename from tutorials/topic-05-switches/commands.sh
rename to tutorials/topic-01-arrays-and-pointers/commands_wse3.sh
index 2b61380..cfb6e60 100755
--- a/tutorials/topic-05-switches/commands.sh
+++ b/tutorials/topic-01-arrays-and-pointers/commands_wse3.sh
@@ -2,7 +2,6 @@
 
 set -e
 
-cslc ./layout.csl --fabric-dims=10,7 --fabric-offsets=4,1 -o out \
---params=MEMCPYD2H_DATA_1_ID:4 \
+cslc --arch=wse3 ./layout.csl --fabric-dims=8,3 --fabric-offsets=4,1 -o out \
 --memcpy --channels=1 --width-west-buf=0 --width-east-buf=0
 cs_python run.py --name out
diff --git a/tutorials/topic-02-libraries/commands.sh b/tutorials/topic-02-libraries/commands_wse2.sh
similarity index 100%
rename from tutorials/topic-02-libraries/commands.sh
rename to tutorials/topic-02-libraries/commands_wse2.sh
diff --git a/tutorials/topic-06-libraries/commands.sh b/tutorials/topic-02-libraries/commands_wse3.sh
similarity index 64%
rename from tutorials/topic-06-libraries/commands.sh
rename to tutorials/topic-02-libraries/commands_wse3.sh
index 505c67c..9db6ed7 100755
--- a/tutorials/topic-06-libraries/commands.sh
+++ b/tutorials/topic-02-libraries/commands_wse3.sh
@@ -2,8 +2,7 @@
 
 set -e
 
-cslc ./layout.csl --fabric-dims=8,3 --fabric-offsets=4,1 \
+cslc --arch=wse3 ./layout.csl --fabric-dims=8,3 --fabric-offsets=4,1 \
 --params=iterations:200 -o out \
---params=MEMCPYD2H_DATA_1_ID:1 \
 --memcpy --channels=1 --width-west-buf=0 --width-east-buf=0
 cs_python run.py --name out --tolerance 0.1
diff --git a/tutorials/topic-02-streaming-wavelet-data/README.rst b/tutorials/topic-02-streaming-wavelet-data/README.rst
deleted file mode 100644
index fd48de4..0000000
--- a/tutorials/topic-02-streaming-wavelet-data/README.rst
+++ /dev/null
@@ -1,17 +0,0 @@
-Topic 2: Streaming Wavelet Data
-===============================
-
-Often, CSL programs contain tasks that are activated in response to the
-arrival of wavelets of specific colors. Such tasks are also called
-Wavelet-Triggered Tasks, or data tasks.
-
-In this example, the ``comptime`` block binds a data task to a ``data_task_id``
-created from a ``memcpy`` streaming color, which receives data from the host.
-The routing of the color ``MEMCPYH2D_DATA_1`` must not be defined.
-The ``memcpy`` module will figure out the routing of ``MEMCPYH2D_DATA_1``.
-
-Given the task and color association and the route, when a wavelet of
-color ``MEMCPYH2D_DATA_1`` arrives at the router, it is forwarded to the CE,
-which then activates ``main_task``.  The wavelet's payload field is received in
-the argument to the task, and the code uses the wavelet data to update a global
-variable.
diff --git a/tutorials/topic-02-streaming-wavelet-data/layout.csl b/tutorials/topic-02-streaming-wavelet-data/layout.csl
deleted file mode 100644
index d696e17..0000000
--- a/tutorials/topic-02-streaming-wavelet-data/layout.csl
+++ /dev/null
@@ -1,53 +0,0 @@
-// Copyright 2024 Cerebras Systems.
-//
-// Licensed under the Apache License, Version 2.0 (the "License");
-// you may not use this file except in compliance with the License.
-// You may obtain a copy of the License at
-//
-//     http://www.apache.org/licenses/LICENSE-2.0
-//
-// Unless required by applicable law or agreed to in writing, software
-// distributed under the License is distributed on an "AS IS" BASIS,
-// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-// See the License for the specific language governing permissions and
-// limitations under the License.
-
-// color/ task ID map
-//
-//  ID var                ID var         ID var                ID var
-//   0 MEMCPY_H2D_DATA_1   9             18                    27 reserved (memcpy)
-//   1 MEMCPY_D2H_DATA_1  10             19                    28 reserved (memcpy)
-//   2                    11             20                    29 reserved
-//   3                    12             21 reserved (memcpy)  30 reserved (memcpy)
-//   4                    13             22 reserved (memcpy)  31 reserved
-//   5                    14             23 reserved (memcpy)  32
-//   6                    15             24                    33
-//   7                    16             25                    34
-//   8                    17             26                    35
-
-// IDs for memcpy streaming colors
-param MEMCPYH2D_DATA_1_ID: i16;
-param MEMCPYD2H_DATA_1_ID: i16;
-
-// Colors
-const MEMCPYH2D_DATA_1: color = @get_color(MEMCPYH2D_DATA_1_ID);
-const MEMCPYD2H_DATA_1: color = @get_color(MEMCPYD2H_DATA_1_ID);
-
-// Task IDs
-const main_task_id: data_task_id = @get_data_task_id(MEMCPYH2D_DATA_1);
-
-const memcpy = @import_module( "<memcpy/get_params>", .{
-  .width = 1,
-  .height = 1,
-  .MEMCPYH2D_1 = MEMCPYH2D_DATA_1,
-  .MEMCPYD2H_1 = MEMCPYD2H_DATA_1
-});
-
-layout {
-  @set_rectangle(1, 1);
-
-  @set_tile_code(0, 0, "pe_program.csl",  .{
-    .memcpy_params = memcpy.get_params(0),
-    .main_task_id = main_task_id
-  });
-}
diff --git a/tutorials/topic-02-streaming-wavelet-data/pe_program.csl b/tutorials/topic-02-streaming-wavelet-data/pe_program.csl
deleted file mode 100644
index bf17ff4..0000000
--- a/tutorials/topic-02-streaming-wavelet-data/pe_program.csl
+++ /dev/null
@@ -1,40 +0,0 @@
-// Copyright 2024 Cerebras Systems.
-//
-// Licensed under the Apache License, Version 2.0 (the "License");
-// you may not use this file except in compliance with the License.
-// You may obtain a copy of the License at
-//
-//     http://www.apache.org/licenses/LICENSE-2.0
-//
-// Unless required by applicable law or agreed to in writing, software
-// distributed under the License is distributed on an "AS IS" BASIS,
-// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-// See the License for the specific language governing permissions and
-// limitations under the License.
-
-// Not a complete program; the top-level source file is layout.csl.
-
-param memcpy_params: comptime_struct;
-
-// Task IDs
-param main_task_id: data_task_id; // Data task main_task triggered by wlts along MEMCPYH2D_DATA_1
-
-const sys_mod = @import_module( "<memcpy/memcpy>", memcpy_params);
-
-export var global: i16 = 0;
-
-const out_dsd = @get_dsd(fabout_dsd, .{
-   .extent = 1,
-   .fabric_color = sys_mod.MEMCPYD2H_1
-});
-
-task main_task(wavelet_data: i16) void {
-  global = wavelet_data;
-  // The non-async operation works here because only one wavelet is sent
-  // It would be better to use async operation with .{async = true}
-  @mov16(out_dsd, global);
-}
-
-comptime {
-  @bind_data_task(main_task, main_task_id);
-}
diff --git a/tutorials/topic-02-streaming-wavelet-data/run.py b/tutorials/topic-02-streaming-wavelet-data/run.py
deleted file mode 100644
index bf4d082..0000000
--- a/tutorials/topic-02-streaming-wavelet-data/run.py
+++ /dev/null
@@ -1,78 +0,0 @@
-#!/usr/bin/env cs_python
-
-# Copyright 2024 Cerebras Systems.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-
-import argparse
-import json
-import numpy as np
-
-from cerebras.sdk.sdk_utils import memcpy_view, input_array_to_u32
-from cerebras.sdk.runtime.sdkruntimepybind import SdkRuntime, MemcpyDataType # pylint: disable=no-name-in-module
-from cerebras.sdk.runtime.sdkruntimepybind import MemcpyOrder # pylint: disable=no-name-in-module
-
-parser = argparse.ArgumentParser()
-parser.add_argument('--name', help='the test name')
-parser.add_argument("--cmaddr", help="IP:port for CS system")
-args = parser.parse_args()
-dirname = args.name
-
-# Parse the compile metadata
-with open(f"{dirname}/out.json", encoding="utf-8") as json_file:
-  compile_data = json.load(json_file)
-params = compile_data["params"]
-MEMCPYH2D_DATA_1 = int(params["MEMCPYH2D_DATA_1_ID"])
-MEMCPYD2H_DATA_1 = int(params["MEMCPYD2H_DATA_1_ID"])
-print(f"MEMCPYH2D_DATA_1 = {MEMCPYH2D_DATA_1}")
-print(f"MEMCPYD2H_DATA_1 = {MEMCPYD2H_DATA_1}")
-
-memcpy_dtype = MemcpyDataType.MEMCPY_16BIT
-runner = SdkRuntime(dirname, cmaddr=args.cmaddr)
-
-runner.load()
-runner.run()
-
-input_tensor = np.array([42], dtype=np.int16)
-
-print("step 1: streaming H2D")
-# "input_tensor" is a 1d array
-# The type of input_tensor is int16, we need to extend it to uint32
-# There are two kind of extension when using the utility function input_array_to_u32
-#    input_array_to_u32(np_arr: np.ndarray, sentinel: Optional[int], fast_dim_sz: int)
-# 1) zero extension:
-#    sentinel = None
-# 2) upper 16-bit is the index of the array:
-#    sentinel is Not None
-#
-# In this example, the upper 16-bit is don't care because pe_program.csl only define
-# WTT to read lower 16-bit
-#tensors_u32 = runtime_utils.input_array_to_u32(input_tensor, 1, 1)
-tensors_u32 = input_array_to_u32(input_tensor, 1, 1)
-runner.memcpy_h2d(MEMCPYH2D_DATA_1, tensors_u32, 0, 0, 1, 1, 1, \
-    streaming=True, data_type=memcpy_dtype, order=MemcpyOrder.COL_MAJOR, nonblock=True)
-
-print("step 2: streaming D2H")
-# The D2H buffer must be of type u32
-out_tensors_u32 = np.zeros(1, np.uint32)
-runner.memcpy_d2h(out_tensors_u32, MEMCPYD2H_DATA_1, 0, 0, 1, 1, 1, \
-    streaming=True, data_type=memcpy_dtype, order=MemcpyOrder.COL_MAJOR, nonblock=False)
-# remove upper 16-bit of each u32
-result_tensor = memcpy_view(out_tensors_u32, np.dtype(np.int16))
-
-runner.stop()
-
-# Ensure that the result matches our expectation
-np.testing.assert_equal(result_tensor, [42])
-print("SUCCESS!")
diff --git a/tutorials/topic-03-sparse-tensors/README.rst b/tutorials/topic-03-sparse-tensors/README.rst
deleted file mode 100644
index dd958db..0000000
--- a/tutorials/topic-03-sparse-tensors/README.rst
+++ /dev/null
@@ -1,15 +0,0 @@
-
-Topic 3: Wavelets for Sparse Tensors
-====================================
-
-When tensors are sparse, it is wasteful to send zero values.  Since wavelet
-payloads are 32 bits wide, we can use the lower 16 bits to contain data as
-usual, but we can also use the upper 16 bits to contain the index of the value.
-
-This example illustrates the latter, where each wavelet of the incoming tensor
-has the index field populated in the upper 16 bits.  Accordingly, the task
-definition uses two function arguments, one for the lower 16 bits whereas
-another for the upper 16 bits.
-
-Optionally, the programmer may also declare a task with just one argument of
-type ``u32`` for receiving 32-bit data.
diff --git a/tutorials/topic-03-sparse-tensors/layout.csl b/tutorials/topic-03-sparse-tensors/layout.csl
deleted file mode 100644
index d696e17..0000000
--- a/tutorials/topic-03-sparse-tensors/layout.csl
+++ /dev/null
@@ -1,53 +0,0 @@
-// Copyright 2024 Cerebras Systems.
-//
-// Licensed under the Apache License, Version 2.0 (the "License");
-// you may not use this file except in compliance with the License.
-// You may obtain a copy of the License at
-//
-//     http://www.apache.org/licenses/LICENSE-2.0
-//
-// Unless required by applicable law or agreed to in writing, software
-// distributed under the License is distributed on an "AS IS" BASIS,
-// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-// See the License for the specific language governing permissions and
-// limitations under the License.
-
-// color/ task ID map
-//
-//  ID var                ID var         ID var                ID var
-//   0 MEMCPY_H2D_DATA_1   9             18                    27 reserved (memcpy)
-//   1 MEMCPY_D2H_DATA_1  10             19                    28 reserved (memcpy)
-//   2                    11             20                    29 reserved
-//   3                    12             21 reserved (memcpy)  30 reserved (memcpy)
-//   4                    13             22 reserved (memcpy)  31 reserved
-//   5                    14             23 reserved (memcpy)  32
-//   6                    15             24                    33
-//   7                    16             25                    34
-//   8                    17             26                    35
-
-// IDs for memcpy streaming colors
-param MEMCPYH2D_DATA_1_ID: i16;
-param MEMCPYD2H_DATA_1_ID: i16;
-
-// Colors
-const MEMCPYH2D_DATA_1: color = @get_color(MEMCPYH2D_DATA_1_ID);
-const MEMCPYD2H_DATA_1: color = @get_color(MEMCPYD2H_DATA_1_ID);
-
-// Task IDs
-const main_task_id: data_task_id = @get_data_task_id(MEMCPYH2D_DATA_1);
-
-const memcpy = @import_module( "<memcpy/get_params>", .{
-  .width = 1,
-  .height = 1,
-  .MEMCPYH2D_1 = MEMCPYH2D_DATA_1,
-  .MEMCPYD2H_1 = MEMCPYD2H_DATA_1
-});
-
-layout {
-  @set_rectangle(1, 1);
-
-  @set_tile_code(0, 0, "pe_program.csl",  .{
-    .memcpy_params = memcpy.get_params(0),
-    .main_task_id = main_task_id
-  });
-}
diff --git a/tutorials/topic-03-sparse-tensors/pe_program.csl b/tutorials/topic-03-sparse-tensors/pe_program.csl
deleted file mode 100644
index 53f9f8a..0000000
--- a/tutorials/topic-03-sparse-tensors/pe_program.csl
+++ /dev/null
@@ -1,40 +0,0 @@
-// Copyright 2024 Cerebras Systems.
-//
-// Licensed under the Apache License, Version 2.0 (the "License");
-// you may not use this file except in compliance with the License.
-// You may obtain a copy of the License at
-//
-//     http://www.apache.org/licenses/LICENSE-2.0
-//
-// Unless required by applicable law or agreed to in writing, software
-// distributed under the License is distributed on an "AS IS" BASIS,
-// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-// See the License for the specific language governing permissions and
-// limitations under the License.
-
-// Not a complete program; the top-level source file is layout.csl.
-
-param memcpy_params: comptime_struct;
-
-// Task IDs
-param main_task_id: data_task_id; // Data task main_task triggered by wlts along MEMCPYH2D_DATA_1
-
-const sys_mod = @import_module( "<memcpy/memcpy>", memcpy_params);
-
-var result = [4]i16 { 0, 0, 0, 0 };
-
-const out_dsd = @get_dsd(fabout_dsd, .{
-   .extent = 1,
-   .fabric_color = sys_mod.MEMCPYD2H_1
-});
-
-task main_task(wavelet_data: i16, index: i16) void {
-  result[index] = wavelet_data;
-  // The non-async operation works here because only two wavelet are sent
-  // It would be better to use async operation with .{async = true}
-  @mov16(out_dsd, wavelet_data);
-}
-
-comptime {
-  @bind_data_task(main_task, main_task_id);
-}
diff --git a/tutorials/topic-03-sparse-tensors/run.py b/tutorials/topic-03-sparse-tensors/run.py
deleted file mode 100644
index 19f4bde..0000000
--- a/tutorials/topic-03-sparse-tensors/run.py
+++ /dev/null
@@ -1,70 +0,0 @@
-#!/usr/bin/env cs_python
-
-# Copyright 2024 Cerebras Systems.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-
-import argparse
-import json
-import numpy as np
-
-from cerebras.sdk.sdk_utils import memcpy_view
-from cerebras.sdk.runtime.sdkruntimepybind import SdkRuntime, MemcpyDataType # pylint: disable=no-name-in-module
-from cerebras.sdk.runtime.sdkruntimepybind import MemcpyOrder # pylint: disable=no-name-in-module
-
-parser = argparse.ArgumentParser()
-parser.add_argument('--name', help='the test name')
-parser.add_argument("--cmaddr", help="IP:port for CS system")
-args = parser.parse_args()
-dirname = args.name
-
-# Parse the compile metadata
-with open(f"{dirname}/out.json", encoding="utf-8") as json_file:
-  compile_data = json.load(json_file)
-params = compile_data["params"]
-MEMCPYH2D_DATA_1 = int(params["MEMCPYH2D_DATA_1_ID"])
-MEMCPYD2H_DATA_1 = int(params["MEMCPYD2H_DATA_1_ID"])
-print(f"MEMCPYH2D_DATA_1 = {MEMCPYH2D_DATA_1}")
-print(f"MEMCPYD2H_DATA_1 = {MEMCPYD2H_DATA_1}")
-
-memcpy_dtype = MemcpyDataType.MEMCPY_16BIT
-runner = SdkRuntime(dirname, cmaddr=args.cmaddr)
-
-runner.load()
-runner.run()
-
-# Turn each tuple of two 16-bit integers into one 32-bit integer
-packed = [(idx << 16) + val for idx, val in [(0, 42), (3, 26)]]
-packed_tensor = np.array(packed, dtype=np.int32)
-
-print("step 1: streaming H2D")
-# "packed_tensor" must be an 1d array of type u32
-runner.memcpy_h2d(MEMCPYH2D_DATA_1, packed_tensor, 0, 0, 1, 1, 2, \
-    streaming=True, data_type=memcpy_dtype, order=MemcpyOrder.COL_MAJOR, nonblock=True)
-
-print("step 2: streaming D2H")
-# The D2H buffer must be of type u32
-out_tensors_u32 = np.zeros(2, np.uint32)
-runner.memcpy_d2h(out_tensors_u32, MEMCPYD2H_DATA_1, 0, 0, 1, 1, 2, \
-    streaming=True, data_type=memcpy_dtype, order=MemcpyOrder.COL_MAJOR, nonblock=False)
-# remove upper 16-bit of each u32
-result_tensor = memcpy_view(out_tensors_u32, np.dtype(np.int16))
-
-runner.stop()
-
-# Ensure that the result matches our expectation
-# Since zero wavelets are skipped during transmission, the `@mov16` operation
-# in the code is executed only twice, once for each non-zero wavelet data
-np.testing.assert_equal(result_tensor, [42, 26])
-print("SUCCESS!")
diff --git a/tutorials/topic-03-streaming-wavelet-data/commands.sh b/tutorials/topic-03-streaming-wavelet-data/commands_wse2.sh
similarity index 100%
rename from tutorials/topic-03-streaming-wavelet-data/commands.sh
rename to tutorials/topic-03-streaming-wavelet-data/commands_wse2.sh
diff --git a/tutorials/topic-03-sparse-tensors/commands.sh b/tutorials/topic-03-streaming-wavelet-data/commands_wse3.sh
similarity index 80%
rename from tutorials/topic-03-sparse-tensors/commands.sh
rename to tutorials/topic-03-streaming-wavelet-data/commands_wse3.sh
index 2dc7f66..b5df57a 100755
--- a/tutorials/topic-03-sparse-tensors/commands.sh
+++ b/tutorials/topic-03-streaming-wavelet-data/commands_wse3.sh
@@ -2,7 +2,7 @@
 
 set -e
 
-cslc ./layout.csl --fabric-dims=8,3 \
+cslc --arch=wse3 ./layout.csl --fabric-dims=8,3 \
 --fabric-offsets=4,1 -o out \
 --params=MEMCPYH2D_DATA_1_ID:0 \
 --params=MEMCPYD2H_DATA_1_ID:1 \
diff --git a/tutorials/topic-04-sentinels/README.rst b/tutorials/topic-04-sentinels/README.rst
deleted file mode 100644
index a021166..0000000
--- a/tutorials/topic-04-sentinels/README.rst
+++ /dev/null
@@ -1,19 +0,0 @@
-
-Topic 4: Sentinels
-==================
-
-In previous programs, we used so-called routable colors, which
-are associated with a route to direct the flow of wavelets.
-On WSE-2, task IDs which can be associated with routable colors
-are in the range 0 through 23.
-This example demonstrates the use of a non-routable control task ID
-to signal the end of an input tensor, thus giving it the name *sentinel*.
-
-In this example, the host sends a sentinel wavelet at the end of the
-wavelets for the input tensor. Since sentinel control task IDs are not
-routable colors, the programmer should not specify a route for them,
-but they do need to bind the control task ID to a control task,
-which will be activated upon receipt of the sentinel wavelet.
-
-Here, the sentinel activates the ``send_result`` task, which relays the
-result of the sum reduction back to the host.
diff --git a/tutorials/topic-04-sentinels/layout.csl b/tutorials/topic-04-sentinels/layout.csl
deleted file mode 100644
index 6084f5e..0000000
--- a/tutorials/topic-04-sentinels/layout.csl
+++ /dev/null
@@ -1,128 +0,0 @@
-// Copyright 2024 Cerebras Systems.
-//
-// Licensed under the Apache License, Version 2.0 (the "License");
-// you may not use this file except in compliance with the License.
-// You may obtain a copy of the License at
-//
-//     http://www.apache.org/licenses/LICENSE-2.0
-//
-// Unless required by applicable law or agreed to in writing, software
-// distributed under the License is distributed on an "AS IS" BASIS,
-// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-// See the License for the specific language governing permissions and
-// limitations under the License.
-
-// color/ task ID map
-//
-//  ID var          ID var     ID var               ID var               ID var
-//   0 main_color    9 STARTUP 18                   27 reserved (memcpy) 36
-//   1 output_color 10         19                   28 reserved (memcpy) 37
-//   2 H2D_1        11 IN_1    20                   29 reserved          38
-//   3 H2D_2        12 IN_2    21 reserved (memcpy) 30 reserved (memcpy) 39
-//   4 D2H          13         22 reserved (memcpy) 31 reserved          40
-//   5              14         23 reserved (memcpy) 32                   41
-//   6              15         24                   33                   42
-//   7              16         25                   34                   43 send_result_task_id
-//   8              17         26                   35                   44
-
-//  +------+---------+------+------+
-//  | west |sentinal | core | east |
-//  +------+---------+------+------+
-
-//            +-------+              +-----------+
-//  H2D_1 --> | west  | --> IN_1 --> | sentinel  |
-//  H2D_2 --> |       | --> IN_2 --> |           |
-//            +-------+              +-----------+
-//
-//           +---------------+                        +-------+
-//  IN_1 --> | sentinel.csl  | --> OUT_1 (main_color) | core  |
-//  IN_2 --> |               |                        +-------+
-//           +---------------+
-
-// IDs for memcpy streaming colors
-param MEMCPYH2D_DATA_1_ID: i16;
-param MEMCPYH2D_DATA_2_ID: i16;
-param MEMCPYD2H_DATA_1_ID: i16;
-
-// number of PEs in a column
-param size: i16;
-
-// Sentinel to tell PE that it is time to send the result to the host
-const end_computation: u16 = 43;
-
-// Colors
-const MEMCPYH2D_DATA_1: color = @get_color(MEMCPYH2D_DATA_1_ID);
-const MEMCPYH2D_DATA_2: color = @get_color(MEMCPYH2D_DATA_2_ID);
-const MEMCPYD2H_DATA_1: color = @get_color(MEMCPYD2H_DATA_1_ID);
-
-const main_color:   color = @get_color(0);
-const output_color: color = @get_color(1);
-
-const IN_1: color = @get_color(11);
-const IN_2: color = @get_color(12);
-
-// Task IDs
-const STARTUP:             local_task_id   = @get_local_task_id(9);
-const main_task_id:        data_task_id    = @get_data_task_id(main_color);
-const send_result_task_id: control_task_id = @get_control_task_id(end_computation);
-const IN_1_task_id:        data_task_id    = @get_data_task_id(IN_1);
-const IN_2_task_id:        data_task_id    = @get_data_task_id(IN_2);
-
-
-const memcpy = @import_module( "<memcpy/get_params>", .{
-    .width = 4,
-    .height = size,
-    .MEMCPYH2D_1 = MEMCPYH2D_DATA_1,
-    .MEMCPYH2D_2 = MEMCPYH2D_DATA_2,
-    .MEMCPYD2H_1 = MEMCPYD2H_DATA_1
-    });
-
-layout {
-  @set_rectangle(4, size);
-
-  const input_route  = .{ .rx = .{ WEST }, .tx = .{ RAMP } };
-  const output_route = .{ .rx = .{ RAMP }, .tx = .{ EAST } };
-
-  var idx :i16 = 0;
-  while (idx < size) {
-
-    // west.csl has two H2Ds
-    @set_tile_code(0, idx, "memcpyEdge/west.csl", .{
-      .memcpy_params = memcpy.get_params(0),
-      .USER_IN_1 = IN_1,
-      .USER_IN_2 = IN_2,
-      .STARTUP = STARTUP,
-    });
-
-    @set_tile_code(1, idx, "sentinel.csl", .{
-      .memcpy_params = memcpy.get_params(1),
-      .wtt_in_1_task_id = IN_1_task_id,
-      .wtt_in_2_task_id = IN_2_task_id,
-      .OUT_1 = main_color,
-      .SENTINEL = end_computation,
-    });
-
-    @set_color_config(1, idx, IN_1,       .{ .routes = input_route });
-    @set_color_config(1, idx, IN_2,       .{ .routes = input_route });
-    @set_color_config(1, idx, main_color, .{ .routes = output_route });
-
-    @set_tile_code(2, idx, "pe_program.csl", .{
-      .memcpy_params = memcpy.get_params(2),
-      .output_color = output_color,
-      .main_task_id = main_task_id,
-      .send_result_task_id = send_result_task_id
-    });
-
-    @set_color_config(2, idx, main_color,   .{ .routes = input_route });
-    @set_color_config(2, idx, output_color, .{ .routes = output_route });
-
-    // east.csl only has a D2H
-    @set_tile_code(3, idx, "memcpyEdge/east.csl", .{
-      .memcpy_params = memcpy.get_params(3),
-      .USER_OUT_1 = output_color,
-      .STARTUP = STARTUP
-    });
-
-    idx += 1;
-  }
-}
diff --git a/tutorials/topic-04-sentinels/memcpyEdge/d2h.csl b/tutorials/topic-04-sentinels/memcpyEdge/d2h.csl
deleted file mode 100644
index 1224c27..0000000
--- a/tutorials/topic-04-sentinels/memcpyEdge/d2h.csl
+++ /dev/null
@@ -1,61 +0,0 @@
-// Copyright 2024 Cerebras Systems.
-//
-// Licensed under the Apache License, Version 2.0 (the "License");
-// you may not use this file except in compliance with the License.
-// You may obtain a copy of the License at
-//
-//     http://www.apache.org/licenses/LICENSE-2.0
-//
-// Unless required by applicable law or agreed to in writing, software
-// distributed under the License is distributed on an "AS IS" BASIS,
-// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-// See the License for the specific language governing permissions and
-// limitations under the License.
-
-
-// One streaming D2H:
-// 1st D2H: UT 5 and UT 6
-
-param MEMCPYD2H_1: color = @get_color(32);
-
-// Color along which we expect a wavelet
-param USER_OUT_1: color = @get_color(32);
-
-param rxdir: direction;
-
-const max_fifo_len = 256*40; // maximum length of the fifo
-
-var fifo1_buffer = @zeros([max_fifo_len]u32);
-const fifo1 = @allocate_fifo(fifo1_buffer);
-
-// length=inf
-var fab_recv_wdsd = @get_dsd(fabin_dsd, .{
-   .extent = 0x7fff,
-   .fabric_color = USER_OUT_1,
-   .input_queue = @get_input_queue(6)
-});
-
-// length=inf
-var fab_trans_wdsd = @get_dsd(fabout_dsd, .{
-    .extent = 0x7fff,
-    .fabric_color = MEMCPYD2H_1,
-    .output_queue = @get_output_queue(5)
-});
-
-// if USER_OUT_1 is not valid, f_startup() is empty
-fn f_startup() void {
-    if ( (@get_int(MEMCPYD2H_1) < 24) and (@get_int(USER_OUT_1) < 24) ){
-        // receive data from USER_OUT_1
-        @mov32(fifo1, fab_recv_wdsd, .{.async=true} );
-
-        // forward data to MEMCPYD2H_1
-        @mov32(fab_trans_wdsd, fifo1, .{.async=true} );
-    }
-}
-
-comptime {
-    if (@get_int(USER_OUT_1) < 24){
-        const d2h_route = .{ .rx = .{ rxdir }, .tx = .{ RAMP } };
-        @set_local_color_config(USER_OUT_1, .{ .routes = d2h_route });
-    }
-}
diff --git a/tutorials/topic-04-sentinels/memcpyEdge/east.csl b/tutorials/topic-04-sentinels/memcpyEdge/east.csl
deleted file mode 100644
index 7303d8c..0000000
--- a/tutorials/topic-04-sentinels/memcpyEdge/east.csl
+++ /dev/null
@@ -1,35 +0,0 @@
-// Copyright 2024 Cerebras Systems.
-//
-// Licensed under the Apache License, Version 2.0 (the "License");
-// you may not use this file except in compliance with the License.
-// You may obtain a copy of the License at
-//
-//     http://www.apache.org/licenses/LICENSE-2.0
-//
-// Unless required by applicable law or agreed to in writing, software
-// distributed under the License is distributed on an "AS IS" BASIS,
-// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-// See the License for the specific language governing permissions and
-// limitations under the License.
-
-
-// send data to the "core"
-param USER_IN_1: color = @get_color(32);
-param USER_IN_2: color = @get_color(32);
-
-// receive data from the "core"
-param USER_OUT_1: color = @get_color(32);
-
-// entrypoint
-param STARTUP: local_task_id;
-
-param memcpy_params: comptime_struct;
-
-const edge_mod = @import_module( "memcpy_edge.csl", .{
-     .memcpy_params = memcpy_params,
-     .USER_IN_1 = USER_IN_1,
-     .USER_IN_2 = USER_IN_2,
-     .USER_OUT_1 = USER_OUT_1,
-     .STARTUP = STARTUP,
-     .dir = WEST
-      });
diff --git a/tutorials/topic-04-sentinels/memcpyEdge/h2d.csl b/tutorials/topic-04-sentinels/memcpyEdge/h2d.csl
deleted file mode 100644
index 017c3df..0000000
--- a/tutorials/topic-04-sentinels/memcpyEdge/h2d.csl
+++ /dev/null
@@ -1,93 +0,0 @@
-// Copyright 2024 Cerebras Systems.
-//
-// Licensed under the Apache License, Version 2.0 (the "License");
-// you may not use this file except in compliance with the License.
-// You may obtain a copy of the License at
-//
-//     http://www.apache.org/licenses/LICENSE-2.0
-//
-// Unless required by applicable law or agreed to in writing, software
-// distributed under the License is distributed on an "AS IS" BASIS,
-// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-// See the License for the specific language governing permissions and
-// limitations under the License.
-
-
-// Two streaming H2Ds:
-// 1st H2D: UT 1 and UT 2
-// 2nd H2D: UT 3 and UT 4
-
-param MEMCPYH2D_1: color = @get_color(32);
-param MEMCPYH2D_2: color = @get_color(32);
-
-// Color along which we send a wavelet to pe_program
-param USER_IN_1: color = @get_color(32);
-param USER_IN_2: color = @get_color(32);
-
-param txdir: direction;
-
-const max_fifo_len = 256*20; // maximum length of the fifo
-
-var fifo1_buffer = @zeros([max_fifo_len]u32);
-const fifo1 = @allocate_fifo(fifo1_buffer);
-
-var fifo2_buffer = @zeros([max_fifo_len]u32);
-const fifo2 = @allocate_fifo(fifo2_buffer);
-
-// length=inf
-var fab_recv_wdsd_1 = @get_dsd(fabin_dsd, .{
-   .extent = 0x7fff,
-   .fabric_color = MEMCPYH2D_1,
-   .input_queue = @get_input_queue(1)
-});
-
-// length=inf
-var fab_trans_wdsd_1 = @get_dsd(fabout_dsd, .{
-    .extent = 0x7fff,
-    .fabric_color = USER_IN_1,
-    .output_queue = @get_output_queue(2)
-});
-
-// length=inf
-var fab_recv_wdsd_2 = @get_dsd(fabin_dsd, .{
-   .extent = 0x7fff,
-   .fabric_color = MEMCPYH2D_2,
-   .input_queue = @get_input_queue(3)
-});
-
-// length=inf
-var fab_trans_wdsd_2 = @get_dsd(fabout_dsd, .{
-    .extent = 0x7fff,
-    .fabric_color = USER_IN_2,
-    .output_queue = @get_output_queue(4)
-});
-
-// if no user's color is defined, f_startup() is empty
-fn f_startup() void {
-    if ( (@get_int(MEMCPYH2D_1) < 24) and (@get_int(USER_IN_1) < 24) ){
-        // receive data from streaming H2D
-        @mov32(fifo1, fab_recv_wdsd_1, .{.async=true} );
-
-        // forward data to USER_IN_1
-        @mov32(fab_trans_wdsd_1, fifo1, .{.async=true} );
-    }
-
-    if ( (@get_int(MEMCPYH2D_2) < 24) and (@get_int(USER_IN_2) < 24) ){
-        // receive data from streaming H2D
-        @mov32(fifo2, fab_recv_wdsd_2, .{.async=true} );
-
-        // forward data to USER_IN_1
-        @mov32(fab_trans_wdsd_2, fifo2, .{.async=true} );
-    }
-}
-
-comptime {
-    if (@get_int(USER_IN_1) < 24){
-        const h2d_route = .{ .rx = .{ RAMP }, .tx = .{ txdir } };
-        @set_local_color_config(USER_IN_1, .{ .routes = h2d_route });
-    }
-    if (@get_int(USER_IN_2) < 24){
-        const h2d_route = .{ .rx = .{ RAMP }, .tx = .{ txdir } };
-        @set_local_color_config(USER_IN_2, .{ .routes = h2d_route });
-    }
-}
diff --git a/tutorials/topic-04-sentinels/memcpyEdge/memcpy_edge.csl b/tutorials/topic-04-sentinels/memcpyEdge/memcpy_edge.csl
deleted file mode 100644
index 5ebfd5f..0000000
--- a/tutorials/topic-04-sentinels/memcpyEdge/memcpy_edge.csl
+++ /dev/null
@@ -1,102 +0,0 @@
-// Copyright 2024 Cerebras Systems.
-//
-// Licensed under the Apache License, Version 2.0 (the "License");
-// you may not use this file except in compliance with the License.
-// You may obtain a copy of the License at
-//
-//     http://www.apache.org/licenses/LICENSE-2.0
-//
-// Unless required by applicable law or agreed to in writing, software
-// distributed under the License is distributed on an "AS IS" BASIS,
-// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-// See the License for the specific language governing permissions and
-// limitations under the License.
-
-
-// This is a template of memcpy over the edges.
-// memcpy_edge.csl can be "north", "south", "west" or "east"
-// of the following layout.
-//        +---------+
-//        |  north  |
-// +------+---------+------+
-// | west |  core   | east |
-// +------+---------+------+
-//        |  south  |
-//        +---------+
-// north.csl, south.csl, west.csl and east.csl instantiate
-// memcpy_edge.csl with a proper direction.
-//
-// memcpy_edge.csl supports 2 streaming H2Ds and one
-// streaming D2H. Such constraint depends on the design.
-// The current implementation binds a FIFO for a H2D or D2H,
-// so we can only support 3 in total.
-// We choose 2 H2Ds and 1 D2H.
-// if we replace FIFO by WTT, we could support more.
-//
-// However the user can instantiate memcpy_edge.csl for each
-// edge. The maximum number of H2Ds is 2*4 = 8 and maximum
-// number of D2Hs is 1*4 = 4.
-//
-// If the user only has a H2D at north, for example, he only
-// needs to configure color USER_IN_1, i.e. only a single
-// streaming H2D is used.
-//
-// For example,
-//   @set_tile_code(pe_x, 0, "north.csl", .{
-//      .USER_IN_1 = mainColor,
-//      .STARTUP = STARTUP,
-//      .memcpy_params = memcpy_params,
-//      .MEMCPYH2D_DATA_1 = MEMCPYH2D_DATA_1,
-//      .MEMCPYD2H_DATA_1 = MEMCPYD2H_DATA_1
-//    });
-
-// send data to the "core"
-param USER_IN_1: color = @get_color(32);
-param USER_IN_2: color = @get_color(32);
-
-// receive data from the "core"
-param USER_OUT_1: color = @get_color(32);
-
-// entrypoint
-param STARTUP: local_task_id;
-
-// ----------
-// Every PE needs to import memcpy module otherwise the I/O cannot
-// propagate the data to the destination.
-
-param memcpy_params: comptime_struct;
-
-// The direction of "core", for example
-// north.csl has dir = SOUTH
-// south.csl has dir = NORTH
-// west.csl has dir = EAST
-// east.csl has dir = WEST
-param dir: direction;
-
-// memcpy module reserves input queue 0 and output queue 0
-const sys_mod = @import_module( "<memcpy/memcpy>", memcpy_params);
-// ----------
-
-const h2d_mod = @import_module("h2d.csl", .{
-     .USER_IN_1 = USER_IN_1,
-     .USER_IN_2 = USER_IN_2,
-     .MEMCPYH2D_1 = memcpy_params.MEMCPYH2D_1,
-     .MEMCPYH2D_2 = memcpy_params.MEMCPYH2D_2,
-     .txdir = dir
-      });
-
-const d2h_mod = @import_module("d2h.csl", .{
-     .USER_OUT_1 = USER_OUT_1,
-     .MEMCPYD2H_1 = memcpy_params.MEMCPYD2H_1,
-     .rxdir = dir
-      });
-
-task f_startup() void {
-    h2d_mod.f_startup();
-    d2h_mod.f_startup();
-}
-
-comptime {
-    @bind_local_task(f_startup, STARTUP);
-    @activate(STARTUP);
-}
diff --git a/tutorials/topic-04-sentinels/memcpyEdge/north.csl b/tutorials/topic-04-sentinels/memcpyEdge/north.csl
deleted file mode 100644
index 1452245..0000000
--- a/tutorials/topic-04-sentinels/memcpyEdge/north.csl
+++ /dev/null
@@ -1,35 +0,0 @@
-// Copyright 2024 Cerebras Systems.
-//
-// Licensed under the Apache License, Version 2.0 (the "License");
-// you may not use this file except in compliance with the License.
-// You may obtain a copy of the License at
-//
-//     http://www.apache.org/licenses/LICENSE-2.0
-//
-// Unless required by applicable law or agreed to in writing, software
-// distributed under the License is distributed on an "AS IS" BASIS,
-// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-// See the License for the specific language governing permissions and
-// limitations under the License.
-
-
-// send data to the "core"
-param USER_IN_1: color = @get_color(32);
-param USER_IN_2: color = @get_color(32);
-
-// receive data from the "core"
-param USER_OUT_1: color = @get_color(32);
-
-// entrypoint
-param STARTUP: local_task_id;
-
-param memcpy_params: comptime_struct;
-
-const edge_mod = @import_module( "memcpy_edge.csl", .{
-     .memcpy_params = memcpy_params,
-     .USER_IN_1 = USER_IN_1,
-     .USER_IN_2 = USER_IN_2,
-     .USER_OUT_1 = USER_OUT_1,
-     .STARTUP = STARTUP,
-     .dir = SOUTH
-      });
diff --git a/tutorials/topic-04-sentinels/memcpyEdge/south.csl b/tutorials/topic-04-sentinels/memcpyEdge/south.csl
deleted file mode 100644
index 11b4c43..0000000
--- a/tutorials/topic-04-sentinels/memcpyEdge/south.csl
+++ /dev/null
@@ -1,35 +0,0 @@
-// Copyright 2024 Cerebras Systems.
-//
-// Licensed under the Apache License, Version 2.0 (the "License");
-// you may not use this file except in compliance with the License.
-// You may obtain a copy of the License at
-//
-//     http://www.apache.org/licenses/LICENSE-2.0
-//
-// Unless required by applicable law or agreed to in writing, software
-// distributed under the License is distributed on an "AS IS" BASIS,
-// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-// See the License for the specific language governing permissions and
-// limitations under the License.
-
-
-// send data to the "core"
-param USER_IN_1: color = @get_color(32);
-param USER_IN_2: color = @get_color(32);
-
-// receive data from the "core"
-param USER_OUT_1: color = @get_color(32);
-
-// entrypoint
-param STARTUP: local_task_id;
-
-param memcpy_params: comptime_struct;
-
-const edge_mod = @import_module( "memcpy_edge.csl", .{
-     .memcpy_params = memcpy_params,
-     .USER_IN_1 = USER_IN_1,
-     .USER_IN_2 = USER_IN_2,
-     .USER_OUT_1 = USER_OUT_1,
-     .STARTUP = STARTUP,
-     .dir = NORTH
-      });
diff --git a/tutorials/topic-04-sentinels/memcpyEdge/west.csl b/tutorials/topic-04-sentinels/memcpyEdge/west.csl
deleted file mode 100644
index 5c7d21a..0000000
--- a/tutorials/topic-04-sentinels/memcpyEdge/west.csl
+++ /dev/null
@@ -1,35 +0,0 @@
-// Copyright 2024 Cerebras Systems.
-//
-// Licensed under the Apache License, Version 2.0 (the "License");
-// you may not use this file except in compliance with the License.
-// You may obtain a copy of the License at
-//
-//     http://www.apache.org/licenses/LICENSE-2.0
-//
-// Unless required by applicable law or agreed to in writing, software
-// distributed under the License is distributed on an "AS IS" BASIS,
-// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-// See the License for the specific language governing permissions and
-// limitations under the License.
-
-
-// send data to the "core"
-param USER_IN_1: color = @get_color(32);
-param USER_IN_2: color = @get_color(32);
-
-// receive data from the "core"
-param USER_OUT_1: color = @get_color(32);
-
-// entrypoint
-param STARTUP: local_task_id;
-
-param memcpy_params: comptime_struct;
-
-const edge_mod = @import_module( "memcpy_edge.csl", .{
-     .memcpy_params = memcpy_params,
-     .USER_IN_1 = USER_IN_1,
-     .USER_IN_2 = USER_IN_2,
-     .USER_OUT_1 = USER_OUT_1,
-     .STARTUP = STARTUP,
-     .dir = EAST
-      });
diff --git a/tutorials/topic-04-sentinels/pe_program.csl b/tutorials/topic-04-sentinels/pe_program.csl
deleted file mode 100644
index 8a70205..0000000
--- a/tutorials/topic-04-sentinels/pe_program.csl
+++ /dev/null
@@ -1,47 +0,0 @@
-// Copyright 2024 Cerebras Systems.
-//
-// Licensed under the Apache License, Version 2.0 (the "License");
-// you may not use this file except in compliance with the License.
-// You may obtain a copy of the License at
-//
-//     http://www.apache.org/licenses/LICENSE-2.0
-//
-// Unless required by applicable law or agreed to in writing, software
-// distributed under the License is distributed on an "AS IS" BASIS,
-// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-// See the License for the specific language governing permissions and
-// limitations under the License.
-
-param memcpy_params: comptime_struct;
-
-// Colors
-param output_color:     color;
-
-// Task IDs
-param main_task_id:        data_task_id;    // data task recieves data along main_color
-param send_result_task_id: control_task_id; // sentinel tells PE to send result to host
-
-// ----------
-// Every PE needs to import memcpy module otherwise the I/O cannot
-// propagate the data to the destination.
-
-// memcpy module reserves input queue 0 and output queue 0
-const sys_mod = @import_module( "<memcpy/memcpy>", memcpy_params);
-// ----------
-
-var result: f16 = 0.0;
-
-const out_dsd = @get_dsd(fabout_dsd, .{.fabric_color = output_color, .extent = 1});
-
-task main_task(data: f16) void {
-  result = result + data;
-}
-
-task send_result() void {
-  @fmovh(out_dsd, result);
-}
-
-comptime {
-  @bind_data_task(main_task, main_task_id);
-  @bind_control_task(send_result, send_result_task_id);
-}
diff --git a/tutorials/topic-04-sentinels/run.py b/tutorials/topic-04-sentinels/run.py
deleted file mode 100644
index 70eb3b2..0000000
--- a/tutorials/topic-04-sentinels/run.py
+++ /dev/null
@@ -1,95 +0,0 @@
-#!/usr/bin/env cs_python
-
-# Copyright 2024 Cerebras Systems.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-
-import argparse
-import json
-import numpy as np
-
-from cerebras.sdk.sdk_utils import memcpy_view, input_array_to_u32
-from cerebras.sdk.runtime.sdkruntimepybind import SdkRuntime, MemcpyDataType # pylint: disable=no-name-in-module
-from cerebras.sdk.runtime.sdkruntimepybind import MemcpyOrder # pylint: disable=no-name-in-module
-
-parser = argparse.ArgumentParser()
-parser.add_argument('--name', help='the test name')
-parser.add_argument("--cmaddr", help="IP:port for CS system")
-args = parser.parse_args()
-dirname = args.name
-
-# Parse the compile metadata
-with open(f"{dirname}/out.json", encoding="utf-8") as json_file:
-  compile_data = json.load(json_file)
-params = compile_data["params"]
-MEMCPYH2D_DATA_1 = int(params["MEMCPYH2D_DATA_1_ID"])
-MEMCPYH2D_DATA_2 = int(params["MEMCPYH2D_DATA_2_ID"])
-MEMCPYD2H_DATA_1 = int(params["MEMCPYD2H_DATA_1_ID"])
-size = int(params["size"])
-print(f"MEMCPYH2D_DATA_1 = {MEMCPYH2D_DATA_1}")
-print(f"MEMCPYH2D_DATA_2 = {MEMCPYH2D_DATA_2}")
-print(f"MEMCPYD2H_DATA_1 = {MEMCPYD2H_DATA_1}")
-print(f"size = number of PEs in a column = {size}")
-
-# memcpy_dtype is DON'T care under streaming mode
-memcpy_dtype = MemcpyDataType.MEMCPY_16BIT
-runner = SdkRuntime(dirname, cmaddr=args.cmaddr)
-
-runner.load()
-runner.run()
-
-num_wvlts = 11
-print(f"num_wvlts = number of wavelets for each PE = {num_wvlts}")
-
-print("step 1: streaming H2D_1 sends number of input wavelets to P0.0")
-h2d1_u32 = np.ones(size).astype(np.uint32) * num_wvlts
-runner.memcpy_h2d(MEMCPYH2D_DATA_1, h2d1_u32.ravel(), 0, 0, 1, size, 1, \
-    streaming=True, data_type=memcpy_dtype, order=MemcpyOrder.ROW_MAJOR, nonblock=True)
-
-# Use a deterministic seed so that CI results are predictable
-np.random.seed(seed=7)
-
-# Setup a {size}x11 input tensor that is reduced along the second dimension
-input_tensor = np.random.rand(size, num_wvlts).astype(np.float16)
-expected = np.sum(input_tensor, axis=1)
-
-print("step 2: streaming H2D_2 to P0.0")
-# "input_tensor" is a 1d array
-# The type of input_tensor is float16, we need to extend it to uint32
-# There are two kind of extension when using the utility function input_array_to_u32
-#    input_array_to_u32(np_arr: np.ndarray, sentinel: Optional[int], fast_dim_sz: int)
-# 1) zero extension:
-#    sentinel = None
-# 2) upper 16-bit is the index of the array:
-#    sentinel is Not None
-#
-# In this example, the upper 16-bit is don't care because pe_program.csl only
-# reads lower 16-bit
-tensors_u32 = input_array_to_u32(input_tensor.ravel(), 1, num_wvlts)
-runner.memcpy_h2d(MEMCPYH2D_DATA_2, tensors_u32, 0, 0, 1, size, num_wvlts, \
-    streaming=True, data_type=memcpy_dtype, order=MemcpyOrder.ROW_MAJOR, nonblock=True)
-
-print("step 3: streaming D2H at P3.0")
-# The D2H buffer must be of type u32
-out_tensors_u32 = np.zeros(size, np.uint32)
-runner.memcpy_d2h(out_tensors_u32, MEMCPYD2H_DATA_1, 3, 0, 1, size, 1, \
-    streaming=True, data_type=memcpy_dtype, order=MemcpyOrder.COL_MAJOR, nonblock=False)
-# remove upper 16-bit of each u32
-result_tensor = memcpy_view(out_tensors_u32, np.dtype(np.float16))
-
-runner.stop()
-
-# Ensure that the result matches our expectation
-np.testing.assert_allclose(result_tensor, expected, atol=0.05, rtol=0)
-print("SUCCESS!")
diff --git a/tutorials/topic-04-sentinels/sentinel.csl b/tutorials/topic-04-sentinels/sentinel.csl
deleted file mode 100644
index 466698d..0000000
--- a/tutorials/topic-04-sentinels/sentinel.csl
+++ /dev/null
@@ -1,81 +0,0 @@
-// Copyright 2024 Cerebras Systems.
-//
-// Licensed under the Apache License, Version 2.0 (the "License");
-// you may not use this file except in compliance with the License.
-// You may obtain a copy of the License at
-//
-//     http://www.apache.org/licenses/LICENSE-2.0
-//
-// Unless required by applicable law or agreed to in writing, software
-// distributed under the License is distributed on an "AS IS" BASIS,
-// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-// See the License for the specific language governing permissions and
-// limitations under the License.
-
-//
-// sentinel.csl appends a variable length input with a sentinel
-// Here is the layout
-//           +---------------+              +-------+
-//  IN_1 --> | sentinel.csl  | --> OUT_1 -> | core  |
-//  IN_2 --> |               |              +-------+
-//           +---------------+
-//
-// IN_1 receives the number of wavelets of IN_2
-// IN_2 receives the data
-// OUT_1 forwards data from IN_2 and appends a sentinel at the end
-param memcpy_params: comptime_struct;
-
-// Sentinel to signal end of data
-param SENTINEL: u16;
-
-// Colors
-param OUT_1: color; // forwards data from IN_2 with appended sentinel
-
-// Task IDs
-param wtt_in_1_task_id: data_task_id; // Data task triggered by IN_1
-param wtt_in_2_task_id: data_task_id; // Data task triggered by IN_2
-
-// ----------
-// Every PE needs to import memcpy module otherwise the I/O cannot
-// propagate the data to the destination.
-
-// memcpy module reserves input queue 0 and output queue 0
-const sys_mod = @import_module( "<memcpy/memcpy>", memcpy_params);
-// ----------
-
-var num_wvlts:i16 = 0;
-var index: i16 = 0;
-
-const fab_trans_wdsd = @get_dsd(fabout_dsd, .{
-    .extent = 1,
-    .fabric_color = OUT_1
-});
-
-const fab_trans_ctrl_wdsd = @get_dsd(fabout_dsd, .{
-    .extent = 1,
-    .control = true,
-    .fabric_color = OUT_1
-});
-
-// IN_1 receives number of wavelets of IN_2
-task wtt_in_1(data: u32) void {
-  num_wvlts = @as(i16, data);
-}
-
-// IN_2 forwards data to OUT_1 and appends a sentinel
-// at the end.
-task wtt_in_2(data: u32) void {
-  @mov32(fab_trans_wdsd, data);
-  index = index + 1;
-  if (index >= num_wvlts){
-     // append a sentinel
-     const ctrl_wvlt = @as(u32, SENTINEL) << 16;
-     @mov32(fab_trans_ctrl_wdsd, ctrl_wvlt);
-     index = 0;
-  }
-}
-
-comptime {
-  @bind_data_task(wtt_in_1, wtt_in_1_task_id);
-  @bind_data_task(wtt_in_2, wtt_in_2_task_id);
-}
diff --git a/tutorials/topic-04-sparse-tensors/commands.sh b/tutorials/topic-04-sparse-tensors/commands_wse2.sh
similarity index 100%
rename from tutorials/topic-04-sparse-tensors/commands.sh
rename to tutorials/topic-04-sparse-tensors/commands_wse2.sh
diff --git a/tutorials/topic-02-streaming-wavelet-data/commands.sh b/tutorials/topic-04-sparse-tensors/commands_wse3.sh
similarity index 80%
rename from tutorials/topic-02-streaming-wavelet-data/commands.sh
rename to tutorials/topic-04-sparse-tensors/commands_wse3.sh
index 2dc7f66..b5df57a 100755
--- a/tutorials/topic-02-streaming-wavelet-data/commands.sh
+++ b/tutorials/topic-04-sparse-tensors/commands_wse3.sh
@@ -2,7 +2,7 @@
 
 set -e
 
-cslc ./layout.csl --fabric-dims=8,3 \
+cslc --arch=wse3 ./layout.csl --fabric-dims=8,3 \
 --fabric-offsets=4,1 -o out \
 --params=MEMCPYH2D_DATA_1_ID:0 \
 --params=MEMCPYD2H_DATA_1_ID:1 \
diff --git a/tutorials/topic-05-sentinels/commands.sh b/tutorials/topic-05-sentinels/commands_wse2.sh
similarity index 100%
rename from tutorials/topic-05-sentinels/commands.sh
rename to tutorials/topic-05-sentinels/commands_wse2.sh
diff --git a/tutorials/topic-04-sentinels/commands.sh b/tutorials/topic-05-sentinels/commands_wse3.sh
similarity index 77%
rename from tutorials/topic-04-sentinels/commands.sh
rename to tutorials/topic-05-sentinels/commands_wse3.sh
index ebb7fa8..3c9b738 100755
--- a/tutorials/topic-04-sentinels/commands.sh
+++ b/tutorials/topic-05-sentinels/commands_wse3.sh
@@ -2,11 +2,11 @@
 
 set -e
 
-cslc ./layout.csl --fabric-dims=11,12 \
+cslc --arch=wse3 ./layout.csl --fabric-dims=11,12 \
 --fabric-offsets=4,1 -o out \
---params=size:10 \
 --params=MEMCPYH2D_DATA_1_ID:2 \
 --params=MEMCPYH2D_DATA_2_ID:3 \
 --params=MEMCPYD2H_DATA_1_ID:4 \
+--params=size:4 \
 --memcpy --channels=1 --width-west-buf=0 --width-east-buf=0
 cs_python run.py --name out
diff --git a/tutorials/topic-05-switches/README.rst b/tutorials/topic-05-switches/README.rst
deleted file mode 100644
index 10106d7..0000000
--- a/tutorials/topic-05-switches/README.rst
+++ /dev/null
@@ -1,18 +0,0 @@
-Topic 5: Switches
-=================
-
-Fabric switches permit limited runtime control of routes.
-
-In this example, the ``layout`` block initializes the default route to receive
-wavelets from the ramp and forward them to the PE's north neighbor.  However, it
-also defines routes for switch positions 1, 2, and 3.  The hardware updates the
-route according to the specified switch positions when it receives a so-called
-Control Wavelet.
-
-For the payload of the control wavelet, the code creates a special wavelet using
-the helper function ``ctrl()``.
-
-Switches can be helpful not just to change the routing configuration in limited
-ways at runtime, but also to save the number of colors used.  For instance, this
-same example could be re-written to use four colors and four routes, but by
-using fabric switches, this example uses just one color.
diff --git a/tutorials/topic-05-switches/empty.csl b/tutorials/topic-05-switches/empty.csl
deleted file mode 100644
index 779ba2c..0000000
--- a/tutorials/topic-05-switches/empty.csl
+++ /dev/null
@@ -1,25 +0,0 @@
-// Copyright 2024 Cerebras Systems.
-//
-// Licensed under the Apache License, Version 2.0 (the "License");
-// you may not use this file except in compliance with the License.
-// You may obtain a copy of the License at
-//
-//     http://www.apache.org/licenses/LICENSE-2.0
-//
-// Unless required by applicable law or agreed to in writing, software
-// distributed under the License is distributed on an "AS IS" BASIS,
-// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-// See the License for the specific language governing permissions and
-// limitations under the License.
-
-// Not a complete program; the top-level source file is code.csl.
-
-// ----------
-// Every PE needs to import memcpy module otherwise the I/O cannot
-// propagate the data to the destination.
-
-param memcpy_params: comptime_struct;
-
-// memcpy module reserves input queue 0 and output queue 0
-const sys_mod = @import_module( "<memcpy/memcpy>", memcpy_params);
-// ----------
diff --git a/tutorials/topic-05-switches/layout.csl b/tutorials/topic-05-switches/layout.csl
deleted file mode 100644
index 8f6ff1d..0000000
--- a/tutorials/topic-05-switches/layout.csl
+++ /dev/null
@@ -1,140 +0,0 @@
-// Copyright 2024 Cerebras Systems.
-//
-// Licensed under the Apache License, Version 2.0 (the "License");
-// you may not use this file except in compliance with the License.
-// You may obtain a copy of the License at
-//
-//     http://www.apache.org/licenses/LICENSE-2.0
-//
-// Unless required by applicable law or agreed to in writing, software
-// distributed under the License is distributed on an "AS IS" BASIS,
-// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-// See the License for the specific language governing permissions and
-// limitations under the License.
-
-// color/ task ID map
-//
-//  ID var           ID var             ID var                ID var
-//   0                9 STARTUP         18                    27 reserved (memcpy)
-//   1 channel       10                 19                    28 reserved (memcpy)
-//   2 out           11                 20                    29 reserved
-//   3               12                 21 reserved (memcpy)  30 reserved (memcpy)
-//   4 D2H           13                 22 reserved (memcpy)  31 reserved
-//   5               14                 23 reserved (memcpy)  32
-//   6               15                 24                    33
-//   7               16                 25                    34
-//   8 main_task_id  17                 26                    35
-//
-
-//  +---------------+
-//  | north (d2h)   |
-//  +---------------+
-//  | core (3-by-3) |
-//  +---------------+
-//  | south (nop)   |
-//  +---------------+
-
-param MEMCPYD2H_DATA_1_ID: i16; // ID for memcpy streaming color
-
-const colorValue = 1; // ID of color used to transmit from send.csl
-
-// Colors
-const MEMCPYD2H_DATA_1: color = @get_color(MEMCPYD2H_DATA_1_ID);
-const channel:          color = @get_color(colorValue);
-const out:              color = @get_color(2);
-
-// Task IDs
-const main_task_id:    local_task_id = @get_local_task_id(8);
-const STARTUP:         local_task_id = @get_local_task_id(9);
-const channel_task_id: data_task_id  = @get_data_task_id(channel);
-
-const memcpy = @import_module( "<memcpy/get_params>", .{
-    .width = 3,
-    .height = 5,
-    .MEMCPYD2H_1 = MEMCPYD2H_DATA_1
-    });
-
-layout {
-  @set_rectangle(3, 5);
-
-  // north only runs D2H which receives data from pe_program
-  // and forwards it to the host
-  for (@range(i16, 3)) |pe_x| {
-    const memcpy_params = memcpy.get_params(pe_x);
-    @set_tile_code(pe_x, 0, "memcpyEdge/north.csl", .{
-      .memcpy_params = memcpy_params,
-      .USER_OUT_1 = out,
-      .STARTUP = STARTUP,
-    });
-  }
-
-  const memcpy_params_0 = memcpy.get_params(0);
-  const memcpy_params_1 = memcpy.get_params(1);
-  const memcpy_params_2 = memcpy.get_params(2);
-
-  // The core has 3-by-3 PEs starting at row 1 where row 0 is "north".
-  // The py coorindate of each PE is added by 1.
-
-  // Out of the nine PEs, the PE in the center (PE #1,1) will send four
-  // control wavelets to the PE's four adjacent neighbors.  These four
-  // adjacent numbers are programmed to receive the control wavelets, whereas
-  // all other PEs (i.e. the PEs at the corners of the rectangle) are
-  // programmed to contain no instructions or routes.
-  @set_tile_code(1, 1+1, "send.csl", .{
-    .memcpy_params = memcpy_params_1,
-    .txColor = channel,
-    .main_task_id = main_task_id,
-    .colorValue = colorValue
-  });
-
-  @set_tile_code(1, 0+1, "recv.csl", .{
-    // Make this PE send the final message back to the host signaling completion
-    .memcpy_params = memcpy_params_1,
-    .rxColor = channel, .outColor = out,
-    .rx_task_id = channel_task_id,
-    .inDir = SOUTH, .fin = true
-  });
-
-  @set_tile_code(0, 1+1, "recv.csl", .{
-    .memcpy_params = memcpy_params_0,
-    .rxColor = channel, .outColor = out,
-    .rx_task_id = channel_task_id,
-    .inDir = EAST, .fin = false
-  });
-
-  @set_tile_code(2, 1+1, "recv.csl", .{
-    .memcpy_params = memcpy_params_2,
-    .rxColor = channel, .outColor = out,
-    .rx_task_id = channel_task_id,
-    .inDir = WEST, .fin = false
-  });
-
-  @set_tile_code(1, 2+1, "recv.csl", .{
-    .memcpy_params = memcpy_params_1,
-    .rxColor = channel, .outColor = out,
-    .rx_task_id = channel_task_id,
-    .inDir = NORTH, .fin = false
-  });
-
-  // south does NOP
-  for (@range(i16, 3)) |pe_x| {
-    const memcpy_params = memcpy.get_params(pe_x);
-    @set_tile_code(pe_x, 4, "memcpyEdge/south.csl", .{
-      .memcpy_params = memcpy_params,
-      .STARTUP = STARTUP
-    });
-  }
-
-  @set_tile_code(0, 0+1, "empty.csl", .{
-    .memcpy_params = memcpy_params_0,
-  });
-  @set_tile_code(2, 0+1, "empty.csl", .{
-    .memcpy_params = memcpy_params_2,
-  });
-  @set_tile_code(0, 2+1, "empty.csl", .{
-    .memcpy_params = memcpy_params_0,
-  });
-  @set_tile_code(2, 2+1, "empty.csl", .{
-    .memcpy_params = memcpy_params_2,
-  });
-}
diff --git a/tutorials/topic-05-switches/memcpyEdge/d2h.csl b/tutorials/topic-05-switches/memcpyEdge/d2h.csl
deleted file mode 100644
index 1224c27..0000000
--- a/tutorials/topic-05-switches/memcpyEdge/d2h.csl
+++ /dev/null
@@ -1,61 +0,0 @@
-// Copyright 2024 Cerebras Systems.
-//
-// Licensed under the Apache License, Version 2.0 (the "License");
-// you may not use this file except in compliance with the License.
-// You may obtain a copy of the License at
-//
-//     http://www.apache.org/licenses/LICENSE-2.0
-//
-// Unless required by applicable law or agreed to in writing, software
-// distributed under the License is distributed on an "AS IS" BASIS,
-// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-// See the License for the specific language governing permissions and
-// limitations under the License.
-
-
-// One streaming D2H:
-// 1st D2H: UT 5 and UT 6
-
-param MEMCPYD2H_1: color = @get_color(32);
-
-// Color along which we expect a wavelet
-param USER_OUT_1: color = @get_color(32);
-
-param rxdir: direction;
-
-const max_fifo_len = 256*40; // maximum length of the fifo
-
-var fifo1_buffer = @zeros([max_fifo_len]u32);
-const fifo1 = @allocate_fifo(fifo1_buffer);
-
-// length=inf
-var fab_recv_wdsd = @get_dsd(fabin_dsd, .{
-   .extent = 0x7fff,
-   .fabric_color = USER_OUT_1,
-   .input_queue = @get_input_queue(6)
-});
-
-// length=inf
-var fab_trans_wdsd = @get_dsd(fabout_dsd, .{
-    .extent = 0x7fff,
-    .fabric_color = MEMCPYD2H_1,
-    .output_queue = @get_output_queue(5)
-});
-
-// if USER_OUT_1 is not valid, f_startup() is empty
-fn f_startup() void {
-    if ( (@get_int(MEMCPYD2H_1) < 24) and (@get_int(USER_OUT_1) < 24) ){
-        // receive data from USER_OUT_1
-        @mov32(fifo1, fab_recv_wdsd, .{.async=true} );
-
-        // forward data to MEMCPYD2H_1
-        @mov32(fab_trans_wdsd, fifo1, .{.async=true} );
-    }
-}
-
-comptime {
-    if (@get_int(USER_OUT_1) < 24){
-        const d2h_route = .{ .rx = .{ rxdir }, .tx = .{ RAMP } };
-        @set_local_color_config(USER_OUT_1, .{ .routes = d2h_route });
-    }
-}
diff --git a/tutorials/topic-05-switches/memcpyEdge/east.csl b/tutorials/topic-05-switches/memcpyEdge/east.csl
deleted file mode 100644
index 7303d8c..0000000
--- a/tutorials/topic-05-switches/memcpyEdge/east.csl
+++ /dev/null
@@ -1,35 +0,0 @@
-// Copyright 2024 Cerebras Systems.
-//
-// Licensed under the Apache License, Version 2.0 (the "License");
-// you may not use this file except in compliance with the License.
-// You may obtain a copy of the License at
-//
-//     http://www.apache.org/licenses/LICENSE-2.0
-//
-// Unless required by applicable law or agreed to in writing, software
-// distributed under the License is distributed on an "AS IS" BASIS,
-// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-// See the License for the specific language governing permissions and
-// limitations under the License.
-
-
-// send data to the "core"
-param USER_IN_1: color = @get_color(32);
-param USER_IN_2: color = @get_color(32);
-
-// receive data from the "core"
-param USER_OUT_1: color = @get_color(32);
-
-// entrypoint
-param STARTUP: local_task_id;
-
-param memcpy_params: comptime_struct;
-
-const edge_mod = @import_module( "memcpy_edge.csl", .{
-     .memcpy_params = memcpy_params,
-     .USER_IN_1 = USER_IN_1,
-     .USER_IN_2 = USER_IN_2,
-     .USER_OUT_1 = USER_OUT_1,
-     .STARTUP = STARTUP,
-     .dir = WEST
-      });
diff --git a/tutorials/topic-05-switches/memcpyEdge/h2d.csl b/tutorials/topic-05-switches/memcpyEdge/h2d.csl
deleted file mode 100644
index 017c3df..0000000
--- a/tutorials/topic-05-switches/memcpyEdge/h2d.csl
+++ /dev/null
@@ -1,93 +0,0 @@
-// Copyright 2024 Cerebras Systems.
-//
-// Licensed under the Apache License, Version 2.0 (the "License");
-// you may not use this file except in compliance with the License.
-// You may obtain a copy of the License at
-//
-//     http://www.apache.org/licenses/LICENSE-2.0
-//
-// Unless required by applicable law or agreed to in writing, software
-// distributed under the License is distributed on an "AS IS" BASIS,
-// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-// See the License for the specific language governing permissions and
-// limitations under the License.
-
-
-// Two streaming H2Ds:
-// 1st H2D: UT 1 and UT 2
-// 2nd H2D: UT 3 and UT 4
-
-param MEMCPYH2D_1: color = @get_color(32);
-param MEMCPYH2D_2: color = @get_color(32);
-
-// Color along which we send a wavelet to pe_program
-param USER_IN_1: color = @get_color(32);
-param USER_IN_2: color = @get_color(32);
-
-param txdir: direction;
-
-const max_fifo_len = 256*20; // maximum length of the fifo
-
-var fifo1_buffer = @zeros([max_fifo_len]u32);
-const fifo1 = @allocate_fifo(fifo1_buffer);
-
-var fifo2_buffer = @zeros([max_fifo_len]u32);
-const fifo2 = @allocate_fifo(fifo2_buffer);
-
-// length=inf
-var fab_recv_wdsd_1 = @get_dsd(fabin_dsd, .{
-   .extent = 0x7fff,
-   .fabric_color = MEMCPYH2D_1,
-   .input_queue = @get_input_queue(1)
-});
-
-// length=inf
-var fab_trans_wdsd_1 = @get_dsd(fabout_dsd, .{
-    .extent = 0x7fff,
-    .fabric_color = USER_IN_1,
-    .output_queue = @get_output_queue(2)
-});
-
-// length=inf
-var fab_recv_wdsd_2 = @get_dsd(fabin_dsd, .{
-   .extent = 0x7fff,
-   .fabric_color = MEMCPYH2D_2,
-   .input_queue = @get_input_queue(3)
-});
-
-// length=inf
-var fab_trans_wdsd_2 = @get_dsd(fabout_dsd, .{
-    .extent = 0x7fff,
-    .fabric_color = USER_IN_2,
-    .output_queue = @get_output_queue(4)
-});
-
-// if no user's color is defined, f_startup() is empty
-fn f_startup() void {
-    if ( (@get_int(MEMCPYH2D_1) < 24) and (@get_int(USER_IN_1) < 24) ){
-        // receive data from streaming H2D
-        @mov32(fifo1, fab_recv_wdsd_1, .{.async=true} );
-
-        // forward data to USER_IN_1
-        @mov32(fab_trans_wdsd_1, fifo1, .{.async=true} );
-    }
-
-    if ( (@get_int(MEMCPYH2D_2) < 24) and (@get_int(USER_IN_2) < 24) ){
-        // receive data from streaming H2D
-        @mov32(fifo2, fab_recv_wdsd_2, .{.async=true} );
-
-        // forward data to USER_IN_1
-        @mov32(fab_trans_wdsd_2, fifo2, .{.async=true} );
-    }
-}
-
-comptime {
-    if (@get_int(USER_IN_1) < 24){
-        const h2d_route = .{ .rx = .{ RAMP }, .tx = .{ txdir } };
-        @set_local_color_config(USER_IN_1, .{ .routes = h2d_route });
-    }
-    if (@get_int(USER_IN_2) < 24){
-        const h2d_route = .{ .rx = .{ RAMP }, .tx = .{ txdir } };
-        @set_local_color_config(USER_IN_2, .{ .routes = h2d_route });
-    }
-}
diff --git a/tutorials/topic-05-switches/memcpyEdge/memcpy_edge.csl b/tutorials/topic-05-switches/memcpyEdge/memcpy_edge.csl
deleted file mode 100644
index 5ebfd5f..0000000
--- a/tutorials/topic-05-switches/memcpyEdge/memcpy_edge.csl
+++ /dev/null
@@ -1,102 +0,0 @@
-// Copyright 2024 Cerebras Systems.
-//
-// Licensed under the Apache License, Version 2.0 (the "License");
-// you may not use this file except in compliance with the License.
-// You may obtain a copy of the License at
-//
-//     http://www.apache.org/licenses/LICENSE-2.0
-//
-// Unless required by applicable law or agreed to in writing, software
-// distributed under the License is distributed on an "AS IS" BASIS,
-// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-// See the License for the specific language governing permissions and
-// limitations under the License.
-
-
-// This is a template of memcpy over the edges.
-// memcpy_edge.csl can be "north", "south", "west" or "east"
-// of the following layout.
-//        +---------+
-//        |  north  |
-// +------+---------+------+
-// | west |  core   | east |
-// +------+---------+------+
-//        |  south  |
-//        +---------+
-// north.csl, south.csl, west.csl and east.csl instantiate
-// memcpy_edge.csl with a proper direction.
-//
-// memcpy_edge.csl supports 2 streaming H2Ds and one
-// streaming D2H. Such constraint depends on the design.
-// The current implementation binds a FIFO for a H2D or D2H,
-// so we can only support 3 in total.
-// We choose 2 H2Ds and 1 D2H.
-// if we replace FIFO by WTT, we could support more.
-//
-// However the user can instantiate memcpy_edge.csl for each
-// edge. The maximum number of H2Ds is 2*4 = 8 and maximum
-// number of D2Hs is 1*4 = 4.
-//
-// If the user only has a H2D at north, for example, he only
-// needs to configure color USER_IN_1, i.e. only a single
-// streaming H2D is used.
-//
-// For example,
-//   @set_tile_code(pe_x, 0, "north.csl", .{
-//      .USER_IN_1 = mainColor,
-//      .STARTUP = STARTUP,
-//      .memcpy_params = memcpy_params,
-//      .MEMCPYH2D_DATA_1 = MEMCPYH2D_DATA_1,
-//      .MEMCPYD2H_DATA_1 = MEMCPYD2H_DATA_1
-//    });
-
-// send data to the "core"
-param USER_IN_1: color = @get_color(32);
-param USER_IN_2: color = @get_color(32);
-
-// receive data from the "core"
-param USER_OUT_1: color = @get_color(32);
-
-// entrypoint
-param STARTUP: local_task_id;
-
-// ----------
-// Every PE needs to import memcpy module otherwise the I/O cannot
-// propagate the data to the destination.
-
-param memcpy_params: comptime_struct;
-
-// The direction of "core", for example
-// north.csl has dir = SOUTH
-// south.csl has dir = NORTH
-// west.csl has dir = EAST
-// east.csl has dir = WEST
-param dir: direction;
-
-// memcpy module reserves input queue 0 and output queue 0
-const sys_mod = @import_module( "<memcpy/memcpy>", memcpy_params);
-// ----------
-
-const h2d_mod = @import_module("h2d.csl", .{
-     .USER_IN_1 = USER_IN_1,
-     .USER_IN_2 = USER_IN_2,
-     .MEMCPYH2D_1 = memcpy_params.MEMCPYH2D_1,
-     .MEMCPYH2D_2 = memcpy_params.MEMCPYH2D_2,
-     .txdir = dir
-      });
-
-const d2h_mod = @import_module("d2h.csl", .{
-     .USER_OUT_1 = USER_OUT_1,
-     .MEMCPYD2H_1 = memcpy_params.MEMCPYD2H_1,
-     .rxdir = dir
-      });
-
-task f_startup() void {
-    h2d_mod.f_startup();
-    d2h_mod.f_startup();
-}
-
-comptime {
-    @bind_local_task(f_startup, STARTUP);
-    @activate(STARTUP);
-}
diff --git a/tutorials/topic-05-switches/memcpyEdge/north.csl b/tutorials/topic-05-switches/memcpyEdge/north.csl
deleted file mode 100644
index 1452245..0000000
--- a/tutorials/topic-05-switches/memcpyEdge/north.csl
+++ /dev/null
@@ -1,35 +0,0 @@
-// Copyright 2024 Cerebras Systems.
-//
-// Licensed under the Apache License, Version 2.0 (the "License");
-// you may not use this file except in compliance with the License.
-// You may obtain a copy of the License at
-//
-//     http://www.apache.org/licenses/LICENSE-2.0
-//
-// Unless required by applicable law or agreed to in writing, software
-// distributed under the License is distributed on an "AS IS" BASIS,
-// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-// See the License for the specific language governing permissions and
-// limitations under the License.
-
-
-// send data to the "core"
-param USER_IN_1: color = @get_color(32);
-param USER_IN_2: color = @get_color(32);
-
-// receive data from the "core"
-param USER_OUT_1: color = @get_color(32);
-
-// entrypoint
-param STARTUP: local_task_id;
-
-param memcpy_params: comptime_struct;
-
-const edge_mod = @import_module( "memcpy_edge.csl", .{
-     .memcpy_params = memcpy_params,
-     .USER_IN_1 = USER_IN_1,
-     .USER_IN_2 = USER_IN_2,
-     .USER_OUT_1 = USER_OUT_1,
-     .STARTUP = STARTUP,
-     .dir = SOUTH
-      });
diff --git a/tutorials/topic-05-switches/memcpyEdge/south.csl b/tutorials/topic-05-switches/memcpyEdge/south.csl
deleted file mode 100644
index 11b4c43..0000000
--- a/tutorials/topic-05-switches/memcpyEdge/south.csl
+++ /dev/null
@@ -1,35 +0,0 @@
-// Copyright 2024 Cerebras Systems.
-//
-// Licensed under the Apache License, Version 2.0 (the "License");
-// you may not use this file except in compliance with the License.
-// You may obtain a copy of the License at
-//
-//     http://www.apache.org/licenses/LICENSE-2.0
-//
-// Unless required by applicable law or agreed to in writing, software
-// distributed under the License is distributed on an "AS IS" BASIS,
-// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-// See the License for the specific language governing permissions and
-// limitations under the License.
-
-
-// send data to the "core"
-param USER_IN_1: color = @get_color(32);
-param USER_IN_2: color = @get_color(32);
-
-// receive data from the "core"
-param USER_OUT_1: color = @get_color(32);
-
-// entrypoint
-param STARTUP: local_task_id;
-
-param memcpy_params: comptime_struct;
-
-const edge_mod = @import_module( "memcpy_edge.csl", .{
-     .memcpy_params = memcpy_params,
-     .USER_IN_1 = USER_IN_1,
-     .USER_IN_2 = USER_IN_2,
-     .USER_OUT_1 = USER_OUT_1,
-     .STARTUP = STARTUP,
-     .dir = NORTH
-      });
diff --git a/tutorials/topic-05-switches/memcpyEdge/west.csl b/tutorials/topic-05-switches/memcpyEdge/west.csl
deleted file mode 100644
index 5c7d21a..0000000
--- a/tutorials/topic-05-switches/memcpyEdge/west.csl
+++ /dev/null
@@ -1,35 +0,0 @@
-// Copyright 2024 Cerebras Systems.
-//
-// Licensed under the Apache License, Version 2.0 (the "License");
-// you may not use this file except in compliance with the License.
-// You may obtain a copy of the License at
-//
-//     http://www.apache.org/licenses/LICENSE-2.0
-//
-// Unless required by applicable law or agreed to in writing, software
-// distributed under the License is distributed on an "AS IS" BASIS,
-// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-// See the License for the specific language governing permissions and
-// limitations under the License.
-
-
-// send data to the "core"
-param USER_IN_1: color = @get_color(32);
-param USER_IN_2: color = @get_color(32);
-
-// receive data from the "core"
-param USER_OUT_1: color = @get_color(32);
-
-// entrypoint
-param STARTUP: local_task_id;
-
-param memcpy_params: comptime_struct;
-
-const edge_mod = @import_module( "memcpy_edge.csl", .{
-     .memcpy_params = memcpy_params,
-     .USER_IN_1 = USER_IN_1,
-     .USER_IN_2 = USER_IN_2,
-     .USER_OUT_1 = USER_OUT_1,
-     .STARTUP = STARTUP,
-     .dir = EAST
-      });
diff --git a/tutorials/topic-05-switches/recv.csl b/tutorials/topic-05-switches/recv.csl
deleted file mode 100644
index 24be398..0000000
--- a/tutorials/topic-05-switches/recv.csl
+++ /dev/null
@@ -1,54 +0,0 @@
-// Copyright 2024 Cerebras Systems.
-//
-// Licensed under the Apache License, Version 2.0 (the "License");
-// you may not use this file except in compliance with the License.
-// You may obtain a copy of the License at
-//
-//     http://www.apache.org/licenses/LICENSE-2.0
-//
-// Unless required by applicable law or agreed to in writing, software
-// distributed under the License is distributed on an "AS IS" BASIS,
-// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-// See the License for the specific language governing permissions and
-// limitations under the License.
-
-// Not a complete program; the top-level source file is code.csl.
-param memcpy_params: comptime_struct;
-
-param fin: bool;
-param inDir: direction;
-
-// Colors
-param rxColor:          color;
-param outColor:         color;
-
-// Task IDs
-param rx_task_id: data_task_id; // Data task receives data along rxColor
-
-// ----------
-// Every PE needs to import memcpy module otherwise the I/O cannot
-// propagate the data to the destination.
-
-// memcpy module reserves input queue 0 and output queue 0
-const sys_mod = @import_module( "<memcpy/memcpy>", memcpy_params);
-// ----------
-
-const dsd = @get_dsd(fabout_dsd, .{.fabric_color = outColor, .extent = 1});
-
-export var global:u16 = 0;
-
-task rxTask(data: u16) void {
-  global = data;
-
-  if (fin) {
-    @mov16(dsd, 0);
-  }
-}
-
-comptime {
-  @bind_data_task(rxTask, rx_task_id);
-  @set_local_color_config(rxColor, .{.routes = .{ .rx = .{ inDir }, .tx = .{ RAMP } } });
-
-  const outRoute = .{ .rx = .{ RAMP }, .tx = .{ NORTH } };
-  @set_local_color_config(outColor, .{.routes = outRoute});
-}
diff --git a/tutorials/topic-05-switches/run.py b/tutorials/topic-05-switches/run.py
deleted file mode 100644
index 3b35988..0000000
--- a/tutorials/topic-05-switches/run.py
+++ /dev/null
@@ -1,74 +0,0 @@
-#!/usr/bin/env cs_python
-
-# Copyright 2024 Cerebras Systems.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-
-import argparse
-import json
-import numpy as np
-
-from cerebras.sdk.debug.debug_util import debug_util
-from cerebras.sdk.sdk_utils import memcpy_view
-from cerebras.sdk.runtime.sdkruntimepybind import SdkRuntime, MemcpyDataType # pylint: disable=no-name-in-module
-from cerebras.sdk.runtime.sdkruntimepybind import MemcpyOrder # pylint: disable=no-name-in-module
-
-parser = argparse.ArgumentParser()
-parser.add_argument('--name', help='the test name')
-parser.add_argument("--cmaddr", help="IP:port for CS system")
-args = parser.parse_args()
-dirname = args.name
-
-# Parse the compile metadata
-with open(f"{dirname}/out.json", encoding="utf-8") as json_file:
-  compile_data = json.load(json_file)
-params = compile_data["params"]
-MEMCPYD2H_DATA_1 = int(params["MEMCPYD2H_DATA_1_ID"])
-print(f"MEMCPYD2H_DATA_1 = {MEMCPYD2H_DATA_1}")
-
-memcpy_dtype = MemcpyDataType.MEMCPY_16BIT
-runner = SdkRuntime(dirname, cmaddr=args.cmaddr)
-
-runner.load()
-runner.run()
-
-print("step 1: streaming D2H at P1.0 (end of communication)")
-# The D2H buffer must be of type u32
-out_tensors_u32 = np.zeros(1, np.uint32)
-runner.memcpy_d2h(out_tensors_u32, MEMCPYD2H_DATA_1, 1, 0, 1, 1, 1, \
-    streaming=True, data_type=memcpy_dtype, order=MemcpyOrder.COL_MAJOR, nonblock=False)
-# remove upper 16-bit of each u32
-result_tensor = memcpy_view(out_tensors_u32, np.dtype(np.int16))
-
-runner.stop()
-
-debug_mod = debug_util(dirname, cmaddr=args.cmaddr)
-core_offset_x = 4
-core_offset_y = 1
-print(f"=== core rectangle starts at {core_offset_x}, {core_offset_y}")
-# sender PE is P1.1
-# top PE of sender PE is P1.1
-result_top = debug_mod.get_symbol(core_offset_x+1, core_offset_y+1, "global", np.uint16)
-# left PE of sender PE is P0.2
-result_left = debug_mod.get_symbol(core_offset_x+0, core_offset_y+2, "global", np.uint16)
-# right PE of sender PE is P2.2
-result_right = debug_mod.get_symbol(core_offset_x+2, core_offset_y+2, "global", np.uint16)
-# bottom PE of sender PE is P1.3
-result_bottom = debug_mod.get_symbol(core_offset_x+1, core_offset_y+3, "global", np.uint16)
-
-np.testing.assert_allclose(result_top, 0xdd)
-np.testing.assert_allclose(result_left, 0xaa)
-np.testing.assert_allclose(result_right, 0xbb)
-np.testing.assert_allclose(result_bottom, 0xcc)
-print("SUCCESS!")
diff --git a/tutorials/topic-05-switches/send.csl b/tutorials/topic-05-switches/send.csl
deleted file mode 100644
index b621f59..0000000
--- a/tutorials/topic-05-switches/send.csl
+++ /dev/null
@@ -1,116 +0,0 @@
-// Copyright 2024 Cerebras Systems.
-//
-// Licensed under the Apache License, Version 2.0 (the "License");
-// you may not use this file except in compliance with the License.
-// You may obtain a copy of the License at
-//
-//     http://www.apache.org/licenses/LICENSE-2.0
-//
-// Unless required by applicable law or agreed to in writing, software
-// distributed under the License is distributed on an "AS IS" BASIS,
-// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-// See the License for the specific language governing permissions and
-// limitations under the License.
-
-// Not a complete program; the top-level source file is code.csl.
-param memcpy_params: comptime_struct;
-
-param colorValue;
-
-// Colors
-param txColor:          color;
-
-// Task IDs
-param main_task_id: local_task_id;
-
-// ----------
-// Every PE needs to import memcpy module otherwise the I/O cannot
-// propagate the data to the destination.
-
-// memcpy module reserves input queue 0 and output queue 0
-const sys_mod = @import_module( "<memcpy/memcpy>", memcpy_params);
-// ----------
-
-const dsd = @get_dsd(fabout_dsd, .{
-  .extent = 1,
-  .fabric_color = txColor,
-
-  // Specify that this wavelet is a control wavelet
-  .control = true,
-});
-
-// Opcodes for potentially updating switches
-const opcode_nop = 0;
-const opcode_switch_advance = 1;
-const opcode_switch_reset = 2;
-const opcode_teardown = 3;
-
-// Helper function to construct the payload of the control wavelet.
-// args:
-//    ceFilter: a filter bit to disable transmission from the destination
-//              router to the destination CE,
-//    opcode: switching opcode (see comment above), and
-//    data: 16-bit wavelet data
-fn ctrl(ce_filter: bool, opcode: i16, data: u16) u32 {
-  const six = @as(u32, 6);
-  const eight = @as(u32, 8);
-  const sixteen = @as(u32, 16);
-
-  const hi_word = @as(u32, colorValue) |
-                  @as(u32, opcode) << six |
-                  @as(u32, ce_filter) << eight;
-
-  const lo_word = @as(u32, data);
-  return hi_word << sixteen | lo_word;
-}
-
-task mainTask() void {
-  // Now we can reuse a single color to send four different values to the four
-  // neighbors of this PE.  The four wavelets will be sent over four
-  // consecutive cycles.
-
-  // Send 0xaa along the first (WEST) direction
-  // Since all arguments to this function are known at compile time, we make
-  // this a `comptime` function call.
-  @mov32(dsd, comptime ctrl(false, opcode_switch_advance, 0xaa));
-
-  // Send 0xbb along the second (EAST) direction
-  @mov32(dsd, comptime ctrl(false, opcode_switch_advance, 0xbb));
-
-  // Send 0xcc along the third (SOUTH) direction
-  @mov32(dsd, comptime ctrl(false, opcode_switch_advance, 0xcc));
-
-  // Send 0xdd along the fourth (NORTH) direction
-  @mov32(dsd, comptime ctrl(false, opcode_switch_advance, 0xdd));
-}
-
-comptime {
-  @bind_local_task(mainTask, main_task_id);
-  @activate(main_task_id);
-
-  const routes = .{
-    // The default route, which is to receive from ramp and send to north
-    .rx = .{ RAMP },
-    .tx = .{ NORTH }
-  };
-
-  const switches = .{
-
-    // Upon a control wavelet, change the transmit direction to west
-    .pos1 = .{ .tx = WEST },
-
-    // Upon another control wavelet, change the transmit direction to east
-    .pos2 = .{ .tx = EAST },
-
-    // Upon yet another control wavelet, change the transmit direction to south
-    .pos3 = .{ .tx = SOUTH },
-
-    // Send to west PE first, then east PE, then south PE, and then north PE
-    .current_switch_pos = 1,
-
-    // Wrap around from position 3 to position 0 after receiving control wavelet
-    .ring_mode = true,
-  };
-
-  @set_local_color_config(txColor, .{.routes = routes, .switches = switches});
-}
diff --git a/tutorials/topic-06-libraries/README.rst b/tutorials/topic-06-libraries/README.rst
deleted file mode 100644
index 61dd8d6..0000000
--- a/tutorials/topic-06-libraries/README.rst
+++ /dev/null
@@ -1,12 +0,0 @@
-
-Topic 6: Libraries
-==================
-
-The CSL compiler comes bundled with a few standard libraries, which can be
-imported into the user's program using the ``@import_module()`` builtin.  This
-example shows three such compiler-bundled libraries:
-
-
-* the ``random`` library for generating uniform random numbers,
-* the ``timestamp`` library for reading the on-chip timestamp counter, and
-* the ``math`` library for square root.
diff --git a/tutorials/topic-06-libraries/layout.csl b/tutorials/topic-06-libraries/layout.csl
deleted file mode 100644
index 94d6610..0000000
--- a/tutorials/topic-06-libraries/layout.csl
+++ /dev/null
@@ -1,57 +0,0 @@
-// Copyright 2024 Cerebras Systems.
-//
-// Licensed under the Apache License, Version 2.0 (the "License");
-// you may not use this file except in compliance with the License.
-// You may obtain a copy of the License at
-//
-//     http://www.apache.org/licenses/LICENSE-2.0
-//
-// Unless required by applicable law or agreed to in writing, software
-// distributed under the License is distributed on an "AS IS" BASIS,
-// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-// See the License for the specific language governing permissions and
-// limitations under the License.
-
-// color/ task ID map
-//
-//  ID var           ID var     ID var                ID var
-//   0 D2H_1          9         18                    27 reserved (memcpy)
-//   1               10         19                    28 reserved (memcpy)
-//   2               11         20                    29 reserved
-//   3               12         21 reserved (memcpy)  30 reserved (memcpy)
-//   4               13         22 reserved (memcpy)  31 reserved
-//   5               14         23 reserved (memcpy)  32
-//   6               15         24                    33
-//   7               16         25                    34
-//   8 main_task_id  17         26                    35
-//
-
-param MEMCPYD2H_DATA_1_ID: i16; // ID for memcpy streaming colors
-
-param iterations: u32;
-
-// Colors
-const MEMCPYD2H_DATA_1: color = @get_color(MEMCPYD2H_DATA_1_ID);
-
-// Task IDs
-const main_task_id: local_task_id = @get_local_task_id(8);
-
-const memcpy = @import_module( "<memcpy/get_params>", .{
-  .width = 1,
-  .height = 1,
-  .MEMCPYD2H_1 = MEMCPYD2H_DATA_1
-});
-
-layout {
-  @set_rectangle(1, 1);
-
-  @set_tile_code(0, 0, "pe_program.csl", .{
-    .memcpy_params = memcpy.get_params(0),
-    .main_task_id = main_task_id,
-    .iterations = iterations
-  });
-
-  // export symbol name
-  @export_name("f_run", fn()void);
-  @export_name("f_send_timestamps", fn()void);
-}
diff --git a/tutorials/topic-06-libraries/pe_program.csl b/tutorials/topic-06-libraries/pe_program.csl
deleted file mode 100644
index 1dee7a9..0000000
--- a/tutorials/topic-06-libraries/pe_program.csl
+++ /dev/null
@@ -1,124 +0,0 @@
-// Copyright 2024 Cerebras Systems.
-//
-// Licensed under the Apache License, Version 2.0 (the "License");
-// you may not use this file except in compliance with the License.
-// You may obtain a copy of the License at
-//
-//     http://www.apache.org/licenses/LICENSE-2.0
-//
-// Unless required by applicable law or agreed to in writing, software
-// distributed under the License is distributed on an "AS IS" BASIS,
-// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-// See the License for the specific language governing permissions and
-// limitations under the License.
-
-// Not a complete program; the top-level source file is layout.csl.
-param memcpy_params: comptime_struct;
-
-param iterations: u32;
-
-// Task IDs
-param main_task_id: local_task_id;
-
-// memcpy module reserves input queue 0 and output queue 0
-const sys_mod = @import_module( "<memcpy/memcpy>", memcpy_params);
-
-// Import compiler-bundled libraries, which are identified by names surrounded
-// by angular brackets ('<' and '>').
-const random = @import_module("<random>");
-const tsc = @import_module("<time>");
-const math = @import_module("<math>");
-
-// Declare variables for storing the timestamp counter at the start and the end
-// of the core computation.
-var startBuffer = @zeros([tsc.tsc_size_words]u16);
-var finishBuffer = @zeros([tsc.tsc_size_words]u16);
-var timeBuffer = @zeros([tsc.tsc_size_words*2]u16);
-
-/// Send the final result to the host.
-fn sendResult(result: f32) void {
-  const resultDsd = @get_dsd(fabout_dsd, .{
-    .extent = 1,
-    .fabric_color = sys_mod.MEMCPYD2H_1,
-    .output_queue = @get_output_queue(1)
-  });
-  // The sync operation works here because the length is 1
-  // It would better to use {.async=true}
-  @fmovs(resultDsd, result);
-}
-
-/// Send the begin and end timestamp counters to the host, which then performs a
-/// 48-bit subtraction to get the final cycle count.
-fn sendTimeStampCounters() void {
-  timeBuffer[0] = startBuffer[0];
-  timeBuffer[1] = startBuffer[1];
-  timeBuffer[2] = startBuffer[2];
-
-  timeBuffer[3] = finishBuffer[0];
-  timeBuffer[4] = finishBuffer[1];
-  timeBuffer[5] = finishBuffer[2];
-
-  const timeBufferDsd = @get_dsd(mem1d_dsd, .{
-    .tensor_access = |i|{tsc.tsc_size_words*2} -> timeBuffer[i]
-  });
-
-  const timeStampDsd = @get_dsd(fabout_dsd, .{
-    .extent = tsc.tsc_size_words*2,
-    .fabric_color = sys_mod.MEMCPYD2H_1,
-    .output_queue = @get_output_queue(1)
-  });
-
-  @mov16(timeStampDsd, timeBufferDsd, .{.async=true});
-}
-
-task mainTask() void {
-  var idx: u32 = 0;
-  var hitCount: u32 = 0;
-
-  tsc.enable_tsc();
-  tsc.get_timestamp(&startBuffer);
-
-  // For each iteration, compute two random values between -1 and +1, and check
-  // whether they are inside the circle of unit radius.
-  while (idx < iterations) : (idx += 1) {
-    var x = random.random_f32(-1.0, 1.0);
-    var y = random.random_f32(-1.0, 1.0);
-    var distanceFromOrigin = math.sqrt_f32(x * x + y * y);
-
-    if (distanceFromOrigin <= 1.0) {
-      hitCount += 1;
-    }
-  }
-
-  tsc.get_timestamp(&finishBuffer);
-  sendResult(4.0 * @as(f32, hitCount) / @as(f32, iterations));
-}
-
-comptime {
-  @bind_local_task(mainTask, main_task_id);
-}
-
-fn f_run() void {
-  @activate(main_task_id);
-
-  // RPC returns early before the data is sent out via D2H color
-  // The host must wait for streaming D2H
-
-  // WARNING: the user must unblock cmd color for every PE
-  sys_mod.unblock_cmd_stream();
-}
-
-fn f_send_timestamps() void {
-  sendTimeStampCounters();
-
-  // RPC returns early before the data is sent out via D2H color
-  // The host must wait for streaming D2H
-
-  // WARNING: the user must unblock cmd color for every PE
-  sys_mod.unblock_cmd_stream();
-}
-
-comptime{
-  @export_symbol(f_run);
-  @export_symbol(f_send_timestamps);
-}
diff --git a/tutorials/topic-06-libraries/run.py b/tutorials/topic-06-libraries/run.py
deleted file mode 100644
index ff946a9..0000000
--- a/tutorials/topic-06-libraries/run.py
+++ /dev/null
@@ -1,82 +0,0 @@
-#!/usr/bin/env cs_python
-
-# Copyright 2024 Cerebras Systems.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-
-import argparse
-import json
-import numpy as np
-
-from cerebras.sdk.sdk_utils import memcpy_view
-from cerebras.sdk.runtime.sdkruntimepybind import SdkRuntime, MemcpyDataType # pylint: disable=no-name-in-module
-from cerebras.sdk.runtime.sdkruntimepybind import MemcpyOrder # pylint: disable=no-name-in-module
-
-parser = argparse.ArgumentParser()
-parser.add_argument('--name', help='the test name')
-parser.add_argument("--cmaddr", help="IP:port for CS system")
-parser.add_argument("--tolerance", type=float, help="tolerance for result")
-args = parser.parse_args()
-dirname = args.name
-
-# Parse the compile metadata
-with open(f"{dirname}/out.json", encoding="utf-8") as json_file:
-  compile_data = json.load(json_file)
-params = compile_data["params"]
-MEMCPYD2H_DATA_1 = int(params["MEMCPYD2H_DATA_1_ID"])
-print(f"MEMCPYD2H_DATA_1 = {MEMCPYD2H_DATA_1}")
-
-print("The simfab may take 25 sec more")
-runner = SdkRuntime(dirname, cmaddr=args.cmaddr)
-
-runner.load()
-runner.run()
-
-print("step 1: call f_run to start streaming D2H (result)")
-runner.launch("f_run", nonblock=False)
-
-print("step 2: streaming D2H (result)")
-# The D2H buffer must be of type u32
-result = np.zeros(1, np.float32)
-runner.memcpy_d2h(result, MEMCPYD2H_DATA_1, 0, 0, 1, 1, 1, \
-    streaming=True, data_type=MemcpyDataType.MEMCPY_32BIT, \
-    order=MemcpyOrder.COL_MAJOR, nonblock=False)
-
-print("step 3: call f_send_timestamps to start streaming D2H (timestamp)")
-runner.launch("f_send_timestamps", nonblock=False)
-
-print("step 4: streaming D2H (timestamps)")
-# The D2H buffer must be of type u32
-timestamps_u32 = np.zeros(6, np.uint32)
-runner.memcpy_d2h(timestamps_u32, MEMCPYD2H_DATA_1, 0, 0, 1, 1, 6, \
-    streaming=True, data_type=MemcpyDataType.MEMCPY_16BIT, \
-    order=MemcpyOrder.COL_MAJOR, nonblock=False)
-# remove upper 16-bit of each u32
-timestamps = memcpy_view(timestamps_u32, np.dtype(np.uint16))
-
-runner.stop()
-
-# Helper functions for computing the delta in the cycle count
-def make_u48(words):
-  return words[0] + (words[1] << 16) + (words[2] << 32)
-
-def subtract_timestamps(words):
-  return make_u48(words[3:]) - make_u48(words[0:3])
-
-cycles = subtract_timestamps(timestamps)
-print("cycle count:", cycles)
-
-print(f"result = {result}, np.pi = {np.pi}, tol = {args.tolerance}")
-np.testing.assert_allclose(result, np.pi, atol=args.tolerance, rtol=0)
-print("SUCCESS!")
diff --git a/tutorials/topic-06-switches/commands.sh b/tutorials/topic-06-switches/commands_wse2.sh
similarity index 100%
rename from tutorials/topic-06-switches/commands.sh
rename to tutorials/topic-06-switches/commands_wse2.sh
diff --git a/tutorials/topic-06-switches/commands_wse3.sh b/tutorials/topic-06-switches/commands_wse3.sh
new file mode 100755
index 0000000..2fa2c6a
--- /dev/null
+++ b/tutorials/topic-06-switches/commands_wse3.sh
@@ -0,0 +1,7 @@
+#!/usr/bin/env bash
+
+set -e
+
+cslc --arch=wse3 ./layout.csl --fabric-dims=10,5 --fabric-offsets=4,1 -o out \
+--memcpy --channels=1 --width-west-buf=0 --width-east-buf=0
+cs_python run.py --name out
diff --git a/tutorials/topic-07-filters/README.rst b/tutorials/topic-07-filters/README.rst
deleted file mode 100644
index ead85d5..0000000
--- a/tutorials/topic-07-filters/README.rst
+++ /dev/null
@@ -1,10 +0,0 @@
-
-Topic 7: Filters
-================
-
-Fabric filters allow a PE to selectively accept incoming wavelets.  This example
-shows the use of so-called range filters, which specify the wavelets to allow to
-be forwarded to the CE based on the upper 16 bits of the wavelet contents.
-Specifically, PE #0 sends all 12 wavelets to the other PEs, while each recipient
-PE receives and processes only a quarter of the incoming wavelets.
-See :ref:`language-builtins-filters` for other possible filter configurations.
diff --git a/tutorials/topic-07-filters/commands.sh b/tutorials/topic-07-filters/commands.sh
deleted file mode 100755
index ad2c75c..0000000
--- a/tutorials/topic-07-filters/commands.sh
+++ /dev/null
@@ -1,9 +0,0 @@
-#!/usr/bin/env bash
-
-set -e
-
-cslc ./layout.csl --fabric-dims=11,5 --fabric-offsets=4,1 -o out \
---params=MEMCPYH2D_DATA_1_ID:3 \
---params=MEMCPYD2H_DATA_1_ID:4 \
---memcpy --channels=1 --width-west-buf=0 --width-east-buf=0
-cs_python run.py --name out
diff --git a/tutorials/topic-07-filters/layout.csl b/tutorials/topic-07-filters/layout.csl
deleted file mode 100644
index 544b1b7..0000000
--- a/tutorials/topic-07-filters/layout.csl
+++ /dev/null
@@ -1,109 +0,0 @@
-// Copyright 2024 Cerebras Systems.
-//
-// Licensed under the Apache License, Version 2.0 (the "License");
-// you may not use this file except in compliance with the License.
-// You may obtain a copy of the License at
-//
-//     http://www.apache.org/licenses/LICENSE-2.0
-//
-// Unless required by applicable law or agreed to in writing, software
-// distributed under the License is distributed on an "AS IS" BASIS,
-// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-// See the License for the specific language governing permissions and
-// limitations under the License.
-
-// color/ task ID map
-//
-//  ID var           ID var            ID var                ID var
-//   0                9 STARTUP        18                    27 reserved (memcpy)
-//   1 dataColor     10                19                    28 reserved (memcpy)
-//   2 resultColor   11                20                    29 reserved
-//   3 H2D           12                21 reserved (memcpy)  30 reserved (memcpy)
-//   4 D2H           13                22 reserved (memcpy)  31 reserved
-//   5               14                23 reserved (memcpy)  32
-//   6               15                24                    33
-//   7               16                25                    34
-//   8 main_task_id  17                26                    35
-
-//  +-------------+
-//  | north(d2H)  |
-//  +-------------+
-//  | core        |
-//  +-------------+
-//  | south(nop)  |
-//  +-------------+
-
-// IDs for memcpy streaming colors
-param MEMCPYH2D_DATA_1_ID: i16;
-param MEMCPYD2H_DATA_1_ID: i16;
-
-// Colors
-const MEMCPYH2D_DATA_1: color = @get_color(MEMCPYH2D_DATA_1_ID);
-const MEMCPYD2H_DATA_1: color = @get_color(MEMCPYD2H_DATA_1_ID);
-const dataColor:        color = @get_color(1);
-const resultColor:      color = @get_color(2);
-
-// Task IDs
-const STARTUP:      local_task_id = @get_local_task_id(9);
-const main_task_id: local_task_id = @get_local_task_id(8);
-const recv_task_id: data_task_id  = @get_data_task_id(dataColor);
-
-const memcpy = @import_module( "<memcpy/get_params>", .{
-    .width = 4,
-    .height = 3,
-    .MEMCPYH2D_1 = MEMCPYH2D_DATA_1,
-    .MEMCPYD2H_1 = MEMCPYD2H_DATA_1
-    });
-
-layout {
-  @set_rectangle(4, 3);
-
-  for (@range(i16, 4)) |pe_x| {
-    const memcpy_params = memcpy.get_params(pe_x);
-
-    // north PE only runs d2h
-    @set_tile_code(pe_x, 0, "memcpyEdge/north.csl", .{
-      .memcpy_params = memcpy_params,
-      .USER_OUT_1 = resultColor,
-      .STARTUP = STARTUP,
-    });
-  }
-
-  const memcpy_params_0 = memcpy.get_params(0);
-  const memcpy_params_1 = memcpy.get_params(1);
-  const memcpy_params_2 = memcpy.get_params(2);
-  const memcpy_params_3 = memcpy.get_params(3);
-
-  @set_tile_code(0, 1, "send.csl", .{
-    .peId = 0,
-    .memcpy_params = memcpy_params_0,
-    .exchColor = dataColor,
-    .resultColor = resultColor,
-    .main_task_id = main_task_id
-  });
-
-  const recvStruct = .{ .recvColor    = dataColor,
-                        .resultColor  = resultColor,
-                        .recv_task_id = recv_task_id };
-  @set_tile_code(1, 1, "recv.csl", @concat_structs(recvStruct, .{
-    .peId = 1,
-    .memcpy_params = memcpy_params_1,
-  }));
-  @set_tile_code(2, 1, "recv.csl", @concat_structs(recvStruct, .{
-    .peId = 2,
-    .memcpy_params = memcpy_params_2,
-  }));
-  @set_tile_code(3, 1, "recv.csl", @concat_structs(recvStruct, .{
-    .peId = 3,
-    .memcpy_params = memcpy_params_3,
-  }));
-
-  for (@range(i16, 4)) |pe_x| {
-    const memcpy_params = memcpy.get_params(pe_x);
-    // south does nothing
-    @set_tile_code(pe_x, 2, "memcpyEdge/south.csl", .{
-      .memcpy_params = memcpy_params,
-      .STARTUP = STARTUP
-    });
-  }
-}
diff --git a/tutorials/topic-07-filters/memcpyEdge/d2h.csl b/tutorials/topic-07-filters/memcpyEdge/d2h.csl
deleted file mode 100644
index 1224c27..0000000
--- a/tutorials/topic-07-filters/memcpyEdge/d2h.csl
+++ /dev/null
@@ -1,61 +0,0 @@
-// Copyright 2024 Cerebras Systems.
-//
-// Licensed under the Apache License, Version 2.0 (the "License");
-// you may not use this file except in compliance with the License.
-// You may obtain a copy of the License at
-//
-//     http://www.apache.org/licenses/LICENSE-2.0
-//
-// Unless required by applicable law or agreed to in writing, software
-// distributed under the License is distributed on an "AS IS" BASIS,
-// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-// See the License for the specific language governing permissions and
-// limitations under the License.
-
-
-// One streaming D2H:
-// 1st D2H: UT 5 and UT 6
-
-param MEMCPYD2H_1: color = @get_color(32);
-
-// Color along which we expect a wavelet
-param USER_OUT_1: color = @get_color(32);
-
-param rxdir: direction;
-
-const max_fifo_len = 256*40; // maximum length of the fifo
-
-var fifo1_buffer = @zeros([max_fifo_len]u32);
-const fifo1 = @allocate_fifo(fifo1_buffer);
-
-// length=inf
-var fab_recv_wdsd = @get_dsd(fabin_dsd, .{
-   .extent = 0x7fff,
-   .fabric_color = USER_OUT_1,
-   .input_queue = @get_input_queue(6)
-});
-
-// length=inf
-var fab_trans_wdsd = @get_dsd(fabout_dsd, .{
-    .extent = 0x7fff,
-    .fabric_color = MEMCPYD2H_1,
-    .output_queue = @get_output_queue(5)
-});
-
-// if USER_OUT_1 is not valid, f_startup() is empty
-fn f_startup() void {
-    if ( (@get_int(MEMCPYD2H_1) < 24) and (@get_int(USER_OUT_1) < 24) ){
-        // receive data from USER_OUT_1
-        @mov32(fifo1, fab_recv_wdsd, .{.async=true} );
-
-        // forward data to MEMCPYD2H_1
-        @mov32(fab_trans_wdsd, fifo1, .{.async=true} );
-    }
-}
-
-comptime {
-    if (@get_int(USER_OUT_1) < 24){
-        const d2h_route = .{ .rx = .{ rxdir }, .tx = .{ RAMP } };
-        @set_local_color_config(USER_OUT_1, .{ .routes = d2h_route });
-    }
-}
diff --git a/tutorials/topic-07-filters/memcpyEdge/east.csl b/tutorials/topic-07-filters/memcpyEdge/east.csl
deleted file mode 100644
index 7303d8c..0000000
--- a/tutorials/topic-07-filters/memcpyEdge/east.csl
+++ /dev/null
@@ -1,35 +0,0 @@
-// Copyright 2024 Cerebras Systems.
-//
-// Licensed under the Apache License, Version 2.0 (the "License");
-// you may not use this file except in compliance with the License.
-// You may obtain a copy of the License at
-//
-//     http://www.apache.org/licenses/LICENSE-2.0
-//
-// Unless required by applicable law or agreed to in writing, software
-// distributed under the License is distributed on an "AS IS" BASIS,
-// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-// See the License for the specific language governing permissions and
-// limitations under the License.
-
-
-// send data to the "core"
-param USER_IN_1: color = @get_color(32);
-param USER_IN_2: color = @get_color(32);
-
-// receive data from the "core"
-param USER_OUT_1: color = @get_color(32);
-
-// entrypoint
-param STARTUP: local_task_id;
-
-param memcpy_params: comptime_struct;
-
-const edge_mod = @import_module( "memcpy_edge.csl", .{
-     .memcpy_params = memcpy_params,
-     .USER_IN_1 = USER_IN_1,
-     .USER_IN_2 = USER_IN_2,
-     .USER_OUT_1 = USER_OUT_1,
-     .STARTUP = STARTUP,
-     .dir = WEST
-      });
diff --git a/tutorials/topic-07-filters/memcpyEdge/h2d.csl b/tutorials/topic-07-filters/memcpyEdge/h2d.csl
deleted file mode 100644
index 017c3df..0000000
--- a/tutorials/topic-07-filters/memcpyEdge/h2d.csl
+++ /dev/null
@@ -1,93 +0,0 @@
-// Copyright 2024 Cerebras Systems.
-//
-// Licensed under the Apache License, Version 2.0 (the "License");
-// you may not use this file except in compliance with the License.
-// You may obtain a copy of the License at
-//
-//     http://www.apache.org/licenses/LICENSE-2.0
-//
-// Unless required by applicable law or agreed to in writing, software
-// distributed under the License is distributed on an "AS IS" BASIS,
-// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-// See the License for the specific language governing permissions and
-// limitations under the License.
-
-
-// Two streaming H2Ds:
-// 1st H2D: UT 1 and UT 2
-// 2nd H2D: UT 3 and UT 4
-
-param MEMCPYH2D_1: color = @get_color(32);
-param MEMCPYH2D_2: color = @get_color(32);
-
-// Color along which we send a wavelet to pe_program
-param USER_IN_1: color = @get_color(32);
-param USER_IN_2: color = @get_color(32);
-
-param txdir: direction;
-
-const max_fifo_len = 256*20; // maximum length of the fifo
-
-var fifo1_buffer = @zeros([max_fifo_len]u32);
-const fifo1 = @allocate_fifo(fifo1_buffer);
-
-var fifo2_buffer = @zeros([max_fifo_len]u32);
-const fifo2 = @allocate_fifo(fifo2_buffer);
-
-// length=inf
-var fab_recv_wdsd_1 = @get_dsd(fabin_dsd, .{
-   .extent = 0x7fff,
-   .fabric_color = MEMCPYH2D_1,
-   .input_queue = @get_input_queue(1)
-});
-
-// length=inf
-var fab_trans_wdsd_1 = @get_dsd(fabout_dsd, .{
-    .extent = 0x7fff,
-    .fabric_color = USER_IN_1,
-    .output_queue = @get_output_queue(2)
-});
-
-// length=inf
-var fab_recv_wdsd_2 = @get_dsd(fabin_dsd, .{
-   .extent = 0x7fff,
-   .fabric_color = MEMCPYH2D_2,
-   .input_queue = @get_input_queue(3)
-});
-
-// length=inf
-var fab_trans_wdsd_2 = @get_dsd(fabout_dsd, .{
-    .extent = 0x7fff,
-    .fabric_color = USER_IN_2,
-    .output_queue = @get_output_queue(4)
-});
-
-// if no user's color is defined, f_startup() is empty
-fn f_startup() void {
-    if ( (@get_int(MEMCPYH2D_1) < 24) and (@get_int(USER_IN_1) < 24) ){
-        // receive data from streaming H2D
-        @mov32(fifo1, fab_recv_wdsd_1, .{.async=true} );
-
-        // forward data to USER_IN_1
-        @mov32(fab_trans_wdsd_1, fifo1, .{.async=true} );
-    }
-
-    if ( (@get_int(MEMCPYH2D_2) < 24) and (@get_int(USER_IN_2) < 24) ){
-        // receive data from streaming H2D
-        @mov32(fifo2, fab_recv_wdsd_2, .{.async=true} );
-
-        // forward data to USER_IN_1
-        @mov32(fab_trans_wdsd_2, fifo2, .{.async=true} );
-    }
-}
-
-comptime {
-    if (@get_int(USER_IN_1) < 24){
-        const h2d_route = .{ .rx = .{ RAMP }, .tx = .{ txdir } };
-        @set_local_color_config(USER_IN_1, .{ .routes = h2d_route });
-    }
-    if (@get_int(USER_IN_2) < 24){
-        const h2d_route = .{ .rx = .{ RAMP }, .tx = .{ txdir } };
-        @set_local_color_config(USER_IN_2, .{ .routes = h2d_route });
-    }
-}
diff --git a/tutorials/topic-07-filters/memcpyEdge/memcpy_edge.csl b/tutorials/topic-07-filters/memcpyEdge/memcpy_edge.csl
deleted file mode 100644
index 5ebfd5f..0000000
--- a/tutorials/topic-07-filters/memcpyEdge/memcpy_edge.csl
+++ /dev/null
@@ -1,102 +0,0 @@
-// Copyright 2024 Cerebras Systems.
-//
-// Licensed under the Apache License, Version 2.0 (the "License");
-// you may not use this file except in compliance with the License.
-// You may obtain a copy of the License at
-//
-//     http://www.apache.org/licenses/LICENSE-2.0
-//
-// Unless required by applicable law or agreed to in writing, software
-// distributed under the License is distributed on an "AS IS" BASIS,
-// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-// See the License for the specific language governing permissions and
-// limitations under the License.
-
-
-// This is a template of memcpy over the edges.
-// memcpy_edge.csl can be "north", "south", "west" or "east"
-// of the following layout.
-//        +---------+
-//        |  north  |
-// +------+---------+------+
-// | west |  core   | east |
-// +------+---------+------+
-//        |  south  |
-//        +---------+
-// north.csl, south.csl, west.csl and east.csl instantiate
-// memcpy_edge.csl with a proper direction.
-//
-// memcpy_edge.csl supports 2 streaming H2Ds and one
-// streaming D2H. Such constraint depends on the design.
-// The current implementation binds a FIFO for a H2D or D2H,
-// so we can only support 3 in total.
-// We choose 2 H2Ds and 1 D2H.
-// if we replace FIFO by WTT, we could support more.
-//
-// However the user can instantiate memcpy_edge.csl for each
-// edge. The maximum number of H2Ds is 2*4 = 8 and maximum
-// number of D2Hs is 1*4 = 4.
-//
-// If the user only has a H2D at north, for example, he only
-// needs to configure color USER_IN_1, i.e. only a single
-// streaming H2D is used.
-//
-// For example,
-//   @set_tile_code(pe_x, 0, "north.csl", .{
-//      .USER_IN_1 = mainColor,
-//      .STARTUP = STARTUP,
-//      .memcpy_params = memcpy_params,
-//      .MEMCPYH2D_DATA_1 = MEMCPYH2D_DATA_1,
-//      .MEMCPYD2H_DATA_1 = MEMCPYD2H_DATA_1
-//    });
-
-// send data to the "core"
-param USER_IN_1: color = @get_color(32);
-param USER_IN_2: color = @get_color(32);
-
-// receive data from the "core"
-param USER_OUT_1: color = @get_color(32);
-
-// entrypoint
-param STARTUP: local_task_id;
-
-// ----------
-// Every PE needs to import memcpy module otherwise the I/O cannot
-// propagate the data to the destination.
-
-param memcpy_params: comptime_struct;
-
-// The direction of "core", for example
-// north.csl has dir = SOUTH
-// south.csl has dir = NORTH
-// west.csl has dir = EAST
-// east.csl has dir = WEST
-param dir: direction;
-
-// memcpy module reserves input queue 0 and output queue 0
-const sys_mod = @import_module( "<memcpy/memcpy>", memcpy_params);
-// ----------
-
-const h2d_mod = @import_module("h2d.csl", .{
-     .USER_IN_1 = USER_IN_1,
-     .USER_IN_2 = USER_IN_2,
-     .MEMCPYH2D_1 = memcpy_params.MEMCPYH2D_1,
-     .MEMCPYH2D_2 = memcpy_params.MEMCPYH2D_2,
-     .txdir = dir
-      });
-
-const d2h_mod = @import_module("d2h.csl", .{
-     .USER_OUT_1 = USER_OUT_1,
-     .MEMCPYD2H_1 = memcpy_params.MEMCPYD2H_1,
-     .rxdir = dir
-      });
-
-task f_startup() void {
-    h2d_mod.f_startup();
-    d2h_mod.f_startup();
-}
-
-comptime {
-    @bind_local_task(f_startup, STARTUP);
-    @activate(STARTUP);
-}
diff --git a/tutorials/topic-07-filters/memcpyEdge/north.csl b/tutorials/topic-07-filters/memcpyEdge/north.csl
deleted file mode 100644
index 1452245..0000000
--- a/tutorials/topic-07-filters/memcpyEdge/north.csl
+++ /dev/null
@@ -1,35 +0,0 @@
-// Copyright 2024 Cerebras Systems.
-//
-// Licensed under the Apache License, Version 2.0 (the "License");
-// you may not use this file except in compliance with the License.
-// You may obtain a copy of the License at
-//
-//     http://www.apache.org/licenses/LICENSE-2.0
-//
-// Unless required by applicable law or agreed to in writing, software
-// distributed under the License is distributed on an "AS IS" BASIS,
-// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-// See the License for the specific language governing permissions and
-// limitations under the License.
-
-
-// send data to the "core"
-param USER_IN_1: color = @get_color(32);
-param USER_IN_2: color = @get_color(32);
-
-// receive data from the "core"
-param USER_OUT_1: color = @get_color(32);
-
-// entrypoint
-param STARTUP: local_task_id;
-
-param memcpy_params: comptime_struct;
-
-const edge_mod = @import_module( "memcpy_edge.csl", .{
-     .memcpy_params = memcpy_params,
-     .USER_IN_1 = USER_IN_1,
-     .USER_IN_2 = USER_IN_2,
-     .USER_OUT_1 = USER_OUT_1,
-     .STARTUP = STARTUP,
-     .dir = SOUTH
-      });
diff --git a/tutorials/topic-07-filters/memcpyEdge/south.csl b/tutorials/topic-07-filters/memcpyEdge/south.csl
deleted file mode 100644
index 11b4c43..0000000
--- a/tutorials/topic-07-filters/memcpyEdge/south.csl
+++ /dev/null
@@ -1,35 +0,0 @@
-// Copyright 2024 Cerebras Systems.
-//
-// Licensed under the Apache License, Version 2.0 (the "License");
-// you may not use this file except in compliance with the License.
-// You may obtain a copy of the License at
-//
-//     http://www.apache.org/licenses/LICENSE-2.0
-//
-// Unless required by applicable law or agreed to in writing, software
-// distributed under the License is distributed on an "AS IS" BASIS,
-// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-// See the License for the specific language governing permissions and
-// limitations under the License.
-
-
-// send data to the "core"
-param USER_IN_1: color = @get_color(32);
-param USER_IN_2: color = @get_color(32);
-
-// receive data from the "core"
-param USER_OUT_1: color = @get_color(32);
-
-// entrypoint
-param STARTUP: local_task_id;
-
-param memcpy_params: comptime_struct;
-
-const edge_mod = @import_module( "memcpy_edge.csl", .{
-     .memcpy_params = memcpy_params,
-     .USER_IN_1 = USER_IN_1,
-     .USER_IN_2 = USER_IN_2,
-     .USER_OUT_1 = USER_OUT_1,
-     .STARTUP = STARTUP,
-     .dir = NORTH
-      });
diff --git a/tutorials/topic-07-filters/memcpyEdge/west.csl b/tutorials/topic-07-filters/memcpyEdge/west.csl
deleted file mode 100644
index 5c7d21a..0000000
--- a/tutorials/topic-07-filters/memcpyEdge/west.csl
+++ /dev/null
@@ -1,35 +0,0 @@
-// Copyright 2024 Cerebras Systems.
-//
-// Licensed under the Apache License, Version 2.0 (the "License");
-// you may not use this file except in compliance with the License.
-// You may obtain a copy of the License at
-//
-//     http://www.apache.org/licenses/LICENSE-2.0
-//
-// Unless required by applicable law or agreed to in writing, software
-// distributed under the License is distributed on an "AS IS" BASIS,
-// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-// See the License for the specific language governing permissions and
-// limitations under the License.
-
-
-// send data to the "core"
-param USER_IN_1: color = @get_color(32);
-param USER_IN_2: color = @get_color(32);
-
-// receive data from the "core"
-param USER_OUT_1: color = @get_color(32);
-
-// entrypoint
-param STARTUP: local_task_id;
-
-param memcpy_params: comptime_struct;
-
-const edge_mod = @import_module( "memcpy_edge.csl", .{
-     .memcpy_params = memcpy_params,
-     .USER_IN_1 = USER_IN_1,
-     .USER_IN_2 = USER_IN_2,
-     .USER_OUT_1 = USER_OUT_1,
-     .STARTUP = STARTUP,
-     .dir = EAST
-      });
diff --git a/tutorials/topic-07-filters/recv.csl b/tutorials/topic-07-filters/recv.csl
deleted file mode 100644
index 8650999..0000000
--- a/tutorials/topic-07-filters/recv.csl
+++ /dev/null
@@ -1,80 +0,0 @@
-// Copyright 2024 Cerebras Systems.
-//
-// Licensed under the Apache License, Version 2.0 (the "License");
-// you may not use this file except in compliance with the License.
-// You may obtain a copy of the License at
-//
-//     http://www.apache.org/licenses/LICENSE-2.0
-//
-// Unless required by applicable law or agreed to in writing, software
-// distributed under the License is distributed on an "AS IS" BASIS,
-// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-// See the License for the specific language governing permissions and
-// limitations under the License.
-
-param memcpy_params: comptime_struct;
-
-param peId: u16;
-
-// Colors
-param recvColor:        color;
-param resultColor:      color;
-
-// Task IDs
-param recv_task_id: data_task_id; // data task receives data along recvColor
-
-// ----------
-// Every PE needs to import memcpy module otherwise the I/O cannot
-// propagate the data to the destination.
-
-// memcpy module reserves input queue 0 and output queue 0
-const sys_mod = @import_module( "<memcpy/memcpy>", memcpy_params);
-// ----------
-
-/// The recipient simply halves the value in the incoming wavelet and sends the
-/// result to the north neighbor (halo PE).
-var buf = @zeros([1]f16);
-task recvTask(data: f16) void {
-  @block(recvColor);
-  buf[0] = data / 2.0;
-  const outDsd = @get_dsd(fabout_dsd, .{
-    .extent = 1,
-    .fabric_color = resultColor,
-    .output_queue = @get_output_queue(1)
-  });
-  const bufDsd = @get_dsd(mem1d_dsd, .{
-    .tensor_access = |i|{1} -> buf[i]
-  });
-  // WARNING: nonblock is necessary otherwise CE has no resource
-  // to run memcpy kernel
-  @fmovh(outDsd, bufDsd, .{.async = true, .unblock = recv_task_id});
-}
-
-comptime {
-  @bind_data_task(recvTask, recv_task_id);
-
-  const baseRoute = .{
-    .rx = .{ WEST }
-  };
-
-  const filter = .{
-      // Each PE should only accept three wavelets starting with the one whose
-      // index field contains the value peId * 3.
-      .kind = .{ .range = true },
-      .min_idx = peId * 3,
-      .max_idx = peId * 3 + 2,
-    };
-
-  if (peId == 3) {
-    // This is the last PE, don't forward the wavelet further to the east.
-    const txRoute = @concat_structs(baseRoute, .{ .tx = .{ RAMP } });
-    @set_local_color_config(recvColor, .{.routes = txRoute, .filter = filter});
-  } else {
-    // Otherwise, forward incoming wavelets to both CE and to the east neighbor.
-    const txRoute = @concat_structs(baseRoute, .{ .tx = .{ RAMP, EAST } });
-    @set_local_color_config(recvColor, .{.routes = txRoute, .filter = filter});
-  }
-
-  // Send result wavelets to the north neighbor (i.e. the halo PEs).
-  @set_local_color_config(resultColor, .{ .routes = .{ .rx = .{ RAMP }, .tx = .{ NORTH } } });
-}
diff --git a/tutorials/topic-07-filters/run.py b/tutorials/topic-07-filters/run.py
deleted file mode 100644
index 5efeebc..0000000
--- a/tutorials/topic-07-filters/run.py
+++ /dev/null
@@ -1,57 +0,0 @@
-#!/usr/bin/env cs_python
-
-# Copyright 2024 Cerebras Systems.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-
-import argparse
-import json
-import numpy as np
-
-from cerebras.sdk.sdk_utils import memcpy_view
-from cerebras.sdk.runtime.sdkruntimepybind import SdkRuntime, MemcpyDataType # pylint: disable=no-name-in-module
-from cerebras.sdk.runtime.sdkruntimepybind import MemcpyOrder # pylint: disable=no-name-in-module
-
-parser = argparse.ArgumentParser()
-parser.add_argument('--name', help='the test name')
-parser.add_argument("--cmaddr", help="IP:port for CS system")
-args = parser.parse_args()
-dirname = args.name
-
-# Parse the compile metadata
-with open(f"{dirname}/out.json", encoding="utf-8") as json_file:
-  compile_data = json.load(json_file)
-params = compile_data["params"]
-MEMCPYD2H_DATA_1 = int(params["MEMCPYD2H_DATA_1_ID"])
-print(f"MEMCPYD2H_DATA_1 = {MEMCPYD2H_DATA_1}")
-
-memcpy_dtype = MemcpyDataType.MEMCPY_16BIT
-runner = SdkRuntime(dirname, cmaddr=args.cmaddr)
-
-runner.load()
-runner.run()
-
-print("step 1: streaming D2H at P0.0")
-# The D2H buffer must be of type u32
-out_tensors_u32 = np.zeros(4*3, np.uint32)
-runner.memcpy_d2h(out_tensors_u32, MEMCPYD2H_DATA_1, 0, 0, 4, 1, 3, \
-    streaming=True, data_type=memcpy_dtype, order=MemcpyOrder.ROW_MAJOR, nonblock=False)
-# remove upper 16-bit of each u32
-result = memcpy_view(out_tensors_u32, np.dtype(np.float16))
-
-runner.stop()
-
-oracle = [5, 5.5, 6, 6.5, 7, 7.5, 8, 8.5, 9, 9.5, 10, 10.5]
-np.testing.assert_allclose(result, oracle, atol=0.0001, rtol=0)
-print("SUCCESS!")
diff --git a/tutorials/topic-07-filters/send.csl b/tutorials/topic-07-filters/send.csl
deleted file mode 100644
index 168f34c..0000000
--- a/tutorials/topic-07-filters/send.csl
+++ /dev/null
@@ -1,103 +0,0 @@
-// Copyright 2024 Cerebras Systems.
-//
-// Licensed under the Apache License, Version 2.0 (the "License");
-// you may not use this file except in compliance with the License.
-// You may obtain a copy of the License at
-//
-//     http://www.apache.org/licenses/LICENSE-2.0
-//
-// Unless required by applicable law or agreed to in writing, software
-// distributed under the License is distributed on an "AS IS" BASIS,
-// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-// See the License for the specific language governing permissions and
-// limitations under the License.
-
-param memcpy_params: comptime_struct;
-
-param peId: u16;
-
-// Colors
-param exchColor:        color;
-param resultColor:      color;
-
-// Task IDs
-param main_task_id: local_task_id;
-
-// ----------
-// Every PE needs to import memcpy module otherwise the I/O cannot
-// propagate the data to the destination.
-
-// memcpy module reserves input queue 0 and output queue 0
-const sys_mod = @import_module( "<memcpy/memcpy>", memcpy_params);
-// ----------
-
-/// Helper function to pack 16-bit index and 16-bit float value into one 32-bit
-/// wavelet.
-fn pack(index: u16, data: f16) u32 {
-  return (@as(u32, index) << 16) | @as(u32, @bitcast(u16, data));
-}
-
-const size = 12;
-const data = [size]u32 {
-  pack(0, 10.0),  pack( 1, 11.0), pack( 2, 12.0),
-  pack(3, 13.0),  pack( 4, 14.0), pack( 5, 15.0),
-  pack(6, 16.0),  pack( 7, 17.0), pack( 8, 18.0),
-  pack(9, 19.0),  pack(10, 20.0), pack(11, 21.0),
-};
-
-/// Function to send all data values to all east neighbors.
-fn sendDataToEastTiles() void {
-  const inDsd = @get_dsd(mem1d_dsd, .{
-    .tensor_access = |i|{size} -> data[i]
-  });
-
-  const outDsd = @get_dsd(fabout_dsd, .{
-    .extent = size,
-    .fabric_color = exchColor,
-    .output_queue = @get_output_queue(2)
-  });
-
-  // WARNING: "async" is necessary otherwise CE has no resource
-  // to run memcpy kernel
-  @mov32(outDsd, inDsd, .{.async=true});
-}
-
-/// Function to process (divide by 2) the first three values and send result to
-/// the north neighbor (halo PE).
-const num_wvlts: u16 = 3;
-var buf = @zeros([num_wvlts]f16);
-var ptr_buf : [*]f16 = &buf;
-
-fn processAndSendSubset() void {
-  const outDsd = @get_dsd(fabout_dsd, .{
-    .extent = num_wvlts,
-    .fabric_color = resultColor,
-    .output_queue = @get_output_queue(1)
-  });
-  const bufDsd = @get_dsd(mem1d_dsd, .{
-    .tensor_access = |i|{num_wvlts} -> buf[i]
-  });
-
-  var idx: u16 = 0;
-  while (idx < num_wvlts) : (idx += 1) {
-    const payload = @as(u16, data[idx] & 0xffff);
-    const floatValue = @bitcast(f16, payload);
-    buf[idx] = floatValue / 2.0;
-  }
-  // WARNING: nonblock is necessary otherwise CE has no resource
-  // to run memcpy kernel
-  @fmovh(outDsd, bufDsd, .{.async = true});
-}
-
-task mainTask() void {
-  sendDataToEastTiles();
-  processAndSendSubset();
-}
-
-comptime {
-  @activate(main_task_id);
-  @bind_local_task(mainTask, main_task_id);
-
-  @set_local_color_config(exchColor, .{ .routes = .{ .rx = .{ RAMP }, .tx = .{ EAST } } });
-  @set_local_color_config(resultColor, .{ .routes = .{ .rx = .{ RAMP }, .tx = .{ NORTH } } });
-}
diff --git a/tutorials/topic-07-switches-entrypt/commands.sh b/tutorials/topic-07-switches-entrypt/commands_wse2.sh
similarity index 100%
rename from tutorials/topic-07-switches-entrypt/commands.sh
rename to tutorials/topic-07-switches-entrypt/commands_wse2.sh
diff --git a/tutorials/topic-07-switches-entrypt/commands_wse3.sh b/tutorials/topic-07-switches-entrypt/commands_wse3.sh
new file mode 100755
index 0000000..2fa2c6a
--- /dev/null
+++ b/tutorials/topic-07-switches-entrypt/commands_wse3.sh
@@ -0,0 +1,7 @@
+#!/usr/bin/env bash
+
+set -e
+
+cslc --arch=wse3 ./layout.csl --fabric-dims=10,5 --fabric-offsets=4,1 -o out \
+--memcpy --channels=1 --width-west-buf=0 --width-east-buf=0
+cs_python run.py --name out
diff --git a/tutorials/topic-08-fifos/README.rst b/tutorials/topic-08-fifos/README.rst
deleted file mode 100644
index 0d819b9..0000000
--- a/tutorials/topic-08-fifos/README.rst
+++ /dev/null
@@ -1,23 +0,0 @@
-Topic 8: FIFOs
-==============
-
-A FIFO DSD is useful to buffer input going into or out of a PE, as a way to
-extend the small hardware queues used for fabric communication. In particular,
-this may prevent stalls in the communication fabric when input or output
-happens in bursts. It is also possible to operate on the values while they flow
-through the FIFO, as this code sample demonstrates.
-
-This example illustrates a typical pattern in the use of FIFOs, where a
-receiver receives wavelets from the fabric and forwards them to a task that
-performs some computation. Specifically, incoming data from the host is stored
-in the FIFO, thus relieving the sender from being blocked until the receiver
-has received all wavelets. While the incoming wavelets are being asynchronously
-received into the FIFO buffer, we also start a second asynchronous DSD
-operation that pulls data from the FIFO and forwards it to a wavelet-triggered
-task.
-
-This example also illustrates another common pattern, where a PE starts a
-wavelet-triggered task using its own wavelets, by sending them to the router
-which immediately sends them back to the compute element. In our example, this
-wavelet-triggered task simply computes the cube of the wavelet's data, before
-sending the result to the host.
diff --git a/tutorials/topic-08-fifos/buffer.csl b/tutorials/topic-08-fifos/buffer.csl
deleted file mode 100644
index 3005ef7..0000000
--- a/tutorials/topic-08-fifos/buffer.csl
+++ /dev/null
@@ -1,78 +0,0 @@
-// Copyright 2024 Cerebras Systems.
-//
-// Licensed under the Apache License, Version 2.0 (the "License");
-// you may not use this file except in compliance with the License.
-// You may obtain a copy of the License at
-//
-//     http://www.apache.org/licenses/LICENSE-2.0
-//
-// Unless required by applicable law or agreed to in writing, software
-// distributed under the License is distributed on an "AS IS" BASIS,
-// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-// See the License for the specific language governing permissions and
-// limitations under the License.
-
-param memcpy_params: comptime_struct;
-
-param num_elements_to_process: i16;
-
-// Colors
-param in_color:         color;
-param out_color:        color;
-param result_color:     color;
-
-// Task IDs
-param process_task_id: data_task_id;  // Data task process_task triggered by out_color wlts
-param main_task_id:    local_task_id;
-
-// ----------
-// Every PE needs to import memcpy module otherwise the I/O cannot
-// propagate the data to the destination.
-
-// memcpy module reserves input queue 0 and output queue 0
-const sys_mod = @import_module( "<memcpy/memcpy>", memcpy_params);
-// ----------
-
-var fifo_buffer = @zeros([1024]f16);
-const fifo = @allocate_fifo(fifo_buffer);
-
-const in_queue = @get_input_queue(0);
-const in_dsd = @get_dsd(fabin_dsd, .{.extent = num_elements_to_process,
-                                     .fabric_color = in_color,
-                                     .input_queue = in_queue});
-comptime {
-  @set_local_color_config(in_color, .{.routes = .{.rx = .{WEST}, .tx = .{RAMP}}});
-}
-
-const out_queue = @get_output_queue(1);
-const out_dsd = @get_dsd(fabout_dsd, .{.extent = num_elements_to_process,
-                                       .fabric_color = out_color,
-                                       .output_queue = out_queue});
-
-const ten = [1]f16 {10.0};
-const dsd_ten = @get_dsd(mem1d_dsd, .{.tensor_access = |i|{num_elements_to_process} -> ten[0]});
-
-task main_task() void {
-  // Move from the fabric to the FIFO
-  // adding 10.0 to each element at the same time
-  @faddh(fifo, in_dsd, dsd_ten, .{.async = true});
-
-  // Move from the FIFO to a process_task
-  // negating values at the same time
-  @fnegh(out_dsd, fifo, .{.async = true});
-}
-
-const result_dsd = @get_dsd(fabout_dsd, .{.extent = 1, .fabric_color = result_color});
-
-task process_task(element:f16) void {
-  @fmovh(result_dsd, element * element * element);
-}
-
-comptime {
-  @bind_data_task(process_task, process_task_id); // data task receives wlts along out_color
-  @bind_local_task(main_task, main_task_id);
-  @activate(main_task_id);
-
-  @set_local_color_config(out_color, .{.routes = .{.rx = .{RAMP}, .tx = .{RAMP}}});
-  @set_local_color_config(result_color, .{.routes = .{.rx = .{RAMP}, .tx = .{EAST}}});
-}
diff --git a/tutorials/topic-08-fifos/layout.csl b/tutorials/topic-08-fifos/layout.csl
deleted file mode 100644
index 04e1702..0000000
--- a/tutorials/topic-08-fifos/layout.csl
+++ /dev/null
@@ -1,88 +0,0 @@
-// Copyright 2024 Cerebras Systems.
-//
-// Licensed under the Apache License, Version 2.0 (the "License");
-// you may not use this file except in compliance with the License.
-// You may obtain a copy of the License at
-//
-//     http://www.apache.org/licenses/LICENSE-2.0
-//
-// Unless required by applicable law or agreed to in writing, software
-// distributed under the License is distributed on an "AS IS" BASIS,
-// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-// See the License for the specific language governing permissions and
-// limitations under the License.
-
-// color/ task ID map
-//
-//  ID var           ID var      ID var                ID var
-//   0 in_color       9 STARTUP  18                    27 reserved (memcpy)
-//   1 out_color     10          19                    28 reserved (memcpy)
-//   2               11          20                    29 reserved
-//   3 result_color  12          21 reserved (memcpy)  30 reserved (memcpy)
-//   4 H2D           13          22 reserved (memcpy)  31 reserved
-//   5 D2H           14          23 reserved (memcpy)  32
-//   6               15          24                    33
-//   7               16          25                    34
-//   8 main_task_id  17          26                    35
-//
-
-//  +------+------+------+
-//  | west | core | east |
-//  +------+------+------+
-
-// IDs for memcpy streaming colors
-param MEMCPYH2D_DATA_1_ID: i16;
-param MEMCPYD2H_DATA_1_ID: i16;
-
-param num_elements_to_process: i16;
-
-// Colors
-const MEMCPYH2D_DATA_1: color = @get_color(MEMCPYH2D_DATA_1_ID);
-const MEMCPYD2H_DATA_1: color = @get_color(MEMCPYD2H_DATA_1_ID);
-const in_color:         color = @get_color(0);
-const out_color:        color = @get_color(1);
-const result_color:     color = @get_color(3);
-
-// Task IDs
-const main_task_id:    local_task_id = @get_local_task_id(8);
-const STARTUP:         local_task_id = @get_local_task_id(9);
-const process_task_id: data_task_id  = @get_data_task_id(out_color);
-
-const memcpy = @import_module( "<memcpy/get_params>", .{
-    .width = 3,
-    .height = 1,
-    .MEMCPYH2D_1 = MEMCPYH2D_DATA_1,
-    .MEMCPYD2H_1 = MEMCPYD2H_DATA_1
-    });
-
-
-layout {
-  @set_rectangle(3,1);
-
-  // west.csl has a H2D
-  const memcpy_params_0 = memcpy.get_params(0);
-  @set_tile_code(0, 0, "memcpyEdge/west.csl", .{
-    .memcpy_params = memcpy_params_0,
-    .USER_IN_1 = in_color,
-    .STARTUP = STARTUP
-  });
-
-  const memcpy_params_1 = memcpy.get_params(1);
-  @set_tile_code(1, 0, "buffer.csl", .{
-    .memcpy_params = memcpy_params_1,
-    .in_color = in_color,
-    .out_color = out_color,
-    .result_color = result_color,
-    .main_task_id = main_task_id,
-    .process_task_id = process_task_id,
-    .num_elements_to_process = num_elements_to_process
-  });
-
-  // east.csl only has a D2H
-  const memcpy_params_2 = memcpy.get_params(2);
-  @set_tile_code(2, 0, "memcpyEdge/east.csl", .{
-    .memcpy_params = memcpy_params_2,
-    .USER_OUT_1 = result_color,
-    .STARTUP = STARTUP
-  });
-}
diff --git a/tutorials/topic-08-fifos/memcpyEdge/d2h.csl b/tutorials/topic-08-fifos/memcpyEdge/d2h.csl
deleted file mode 100644
index 1224c27..0000000
--- a/tutorials/topic-08-fifos/memcpyEdge/d2h.csl
+++ /dev/null
@@ -1,61 +0,0 @@
-// Copyright 2024 Cerebras Systems.
-//
-// Licensed under the Apache License, Version 2.0 (the "License");
-// you may not use this file except in compliance with the License.
-// You may obtain a copy of the License at
-//
-//     http://www.apache.org/licenses/LICENSE-2.0
-//
-// Unless required by applicable law or agreed to in writing, software
-// distributed under the License is distributed on an "AS IS" BASIS,
-// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-// See the License for the specific language governing permissions and
-// limitations under the License.
-
-
-// One streaming D2H:
-// 1st D2H: UT 5 and UT 6
-
-param MEMCPYD2H_1: color = @get_color(32);
-
-// Color along which we expect a wavelet
-param USER_OUT_1: color = @get_color(32);
-
-param rxdir: direction;
-
-const max_fifo_len = 256*40; // maximum length of the fifo
-
-var fifo1_buffer = @zeros([max_fifo_len]u32);
-const fifo1 = @allocate_fifo(fifo1_buffer);
-
-// length=inf
-var fab_recv_wdsd = @get_dsd(fabin_dsd, .{
-   .extent = 0x7fff,
-   .fabric_color = USER_OUT_1,
-   .input_queue = @get_input_queue(6)
-});
-
-// length=inf
-var fab_trans_wdsd = @get_dsd(fabout_dsd, .{
-    .extent = 0x7fff,
-    .fabric_color = MEMCPYD2H_1,
-    .output_queue = @get_output_queue(5)
-});
-
-// if USER_OUT_1 is not valid, f_startup() is empty
-fn f_startup() void {
-    if ( (@get_int(MEMCPYD2H_1) < 24) and (@get_int(USER_OUT_1) < 24) ){
-        // receive data from USER_OUT_1
-        @mov32(fifo1, fab_recv_wdsd, .{.async=true} );
-
-        // forward data to MEMCPYD2H_1
-        @mov32(fab_trans_wdsd, fifo1, .{.async=true} );
-    }
-}
-
-comptime {
-    if (@get_int(USER_OUT_1) < 24){
-        const d2h_route = .{ .rx = .{ rxdir }, .tx = .{ RAMP } };
-        @set_local_color_config(USER_OUT_1, .{ .routes = d2h_route });
-    }
-}
diff --git a/tutorials/topic-08-fifos/memcpyEdge/east.csl b/tutorials/topic-08-fifos/memcpyEdge/east.csl
deleted file mode 100644
index 7303d8c..0000000
--- a/tutorials/topic-08-fifos/memcpyEdge/east.csl
+++ /dev/null
@@ -1,35 +0,0 @@
-// Copyright 2024 Cerebras Systems.
-//
-// Licensed under the Apache License, Version 2.0 (the "License");
-// you may not use this file except in compliance with the License.
-// You may obtain a copy of the License at
-//
-//     http://www.apache.org/licenses/LICENSE-2.0
-//
-// Unless required by applicable law or agreed to in writing, software
-// distributed under the License is distributed on an "AS IS" BASIS,
-// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-// See the License for the specific language governing permissions and
-// limitations under the License.
-
-
-// send data to the "core"
-param USER_IN_1: color = @get_color(32);
-param USER_IN_2: color = @get_color(32);
-
-// receive data from the "core"
-param USER_OUT_1: color = @get_color(32);
-
-// entrypoint
-param STARTUP: local_task_id;
-
-param memcpy_params: comptime_struct;
-
-const edge_mod = @import_module( "memcpy_edge.csl", .{
-     .memcpy_params = memcpy_params,
-     .USER_IN_1 = USER_IN_1,
-     .USER_IN_2 = USER_IN_2,
-     .USER_OUT_1 = USER_OUT_1,
-     .STARTUP = STARTUP,
-     .dir = WEST
-      });
diff --git a/tutorials/topic-08-fifos/memcpyEdge/h2d.csl b/tutorials/topic-08-fifos/memcpyEdge/h2d.csl
deleted file mode 100644
index 017c3df..0000000
--- a/tutorials/topic-08-fifos/memcpyEdge/h2d.csl
+++ /dev/null
@@ -1,93 +0,0 @@
-// Copyright 2024 Cerebras Systems.
-//
-// Licensed under the Apache License, Version 2.0 (the "License");
-// you may not use this file except in compliance with the License.
-// You may obtain a copy of the License at
-//
-//     http://www.apache.org/licenses/LICENSE-2.0
-//
-// Unless required by applicable law or agreed to in writing, software
-// distributed under the License is distributed on an "AS IS" BASIS,
-// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-// See the License for the specific language governing permissions and
-// limitations under the License.
-
-
-// Two streaming H2Ds:
-// 1st H2D: UT 1 and UT 2
-// 2nd H2D: UT 3 and UT 4
-
-param MEMCPYH2D_1: color = @get_color(32);
-param MEMCPYH2D_2: color = @get_color(32);
-
-// Color along which we send a wavelet to pe_program
-param USER_IN_1: color = @get_color(32);
-param USER_IN_2: color = @get_color(32);
-
-param txdir: direction;
-
-const max_fifo_len = 256*20; // maximum length of the fifo
-
-var fifo1_buffer = @zeros([max_fifo_len]u32);
-const fifo1 = @allocate_fifo(fifo1_buffer);
-
-var fifo2_buffer = @zeros([max_fifo_len]u32);
-const fifo2 = @allocate_fifo(fifo2_buffer);
-
-// length=inf
-var fab_recv_wdsd_1 = @get_dsd(fabin_dsd, .{
-   .extent = 0x7fff,
-   .fabric_color = MEMCPYH2D_1,
-   .input_queue = @get_input_queue(1)
-});
-
-// length=inf
-var fab_trans_wdsd_1 = @get_dsd(fabout_dsd, .{
-    .extent = 0x7fff,
-    .fabric_color = USER_IN_1,
-    .output_queue = @get_output_queue(2)
-});
-
-// length=inf
-var fab_recv_wdsd_2 = @get_dsd(fabin_dsd, .{
-   .extent = 0x7fff,
-   .fabric_color = MEMCPYH2D_2,
-   .input_queue = @get_input_queue(3)
-});
-
-// length=inf
-var fab_trans_wdsd_2 = @get_dsd(fabout_dsd, .{
-    .extent = 0x7fff,
-    .fabric_color = USER_IN_2,
-    .output_queue = @get_output_queue(4)
-});
-
-// if no user's color is defined, f_startup() is empty
-fn f_startup() void {
-    if ( (@get_int(MEMCPYH2D_1) < 24) and (@get_int(USER_IN_1) < 24) ){
-        // receive data from streaming H2D
-        @mov32(fifo1, fab_recv_wdsd_1, .{.async=true} );
-
-        // forward data to USER_IN_1
-        @mov32(fab_trans_wdsd_1, fifo1, .{.async=true} );
-    }
-
-    if ( (@get_int(MEMCPYH2D_2) < 24) and (@get_int(USER_IN_2) < 24) ){
-        // receive data from streaming H2D
-        @mov32(fifo2, fab_recv_wdsd_2, .{.async=true} );
-
-        // forward data to USER_IN_1
-        @mov32(fab_trans_wdsd_2, fifo2, .{.async=true} );
-    }
-}
-
-comptime {
-    if (@get_int(USER_IN_1) < 24){
-        const h2d_route = .{ .rx = .{ RAMP }, .tx = .{ txdir } };
-        @set_local_color_config(USER_IN_1, .{ .routes = h2d_route });
-    }
-    if (@get_int(USER_IN_2) < 24){
-        const h2d_route = .{ .rx = .{ RAMP }, .tx = .{ txdir } };
-        @set_local_color_config(USER_IN_2, .{ .routes = h2d_route });
-    }
-}
diff --git a/tutorials/topic-08-fifos/memcpyEdge/memcpy_edge.csl b/tutorials/topic-08-fifos/memcpyEdge/memcpy_edge.csl
deleted file mode 100644
index 5ebfd5f..0000000
--- a/tutorials/topic-08-fifos/memcpyEdge/memcpy_edge.csl
+++ /dev/null
@@ -1,102 +0,0 @@
-// Copyright 2024 Cerebras Systems.
-//
-// Licensed under the Apache License, Version 2.0 (the "License");
-// you may not use this file except in compliance with the License.
-// You may obtain a copy of the License at
-//
-//     http://www.apache.org/licenses/LICENSE-2.0
-//
-// Unless required by applicable law or agreed to in writing, software
-// distributed under the License is distributed on an "AS IS" BASIS,
-// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-// See the License for the specific language governing permissions and
-// limitations under the License.
-
-
-// This is a template of memcpy over the edges.
-// memcpy_edge.csl can be "north", "south", "west" or "east"
-// of the following layout.
-//        +---------+
-//        |  north  |
-// +------+---------+------+
-// | west |  core   | east |
-// +------+---------+------+
-//        |  south  |
-//        +---------+
-// north.csl, south.csl, west.csl and east.csl instantiate
-// memcpy_edge.csl with a proper direction.
-//
-// memcpy_edge.csl supports 2 streaming H2Ds and one
-// streaming D2H. Such constraint depends on the design.
-// The current implementation binds a FIFO for a H2D or D2H,
-// so we can only support 3 in total.
-// We choose 2 H2Ds and 1 D2H.
-// if we replace FIFO by WTT, we could support more.
-//
-// However the user can instantiate memcpy_edge.csl for each
-// edge. The maximum number of H2Ds is 2*4 = 8 and maximum
-// number of D2Hs is 1*4 = 4.
-//
-// If the user only has a H2D at north, for example, he only
-// needs to configure color USER_IN_1, i.e. only a single
-// streaming H2D is used.
-//
-// For example,
-//   @set_tile_code(pe_x, 0, "north.csl", .{
-//      .USER_IN_1 = mainColor,
-//      .STARTUP = STARTUP,
-//      .memcpy_params = memcpy_params,
-//      .MEMCPYH2D_DATA_1 = MEMCPYH2D_DATA_1,
-//      .MEMCPYD2H_DATA_1 = MEMCPYD2H_DATA_1
-//    });
-
-// send data to the "core"
-param USER_IN_1: color = @get_color(32);
-param USER_IN_2: color = @get_color(32);
-
-// receive data from the "core"
-param USER_OUT_1: color = @get_color(32);
-
-// entrypoint
-param STARTUP: local_task_id;
-
-// ----------
-// Every PE needs to import memcpy module otherwise the I/O cannot
-// propagate the data to the destination.
-
-param memcpy_params: comptime_struct;
-
-// The direction of "core", for example
-// north.csl has dir = SOUTH
-// south.csl has dir = NORTH
-// west.csl has dir = EAST
-// east.csl has dir = WEST
-param dir: direction;
-
-// memcpy module reserves input queue 0 and output queue 0
-const sys_mod = @import_module( "<memcpy/memcpy>", memcpy_params);
-// ----------
-
-const h2d_mod = @import_module("h2d.csl", .{
-     .USER_IN_1 = USER_IN_1,
-     .USER_IN_2 = USER_IN_2,
-     .MEMCPYH2D_1 = memcpy_params.MEMCPYH2D_1,
-     .MEMCPYH2D_2 = memcpy_params.MEMCPYH2D_2,
-     .txdir = dir
-      });
-
-const d2h_mod = @import_module("d2h.csl", .{
-     .USER_OUT_1 = USER_OUT_1,
-     .MEMCPYD2H_1 = memcpy_params.MEMCPYD2H_1,
-     .rxdir = dir
-      });
-
-task f_startup() void {
-    h2d_mod.f_startup();
-    d2h_mod.f_startup();
-}
-
-comptime {
-    @bind_local_task(f_startup, STARTUP);
-    @activate(STARTUP);
-}
diff --git a/tutorials/topic-08-fifos/memcpyEdge/north.csl b/tutorials/topic-08-fifos/memcpyEdge/north.csl
deleted file mode 100644
index 1452245..0000000
--- a/tutorials/topic-08-fifos/memcpyEdge/north.csl
+++ /dev/null
@@ -1,35 +0,0 @@
-// Copyright 2024 Cerebras Systems.
-//
-// Licensed under the Apache License, Version 2.0 (the "License");
-// you may not use this file except in compliance with the License.
-// You may obtain a copy of the License at
-//
-//     http://www.apache.org/licenses/LICENSE-2.0
-//
-// Unless required by applicable law or agreed to in writing, software
-// distributed under the License is distributed on an "AS IS" BASIS,
-// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-// See the License for the specific language governing permissions and
-// limitations under the License.
-
-
-// send data to the "core"
-param USER_IN_1: color = @get_color(32);
-param USER_IN_2: color = @get_color(32);
-
-// receive data from the "core"
-param USER_OUT_1: color = @get_color(32);
-
-// entrypoint
-param STARTUP: local_task_id;
-
-param memcpy_params: comptime_struct;
-
-const edge_mod = @import_module( "memcpy_edge.csl", .{
-     .memcpy_params = memcpy_params,
-     .USER_IN_1 = USER_IN_1,
-     .USER_IN_2 = USER_IN_2,
-     .USER_OUT_1 = USER_OUT_1,
-     .STARTUP = STARTUP,
-     .dir = SOUTH
-      });
diff --git a/tutorials/topic-08-fifos/memcpyEdge/south.csl b/tutorials/topic-08-fifos/memcpyEdge/south.csl
deleted file mode 100644
index 11b4c43..0000000
--- a/tutorials/topic-08-fifos/memcpyEdge/south.csl
+++ /dev/null
@@ -1,35 +0,0 @@
-// Copyright 2024 Cerebras Systems.
-//
-// Licensed under the Apache License, Version 2.0 (the "License");
-// you may not use this file except in compliance with the License.
-// You may obtain a copy of the License at
-//
-//     http://www.apache.org/licenses/LICENSE-2.0
-//
-// Unless required by applicable law or agreed to in writing, software
-// distributed under the License is distributed on an "AS IS" BASIS,
-// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-// See the License for the specific language governing permissions and
-// limitations under the License.
-
-
-// send data to the "core"
-param USER_IN_1: color = @get_color(32);
-param USER_IN_2: color = @get_color(32);
-
-// receive data from the "core"
-param USER_OUT_1: color = @get_color(32);
-
-// entrypoint
-param STARTUP: local_task_id;
-
-param memcpy_params: comptime_struct;
-
-const edge_mod = @import_module( "memcpy_edge.csl", .{
-     .memcpy_params = memcpy_params,
-     .USER_IN_1 = USER_IN_1,
-     .USER_IN_2 = USER_IN_2,
-     .USER_OUT_1 = USER_OUT_1,
-     .STARTUP = STARTUP,
-     .dir = NORTH
-      });
diff --git a/tutorials/topic-08-fifos/memcpyEdge/west.csl b/tutorials/topic-08-fifos/memcpyEdge/west.csl
deleted file mode 100644
index 5c7d21a..0000000
--- a/tutorials/topic-08-fifos/memcpyEdge/west.csl
+++ /dev/null
@@ -1,35 +0,0 @@
-// Copyright 2024 Cerebras Systems.
-//
-// Licensed under the Apache License, Version 2.0 (the "License");
-// you may not use this file except in compliance with the License.
-// You may obtain a copy of the License at
-//
-//     http://www.apache.org/licenses/LICENSE-2.0
-//
-// Unless required by applicable law or agreed to in writing, software
-// distributed under the License is distributed on an "AS IS" BASIS,
-// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-// See the License for the specific language governing permissions and
-// limitations under the License.
-
-
-// send data to the "core"
-param USER_IN_1: color = @get_color(32);
-param USER_IN_2: color = @get_color(32);
-
-// receive data from the "core"
-param USER_OUT_1: color = @get_color(32);
-
-// entrypoint
-param STARTUP: local_task_id;
-
-param memcpy_params: comptime_struct;
-
-const edge_mod = @import_module( "memcpy_edge.csl", .{
-     .memcpy_params = memcpy_params,
-     .USER_IN_1 = USER_IN_1,
-     .USER_IN_2 = USER_IN_2,
-     .USER_OUT_1 = USER_OUT_1,
-     .STARTUP = STARTUP,
-     .dir = EAST
-      });
diff --git a/tutorials/topic-08-fifos/run.py b/tutorials/topic-08-fifos/run.py
deleted file mode 100644
index 33921f5..0000000
--- a/tutorials/topic-08-fifos/run.py
+++ /dev/null
@@ -1,87 +0,0 @@
-#!/usr/bin/env cs_python
-
-# Copyright 2024 Cerebras Systems.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-
-import argparse
-import json
-import numpy as np
-
-from cerebras.sdk.sdk_utils import memcpy_view, input_array_to_u32
-from cerebras.sdk.runtime.sdkruntimepybind import SdkRuntime, MemcpyDataType # pylint: disable=no-name-in-module
-from cerebras.sdk.runtime.sdkruntimepybind import MemcpyOrder # pylint: disable=no-name-in-module
-
-parser = argparse.ArgumentParser()
-parser.add_argument('--name', help='the test name')
-parser.add_argument("--cmaddr", help="IP:port for CS system")
-args = parser.parse_args()
-dirname = args.name
-
-# Parse the compile metadata
-with open(f"{dirname}/out.json", encoding="utf-8") as json_file:
-  compile_data = json.load(json_file)
-params = compile_data["params"]
-MEMCPYH2D_DATA_1 = int(params["MEMCPYH2D_DATA_1_ID"])
-MEMCPYD2H_DATA_1 = int(params["MEMCPYD2H_DATA_1_ID"])
-size = int(params["num_elements_to_process"])
-print(f"MEMCPYH2D_DATA_1 = {MEMCPYH2D_DATA_1}")
-print(f"MEMCPYD2H_DATA_1 = {MEMCPYD2H_DATA_1}")
-print(f"size = {size}")
-
-# maximum length of the fifo
-max_fifo_len = 256*20
-print(f"maximum size of the buffer in the artificial halo is {max_fifo_len}")
-assert size < max_fifo_len, "input size exceeds max. capacity, may stall"
-
-memcpy_dtype = MemcpyDataType.MEMCPY_16BIT
-runner = SdkRuntime(dirname, cmaddr=args.cmaddr)
-
-runner.load()
-runner.run()
-
-np.random.seed(seed=7)
-
-input_tensor = np.random.random(size).astype(np.float16)
-print("step 1: streaming H2D to P0.0")
-# "input_tensor" is an 1d array
-# The type of input_tensor is f16, we need to extend it to uint32
-# There are two kind of extension when using the utility function input_array_to_u32
-#    input_array_to_u32(np_arr: np.ndarray, sentinel: Optional[int], fast_dim_sz: int)
-# 1) zero extension:
-#    sentinel = None
-# 2) upper 16-bit is the index of the array:
-#    sentinel is Not None
-#
-# In this example, the upper 16-bit is don't care because buffer.csl only
-# reads lower 16-bit
-tensors_u32 = input_array_to_u32(input_tensor, 1, size)
-runner.memcpy_h2d(MEMCPYH2D_DATA_1, tensors_u32, 0, 0, 1, 1, size, \
-    streaming=True, data_type=memcpy_dtype, order=MemcpyOrder.COL_MAJOR, nonblock=True)
-
-print("step 2: streaming D2H at P2.0")
-# The D2H buffer must be of type u32
-out_tensors_u32 = np.zeros(size, np.uint32)
-runner.memcpy_d2h(out_tensors_u32, MEMCPYD2H_DATA_1, 2, 0, 1, 1, size, \
-    streaming=True, data_type=memcpy_dtype, order=MemcpyOrder.COL_MAJOR, nonblock=False)
-# remove upper 16-bit of each u32
-result_tensor = memcpy_view(out_tensors_u32, np.dtype(np.float16))
-
-runner.stop()
-
-add_ten_negate = -(input_tensor + 10.0)
-expected = add_ten_negate * add_ten_negate * add_ten_negate
-
-np.testing.assert_equal(result_tensor, expected)
-print("SUCCESS!")
diff --git a/tutorials/topic-08-filters/commands.sh b/tutorials/topic-08-filters/commands_wse2.sh
similarity index 100%
rename from tutorials/topic-08-filters/commands.sh
rename to tutorials/topic-08-filters/commands_wse2.sh
diff --git a/tutorials/topic-08-filters/commands_wse3.sh b/tutorials/topic-08-filters/commands_wse3.sh
new file mode 100755
index 0000000..20f75a3
--- /dev/null
+++ b/tutorials/topic-08-filters/commands_wse3.sh
@@ -0,0 +1,7 @@
+#!/usr/bin/env bash
+
+set -e
+
+cslc --arch=wse3 ./layout.csl --fabric-dims=11,3 --fabric-offsets=4,1 -o out \
+--memcpy --channels=1 --width-west-buf=0 --width-east-buf=0
+cs_python run.py --name out
diff --git a/tutorials/topic-09-fifos/commands.sh b/tutorials/topic-09-fifos/commands_wse2.sh
similarity index 100%
rename from tutorials/topic-09-fifos/commands.sh
rename to tutorials/topic-09-fifos/commands_wse2.sh
diff --git a/tutorials/topic-08-fifos/commands.sh b/tutorials/topic-09-fifos/commands_wse3.sh
similarity index 63%
rename from tutorials/topic-08-fifos/commands.sh
rename to tutorials/topic-09-fifos/commands_wse3.sh
index 0002459..5232067 100755
--- a/tutorials/topic-08-fifos/commands.sh
+++ b/tutorials/topic-09-fifos/commands_wse3.sh
@@ -2,11 +2,11 @@
 
 set -e
 
-cslc ./layout.csl \
---fabric-dims=10,3 --fabric-offsets=4,1 \
---params=num_elements_to_process:2048 \
--o out \
+cslc --arch=wse3 ./layout.csl \
+--fabric-dims=8,3 --fabric-offsets=4,1 \
+--params=num_elems_to_process:512 \
 --params=MEMCPYH2D_DATA_1_ID:4 \
 --params=MEMCPYD2H_DATA_1_ID:5 \
+-o out \
 --memcpy --channels=1 --width-west-buf=0 --width-east-buf=0
 cs_python run.py --name out
diff --git a/tutorials/topic-09-map-builtin/README.rst b/tutorials/topic-09-map-builtin/README.rst
deleted file mode 100644
index 0c52310..0000000
--- a/tutorials/topic-09-map-builtin/README.rst
+++ /dev/null
@@ -1,26 +0,0 @@
-
-Topic 9: @map Builtin
-=====================
-
-The ``@map`` builtin can be used to perform custom operations on the data
-elements of one or more DSDs. In other words, it is a
-*customizable DSD operation* that allows us to go beyond the
-:ref:`fixed list <language-builtins-for-dsd-operations>` of
-natively supported DSD operations.
-
-This example demonstrates three use-cases of the ``@map`` builtin:
-
-1. In the first use-case, ``@map`` is used to compute the square-root of the
-   diagonal elements of a 2D tensor.
-2. In the second use-case ``@map`` is used to perform a custom calculation with
-   a mix of input DSDs of various kinds (``mem1d_dsd`` and ``fabin_dsd``) and
-   scalar values while the result is stored to a ``mem1d_dsd``. It shows how we
-   can use arbitrary callbacks combined with a variety of input and output DSDs.
-3. Finally, we demonstrate how ``@map`` can be used to compute a reduction like
-   the sum of all elements in a tensor.
-
-Without ``@map``, we would have to write explicit loops iterating over each
-element involved in these computations. With ``@map`` we can avoid writing such
-loops by utilizing the DSD descriptions which specify the loop structure
-implicitly. Since DSDs are supported natively by the hardware, using ``@map``
-can lead to significant performance gains compared to writing explicit loops.
diff --git a/tutorials/topic-09-map-builtin/layout.csl b/tutorials/topic-09-map-builtin/layout.csl
deleted file mode 100644
index 78c5ea8..0000000
--- a/tutorials/topic-09-map-builtin/layout.csl
+++ /dev/null
@@ -1,62 +0,0 @@
-// Copyright 2024 Cerebras Systems.
-//
-// Licensed under the Apache License, Version 2.0 (the "License");
-// you may not use this file except in compliance with the License.
-// You may obtain a copy of the License at
-//
-//     http://www.apache.org/licenses/LICENSE-2.0
-//
-// Unless required by applicable law or agreed to in writing, software
-// distributed under the License is distributed on an "AS IS" BASIS,
-// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-// See the License for the specific language governing permissions and
-// limitations under the License.
-
-// color/ task ID map
-//
-//  ID var           ID var      ID var                ID var
-//   0 H2D            9          18                    27 reserved (memcpy)
-//   1 D2H           10          19                    28 reserved (memcpy)
-//   2               11          20                    29 reserved
-//   3               12          21 reserved (memcpy)  30 reserved (memcpy)
-//   4               13          22 reserved (memcpy)  31 reserved
-//   5               14          23 reserved (memcpy)  32
-//   6               15          24                    33
-//   7               16          25                    34
-//   8 main_task_id  17          26                    35
-//
-
-param size: i16;
-
-// IDs for memcpy streaming colors
-param MEMCPYH2D_DATA_1_ID: i16;
-param MEMCPYD2H_DATA_1_ID: i16;
-
-// Colors
-const MEMCPYH2D_DATA_1: color = @get_color(MEMCPYH2D_DATA_1_ID);
-const MEMCPYD2H_DATA_1: color = @get_color(MEMCPYD2H_DATA_1_ID);
-
-// Task IDs
-const main_task_id: local_task_id = @get_local_task_id(8);
-
-const memcpy = @import_module( "<memcpy/get_params>", .{
-  .width = 1,
-  .height = 1,
-  .MEMCPYH2D_1 = MEMCPYH2D_DATA_1,
-  .MEMCPYD2H_1 = MEMCPYD2H_DATA_1
-});
-
-layout {
-  @set_rectangle(1, 1);
-
-  @set_tile_code(0, 0, "pe_program.csl", .{
-    .memcpy_params = memcpy.get_params(0),
-    .main_task_id = main_task_id,
-    .size = size,
-  });
-
-  // export symbol name
-  @export_name("weight", [*]f16, true);
-  @export_name("sqrt_diag_A", [*]f16, true);
-  @export_name("f_run", fn()void);
-}
diff --git a/tutorials/topic-09-map-builtin/pe_program.csl b/tutorials/topic-09-map-builtin/pe_program.csl
deleted file mode 100644
index 9168f66..0000000
--- a/tutorials/topic-09-map-builtin/pe_program.csl
+++ /dev/null
@@ -1,85 +0,0 @@
-// Copyright 2024 Cerebras Systems.
-//
-// Licensed under the Apache License, Version 2.0 (the "License");
-// you may not use this file except in compliance with the License.
-// You may obtain a copy of the License at
-//
-//     http://www.apache.org/licenses/LICENSE-2.0
-//
-// Unless required by applicable law or agreed to in writing, software
-// distributed under the License is distributed on an "AS IS" BASIS,
-// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-// See the License for the specific language governing permissions and
-// limitations under the License.
-
-// Not a complete program; the top-level source file is layout.csl.
-param memcpy_params: comptime_struct;
-
-param size: i16;
-
-// Task IDs
-param main_task_id: local_task_id;
-
-// memcpy module reserves input queue 0 and output queue 0
-const sys_mod = @import_module( "<memcpy/memcpy>", memcpy_params);
-
-export const A = @constants([size, size]f16, 42.0);
-const B = [size]i16{10, 20, 30, 40, 50};
-
-const math_lib = @import_module("<math>");
-
-var sqrt_diag_A = @zeros([size]f16);
-var weight = @zeros([size]f16);
-
-var ptr_weight: [*]f16 = &weight;
-var ptr_sqrt_diag_A: [*]f16 = &sqrt_diag_A;
-
-// The loop structure is implicitly specified by the memory DSD descriptions
-const dsdA = @get_dsd(mem1d_dsd, .{.tensor_access = |i|{size} -> A[i, i]});
-const dsdB = @get_dsd(mem1d_dsd, .{.tensor_access = |i|{size} -> B[i]});
-
-const dsd_sqrt_diag_A = @get_dsd(mem1d_dsd, .{.tensor_access = |i|{size} -> sqrt_diag_A[i]});
-const dsd_weight = @get_dsd(mem1d_dsd, .{.tensor_access = |i|{size} -> weight[i]});
-
-export var sum : i16 = 0;
-
-fn transformation(value : f16, coeff1 : f16, coeff2 : f16, weight : f16) f16 {
-  return value * (coeff1 + weight) + value * (coeff2 + weight);
-}
-
-fn reduction(value : i16, sum : *i16) i16 {
-  return sum.* + value;
-}
-
-task main_task() void {
-  // Compute the square-root of each element of `dsdA` and
-  // send it out to `outDSD`.
-  //
-  // Notice how we avoid writing an explicit loop and rely
-  // on the DSD description instead.
-  @map(math_lib.sqrt_f16, dsdA, dsd_sqrt_diag_A);
-
-  // Transform tensor A in-place through a custom calculation.
-  @map(transformation, dsdA, 2.0, 6.0, dsd_weight, dsdA);
-
-  // Compute the sum of all elements in tensor B.
-  @map(reduction, dsdB, &sum, &sum);
-
-  // WARNING: the user must unblock cmd color for every PE
-  sys_mod.unblock_cmd_stream();
-}
-
-comptime {
-  @bind_local_task(main_task, main_task_id);
-}
-
-fn f_run() void {
-  @activate(main_task_id);
-  // terminate when main_task() finishes
-}
-
-comptime{
-  @export_symbol(ptr_weight, "weight");
-  @export_symbol(ptr_sqrt_diag_A, "sqrt_diag_A");
-  @export_symbol(f_run);
-}
diff --git a/tutorials/topic-09-map-builtin/run.py b/tutorials/topic-09-map-builtin/run.py
deleted file mode 100644
index 1782314..0000000
--- a/tutorials/topic-09-map-builtin/run.py
+++ /dev/null
@@ -1,102 +0,0 @@
-#!/usr/bin/env cs_python
-
-# Copyright 2024 Cerebras Systems.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-
-import argparse
-import json
-import numpy as np
-
-from cerebras.sdk.debug.debug_util import debug_util
-from cerebras.sdk.sdk_utils import memcpy_view, input_array_to_u32
-from cerebras.sdk.runtime.sdkruntimepybind import SdkRuntime, MemcpyDataType # pylint: disable=no-name-in-module
-from cerebras.sdk.runtime.sdkruntimepybind import MemcpyOrder # pylint: disable=no-name-in-module
-
-parser = argparse.ArgumentParser()
-parser.add_argument('--name', help='the test name')
-parser.add_argument('--cmaddr', help='IP:port for CS system')
-args = parser.parse_args()
-dirname = args.name
-
-# Parse the compile metadata
-with open(f"{dirname}/out.json", encoding="utf-8") as json_file:
-  compile_data = json.load(json_file)
-params = compile_data["params"]
-size = int(params["size"])
-print(f"size = {size}")
-
-memcpy_dtype = MemcpyDataType.MEMCPY_16BIT
-runner = SdkRuntime(dirname, cmaddr=args.cmaddr)
-
-sym_weight = runner.get_id("weight")
-sym_sqrt_diag_A = runner.get_id("sqrt_diag_A")
-
-runner.load()
-runner.run()
-
-A = np.array([[42.0, 42.0, 42.0, 42.0, 42.0],
-              [42.0, 42.0, 42.0, 42.0, 42.0],
-              [42.0, 42.0, 42.0, 42.0, 42.0],
-              [42.0, 42.0, 42.0, 42.0, 42.0],
-              [42.0, 42.0, 42.0, 42.0, 42.0]]).astype(np.float16)
-B = np.array([10, 20, 30, 40, 50]).astype(np.int16)
-
-def transformation(value: np.array, coeff1: float, coeff2: float, weight: np.array):
-  return np.multiply(value, coeff1 + weight) + np.multiply(value, coeff2 + weight)
-
-def reduction(array):
-  return sum(array)
-
-np.random.seed(seed=7)
-
-print("step 1: copy mode H2D")
-weights = np.random.random(size).astype(np.float16)
-tensors_u32 = input_array_to_u32(weights, 0, size)
-runner.memcpy_h2d(sym_weight, tensors_u32, 0, 0, 1, 1, size, \
-    streaming=False, data_type=memcpy_dtype, order=MemcpyOrder.COL_MAJOR, nonblock=True)
-
-print("stpe 2: call f_run to test @map")
-runner.launch("f_run", nonblock=False)
-
-print("step 3: copy mode D2H")
-# The D2H buffer must be of type u32
-out_tensors_u32 = np.zeros(size, np.uint32)
-runner.memcpy_d2h(out_tensors_u32, sym_sqrt_diag_A, 0, 0, 1, 1, size, \
-    streaming=False, data_type=memcpy_dtype, order=MemcpyOrder.COL_MAJOR, nonblock=False)
-# remove upper 16-bit of each u32
-sqrt_result = memcpy_view(out_tensors_u32, np.dtype(np.float16))
-
-runner.stop()
-
-expected = np.sqrt(np.diag(A))
-np.testing.assert_equal(sqrt_result, expected)
-
-debug_mod = debug_util(dirname, cmaddr=args.cmaddr)
-core_offset_x = 4
-core_offset_y = 1
-print(f"=== dump core: core rectangle starts at {core_offset_x}, {core_offset_y}")
-
-# Transformation example
-expected = transformation(np.diag(A), 2.0, 6.0, weights)
-np.fill_diagonal(A, expected)
-actual = debug_mod.get_symbol(core_offset_x, core_offset_y, "A", np.float16)
-np.testing.assert_equal(actual.reshape((5, 5)), A)
-
-# Reduction example
-sum_result = np.array([reduction(B)], dtype=np.int16)
-expected = debug_mod.get_symbol(core_offset_x, core_offset_y, "sum", np.int16)
-np.testing.assert_equal(sum_result, expected)
-
-print("SUCCESS!")
diff --git a/tutorials/topic-10-collectives/README.rst b/tutorials/topic-10-collectives/README.rst
deleted file mode 100644
index 0a81a4d..0000000
--- a/tutorials/topic-10-collectives/README.rst
+++ /dev/null
@@ -1,24 +0,0 @@
-
-Topic 10: Collective Communications
-===================================
-
-The ``<collectives_2d>`` library can be used for communication between PEs in
-the same row or column. It mimics the capabilities provided by
-`message passing interface <https://www.open-mpi.org/>`_ (MPI)
-collective operations found in other programming languages.
-
-This example showcases each of the currently available communication primitives
-while using the library across two indepedent dimensions. The communication
-tasks are executed asynchronously.
-
-``task_x`` uses the ``broadcast`` primitive to transmit data from the first PE
-in every row to every other PE in the same row. After the data is received,
-``reduce_fadds`` computes the vector sum of the ``broadcast_recv``. The result
-is transmitted back to the first PE in every row.
-
-``task_y`` operates concurrently along every column of PEs. The task first
-uses ``scatter`` to distribute ``chunk_size`` slices of ``scatter_data``
-across every other PE in the same column. The task uses ``gather`` to collect
-``chunk_size`` slices of data distributed by ``scatter``. Because ``scatter``
-is the inversion of ``gather``, we have used collective communications to
-transmit the data from ``scatter_data`` to ``gather_recv``.
diff --git a/tutorials/topic-10-collectives/layout.csl b/tutorials/topic-10-collectives/layout.csl
deleted file mode 100644
index 7f9f90b..0000000
--- a/tutorials/topic-10-collectives/layout.csl
+++ /dev/null
@@ -1,86 +0,0 @@
-// Copyright 2024 Cerebras Systems.
-//
-// Licensed under the Apache License, Version 2.0 (the "License");
-// you may not use this file except in compliance with the License.
-// You may obtain a copy of the License at
-//
-//     http://www.apache.org/licenses/LICENSE-2.0
-//
-// Unless required by applicable law or agreed to in writing, software
-// distributed under the License is distributed on an "AS IS" BASIS,
-// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-// See the License for the specific language governing permissions and
-// limitations under the License.
-
-// color/ task ID map
-//
-//  ID var              ID var             ID var                ID var
-//   0 c2d_x_color_0     9                 18                    27 reserved (memcpy)
-//   1 c2d_x_color_1    10 c2d_x_entrypt_0 19                    28 reserved (memcpy)
-//   2                  11 c2d_x_entrypt_1 20                    29 reserved
-//   3                  12 c2d_y_entrypt_0 21 reserved (memcpy)  30 reserved (memcpy)
-//   4 c2d_y_color_0    13 c2d_y_entrypt_1 22 reserved (memcpy)  31 reserved
-//   5 c2d_y_color_1    14                 23 reserved (memcpy)  32
-//   6                  15 task_x_id       24                    33
-//   7                  16 task_y_id       25                    34
-//   8                  17                 26                    35
-//
-
-param Pw:         u16; // kernel width
-param Ph:         u16; // kernel height
-param chunk_size: u16; // Num elements to send/recv in collectives
-
-// Colors
-const c2d_x_color_0: color = @get_color(0);
-const c2d_x_color_1: color = @get_color(1);
-const c2d_y_color_0: color = @get_color(4);
-const c2d_y_color_1: color = @get_color(5);
-
-// Task IDs
-const c2d_x_entrypt_0: local_task_id = @get_local_task_id(10);
-const c2d_x_entrypt_1: local_task_id = @get_local_task_id(11);
-const c2d_y_entrypt_0: local_task_id = @get_local_task_id(12);
-const c2d_y_entrypt_1: local_task_id = @get_local_task_id(13);
-const task_x_id:       local_task_id = @get_local_task_id(15);
-const task_y_id:       local_task_id = @get_local_task_id(16);
-
-const c2d = @import_module("<collectives_2d/params>");
-
-const memcpy = @import_module( "<memcpy/get_params>", .{
-  .width = Pw,
-  .height = Ph
-});
-
-layout {
-  @set_rectangle(Pw, Ph);
-
-  var Px: u16 = 0;
-  while (Px < Pw) : (Px += 1) {
-    var Py: u16 = 0;
-    while (Py < Ph) : (Py += 1) {
-      const params = c2d.get_params(Px, Py, .{
-        .x_colors      = .{ c2d_x_color_0,   c2d_x_color_1 },
-        .x_entrypoints = .{ c2d_x_entrypt_0, c2d_x_entrypt_1 },
-        .y_colors      = .{ c2d_y_color_0,   c2d_y_color_1 },
-        .y_entrypoints = .{ c2d_y_entrypt_0, c2d_y_entrypt_1 },
-      });
-      const memcpy_params = memcpy.get_params(Px);
-      @set_tile_code(Px, Py, "pe_program.csl", .{
-        .memcpy_params = memcpy_params,
-        .c2d_params = params,
-        .chunk_size = chunk_size,
-        .task_x_id = task_x_id,
-        .task_y_id = task_y_id });
-    }
-  }
-
-  // export symbol name
-  @export_name("broadcast_data", [*]u32, true);
-  @export_name("scatter_data", [*]u32, true);
-  @export_name("broadcast_recv", [*]u32, true);
-  @export_name("faddh_result", [*]u32, true);
-  @export_name("gather_recv", [*]u32, true);
-
-  @export_name("f_run_x", fn()void);
-  @export_name("f_run_y", fn()void);
-}
diff --git a/tutorials/topic-10-collectives/pe_program.csl b/tutorials/topic-10-collectives/pe_program.csl
deleted file mode 100644
index a8b7417..0000000
--- a/tutorials/topic-10-collectives/pe_program.csl
+++ /dev/null
@@ -1,147 +0,0 @@
-// Copyright 2024 Cerebras Systems.
-//
-// Licensed under the Apache License, Version 2.0 (the "License");
-// you may not use this file except in compliance with the License.
-// You may obtain a copy of the License at
-//
-//     http://www.apache.org/licenses/LICENSE-2.0
-//
-// Unless required by applicable law or agreed to in writing, software
-// distributed under the License is distributed on an "AS IS" BASIS,
-// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-// See the License for the specific language governing permissions and
-// limitations under the License.
-
-param c2d_params: comptime_struct;
-param memcpy_params: comptime_struct;
-
-param chunk_size: u16; // Number of elements to send/recv in collectives
-
-// Task IDs
-param task_x_id: local_task_id; // Task ID for callback for collectives in x direction
-param task_y_id: local_task_id; // Task ID for callback for collectives in y direction
-
-const sys_mod = @import_module( "<memcpy/memcpy>", memcpy_params);
-
-const rect_height = @get_rectangle().height;
-const rect_width = @get_rectangle().width;
-
-const mpi_x = @import_module("<collectives_2d/pe>", .{
-    .dim_params = c2d_params.x,
-    .queues = [2]u16{2,4},
-    .dest_dsr_ids = [1]u16{1},
-    .src0_dsr_ids = [1]u16{1},
-    .src1_dsr_ids = [1]u16{1}
-    });
-const mpi_y = @import_module("<collectives_2d/pe>", .{
-    .dim_params = c2d_params.y,
-    .queues = [2]u16{3,5},
-    .dest_dsr_ids = [1]u16{2},
-    .src0_dsr_ids = [1]u16{2},
-    .src1_dsr_ids = [1]u16{2}
-    });
-
-
-const Nx = chunk_size * rect_width;
-const Ny = chunk_size * rect_height;
-
-// broadcast_data and scatter_data supplied by run.py
-var broadcast_data = @zeros([Nx]u32);
-var broadcast_recv = @zeros([Nx]u32);
-var faddh_result = @zeros([Nx]u32);
-
-var scatter_data = @zeros([Ny]u32);
-var scatter_recv = @zeros([Ny]u32);
-var gather_recv = @zeros([Ny]u32);
-
-var ptr_broadcast_data: [*]u32 = &broadcast_data;
-var ptr_scatter_data: [*]u32 = &scatter_data;
-var ptr_broadcast_recv: [*]u32 = &broadcast_recv;
-var ptr_faddh_result: [*]u32 = &faddh_result;
-var ptr_gather_recv: [*]u32 = &gather_recv;
-
-var task_x_state: u16 = 0;
-task task_x() void {
-   switch (task_x_state) {
-      0 => {
-         mpi_x.init();
-         var send_buf = @ptrcast([*]u32, &broadcast_data);
-         var recv_buf = @ptrcast([*]u32, &broadcast_recv);
-         if (mpi_x.pe_id == 0) {
-            mpi_x.broadcast(0, send_buf, Nx, task_x_id);
-         } else {
-            mpi_x.broadcast(0, recv_buf, Nx, task_x_id);
-         }
-
-         task_x_state += 1;
-      },
-      1 => {
-         var send_buf = @ptrcast([*]f32, &broadcast_recv);
-         var recv_buf = @ptrcast([*]f32, &faddh_result);
-
-         mpi_x.reduce_fadds(0, send_buf, recv_buf, Nx, task_x_id);
-
-         task_x_state += 1;
-      },
-      else => {
-         // WARNING: the user must unblock cmd color for every PE
-         sys_mod.unblock_cmd_stream();
-         return;
-      }
-   }
-}
-
-var task_y_state: u16 = 0;
-task task_y() void {
-   switch (task_y_state) {
-      0 => {
-         mpi_y.init();
-         var send_buf = @ptrcast([*]u32, &scatter_data);
-         var recv_buf = @ptrcast([*]u32, &scatter_recv);
-
-         mpi_y.scatter(0, send_buf, recv_buf, chunk_size, task_y_id);
-
-         task_y_state += 1;
-      },
-      1 => {
-         var send_buf = @ptrcast([*]u32, &scatter_recv);
-         var recv_buf = @ptrcast([*]u32, &gather_recv);
-
-         mpi_y.gather(0, send_buf, recv_buf, chunk_size, task_y_id);
-
-         task_y_state += 1;
-      },
-      else => {
-         // WARNING: the user must unblock cmd color for every PE
-         sys_mod.unblock_cmd_stream();
-         return;
-      }
-   }
-}
-
-comptime {
-   @bind_local_task(task_x, task_x_id);
-   @bind_local_task(task_y, task_y_id);
-}
-
-fn f_run_x() void {
-   @activate(task_x_id);
-
-   // terminate when task_x finishes
-}
-
-fn f_run_y() void {
-   @activate(task_y_id);
-
-   // terminate when task_y finishes
-}
-
-comptime{
-  @export_symbol(ptr_broadcast_data, "broadcast_data");
-  @export_symbol(ptr_scatter_data, "scatter_data");
-  @export_symbol(ptr_broadcast_recv, "broadcast_recv");
-  @export_symbol(ptr_faddh_result, "faddh_result");
-  @export_symbol(ptr_gather_recv, "gather_recv");
-  @export_symbol(f_run_x);
-  @export_symbol(f_run_y);
-}
diff --git a/tutorials/topic-10-collectives/run.py b/tutorials/topic-10-collectives/run.py
deleted file mode 100644
index 1298f31..0000000
--- a/tutorials/topic-10-collectives/run.py
+++ /dev/null
@@ -1,120 +0,0 @@
-#!/usr/bin/env cs_python
-
-# Copyright 2024 Cerebras Systems.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-
-import argparse
-import json
-import numpy as np
-
-from cerebras.sdk.runtime.sdkruntimepybind import SdkRuntime, MemcpyDataType # pylint: disable=no-name-in-module
-from cerebras.sdk.runtime.sdkruntimepybind import MemcpyOrder # pylint: disable=no-name-in-module
-
-parser = argparse.ArgumentParser()
-parser.add_argument('--name', help='the test name')
-parser.add_argument('--cmaddr', help='IP:port for CS system')
-args = parser.parse_args()
-dirname = args.name
-
-# Parse the compile metadata
-with open(f"{dirname}/out.json", encoding="utf-8") as json_file:
-  compile_data = json.load(json_file)
-params = compile_data["params"]
-Pw = int(params["Pw"])
-Ph = int(params["Ph"])
-chunk_size = int(params["chunk_size"])
-print(f"Pw = width of the core = {Pw}")
-print(f"Ph = height of the core = {Ph}")
-print(f"chunk_size = {chunk_size}")
-
-Nx = Pw*chunk_size
-Ny = Ph*chunk_size
-
-print(f"Nx = {Nx}, Ny = {Ny}")
-
-memcpy_dtype = MemcpyDataType.MEMCPY_32BIT
-runner = SdkRuntime(dirname, cmaddr=args.cmaddr)
-
-sym_broadcast_data = runner.get_id("broadcast_data")
-sym_scatter_data = runner.get_id("scatter_data")
-sym_broadcast_recv = runner.get_id("broadcast_recv")
-sym_faddh_result = runner.get_id("faddh_result")
-sym_gather_recv = runner.get_id("gather_recv")
-
-runner.load()
-runner.run()
-
-print("step 1: copy mode H2D(broadcast_data) to 1st column PEs")
-broadcast_data = np.ones((Ph, 1, Nx)).astype(np.float32)
-runner.memcpy_h2d(sym_broadcast_data, broadcast_data.ravel(), 0, 0, 1, Ph, Nx, \
-    streaming=False, data_type=memcpy_dtype, order=MemcpyOrder.ROW_MAJOR, nonblock=True)
-
-print("step 2: copy mode H2D(scatter_data) to 1st row PEs")
-scatter_data = np.ones((1, Pw, Ny)).astype(np.int32)
-runner.memcpy_h2d(sym_scatter_data, scatter_data.ravel(), 0, 0, Pw, 1, Ny, \
-    streaming=False, data_type=memcpy_dtype, order=MemcpyOrder.ROW_MAJOR, nonblock=True)
-
-print("step 3: call f_run_x to test broadcast and reduction")
-runner.launch("f_run_x", nonblock=False)
-
-print("step 4: call f_run_y to test scatter and gather")
-runner.launch("f_run_y", nonblock=False)
-
-print("step 5: copy mode D2H(broadcast_recv)")
-# broadcast on x: Px=0 broadcasts data to all other PEs
-# broadcast_recv(y, x=0) = 0
-# broadcast_recv(y, x !=0) = ones
-broadcast_recv_1d = np.zeros(Ph*Pw*Nx, np.float32)
-runner.memcpy_d2h(broadcast_recv_1d, sym_broadcast_recv, 0, 0, Pw, Ph, Nx, \
-    streaming=False, data_type=memcpy_dtype, order=MemcpyOrder.ROW_MAJOR, nonblock=False)
-broadcast_recv = broadcast_recv_1d.reshape((Ph, Pw, Nx))
-
-print("step 6: copy mode D2H(faddh_result) from 1st column PEs")
-# reduce(broadcast_recv) to Px=0
-faddh_result_1d = np.zeros(Ph*Nx, np.float32)
-runner.memcpy_d2h(faddh_result_1d, sym_faddh_result, 0, 0, 1, Ph, Nx, \
-    streaming=False, data_type=memcpy_dtype, order=MemcpyOrder.ROW_MAJOR, nonblock=False)
-faddh_result = faddh_result_1d.reshape((Ph, 1, Nx))
-
-print("step 7: copy mode D2H(gather_recv) from 1st row PEs")
-gather_recv_1d = np.zeros(Pw*Ny, np.int32)
-runner.memcpy_d2h(gather_recv_1d, sym_gather_recv, 0, 0, Pw, 1, Ny, \
-    streaming=False, data_type=memcpy_dtype, order=MemcpyOrder.ROW_MAJOR, nonblock=False)
-gather_recv = gather_recv_1d.reshape((1, Pw, Ny))
-
-runner.stop()
-
-# verify broadcast on x-direction
-correct_broadcast_recv = np.ones(Nx).astype(np.float32)
-for y in range(Ph):
-  for x in range(Pw):
-    if x == 0:
-      continue
-    np.testing.assert_equal(broadcast_recv[y, x], correct_broadcast_recv)
-
-# verify faddh_result at 1st column PEs
-# reduce on x: reduce(broadcast_recvs) to Px=0
-# where broadcast_recvs(y, x=0) = 0
-#       broadcast_recvs(y, x != 0) = ones
-correct_faddh_result = np.full(Nx, (Pw-1), dtype=np.float32)
-for y in range(Ph):
-  np.testing.assert_equal(faddh_result[y, 0], correct_faddh_result)
-
-# verify gather_recv at 1st row PEs
-correct_gather_recv = np.ones(Ny).astype(np.int32)
-for x in range(Pw):
-  np.testing.assert_equal(gather_recv[0, x], correct_gather_recv)
-
-print("SUCCESS")
diff --git a/tutorials/topic-10-map-builtin/commands.sh b/tutorials/topic-10-map-builtin/commands_wse2.sh
similarity index 100%
rename from tutorials/topic-10-map-builtin/commands.sh
rename to tutorials/topic-10-map-builtin/commands_wse2.sh
diff --git a/tutorials/topic-10-map-builtin/commands_wse3.sh b/tutorials/topic-10-map-builtin/commands_wse3.sh
new file mode 100755
index 0000000..ce3b8c8
--- /dev/null
+++ b/tutorials/topic-10-map-builtin/commands_wse3.sh
@@ -0,0 +1,8 @@
+#!/usr/bin/env bash
+
+set -e
+
+cslc --arch=wse3 ./layout.csl \
+--fabric-dims=8,3 --fabric-offsets=4,1 --params=size:5 \
+-o out --memcpy --channels=1 --width-west-buf=0 --width-east-buf=0
+cs_python run.py --name out
diff --git a/tutorials/topic-11-collectives/commands.sh b/tutorials/topic-11-collectives/commands_wse2.sh
similarity index 100%
rename from tutorials/topic-11-collectives/commands.sh
rename to tutorials/topic-11-collectives/commands_wse2.sh
diff --git a/tutorials/topic-10-collectives/commands.sh b/tutorials/topic-11-collectives/commands_wse3.sh
similarity index 68%
rename from tutorials/topic-10-collectives/commands.sh
rename to tutorials/topic-11-collectives/commands_wse3.sh
index 7061f5e..28c1572 100755
--- a/tutorials/topic-10-collectives/commands.sh
+++ b/tutorials/topic-11-collectives/commands_wse3.sh
@@ -2,7 +2,7 @@
 
 set -e
 
-cslc ./layout.csl --fabric-dims=22,17 --fabric-offsets=4,1 \
+cslc --arch=wse3 ./layout.csl --fabric-dims=22,17 --fabric-offsets=4,1 \
 --params=Pw:15,Ph:15,chunk_size:3 -o out \
 --memcpy --channels=1 --width-west-buf=0 --width-east-buf=0
 cs_python run.py --name out
diff --git a/tutorials/topic-11-debug-library/README.rst b/tutorials/topic-11-debug-library/README.rst
deleted file mode 100644
index 0bd1400..0000000
--- a/tutorials/topic-11-debug-library/README.rst
+++ /dev/null
@@ -1,43 +0,0 @@
-
-Topic 11: Debug Library
-=======================
-
-This example shows a program that uses the tracing mechanism of the
-``<debug>`` library to record variable values and compile time strings
-as well as timestamps, for inspection by the host code.
-
-This program uses a row of four contiguous PEs.
-Two colors, ``red`` (color 0) and ``blue`` (color 1), are used.
-On all PEs, the routing associated with these colors receives
-from the ``WEST`` and sends down the ``RAMP`` and ``EAST``.
-Additionally, for both colors, ``swap_color_x`` is set to ``true``.
-Because these colors differ only in their lowest bit, when a
-``red`` wavelet comes into a router from ``WEST``, it leaves the
-router to the ``EAST`` as a ``blue`` wavelet, and vice versa.
-
-The host code sends four wavelets along the color ``MEMCPYH2D_DATA_1``
-into the first PE. The WTT of ``MEMCPYH2D_DATA_1`` forwards this data
-to color ``blue``. When a PE receives a ``red`` wavelet, the task
-``red_task`` is activated, and when a PE receives a ``blue`` wavelet,
-the task ``blue_task`` is activated.
-
-Each PE program contains a global variable named ``global``,
-initialized to zero.
-When a ``red_task`` is activated by an incoming wavelet ``in_data``,
-``global`` is incremented by an amount ``in_data``.
-When a ``blue_task`` is activated by an incoming wavelet ``in_data``,
-``global`` is incremented by an amount ``2 * in_data``.
-
-The programs running on each PE import two instances of the
-``<debug>`` library. Each time a task activates, the instance
-named ``trace`` logs a compile time string noting the color
-of the task, and the updated value of ``global``.
-The instance named ``times`` logs a timestamp at the beginning
-of a task, and at the end of a task.
-
-The host code uses the function ``read_trace`` from
-``cerebras.sdk.debug.debug_util`` to read the logged
-values after execution of the device code finishes.
-Note that the PE coordinates passed to ``read_trace`` start
-from the northwest corner of the fabric, not from the
-northwest corner of the program rectangle.
diff --git a/tutorials/topic-11-debug-library/commands.sh b/tutorials/topic-11-debug-library/commands.sh
deleted file mode 100755
index c3d34a2..0000000
--- a/tutorials/topic-11-debug-library/commands.sh
+++ /dev/null
@@ -1,10 +0,0 @@
-#!/usr/bin/env bash
-
-set -e
-
-cslc ./layout.csl --fabric-dims=11,3 \
---fabric-offsets=4,1 --params=width:4 -o out  \
---params=MEMCPYH2D_DATA_1_ID:6 \
---params=MEMCPYD2H_DATA_1_ID:7 \
---memcpy --channels=1 --width-west-buf=0 --width-east-buf=0
-cs_python run.py --name out
diff --git a/tutorials/topic-11-debug-library/layout.csl b/tutorials/topic-11-debug-library/layout.csl
deleted file mode 100644
index d9e18cb..0000000
--- a/tutorials/topic-11-debug-library/layout.csl
+++ /dev/null
@@ -1,90 +0,0 @@
-// Copyright 2024 Cerebras Systems.
-//
-// Licensed under the Apache License, Version 2.0 (the "License");
-// you may not use this file except in compliance with the License.
-// You may obtain a copy of the License at
-//
-//     http://www.apache.org/licenses/LICENSE-2.0
-//
-// Unless required by applicable law or agreed to in writing, software
-// distributed under the License is distributed on an "AS IS" BASIS,
-// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-// See the License for the specific language governing permissions and
-// limitations under the License.
-
-// color/ task ID map
-//
-//  ID var         ID var     ID var                ID var
-//   0 red          9         18                    27 reserved (memcpy)
-//   1 blue        10         19                    28 reserved (memcpy)
-//   2             11         20                    29 reserved
-//   3             12         21 reserved (memcpy)  30 reserved (memcpy)
-//   4             13         22 reserved (memcpy)  31 reserved
-//   5             14         23 reserved (memcpy)  32
-//   6 H2D         15         24                    33
-//   7 D2H         16         25                    34
-//   8             17         26                    35
-//
-
-param width : u16;
-
-// IDs for memcpy streaming colors
-param MEMCPYH2D_DATA_1_ID: i16;
-param MEMCPYD2H_DATA_1_ID: i16;
-
-// Colors
-const MEMCPYH2D_DATA_1: color = @get_color(MEMCPYH2D_DATA_1_ID);
-const MEMCPYD2H_DATA_1: color = @get_color(MEMCPYD2H_DATA_1_ID);
-
-const red:              color = @get_color(0);
-const blue:             color = @get_color(1);
-
-// Task IDs
-const h2d_task_id:  data_task_id = @get_data_task_id(MEMCPYH2D_DATA_1);
-const red_task_id:  data_task_id = @get_data_task_id(red);
-const blue_task_id: data_task_id = @get_data_task_id(blue);
-
-const memcpy = @import_module( "<memcpy/get_params>", .{
-  .width = width,
-  .height = 1,
-  .MEMCPYH2D_1 = MEMCPYH2D_DATA_1,
-  .MEMCPYD2H_1 = MEMCPYD2H_DATA_1
-});
-
-layout {
-  @set_rectangle(width, 1);
-
-  for (@range(u16, width)) |pe_x| {
-
-    const memcpy_params = memcpy.get_params(pe_x);
-
-    @set_tile_code(pe_x, 0, "pe_program.csl", .{
-      .memcpy_params = memcpy_params,
-      .red = red,
-      .blue = blue,
-      .wtt_h2d_task_id = h2d_task_id,
-      .red_task_id = red_task_id,
-      .blue_task_id = blue_task_id,
-    });
-
-    const routes = .{ .rx = .{ WEST }, .tx = .{ RAMP, EAST }, .color_swap_x = true };
-    const end = .{ .rx = .{ WEST }, .tx = .{ RAMP }, .color_swap_x = true };
-    const start = .{ .rx = .{ RAMP }, .tx = .{ RAMP, EAST }, .color_swap_x = true };
-
-    if (pe_x == 0){
-      // 1st PE receives data from streaming H2D, then forwards it to color "red"
-      // (WTT(H2D) forwards data to color "blue", not color "red")
-      @set_color_config(pe_x, 0, blue, .{ .routes = start });
-      @set_color_config(pe_x, 0, red, .{ .routes = start });
-    }else if (pe_x == width - 1) {
-      @set_color_config(pe_x, 0, blue, .{ .routes = end });
-      @set_color_config(pe_x, 0, red, .{ .routes = end });
-    } else {
-      @set_color_config(pe_x, 0, blue, .{ .routes = routes });
-      @set_color_config(pe_x, 0, red, .{ .routes = routes });
-    }
-  }
-
-  // export symbol name
-  @export_name("buf", [*]i16, true);
-}
diff --git a/tutorials/topic-11-debug-library/pe_program.csl b/tutorials/topic-11-debug-library/pe_program.csl
deleted file mode 100644
index 2d049f0..0000000
--- a/tutorials/topic-11-debug-library/pe_program.csl
+++ /dev/null
@@ -1,116 +0,0 @@
-// Copyright 2024 Cerebras Systems.
-//
-// Licensed under the Apache License, Version 2.0 (the "License");
-// you may not use this file except in compliance with the License.
-// You may obtain a copy of the License at
-//
-//     http://www.apache.org/licenses/LICENSE-2.0
-//
-// Unless required by applicable law or agreed to in writing, software
-// distributed under the License is distributed on an "AS IS" BASIS,
-// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-// See the License for the specific language governing permissions and
-// limitations under the License.
-
-// Not a complete program; the top-level source file is layout.csl.
-
-param memcpy_params: comptime_struct;
-
-//Colors
-param red:  color;
-param blue: color;
-
-// Task IDs
-param wtt_h2d_task_id: data_task_id; // Data task wtt_h2d triggered by MEMCPYH2D_DATA_1 wlts
-param red_task_id:     data_task_id; // Data task red_task triggered by red wlts
-param blue_task_id:    data_task_id; // Data task blue_task triggerd by blue wlts
-
-const sys_mod = @import_module( "<memcpy/memcpy>", memcpy_params);
-
-// Import two instances of <debug>:
-// `trace` records comptime string and value of 'global'
-// `times` records timestamps at begin and end of tasks
-const trace = @import_module(
-  "<debug>",
-  .{ .key = "trace",
-     .buffer_size = 100,
-   }
-);
-const times = @import_module(
-  "<debug>",
-  .{ .key = "times",
-     .buffer_size = 100,
-   }
-);
-
-
-// Variable whose value we update in our tasks
-var global : i16 = 0;
-
-// Task that will be triggered by red wavelet
-task red_task(in_data : i16) void {
-  // Record timestamp for beginning of task in `times`
-  times.trace_timestamp();
-
-  // Record string denoting task color in `trace`
-  trace.trace_string("Start red task");
-
-  // Update global variable
-  global += in_data;
-
-  // Record updated value of global in `trace`
-  trace.trace_i16(global);
-
-  // Record timestamp for end of task in `times`
-  times.trace_timestamp();
-}
-
-// Task that will be triggered by blue wavelet
-task blue_task(in_data : i16) void {
-  // Record timestamp for beginning of task in `times`
-  times.trace_timestamp();
-
-  // Record string denoting task color in `trace`
-  trace.trace_string("Start blue task");
-
-  // Update global variable
-  global += in_data * 2;
-
-  // Record updated value of global in `trace`
-  trace.trace_i16(global);
-
-  // Record timestamp for end of task in `times`
-  times.trace_timestamp();
-}
-
-comptime {
-  // Associate the appropriate task with the wavelet's color
-  @bind_data_task(red_task, red_task_id);
-  @bind_data_task(blue_task, blue_task_id);
-}
-
-
-var buf = @zeros([1]i16);
-var ptr_buf: [*]i16 = &buf;
-
-const bufDsd = @get_dsd(mem1d_dsd, .{.tensor_access = |i|{1} -> buf[i]});
-
-// PEs 0, 2 activate blue task; 1, 3 activate red task.
-const outDsd = @get_dsd(fabout_dsd, .{
-  .extent = 1,
-  .fabric_color = blue,
-  .output_queue = @get_output_queue(1)
-});
-
-// receive data from streaming H2D and forward it to color red
-task wtt_h2d(data: i16) void {
-  @block(wtt_h2d_task_id);
-  buf[0] = data;
-  @mov16(outDsd, bufDsd, .{.async=true, .unblock=wtt_h2d_task_id} );
-}
-
-comptime {
-  @bind_data_task(wtt_h2d, wtt_h2d_task_id);
-
-  @export_symbol(ptr_buf, "buf");
-}
diff --git a/tutorials/topic-11-debug-library/run.py b/tutorials/topic-11-debug-library/run.py
deleted file mode 100644
index 518e1eb..0000000
--- a/tutorials/topic-11-debug-library/run.py
+++ /dev/null
@@ -1,104 +0,0 @@
-#!/usr/bin/env cs_python
-
-# Copyright 2024 Cerebras Systems.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-
-import argparse
-import json
-import numpy as np
-
-from cerebras.sdk.debug.debug_util import debug_util
-from cerebras.sdk.sdk_utils import memcpy_view, input_array_to_u32
-from cerebras.sdk.runtime.sdkruntimepybind import SdkRuntime, MemcpyDataType # pylint: disable=no-name-in-module
-from cerebras.sdk.runtime.sdkruntimepybind import MemcpyOrder # pylint: disable=no-name-in-module
-
-parser = argparse.ArgumentParser()
-parser.add_argument('--name', help='the test name')
-parser.add_argument("--cmaddr", help="IP:port for CS system")
-args = parser.parse_args()
-dirname = args.name
-
-# Parse the compile metadata
-with open(f"{dirname}/out.json", encoding="utf-8") as json_file:
-  compile_data = json.load(json_file)
-params = compile_data["params"]
-MEMCPYH2D_DATA_1 = int(params["MEMCPYH2D_DATA_1_ID"])
-width = int(params["width"])
-print(f"MEMCPYH2D_DATA_1 = {MEMCPYH2D_DATA_1}")
-print(f"width = {width}")
-
-memcpy_dtype = MemcpyDataType.MEMCPY_16BIT
-runner = SdkRuntime(dirname, cmaddr=args.cmaddr)
-
-sym_buf = runner.get_id("buf")
-
-runner.load()
-runner.run()
-
-num_entries = 4
-x = np.arange(num_entries, dtype=np.int16)
-
-print("step 1: streaming H2D to 1st PE")
-tensors_u32 = input_array_to_u32(x, 0, num_entries)
-runner.memcpy_h2d(MEMCPYH2D_DATA_1, tensors_u32, 0, 0, 1, 1, num_entries, \
-    streaming=True, data_type=memcpy_dtype, order=MemcpyOrder.COL_MAJOR, nonblock=True)
-
-print("step 2: copy mode D2H buf (need at least one D2H)")
-# The D2H buffer must be of type u32
-out_tensors_u32 = np.zeros(1, np.uint32)
-runner.memcpy_d2h(out_tensors_u32, sym_buf, 0, 0, 1, 1, 1, \
-    streaming=False, data_type=memcpy_dtype, order=MemcpyOrder.COL_MAJOR, nonblock=False)
-# remove upper 16-bit of each u32
-buf_result = memcpy_view(out_tensors_u32, np.dtype(np.int16))
-
-runner.stop()
-
-debug_mod = debug_util(dirname, cmaddr=args.cmaddr)
-core_offset_x = 4
-core_offset_y = 1
-print(f"=== dump core: core rectangle starts at {core_offset_x}, {core_offset_y}")
-
-result = np.zeros([width, num_entries])
-for idx in range(width):
-  # Get traces recorded in 'trace'
-  trace_output = debug_mod.read_trace(core_offset_x + idx, core_offset_y, 'trace')
-
-  # Copy all recorded trace values of variable 'global'
-  result[idx, :] = trace_output[1::2]
-
-  # Get timestamp traces recorded in 'times'
-  timestamp_output = debug_mod.read_trace(core_offset_x + idx, core_offset_y, 'times')
-
-  # Print out all traces for PE
-  print("PE (", idx, ", 0): ")
-  print("Trace: ", trace_output)
-  print("Times: ", timestamp_output)
-  print()
-
-# In order, the host streams in 0, 1, 2, 3 from the West.
-# Red tasks add values to running global sum on its PE.
-# Blue tasks add 2*values to running global sum on its PE.
-# Value of global var is recorded after each update.
-# PEs 0, 2 activate blue task; 1, 3 activate red task.
-# Trace values of global var on even PEs will be: 0, 2, 6, 12
-# Trace values of global var on odd PEs will be: 0, 1, 3, 6
-oracle = np.empty([width, num_entries])
-for i in range(width):
-  for j in range(num_entries):
-    oracle[i, j] = ((i+1) % 2 + 1) * j * (j+1) / 2
-
-# Assert that all trace values of 'global' are as expected
-np.testing.assert_equal(result, oracle)
-print("SUCCESS!")
diff --git a/tutorials/topic-12-debug-library/commands.sh b/tutorials/topic-12-debug-library/commands_wse2.sh
similarity index 100%
rename from tutorials/topic-12-debug-library/commands.sh
rename to tutorials/topic-12-debug-library/commands_wse2.sh
diff --git a/tutorials/topic-12-debug-library/commands_wse3.sh b/tutorials/topic-12-debug-library/commands_wse3.sh
new file mode 100755
index 0000000..e502830
--- /dev/null
+++ b/tutorials/topic-12-debug-library/commands_wse3.sh
@@ -0,0 +1,8 @@
+#!/usr/bin/env bash
+
+set -e
+
+cslc --arch=wse3 ./layout.csl --fabric-dims=11,3 \
+--fabric-offsets=4,1 --params=width:4,num_elems:5 -o out  \
+--memcpy --channels=1 --width-west-buf=0 --width-east-buf=0
+cs_python run.py --name out
diff --git a/tutorials/topic-12-wse3-features/README.rst b/tutorials/topic-12-wse3-features/README.rst
deleted file mode 100644
index 76fbc68..0000000
--- a/tutorials/topic-12-wse3-features/README.rst
+++ /dev/null
@@ -1,41 +0,0 @@
-Topic 12: WSE-3 Features
-========================
-
-Unlike WSE-2, the WSE-3 architecture exposes microthread IDs.
-This example demonstrates the use of explicit microthread IDS
-on the WSE-3 architecture.
-
-On WSE-2, the queue ID of an input or output fabric DSD corresponds to the
-ID of the microthread in which that operation executes.
-On WSE-3, queue IDs and microthreads can be decoupled, so that any
-microthread ID 0 to 7 can be used with any of queues 0 to 7.
-
-In this example, the left PE sends ``M`` wavelets to the right PE over
-the color ``send_color``.
-These wavelets are sent in an asynchronous ``@fmovs`` operation which
-copies from the ``y`` array via ``y_dsd`` into ``out_dsd``.
-``out_dsd`` is a ``fabout_dsd`` associated with the color ``send_color``,
-and the output queue with ID 2.
-The ``@fmovs`` operation is launched using microthread ID 4.
-
-The right PE receives these ``M`` wavelets on the same color (called
-``right_color`` in ``right_pe.csl``) via ``in_dsd``, which uses input
-queue with ID 2.
-The asynchronous ``@fmovs`` operation which receives these wavelets
-and copies them into ``y`` is launched using microthread ID 5.
-
-Decoupling microthread IDs from queue IDs can provide valuable flexibility
-in managing program resource usage, and conserve microthreads.
-
-By using explicit microthread IDs, we allow CSL's DSR allocator to use fewer
-DSRs in situations where fabric DSD operands are not known at compile time.
-
-Additionally, on the WSE-3, output queues cannot be re-used with a different
-color if they have not yet been drained, and CSL does not yet support a
-mechanism for guaranteeing that a given queue is empty.
-This may force the programmer to use more output queues than needed, which in
-turn can lead to overusing microthread IDs (if they are not explicitly
-specified, they default to the respective queue IDs).
-By allowing explicit microthread IDs, a programmer can share microthreads
-between output queues, and thus conserve microthreads for other operations.
-Note, however, that two operations cannot concurrently use the same microthread.
diff --git a/tutorials/topic-12-wse3-features/layout.csl b/tutorials/topic-12-wse3-features/layout.csl
deleted file mode 100644
index 3aaf019..0000000
--- a/tutorials/topic-12-wse3-features/layout.csl
+++ /dev/null
@@ -1,42 +0,0 @@
-// Copyright 2024 Cerebras Systems.
-//
-// Licensed under the Apache License, Version 2.0 (the "License");
-// you may not use this file except in compliance with the License.
-// You may obtain a copy of the License at
-//
-//     http://www.apache.org/licenses/LICENSE-2.0
-//
-// Unless required by applicable law or agreed to in writing, software
-// distributed under the License is distributed on an "AS IS" BASIS,
-// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-// See the License for the specific language governing permissions and
-// limitations under the License.
-
-// Colors
-const send_color: color = @get_color(0); // Color used to send/recv data between PEs
-
-// This example only uses 2 PEs
-const memcpy = @import_module("<memcpy/get_params>", .{ .width = 2, .height = 1 });
-
-layout {
-  // PE coordinates are (column, row)
-  @set_rectangle(2, 1);
-
-  // Left PE (0, 0)
-  @set_tile_code(0, 0, "left_pe.csl", .{
-    .memcpy_params = memcpy.get_params(0), .send_color = send_color });
-
-  // Left PE sends to the right
-  @set_color_config(0, 0, send_color, .{.routes = .{ .rx = .{RAMP}, .tx = .{EAST} }});
-
-  // Right PE (1, 0)
-  @set_tile_code(1, 0, "right_pe.csl", .{
-    .memcpy_params = memcpy.get_params(1), .recv_color = send_color });
-
-  // Right PE receives from left PE
-  @set_color_config(1, 0, send_color, .{.routes = .{ .rx = .{WEST}, .tx = .{RAMP} }});
-
-  // export symbol names
-  @export_name("y", [*]f32, true);
-  @export_name("compute", fn()void);
-}
diff --git a/tutorials/topic-12-wse3-features/left_pe.csl b/tutorials/topic-12-wse3-features/left_pe.csl
deleted file mode 100644
index c043e87..0000000
--- a/tutorials/topic-12-wse3-features/left_pe.csl
+++ /dev/null
@@ -1,54 +0,0 @@
-// Copyright 2024 Cerebras Systems.
-//
-// Licensed under the Apache License, Version 2.0 (the "License");
-// you may not use this file except in compliance with the License.
-// You may obtain a copy of the License at
-//
-//     http://www.apache.org/licenses/LICENSE-2.0
-//
-// Unless required by applicable law or agreed to in writing, software
-// distributed under the License is distributed on an "AS IS" BASIS,
-// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-// See the License for the specific language governing permissions and
-// limitations under the License.
-
-param memcpy_params: comptime_struct;
-
-param send_color: color;
-
-const M: i16 = 10;
-
-// Task IDs
-const exit_task_id: local_task_id = @get_local_task_id(9);
-
-// Queue and microthread IDs
-const send_color_oq = @get_output_queue(2);
-const send_color_ut = @get_ut_id(4);
-
-const sys_mod = @import_module("<memcpy/memcpy>", memcpy_params);
-
-var y: [M]f32;
-var y_dsd = @get_dsd(mem1d_dsd, .{ .tensor_access = |i|{M} -> y[i] });
-var y_ptr: [*]f32 = &y;
-
-fn compute() void {
-  const out_dsd = @get_dsd(fabout_dsd, .{
-                    .fabric_color = send_color, .extent = M,
-                    .output_queue = send_color_oq
-                  });
-  @fmovs(out_dsd, y_dsd, .{ .async = true, .ut_id = send_color_ut,
-                            .activate = exit_task_id });
-}
-
-task exit_task() void {
-  sys_mod.unblock_cmd_stream();
-}
-
-comptime {
-  @bind_local_task(exit_task, exit_task_id);
-
-  @initialize_queue(send_color_oq, .{ .color = send_color });
-
-  @export_symbol(y_ptr, "y");
-  @export_symbol(compute);
-}
diff --git a/tutorials/topic-12-wse3-features/right_pe.csl b/tutorials/topic-12-wse3-features/right_pe.csl
deleted file mode 100644
index 39455e5..0000000
--- a/tutorials/topic-12-wse3-features/right_pe.csl
+++ /dev/null
@@ -1,54 +0,0 @@
-// Copyright 2024 Cerebras Systems.
-//
-// Licensed under the Apache License, Version 2.0 (the "License");
-// you may not use this file except in compliance with the License.
-// You may obtain a copy of the License at
-//
-//     http://www.apache.org/licenses/LICENSE-2.0
-//
-// Unless required by applicable law or agreed to in writing, software
-// distributed under the License is distributed on an "AS IS" BASIS,
-// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-// See the License for the specific language governing permissions and
-// limitations under the License.
-
-param memcpy_params: comptime_struct;
-
-param recv_color: color;
-
-const M: i16 = 10;
-
-// Task IDs
-const exit_task_id: local_task_id = @get_local_task_id(9);
-
-// Queue and microthread IDs
-const recv_color_iq = @get_input_queue(2);
-const recv_color_ut = @get_ut_id(5);
-
-const sys_mod = @import_module("<memcpy/memcpy>", memcpy_params);
-
-var y: [M]f32;
-var y_dsd = @get_dsd(mem1d_dsd, .{ .tensor_access = |i|{M} -> y[i] });
-var y_ptr: [*]f32 = &y;
-
-fn compute() void {
-  const in_dsd = @get_dsd(fabin_dsd, .{
-                   .fabric_color = recv_color, .extent = M,
-                   .input_queue = recv_color_iq
-                 });
-  @fmovs(y_dsd, in_dsd, .{ .async = true, .ut_id = recv_color_ut,
-                           .activate = exit_task_id });
-}
-
-task exit_task() void {
-  sys_mod.unblock_cmd_stream();
-}
-
-comptime {
-  @bind_local_task(exit_task, exit_task_id);
-
-  @initialize_queue(recv_color_iq, .{ .color = recv_color });
-
-  @export_symbol(y_ptr, "y");
-  @export_symbol(compute);
-}
diff --git a/tutorials/topic-12-wse3-features/run.py b/tutorials/topic-12-wse3-features/run.py
deleted file mode 100644
index c216e21..0000000
--- a/tutorials/topic-12-wse3-features/run.py
+++ /dev/null
@@ -1,61 +0,0 @@
-#!/usr/bin/env cs_python
-
-# Copyright 2024 Cerebras Systems.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-
-import argparse
-import numpy as np
-
-from cerebras.sdk.runtime.sdkruntimepybind import SdkRuntime, MemcpyDataType, MemcpyOrder # pylint: disable=no-name-in-module
-
-# Read arguments
-parser = argparse.ArgumentParser()
-parser.add_argument('--name', help="the test compile output dir")
-parser.add_argument('--cmaddr', help="IP:port for CS system")
-args = parser.parse_args()
-
-M = 10
-y = np.arange(M, dtype=np.float32)
-y_expected = y
-
-# Construct a runner using SdkRuntime
-runner = SdkRuntime(args.name, cmaddr=args.cmaddr)
-
-# Get symbols for A, x, y on device
-y_symbol = runner.get_id('y')
-
-# Load and run the program
-runner.load()
-runner.run()
-
-
-# Copy y into PE (0, 0)
-runner.memcpy_h2d(y_symbol, y, 0, 0, 1, 1, M, streaming=False,
-  order=MemcpyOrder.ROW_MAJOR, data_type=MemcpyDataType.MEMCPY_32BIT, nonblock=False)
-
-# Launch the compute function on device
-runner.launch('compute', nonblock=False)
-
-# Copy y back from PE (1, 0)
-y_result = np.zeros([M], dtype=np.float32)
-runner.memcpy_d2h(y_result, y_symbol, 1, 0, 1, 1, M, streaming=False,
-  order=MemcpyOrder.ROW_MAJOR, data_type=MemcpyDataType.MEMCPY_32BIT, nonblock=False)
-
-# Stop the program
-runner.stop()
-
-# Ensure that the result matches our expectation
-np.testing.assert_allclose(y_result, y_expected, atol=0.01, rtol=0)
-print("SUCCESS!")
diff --git a/tutorials/topic-13-simprint/commands.sh b/tutorials/topic-13-simprint/commands_wse2.sh
similarity index 100%
rename from tutorials/topic-13-simprint/commands.sh
rename to tutorials/topic-13-simprint/commands_wse2.sh
diff --git a/tutorials/topic-13-simprint/commands_wse3.sh b/tutorials/topic-13-simprint/commands_wse3.sh
new file mode 100755
index 0000000..e502830
--- /dev/null
+++ b/tutorials/topic-13-simprint/commands_wse3.sh
@@ -0,0 +1,8 @@
+#!/usr/bin/env bash
+
+set -e
+
+cslc --arch=wse3 ./layout.csl --fabric-dims=11,3 \
+--fabric-offsets=4,1 --params=width:4,num_elems:5 -o out  \
+--memcpy --channels=1 --width-west-buf=0 --width-east-buf=0
+cs_python run.py --name out
diff --git a/tutorials/topic-14-color-swap/commands.sh b/tutorials/topic-14-color-swap/commands_wse2.sh
similarity index 100%
rename from tutorials/topic-14-color-swap/commands.sh
rename to tutorials/topic-14-color-swap/commands_wse2.sh
diff --git a/tutorials/topic-12-wse3-features/commands.sh b/tutorials/topic-15-wse3-microthreads/commands_wse3.sh
similarity index 100%
rename from tutorials/topic-12-wse3-features/commands.sh
rename to tutorials/topic-15-wse3-microthreads/commands_wse3.sh