Skip to content

Commit

Permalink
More documentation on cpu monitoring
Browse files Browse the repository at this point in the history
  • Loading branch information
wbjin committed Sep 8, 2024
1 parent 0ae517e commit 6d4a7b5
Show file tree
Hide file tree
Showing 2 changed files with 39 additions and 3 deletions.
8 changes: 5 additions & 3 deletions docs/getting_started/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -47,16 +47,18 @@ The default command would be:

``` { .sh .annotate }
docker run -it \
--gpus all \ # (1)!
--cap-add SYS_ADMIN \ # (2)!
--ipc host \ # (3)!
--gpus all \ # (1)!
--cap-add SYS_ADMIN \ # (2)!
--ipc host \ # (3)!
-v /sys/class/powercap/intel-rapl:/zeus_sys/class/powercap/intel-rapl \ # (4)!
mlenergy/zeus:latest \
bash
```

1. Mounts all GPUs into the Docker container.
2. `SYS_ADMIN` capability is needed to change the GPU's power limit or frequency. See [here](#system-privileges).
3. PyTorch DataLoader workers need enough shared memory for IPC. Without this, they may run out of shared memory and die.
4. Mounts the `intel-rapl` directory so that it can be read inside a docker container. Can be removed if CPU is not being monitored.

!!! Tip "Overriding Zeus installation"
Inside the container, `zeus`'s installation is editable (`pip install -e`).
Expand Down
34 changes: 34 additions & 0 deletions docs/measure/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -76,6 +76,40 @@ Depending on the Deep Learning framework you're using (currently PyTorch and JAX
This is usually what you want, except when using more advanced device partitioning (e.g., using `--xla_force_host_platform_device_count` in JAX to partition CPUs into more pieces).
In such cases, you probably want to opt out from using this function and handle synchronization manually at the appropriate granularity.

## Further on CPU measurements using Intel RAPL

The RAPL interface is available for Intel and AMD CPUs. However, DRAM measurements are not guaranteed to be available. The available measurements are device-specific so to view the supported measurements, it is recommended to initialize [`ZeusMonitor`][zeus.monitor.ZeusMonitor] with `cpu_indices=None`

To measure CPU metrics for a specific CPU index, you can use the ['get_current_cpu_index'][zeus.device.cpu.get_current_cpu_index]function, which retrieves the CPU index where the specified process ID is running. If no PID is provided, or if `pid="current"`, the function returns the CPU index of the current process.

To disable CPU or GPU measurements, you can pass in `cpu_indices=[]` or `gpu_indices=[]` to [`ZeusMonitor`][zeus.monitor.ZeusMonitor].

```python hl_lines="5 12-14"
from zeus.monitor import ZeusMonitor
from zues.device.cpu import get_current_cpu_index

if __name__ == "__main__":
# Get the CPU index of the current process
current_cpu_index = get_current_cpu_index()
monitor = ZeusMonitor(cpu_indices=[current_cpu_socket], gpu_indices=[])

for epoch in range(100):
monitor.begin_window("epoch")

steps = []
for x, y in train_loader:
monitor.begin_window("step")
train_one_step(x, y)
result = monitor.end_window("step")
steps.append(result)

mes = monitor.end_window("epoch")
print(f"Epoch {epoch} consumed {mes.time} s and {mes.total_energy} J.")

avg_time = sum(map(lambda m: m.time, steps)) / len(steps)
avg_energy = sum(map(lambda m: m.total_energy, steps)) / len(steps)
print(f"One step took {avg_time} s and {avg_energy} J on average.")
```

## CLI power and energy monitor

Expand Down

0 comments on commit 6d4a7b5

Please sign in to comment.