Skip to content

Commit

Permalink
IOMMU support episode II (#393)
Browse files Browse the repository at this point in the history
### Issue
#370  

### Description
Adds IOMMU support for Blackhole in a way that should be transparent to
the application.

### List of the changes
* Allow Blackhole to have multiple hugepages / host memory channels
* Add an API on TTDevice for iATU programming
* Rehome Blackhole iATU programming code to blackhole_tt_device.cpp
* Remove unnecessary logic to determine hugepage quantity (just use what
the application passes to Cluster constructor)
* Add sysmem tests for Blackhole.

### Testing
Manual testing was performed for both IOMMU on and IOMMU off cases using
the newly-added sysmem tests for Blackhole.

With IOMMU on:
```
[==========] Running 2 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 2 tests from SiliconDriverBH
[ RUN      ] SiliconDriverBH.SysmemTestWithPcie
  Detecting chips (found 1)
2024-12-10 20:40:07.019 | WARNING  | SiliconDriver   - Unknown board type for chip 0. This might happen because chip is running old firmware. Defaulting to UNKNOWN
2024-12-10 20:40:07.020 | WARNING  | SiliconDriver   - Unknown board type for chip 0. This might happen because chip is running old firmware. Defaulting to UNKNOWN
2024-12-10 20:40:07.083 | INFO     | SiliconDriver   - Detected PCI devices: [0]
2024-12-10 20:40:07.083 | INFO     | SiliconDriver   - Using local chip ids: {0} and remote chip ids {}
2024-12-10 20:40:07.083 | INFO     | SiliconDriver   - Opened PCI device 0; KMD version: 1.30.0, IOMMU: enabled
2024-12-10 20:40:07.170 | INFO     | SiliconDriver   - Allocating sysmem without hugepages (size: 0x40000000).
2024-12-10 20:40:07.417 | INFO     | SiliconDriver   - Mapped sysmem without hugepages to IOVA 0x3ffffff80000000.
2024-12-10 20:40:07.418 | INFO     | SiliconDriver   - Device: 0 Mapping iATU region 0 from 0x0 to 0x3fffffff to 0x3ffffff80000000
[       OK ] SiliconDriverBH.SysmemTestWithPcie (658 ms)
[ RUN      ] SiliconDriverBH.RandomSysmemTestWithPcie
2024-12-10 20:40:07.672 | WARNING  | SiliconDriver   - Unknown board type for chip 0. This might happen because chip is running old firmware. Defaulting to UNKNOWN
2024-12-10 20:40:07.672 | WARNING  | SiliconDriver   - Unknown board type for chip 0. This might happen because chip is running old firmware. Defaulting to UNKNOWN
2024-12-10 20:40:07.731 | INFO     | SiliconDriver   - Detected PCI devices: [0]
2024-12-10 20:40:07.731 | INFO     | SiliconDriver   - Using local chip ids: {0} and remote chip ids {}
2024-12-10 20:40:07.731 | INFO     | SiliconDriver   - Opened PCI device 0; KMD version: 1.30.0, IOMMU: enabled
2024-12-10 20:40:07.818 | INFO     | SiliconDriver   - Allocating sysmem without hugepages (size: 0x40000000).
2024-12-10 20:40:08.081 | INFO     | SiliconDriver   - Mapped sysmem without hugepages to IOVA 0x3ffffff80000000.
2024-12-10 20:40:08.327 | WARNING  | SiliconDriver   - Unknown board type for chip 0. This might happen because chip is running old firmware. Defaulting to UNKNOWN
2024-12-10 20:40:08.327 | WARNING  | SiliconDriver   - Unknown board type for chip 0. This might happen because chip is running old firmware. Defaulting to UNKNOWN
2024-12-10 20:40:08.387 | INFO     | SiliconDriver   - Detected PCI devices: [0]
2024-12-10 20:40:08.387 | INFO     | SiliconDriver   - Using local chip ids: {0} and remote chip ids {}
2024-12-10 20:40:08.387 | INFO     | SiliconDriver   - Opened PCI device 0; KMD version: 1.30.0, IOMMU: enabled
2024-12-10 20:40:08.474 | INFO     | SiliconDriver   - Allocating sysmem without hugepages (size: 0x100000000).
2024-12-10 20:40:09.453 | INFO     | SiliconDriver   - Mapped sysmem without hugepages to IOVA 0x3fffffe00000000.
2024-12-10 20:40:09.453 | INFO     | SiliconDriver   - Device: 0 Mapping iATU region 0 from 0x0 to 0x3fffffff to 0x3fffffe00000000
2024-12-10 20:40:09.454 | INFO     | SiliconDriver   - Device: 0 Mapping iATU region 1 from 0x40000000 to 0x7fffffff to 0x3fffffe40000000
2024-12-10 20:40:09.454 | INFO     | SiliconDriver   - Device: 0 Mapping iATU region 2 from 0x80000000 to 0xbfffffff to 0x3fffffe80000000
2024-12-10 20:40:09.454 | INFO     | SiliconDriver   - Device: 0 Mapping iATU region 3 from 0xc0000000 to 0xffffffff to 0x3fffffec0000000
[       OK ] SiliconDriverBH.RandomSysmemTestWithPcie (7754 ms)
[----------] 2 tests from SiliconDriverBH (8413 ms total)

[----------] Global test environment tear-down
[==========] 2 tests from 1 test suite ran. (8413 ms total)
[  PASSED  ] 2 tests.
```
With IOMMU in passthrough:
```
[==========] Running 2 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 2 tests from SiliconDriverBH
[ RUN      ] SiliconDriverBH.SysmemTestWithPcie
  Detecting chips (found 1)
2024-12-10 20:59:03.744 | WARNING  | SiliconDriver   - Unknown board type for chip 0. This might happen because chip is running old firmware. Defaulting to UNKNOWN
2024-12-10 20:59:03.745 | WARNING  | SiliconDriver   - Unknown board type for chip 0. This might happen because chip is running old firmware. Defaulting to UNKNOWN
2024-12-10 20:59:03.812 | INFO     | SiliconDriver   - Detected PCI devices: [0]
2024-12-10 20:59:03.812 | INFO     | SiliconDriver   - Using local chip ids: {0} and remote chip ids {}
2024-12-10 20:59:03.813 | INFO     | SiliconDriver   - Opened PCI device 0; KMD version: 1.30.0, IOMMU: disabled
2024-12-10 20:59:03.928 | INFO     | SiliconDriver   - Device: 0 Mapping iATU region 0 from 0x0 to 0x3fffffff to 0xe00000000
[       OK ] SiliconDriverBH.SysmemTestWithPcie (383 ms)
[ RUN      ] SiliconDriverBH.RandomSysmemTestWithPcie
2024-12-10 20:59:04.121 | WARNING  | SiliconDriver   - Unknown board type for chip 0. This might happen because chip is running old firmware. Defaulting to UNKNOWN
2024-12-10 20:59:04.121 | WARNING  | SiliconDriver   - Unknown board type for chip 0. This might happen because chip is running old firmware. Defaulting to UNKNOWN
2024-12-10 20:59:04.177 | INFO     | SiliconDriver   - Detected PCI devices: [0]
2024-12-10 20:59:04.177 | INFO     | SiliconDriver   - Using local chip ids: {0} and remote chip ids {}
2024-12-10 20:59:04.177 | INFO     | SiliconDriver   - Opened PCI device 0; KMD version: 1.30.0, IOMMU: disabled
2024-12-10 20:59:04.380 | WARNING  | SiliconDriver   - Unknown board type for chip 0. This might happen because chip is running old firmware. Defaulting to UNKNOWN
2024-12-10 20:59:04.380 | WARNING  | SiliconDriver   - Unknown board type for chip 0. This might happen because chip is running old firmware. Defaulting to UNKNOWN
2024-12-10 20:59:04.435 | INFO     | SiliconDriver   - Detected PCI devices: [0]
2024-12-10 20:59:04.435 | INFO     | SiliconDriver   - Using local chip ids: {0} and remote chip ids {}
2024-12-10 20:59:04.436 | INFO     | SiliconDriver   - Opened PCI device 0; KMD version: 1.30.0, IOMMU: disabled
2024-12-10 20:59:04.513 | INFO     | SiliconDriver   - Device: 0 Mapping iATU region 0 from 0x0 to 0x3fffffff to 0xe00000000
2024-12-10 20:59:04.513 | INFO     | SiliconDriver   - Device: 0 Mapping iATU region 1 from 0x40000000 to 0x7fffffff to 0xe40000000
2024-12-10 20:59:04.513 | INFO     | SiliconDriver   - Device: 0 Mapping iATU region 2 from 0x80000000 to 0xbfffffff to 0xe80000000
2024-12-10 20:59:04.513 | INFO     | SiliconDriver   - Device: 0 Mapping iATU region 3 from 0xc0000000 to 0xffffffff to 0xec0000000
[       OK ] SiliconDriverBH.RandomSysmemTestWithPcie (11055 ms)
[----------] 2 tests from SiliconDriverBH (11438 ms total)

[----------] Global test environment tear-down
[==========] 2 tests from 1 test suite ran. (11438 ms total)
[  PASSED  ] 2 tests.
```

### API Changes
There are no API changes in this PR.
  • Loading branch information
joelsmithTT authored Dec 11, 2024
1 parent 91cc73b commit bf740bd
Show file tree
Hide file tree
Showing 11 changed files with 320 additions and 192 deletions.
4 changes: 0 additions & 4 deletions device/api/umd/device/hugepage.h
Original file line number Diff line number Diff line change
Expand Up @@ -16,10 +16,6 @@ namespace tt::umd {
// Get number of 1GB host hugepages installed.
uint32_t get_num_hugepages();

// Dynamically figure out how many host memory channels (based on hugepages installed) for each device, based on arch.
uint32_t get_available_num_host_mem_channels(
const uint32_t num_channels_per_device_target, const uint16_t device_id, const uint16_t revision_id);

// Looks for hugetlbfs inside /proc/mounts matching desired pagesize (typically 1G)
std::string find_hugepage_dir(std::size_t pagesize);

Expand Down
4 changes: 2 additions & 2 deletions device/api/umd/device/pci_device.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -142,8 +142,8 @@ class PCIDevice {
*/
bool init_iommu(size_t size);

int get_num_host_mem_channels() const;
hugepage_mapping get_hugepage_mapping(int channel) const;
size_t get_num_host_mem_channels() const;
hugepage_mapping get_hugepage_mapping(size_t channel) const;

/**
* Map a buffer for DMA access by the device.
Expand Down
8 changes: 8 additions & 0 deletions device/api/umd/device/tt_device/blackhole_tt_device.h
Original file line number Diff line number Diff line change
Expand Up @@ -6,12 +6,20 @@

#pragma once

#include <set>

#include "umd/device/tt_device/tt_device.h"

namespace tt::umd {
class BlackholeTTDevice : public TTDevice {
public:
BlackholeTTDevice(std::unique_ptr<PCIDevice> pci_device);
~BlackholeTTDevice();

void configure_iatu_region(size_t region, uint64_t base, uint64_t target, size_t size) override;

private:
static constexpr uint64_t ATU_OFFSET_IN_BH_BAR2 = 0x1200;
std::set<size_t> iatu_regions_;
};
} // namespace tt::umd
29 changes: 29 additions & 0 deletions device/api/umd/device/tt_device/tt_device.h
Original file line number Diff line number Diff line change
Expand Up @@ -85,6 +85,35 @@ class TTDevice {
tt_xy_pair end,
std::uint64_t ordering = tt::umd::tlb_data::Relaxed);

/**
* Configures a PCIe Address Translation Unit (iATU) region.
*
* Device software expects to be able to access memory that is shared with
* the host using the following NOC addresses at the PCIe core:
* - GS: 0x0
* - WH: 0x8_0000_0000
* - BH: 0x1000_0000_0000_0000
* Without iATU configuration, these map to host PA 0x0.
*
* While modern hardware supports IOMMU with flexible IOVA mapping, we must
* maintain the iATU configuration to satisfy software that has hard-coded
* the above NOC addresses rather than using driver-provided IOVAs.
*
* This interface is only intended to be used for configuring sysmem with
* either 1GB hugepages or a compatible scheme.
*
* @param region iATU region index (0-15)
* @param base region * (1 << 30)
* @param target DMA address (PA or IOVA) to map to
* @param size size of the mapping window; must be (1 << 30)
*
* NOTE: Programming the iATU from userspace is architecturally incorrect:
* - iATU should be managed by KMD to ensure proper cleanup on process exit
* - Multiple processes can corrupt each other's iATU configurations
* We should fix this!
*/
virtual void configure_iatu_region(size_t region, uint64_t base, uint64_t target, size_t size);

protected:
std::unique_ptr<PCIDevice> pci_device_;
std::unique_ptr<architecture_implementation> architecture_impl_;
Expand Down
179 changes: 71 additions & 108 deletions device/cluster.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -240,18 +240,7 @@ void Cluster::create_device(
}
auto pci_device = m_tt_device_map.at(logical_device_id)->get_pci_device();

uint16_t pcie_device_id = pci_device->get_pci_device_id();
uint32_t pcie_revision = pci_device->get_pci_revision();
// TODO: get rid of this, it doesn't make any sense.
int num_host_mem_channels =
get_available_num_host_mem_channels(num_host_mem_ch_per_mmio_device, pcie_device_id, pcie_revision);
if (pci_device->get_arch() == tt::ARCH::BLACKHOLE && num_host_mem_channels > 1) {
// TODO: Implement support for multiple host channels on BLACKHOLE.
log_warning(
LogSiliconDriver,
"Forcing a single channel for Blackhole device. Multiple host channels not supported.");
num_host_mem_channels = 1;
}
int num_host_mem_channels = num_host_mem_ch_per_mmio_device;

log_debug(
LogSiliconDriver,
Expand All @@ -268,11 +257,6 @@ void Cluster::create_device(
// MT: Initial BH - hugepages will fail init
// For using silicon driver without workload to query mission mode params, no need for hugepage.
if (!skip_driver_allocs) {
// TODO: Implement support for multiple host channels on BLACKHOLE.
log_assert(
!(arch_name == tt::ARCH::BLACKHOLE && num_host_mem_channels > 1),
"More channels are not yet supported for Blackhole");
// Same number of host channels per device for now
bool hugepages_initialized = pci_device->init_hugepage(num_host_mem_channels);
// Large writes to remote chips require hugepages to be initialized.
// Conservative assert - end workload if remote chips present but hugepages not initialized (failures caused
Expand Down Expand Up @@ -1403,43 +1387,68 @@ void Cluster::set_fallback_tlb_ordering_mode(const std::string& fallback_tlb, ui
dynamic_tlb_ordering_modes.at(fallback_tlb) = ordering;
}

// TT<->TT P2P support removed in favor of increased Host memory.
// TODO: this is in the wrong place, it should be in the PCIDevice.
// TODO: this is in the wrong place, it should be in the TTDevice.
// It should also happen at the same time the huge pages or sysmem buffers are
// allocated/pinned/mapped.
void Cluster::init_pcie_iatus() {
int num_enabled_devices = m_tt_device_map.size();
log_debug(LogSiliconDriver, "Cluster::init_pcie_iatus() num_enabled_devices: {}", num_enabled_devices);

for (auto& src_device_it : m_tt_device_map) {
int logical_id = src_device_it.first;
PCIDevice* src_pci_device = src_device_it.second->get_pci_device();
for (auto& [logical_id, tt_device] : m_tt_device_map) {
PCIDevice* pci_device = tt_device->get_pci_device();

// TODO: with the IOMMU case, I think we can get away with using just
// one iATU region for WH. (On BH, we don't need iATU). We can only
// cover slightly less than 4GB with WH, and the iATU can cover 4GB.
// Splitting it into multiple regions is fine, but it's not necessary.
//
// ... something to consider when this code is refactored into PCIDevice
// where it belongs.

// Device to Host (multiple channels)
for (int channel_id = 0; channel_id < src_pci_device->get_num_host_mem_channels(); channel_id++) {
hugepage_mapping hugepage_map = src_pci_device->get_hugepage_mapping(channel_id);
if (hugepage_map.mapping) {
std::uint32_t region_size = hugepage_map.mapping_size;
if (channel_id == 3) {
// Update: unfortunately this turned out to be unrealistic. For the
// IOMMU case, the easiest thing to do is fake that we have hugepages
// so we can support the hugepage-inspired API that the user application
// has come to rely on. In that scenario, it's simpler to treat such
// fake hugepages the same way we treat real ones -- even if underneath
// there is only a single buffer. Simple is good.
//
// With respect to BH: it turns out that Metal has hard-coded NOC
// addressing assumptions for sysmem access. First step to fix this is
// have Metal ask us where sysmem is at runtime, and use that value in
// on-device code. Until then, we're stuck programming iATU. A more
// forward-looking solution is to abandon the sysmem API entirely, and
// have the application assume a more active role in managing the memory
// shared between host and device. UMD would be relegated to assisting
// the application set up and tear down the mappings. This is probably
// a unrealistic for GS/WH, but it's a good goal for BH.
//
// Until then...
//
// For every 1GB channel of memory mapped for DMA, program an iATU
// region to map it to the underlying buffer's IOVA (IOMMU case) or PA
// (non-IOMMU case).
for (size_t channel = 0; channel < pci_device->get_num_host_mem_channels(); channel++) {
hugepage_mapping hugepage_map = pci_device->get_hugepage_mapping(channel);
size_t region_size = hugepage_map.mapping_size;

if (!hugepage_map.mapping) {
throw std::runtime_error(
fmt::format("Hugepages are not allocated for logical device id: {} ch: {}", logical_id, channel));
}

if (arch_name == tt::ARCH::BLACKHOLE) {
uint64_t base = channel * region_size;
uint64_t target = hugepage_map.physical_address;
tt_device->configure_iatu_region(channel, base, target, region_size);
} else {
// TODO: stop doing this. The intent was good, but it's not
// documented and nothing takes advantage of it.
if (channel == 3) {
region_size = HUGEPAGE_CHANNEL_3_SIZE_LIMIT;
}

// This log message doesn't look right.
log_debug(
LogSiliconDriver, "Configuring ATU channel {} to point to hugepage {}.", channel_id, logical_id);
iatu_configure_peer_region(logical_id, channel_id, hugepage_map.physical_address, region_size);

} else {
throw std::runtime_error(fmt::format(
"init_pcie_iatus: Hugepages are not allocated for logical device id: {} ch: {}",
logical_id,
channel_id));
// TODO: remove this and the Blackhole special case after ARC
// messaging is lowered to the TTDevice layer and we have a
// configure_iatu_region that works for GS/WH. Longer term it'd
// be nice to have KMD deal with iATU for us...
iatu_configure_peer_region(logical_id, channel, hugepage_map.physical_address, region_size);
}
}
}
Expand Down Expand Up @@ -1596,86 +1605,40 @@ int Cluster::pcie_arc_msg(
return exit_code;
}

// TODO: this method should be lowered into TTDevice, where a common
// implementation can be shared between GS/WH. The major obstacle to doing it
// (and the reason I'm leaving it alone for now) is the lack of ARC messaging
// support at that layer of abstraction.
int Cluster::iatu_configure_peer_region(
int logical_device_id, uint32_t peer_region_id, uint64_t bar_addr_64, uint32_t region_size) {
if (arch_name == tt::ARCH::BLACKHOLE) {
throw std::runtime_error("Don't call this for Blackhole");
}

uint32_t dest_bar_lo = bar_addr_64 & 0xffffffff;
uint32_t dest_bar_hi = (bar_addr_64 >> 32) & 0xffffffff;
std::uint32_t region_id_to_use = peer_region_id;

// TODO: stop doing this. It's related to HUGEPAGE_CHANNEL_3_SIZE_LIMIT.
if (peer_region_id == 3) {
region_id_to_use = 4; // Hack use region 4 for channel 3..this ensures that we have a smaller chan 3 address
// space with the correct start offset
}

TTDevice* tt_device = get_tt_device(logical_device_id);
PCIDevice* pci_device = tt_device->get_pci_device();
auto architecture_implementation = tt_device->get_architecture_implementation();

// BR: ARC doesn't work yet on Blackhole, so programming ATU directly. Should be removed when arc starts working.
// TODO: Remove when ARC is implemented on BH.
if (arch_name == tt::ARCH::BLACKHOLE) {
uint64_t base_addr = region_id_to_use * region_size;
uint64_t base_size = (region_id_to_use + 1) * region_size;
uint64_t limit_address = base_addr + base_size - 1;

uint32_t region_ctrl_1 = 1 << 13; // INCREASE_REGION_SIZE = 1
uint32_t region_ctrl_2 = 1 << 31; // REGION_EN = 1
uint32_t region_ctrl_3 = 0;
uint32_t base_addr_lo = base_addr & 0xffffffff;
uint32_t base_addr_hi = (base_addr >> 32) & 0xffffffff;
uint32_t limit_address_lo = limit_address & 0xffffffff;
uint32_t limit_address_hi = (limit_address >> 32) & 0xffffffff;

uint64_t iatu_index = 0;
uint64_t iatu_base = UNROLL_ATU_OFFSET_BAR + iatu_index * 0x200;

tt_device->write_regs(
reinterpret_cast<std::uint32_t*>(static_cast<uint8_t*>(pci_device->bar2_uc) + iatu_base + 0x00),
&region_ctrl_1,
1);
tt_device->write_regs(
reinterpret_cast<std::uint32_t*>(static_cast<uint8_t*>(pci_device->bar2_uc) + iatu_base + 0x04),
&region_ctrl_2,
1);
tt_device->write_regs(
reinterpret_cast<std::uint32_t*>(static_cast<uint8_t*>(pci_device->bar2_uc) + iatu_base + 0x08),
&base_addr_lo,
1);
tt_device->write_regs(
reinterpret_cast<std::uint32_t*>(static_cast<uint8_t*>(pci_device->bar2_uc) + iatu_base + 0x0c),
&base_addr_hi,
1);
tt_device->write_regs(
reinterpret_cast<std::uint32_t*>(static_cast<uint8_t*>(pci_device->bar2_uc) + iatu_base + 0x10),
&limit_address_lo,
1);
tt_device->write_regs(
reinterpret_cast<std::uint32_t*>(static_cast<uint8_t*>(pci_device->bar2_uc) + iatu_base + 0x14),
&dest_bar_lo,
1);
tt_device->write_regs(
reinterpret_cast<std::uint32_t*>(static_cast<uint8_t*>(pci_device->bar2_uc) + iatu_base + 0x18),
&dest_bar_hi,
1);
tt_device->write_regs(
reinterpret_cast<std::uint32_t*>(static_cast<uint8_t*>(pci_device->bar2_uc) + iatu_base + 0x1c),
&region_ctrl_3,
1);
tt_device->write_regs(
reinterpret_cast<std::uint32_t*>(static_cast<uint8_t*>(pci_device->bar2_uc) + iatu_base + 0x20),
&limit_address_hi,
1);
} else {
bar_write32(
logical_device_id, architecture_implementation->get_arc_csm_mailbox_offset() + 0 * 4, region_id_to_use);
bar_write32(logical_device_id, architecture_implementation->get_arc_csm_mailbox_offset() + 1 * 4, dest_bar_lo);
bar_write32(logical_device_id, architecture_implementation->get_arc_csm_mailbox_offset() + 2 * 4, dest_bar_hi);
bar_write32(logical_device_id, architecture_implementation->get_arc_csm_mailbox_offset() + 3 * 4, region_size);
arc_msg(
logical_device_id,
0xaa00 | architecture_implementation->get_arc_message_setup_iatu_for_peer_to_peer(),
true,
0,
0);
}
bar_write32(logical_device_id, architecture_implementation->get_arc_csm_mailbox_offset() + 0 * 4, region_id_to_use);
bar_write32(logical_device_id, architecture_implementation->get_arc_csm_mailbox_offset() + 1 * 4, dest_bar_lo);
bar_write32(logical_device_id, architecture_implementation->get_arc_csm_mailbox_offset() + 2 * 4, dest_bar_hi);
bar_write32(logical_device_id, architecture_implementation->get_arc_csm_mailbox_offset() + 3 * 4, region_size);
arc_msg(
logical_device_id,
0xaa00 | architecture_implementation->get_arc_message_setup_iatu_for_peer_to_peer(),
true,
0,
0);

// Print what just happened
uint32_t peer_region_start = region_id_to_use * region_size;
Expand Down
62 changes: 0 additions & 62 deletions device/hugepage.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -37,68 +37,6 @@ uint32_t get_num_hugepages() {
return num_hugepages;
}

uint32_t get_available_num_host_mem_channels(
const uint32_t num_channels_per_device_target, const uint16_t device_id, const uint16_t revision_id) {
// To minimally support hybrid dev systems with mix of ARCH, get only devices matching current ARCH's device_id.
uint32_t total_num_tt_mmio_devices = tt::cpuset::tt_cpuset_allocator::get_num_tt_pci_devices();
uint32_t num_tt_mmio_devices_for_arch =
tt::cpuset::tt_cpuset_allocator::get_num_tt_pci_devices_by_pci_device_id(device_id, revision_id);
uint32_t total_hugepages = get_num_hugepages();

// This shouldn't happen on silicon machines.
if (num_tt_mmio_devices_for_arch == 0) {
log_warning(
LogSiliconDriver,
"No TT devices found that match PCI device_id: 0x{:x} revision: {}, returning NumHostMemChannels:0",
device_id,
revision_id);
return 0;
}

// GS will use P2P + 1 channel, others may support 4 host channels. Apply min of 1 to not completely break setups
// that were incomplete ie fewer hugepages than devices, which would partially work previously for some devices.
uint32_t num_channels_per_device_available =
std::min(num_channels_per_device_target, std::max((uint32_t)1, total_hugepages / num_tt_mmio_devices_for_arch));

// Perform some helpful assertion checks to guard against common pitfalls that would show up as runtime issues later
// on.
if (total_num_tt_mmio_devices > num_tt_mmio_devices_for_arch) {
log_warning(
LogSiliconDriver,
"Hybrid system mixing different TTDevices - this is not well supported. Ensure sufficient "
"Hugepages/HostMemChannels per device.");
}

if (total_hugepages < num_tt_mmio_devices_for_arch) {
log_warning(
LogSiliconDriver,
"Insufficient NumHugepages: {} should be at least NumMMIODevices: {} for device_id: 0x{:x} revision: {}. "
"NumHostMemChannels would be 0, bumping to 1.",
total_hugepages,
num_tt_mmio_devices_for_arch,
device_id,
revision_id);
}

if (num_channels_per_device_available < num_channels_per_device_target) {
log_warning(
LogSiliconDriver,
"NumHostMemChannels: {} used for device_id: 0x{:x} less than target: {}. Workload will fail if it exceeds "
"NumHostMemChannels. Increase Number of Hugepages.",
num_channels_per_device_available,
device_id,
num_channels_per_device_target);
}

log_assert(
num_channels_per_device_available <= g_MAX_HOST_MEM_CHANNELS,
"NumHostMemChannels: {} exceeds supported maximum: {}, this is unexpected.",
num_channels_per_device_available,
g_MAX_HOST_MEM_CHANNELS);

return num_channels_per_device_available;
}

std::string find_hugepage_dir(std::size_t pagesize) {
static const std::regex hugetlbfs_mount_re(
fmt::format("^(nodev|hugetlbfs) ({}) hugetlbfs ([^ ]+) 0 0$", hugepage_dir));
Expand Down
6 changes: 3 additions & 3 deletions device/pcie/pci_device.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -526,10 +526,10 @@ bool PCIDevice::init_iommu(size_t size) {
return true;
}

int PCIDevice::get_num_host_mem_channels() const { return hugepage_mapping_per_channel.size(); }
size_t PCIDevice::get_num_host_mem_channels() const { return hugepage_mapping_per_channel.size(); }

hugepage_mapping PCIDevice::get_hugepage_mapping(int channel) const {
if (channel < 0 || hugepage_mapping_per_channel.size() <= channel) {
hugepage_mapping PCIDevice::get_hugepage_mapping(size_t channel) const {
if (hugepage_mapping_per_channel.size() <= channel) {
return {nullptr, 0, 0};
} else {
return hugepage_mapping_per_channel[channel];
Expand Down
Loading

0 comments on commit bf740bd

Please sign in to comment.