Skip to content

Commit

Permalink
Detecting arch through cluster descriptor (#15564)
Browse files Browse the repository at this point in the history
### Ticket
Related to #13948

### Problem description
detect_arch was in the global namespace. 
Now putting this functionality in the right place. It also now works
with logical ids, as opposed to pci device enumeration id.
Related UMD change tenstorrent/tt-umd#345

### What's changed
Changed detect_arch to tt_ClusterDescriptor::get_arch()
BoardType::DEFAULT changed to UNKNOWN
Also changed detect_arch to PCIDevice::enumerate_devices_info(), due to
ClusterDescriptor::create() not working.

### Checklist
- [x] All post-commit tests :
https://github.com/tenstorrent/tt-metal/actions/runs/12139731818
- [x] Blackhole post-commit tests :
https://github.com/tenstorrent/tt-metal/actions/runs/12139733483
- [ ] (Single-card) Model perf tests :
https://github.com/tenstorrent/tt-metal/actions/runs/12139735518
- [x] (Single-card) Device perf regressions :
https://github.com/tenstorrent/tt-metal/actions/runs/12139737448
- [x] (T3K) T3000 unit tests :
https://github.com/tenstorrent/tt-metal/actions/runs/12139739342
- [x] (T3K) T3000 demo tests :
https://github.com/tenstorrent/tt-metal/actions/runs/12139741041
- [x] (TG) TG unit tests :
https://github.com/tenstorrent/tt-metal/actions/runs/12139742806
- [x] (TG) TG demo tests :
https://github.com/tenstorrent/tt-metal/actions/runs/12139744773
- [ ] (TGG) TGG unit tests :
https://github.com/tenstorrent/tt-metal/actions/runs/12139746353
- [x] (TGG) TGG demo tests :
https://github.com/tenstorrent/tt-metal/actions/runs/12139748257
- [x] All post-commit tests on last commit:
https://github.com/tenstorrent/tt-metal/actions/runs/12178722655
  • Loading branch information
broskoTT authored Dec 5, 2024
1 parent a318130 commit 5c53639
Show file tree
Hide file tree
Showing 5 changed files with 20 additions and 17 deletions.
11 changes: 6 additions & 5 deletions tests/tt_metal/test_utils/env_vars.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
#include "common/utils.hpp"

#include "umd/device/device_api_metal.h"
#include "umd/device/tt_cluster_descriptor.h"

#include <string>

Expand Down Expand Up @@ -43,11 +44,11 @@ inline std::string get_umd_arch_name() {
return get_env_arch_name();
}

std::vector<chip_id_t> physical_mmio_device_ids = tt::umd::Cluster::detect_available_device_ids();
tt::ARCH arch = detect_arch(physical_mmio_device_ids.at(0));
for (int dev_index = 1; dev_index < physical_mmio_device_ids.size(); dev_index++) {
chip_id_t device_id = physical_mmio_device_ids.at(dev_index);
tt::ARCH detected_arch = detect_arch(device_id);
auto cluster_desc = tt_ClusterDescriptor::create();
const std::unordered_set<chip_id_t> &device_ids = cluster_desc->get_all_chips();
tt::ARCH arch = cluster_desc->get_arch(*device_ids.begin());
for (auto device_id : device_ids) {
tt::ARCH detected_arch = cluster_desc->get_arch(device_id);
TT_FATAL(
arch == detected_arch,
"Expected all devices to be {} but device {} is {}",
Expand Down
2 changes: 1 addition & 1 deletion tt_metal/common/metal_soc_descriptor.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -357,7 +357,7 @@ void metal_SocDescriptor::update_pcie_cores(const BoardType& board_type) {
return;
}
switch (board_type) {
case DEFAULT: { // Workaround for BHs running FW that does not return board type in the cluster yaml
case UNKNOWN: { // Workaround for BHs running FW that does not return board type in the cluster yaml
this->pcie_cores = {CoreCoord(11, 0)};
} break;
case P150A: {
Expand Down
20 changes: 11 additions & 9 deletions tt_metal/llrt/get_platform_architecture.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,8 @@

#include "tt_metal/common/tt_backend_api_types.hpp"
#include "tt_metal/common/assert.hpp"
#include "umd/device/cluster.h"
#include "umd/device/pci_device.hpp"
#include "umd/device/tt_soc_descriptor.h"

namespace tt::tt_metal {

Expand Down Expand Up @@ -47,8 +48,7 @@ namespace tt::tt_metal {
* @endcode
*
* @see tt::get_arch_from_string
* @see tt::umd::Cluster::detect_available_device_ids
* @see detect_arch
* @see PCIDevice::enumerate_devices_info
*/
inline tt::ARCH get_platform_architecture() {
auto arch = tt::ARCH::Invalid;
Expand All @@ -57,12 +57,14 @@ inline tt::ARCH get_platform_architecture() {
TT_FATAL(arch_env, "ARCH_NAME env var needed for VCS");
arch = tt::get_arch_from_string(arch_env);
} else {
std::vector<chip_id_t> physical_mmio_device_ids = tt::umd::Cluster::detect_available_device_ids();
if (!physical_mmio_device_ids.empty()) {
arch = detect_arch(physical_mmio_device_ids.at(0));
for (int i = 1; i < physical_mmio_device_ids.size(); ++i) {
chip_id_t device_id = physical_mmio_device_ids.at(i);
tt::ARCH detected_arch = detect_arch(device_id);

// Issue tt_umd#361: tt_ClusterDescriptor::create() won't work here.
// This map holds PCI info for each mmio chip.
auto devices_info = PCIDevice::enumerate_devices_info();
if (devices_info.size() > 0) {
arch = devices_info.begin()->second.get_arch();
for (auto &[device_id, device_info] : devices_info) {
tt::ARCH detected_arch = device_info.get_arch();
TT_FATAL(
arch == detected_arch,
"Expected all devices to be {} but device {} is {}",
Expand Down
2 changes: 1 addition & 1 deletion tt_metal/third_party/umd
Submodule umd updated 40 files
+9 −21 cmake/dependencies.cmake
+22 −7 device/CMakeLists.txt
+1 −0 device/api/umd/device/architecture_implementation.h
+39 −0 device/api/umd/device/blackhole_coordinate_manager.h
+44 −4 device/api/umd/device/blackhole_implementation.h
+0 −5 device/api/umd/device/cluster.h
+158 −52 device/api/umd/device/coordinate_manager.h
+30 −0 device/api/umd/device/grayskull_coordinate_manager.h
+31 −4 device/api/umd/device/grayskull_implementation.h
+30 −1 device/api/umd/device/pci_device.hpp
+5 −1 device/api/umd/device/tt_cluster_descriptor.h
+61 −0 device/api/umd/device/tt_core_coordinates.h
+14 −39 device/api/umd/device/tt_soc_descriptor.h
+0 −56 device/api/umd/device/tt_xy_pair.h
+31 −0 device/api/umd/device/wormhole_coordinate_manager.h
+37 −4 device/api/umd/device/wormhole_implementation.h
+176 −18 device/blackhole/blackhole_coordinate_manager.cpp
+0 −23 device/blackhole/blackhole_coordinate_manager.h
+10 −33 device/cluster.cpp
+481 −121 device/coordinate_manager.cpp
+55 −0 device/grayskull/grayskull_coordinate_manager.cpp
+0 −16 device/grayskull/grayskull_coordinate_manager.h
+109 −3 device/pcie/pci_device.cpp
+19 −10 device/simulation/tt_simulation_device.cpp
+36 −5 device/tt_cluster_descriptor.cpp
+11 −55 device/tt_soc_descriptor.cpp
+59 −19 device/wormhole/wormhole_coordinate_manager.cpp
+0 −27 device/wormhole/wormhole_coordinate_manager.h
+2 −5 tests/CMakeLists.txt
+3 −3 tests/api/CMakeLists.txt
+23 −13 tests/api/test_cluster_descriptor.cpp
+463 −0 tests/api/test_core_coord_translation_bh.cpp
+244 −0 tests/api/test_core_coord_translation_gs.cpp
+283 −0 tests/api/test_core_coord_translation_wh.cpp
+0 −210 tests/api/test_soc_descriptor_bh.cpp
+0 −151 tests/api/test_soc_descriptor_gs.cpp
+0 −181 tests/api/test_soc_descriptor_wh.cpp
+2 −3 tests/microbenchmark/device_fixture.hpp
+0 −18 tests/test_utils/soc_desc_test_utils.hpp
+5 −0 tests/wormhole/test_silicon_driver_wh.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,7 @@ operation::ProgramWithCallbacks Prod_op::create_program(
Tensor prod_all(const Tensor& input, const MemoryConfig& output_mem_config) {
Tensor result = ttnn::tiled_prod(
operation::run(Prod_op{.output_mem_config = output_mem_config}, {input}).at(0), output_mem_config);
auto arch_env = detect_arch();
auto arch_env = tt_ClusterDescriptor::detect_arch((chip_id_t)0);
if (arch_env == tt::ARCH::WORMHOLE_B0) {
return ttnn::numpy::prod_result_computation_WH_B0<bfloat16>(
result, result.get_dtype(), result.get_layout(), result.device(), output_mem_config);
Expand Down

0 comments on commit 5c53639

Please sign in to comment.