Uniformly populate tensor specs per device, when allocating a mesh tensor #18833

omilyutin-tt · 2025-03-08T06:38:56Z

Ticket

N/A

Problem description

Several failing tests that exercise a path that copies tensor into pre-allocated storage.

What's changed

This aligns the behavior with the existing MultiDeviceStorage on main. See this in particular:

tt-metal/ttnn/cpp/ttnn/tensor/tensor.cpp

Lines 1016 to 1041 in bed8ae9

    
           Tensor allocate_tensor_on_devices(const TensorSpec& tensor_spec, const std::vector<IDevice*>& devices) { 
        
               // Top level wrapper to asynchronously create a device tensor (single- or multi-device). 
        
               Tensor device_tensor = Tensor(devices); 
        
               // Save the ref count to later re-set it: 
        
               // 1. device_tensor is copied in the lambda by the main thread, which increments the ref count. 
        
               // 2. The destruction happens in a worker thread, which doesn't decrement the ref count. 
        
               const uint32_t device_tensor_ref_count = device_tensor.tensor_attributes->record_main_thread_ref_count(); 
        
               const auto& workers_in_use = device_tensor.get_workers(); 
        
               uint32_t num_workers = workers_in_use.size(); 
        
               for (int worker_index = 0; worker_index < num_workers; ++worker_index) { 
        
                   auto& worker = devices[worker_index]; 
        
                   worker->push_work([worker, device_tensor, tensor_spec, worker_index]() mutable { 
        
                       auto local_tensor = create_device_tensor(tensor_spec, worker); 
        
                       insert_buffer_and_shape_for_device(worker, local_tensor, device_tensor, worker_index); 
        
                       uint32_t num_workers_completed = (device_tensor.tensor_attributes->num_workers_completed)++; 
        
                       if (not num_workers_completed) { 
        
                           device_tensor.set_tensor_spec(tensor_spec); 
        
                       } 
        
                   }); 
        
               } 
        
               device_tensor.tensor_attributes->update_main_thread_ref_count(workers_in_use.at(0), device_tensor_ref_count); 
        
               return device_tensor; 
        
           }

…nsor

omilyutin-tt · 2025-03-08T06:39:21Z

tt_metal/distributed/mesh_device.cpp

@@ -387,6 +387,7 @@ void MeshDevice::reshape(const MeshShape& new_shape) {
 }

 bool MeshDevice::close() {
+    mesh_command_queues_.clear();


@tt-asaigal added your fix here as well. That was it, right?

yup, thank you!

omilyutin-tt · 2025-03-08T06:40:06Z

ttnn/cpp/ttnn/tensor/storage.hpp

    DeviceStorage(std::shared_ptr<Buffer> buffer_);
-    DeviceStorage(std::shared_ptr<distributed::MeshBuffer> mesh_buffer_);
+    DeviceStorage(
+        std::shared_ptr<distributed::MeshBuffer> mesh_buffer_,
+        std::map<distributed::MeshCoordinate, TensorSpec> specs_,
+        DistributedTensorConfig strategy_);


This is all not final structure. For single device path with std::shared_ptr<Buffer>, we never populate the specs and the strategy, and don't touch it either.

I'm sorry I don't fully understand this comment. The single device Buffer path will be deleted with TT-Mesh in the picture. Don't we need per device specs for uneven cases? Or are you saying that this will be refactored?

The single device Buffer path will be deleted with TT-Mesh in the picture.

Yep. I was saying if there is a single Buffer we don't populate specs. But we need these specs in MeshBuffer to support uneven sharding - right now the structure doesn't capture this representation correctly, as you can access / manually set specs for single Buffer.

tt-asaigal · 2025-03-08T07:25:29Z

thanks @omilyutin-tt! To confirm: the issue was with specs not being initialized for MeshBuffer backed DeviceStorage?

omilyutin-tt · 2025-03-08T14:35:17Z

To confirm: the issue was with specs not being initialized for MeshBuffer backed DeviceStorage?

Yeah, and the reason we have to do it is because copy_host_to_device_tensor / write_tensor accept destination tensor by value. Right now, inside of these 2 functions we modify the destination specs from the source tensor specs. But this modification is no-op because we are working on a copy.

This PR resolves the issue by pre-populating all of specs on the destination tensor, even if the tensor is "empty". This is what we do on main.

However, a cleaner and more robust way would be pass by destination by reference, and then modify the specs as per the source tensor. Because of the existing behavior on main and on this branch, we have the following bug (pseudo code):

// Allocate tensor on mesh. Uniform specs assume each device has (1,1,32,32) tensor
t1 = allocate_mesh_tensor(shape=(1,1,32,32))
// Uneven sharding on t3k - first 7 devices get (1,1,32,32), except for the last one, which is empty
t2 = from_torch(shape=(1, 7, 32, 32), ... mesh_mapper=shard_tensor_to_mesh(dim=1))
copy_host_to_device_tensor(t2, t1)
// Bug! According to the original specs on t1, last device has data. But we didn't write to it!
t1.cpu()

I'm gonna play with this more, and have a test case for repro. We don't run into this because we need to 1) go through Python copy_host_to_device_tensor path 2) use uneven sharding. We just don't use the combination, it seems.

tt-asaigal · 2025-03-08T22:39:20Z

To confirm: the issue was with specs not being initialized for MeshBuffer backed DeviceStorage?

Yeah, and the reason we have to do it is because copy_host_to_device_tensor / write_tensor accept destination tensor by value. Right now, inside of these 2 functions we modify the destination specs from the source tensor specs. But this modification is no-op because we are working on a copy.

This PR resolves the issue by pre-populating all of specs on the destination tensor, even if the tensor is "empty". This is what we do on main.

However, a cleaner and more robust way would be pass by destination by reference, and then modify the specs as per the source tensor. Because of the existing behavior on main and on this branch, we have the following bug (pseudo code):
// Allocate tensor on mesh. Uniform specs assume each device has (1,1,32,32) tensor
t1 = allocate_mesh_tensor(shape=(1,1,32,32))
// Uneven sharding on t3k - first 7 devices get (1,1,32,32), except for the last one, which is empty
t2 = from_torch(shape=(1, 7, 32, 32), ... mesh_mapper=shard_tensor_to_mesh(dim=1))
copy_host_to_device_tensor(t2, t1)
// Bug! According to the original specs on t1, last device has data. But we didn't write to it!
t1.cpu()
I'm gonna play with this more, and have a test case for repro. We don't run into this because we need to 1) go through Python copy_host_to_device_tensor path 2) use uneven sharding. We just don't use the combination, it seems.

Okay this makes sense, thanks Oleg!

Uniformly populate tensor specs per device, when allocating a mesh te…

3178d16

…nsor

omilyutin-tt requested review from cfjchu, ayerofieiev-tt, dmakoviichuk-tt, aliuTT and tt-asaigal as code owners March 8, 2025 06:38

omilyutin-tt commented Mar 8, 2025

View reviewed changes

omilyutin-tt merged commit c97a8cf into jchu/ttnn-integration-with-mesh Mar 8, 2025
1 check passed

omilyutin-tt deleted the omilyutin/mesh-storage-fix branch March 8, 2025 06:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uniformly populate tensor specs per device, when allocating a mesh tensor #18833

Uniformly populate tensor specs per device, when allocating a mesh tensor #18833

omilyutin-tt commented Mar 8, 2025

omilyutin-tt Mar 8, 2025

tt-asaigal Mar 8, 2025

omilyutin-tt Mar 8, 2025

tt-asaigal Mar 8, 2025 •

edited

Loading

omilyutin-tt Mar 8, 2025

tt-asaigal commented Mar 8, 2025

omilyutin-tt commented Mar 8, 2025

tt-asaigal commented Mar 8, 2025

	Tensor allocate_tensor_on_devices(const TensorSpec& tensor_spec, const std::vector<IDevice*>& devices) {
	// Top level wrapper to asynchronously create a device tensor (single- or multi-device).
	Tensor device_tensor = Tensor(devices);

	// Save the ref count to later re-set it:
	// 1. device_tensor is copied in the lambda by the main thread, which increments the ref count.
	// 2. The destruction happens in a worker thread, which doesn't decrement the ref count.
	const uint32_t device_tensor_ref_count = device_tensor.tensor_attributes->record_main_thread_ref_count();
	const auto& workers_in_use = device_tensor.get_workers();
	uint32_t num_workers = workers_in_use.size();

	for (int worker_index = 0; worker_index < num_workers; ++worker_index) {
	auto& worker = devices[worker_index];
	worker->push_work([worker, device_tensor, tensor_spec, worker_index]() mutable {
	auto local_tensor = create_device_tensor(tensor_spec, worker);
	insert_buffer_and_shape_for_device(worker, local_tensor, device_tensor, worker_index);

	uint32_t num_workers_completed = (device_tensor.tensor_attributes->num_workers_completed)++;
	if (not num_workers_completed) {
	device_tensor.set_tensor_spec(tensor_spec);
	}
	});
	}
	device_tensor.tensor_attributes->update_main_thread_ref_count(workers_in_use.at(0), device_tensor_ref_count);
	return device_tensor;
	}

Uniformly populate tensor specs per device, when allocating a mesh tensor #18833

Uniformly populate tensor specs per device, when allocating a mesh tensor #18833

Conversation

omilyutin-tt commented Mar 8, 2025

Ticket

Problem description

What's changed

omilyutin-tt Mar 8, 2025

Choose a reason for hiding this comment

tt-asaigal Mar 8, 2025

Choose a reason for hiding this comment

omilyutin-tt Mar 8, 2025

Choose a reason for hiding this comment

tt-asaigal Mar 8, 2025 • edited Loading

Choose a reason for hiding this comment

omilyutin-tt Mar 8, 2025

Choose a reason for hiding this comment

tt-asaigal commented Mar 8, 2025

omilyutin-tt commented Mar 8, 2025

tt-asaigal commented Mar 8, 2025

tt-asaigal Mar 8, 2025 •

edited

Loading