Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Uniformly populate tensor specs per device, when allocating a mesh tensor #18833

Merged

Conversation

omilyutin-tt
Copy link
Contributor

Ticket

N/A

Problem description

Several failing tests that exercise a path that copies tensor into pre-allocated storage.

What's changed

This aligns the behavior with the existing MultiDeviceStorage on main. See this in particular:

Tensor allocate_tensor_on_devices(const TensorSpec& tensor_spec, const std::vector<IDevice*>& devices) {
// Top level wrapper to asynchronously create a device tensor (single- or multi-device).
Tensor device_tensor = Tensor(devices);
// Save the ref count to later re-set it:
// 1. device_tensor is copied in the lambda by the main thread, which increments the ref count.
// 2. The destruction happens in a worker thread, which doesn't decrement the ref count.
const uint32_t device_tensor_ref_count = device_tensor.tensor_attributes->record_main_thread_ref_count();
const auto& workers_in_use = device_tensor.get_workers();
uint32_t num_workers = workers_in_use.size();
for (int worker_index = 0; worker_index < num_workers; ++worker_index) {
auto& worker = devices[worker_index];
worker->push_work([worker, device_tensor, tensor_spec, worker_index]() mutable {
auto local_tensor = create_device_tensor(tensor_spec, worker);
insert_buffer_and_shape_for_device(worker, local_tensor, device_tensor, worker_index);
uint32_t num_workers_completed = (device_tensor.tensor_attributes->num_workers_completed)++;
if (not num_workers_completed) {
device_tensor.set_tensor_spec(tensor_spec);
}
});
}
device_tensor.tensor_attributes->update_main_thread_ref_count(workers_in_use.at(0), device_tensor_ref_count);
return device_tensor;
}

@@ -387,6 +387,7 @@ void MeshDevice::reshape(const MeshShape& new_shape) {
}

bool MeshDevice::close() {
mesh_command_queues_.clear();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tt-asaigal added your fix here as well. That was it, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup, thank you!

Comment on lines 53 to +57
DeviceStorage(std::shared_ptr<Buffer> buffer_);
DeviceStorage(std::shared_ptr<distributed::MeshBuffer> mesh_buffer_);
DeviceStorage(
std::shared_ptr<distributed::MeshBuffer> mesh_buffer_,
std::map<distributed::MeshCoordinate, TensorSpec> specs_,
DistributedTensorConfig strategy_);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is all not final structure. For single device path with std::shared_ptr<Buffer>, we never populate the specs and the strategy, and don't touch it either.

Copy link
Contributor

@tt-asaigal tt-asaigal Mar 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry I don't fully understand this comment. The single device Buffer path will be deleted with TT-Mesh in the picture. Don't we need per device specs for uneven cases? Or are you saying that this will be refactored?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The single device Buffer path will be deleted with TT-Mesh in the picture.

Yep. I was saying if there is a single Buffer we don't populate specs. But we need these specs in MeshBuffer to support uneven sharding - right now the structure doesn't capture this representation correctly, as you can access / manually set specs for single Buffer.

@omilyutin-tt omilyutin-tt merged commit c97a8cf into jchu/ttnn-integration-with-mesh Mar 8, 2025
1 check passed
@omilyutin-tt omilyutin-tt deleted the omilyutin/mesh-storage-fix branch March 8, 2025 06:44
@tt-asaigal
Copy link
Contributor

thanks @omilyutin-tt! To confirm: the issue was with specs not being initialized for MeshBuffer backed DeviceStorage?

@omilyutin-tt
Copy link
Contributor Author

To confirm: the issue was with specs not being initialized for MeshBuffer backed DeviceStorage?

Yeah, and the reason we have to do it is because copy_host_to_device_tensor / write_tensor accept destination tensor by value. Right now, inside of these 2 functions we modify the destination specs from the source tensor specs. But this modification is no-op because we are working on a copy.

This PR resolves the issue by pre-populating all of specs on the destination tensor, even if the tensor is "empty". This is what we do on main.

However, a cleaner and more robust way would be pass by destination by reference, and then modify the specs as per the source tensor. Because of the existing behavior on main and on this branch, we have the following bug (pseudo code):

// Allocate tensor on mesh. Uniform specs assume each device has (1,1,32,32) tensor
t1 = allocate_mesh_tensor(shape=(1,1,32,32))
// Uneven sharding on t3k - first 7 devices get (1,1,32,32), except for the last one, which is empty
t2 = from_torch(shape=(1, 7, 32, 32), ... mesh_mapper=shard_tensor_to_mesh(dim=1))
copy_host_to_device_tensor(t2, t1)
// Bug! According to the original specs on t1, last device has data. But we didn't write to it!
t1.cpu()

I'm gonna play with this more, and have a test case for repro. We don't run into this because we need to 1) go through Python copy_host_to_device_tensor path 2) use uneven sharding. We just don't use the combination, it seems.

@tt-asaigal
Copy link
Contributor

To confirm: the issue was with specs not being initialized for MeshBuffer backed DeviceStorage?

Yeah, and the reason we have to do it is because copy_host_to_device_tensor / write_tensor accept destination tensor by value. Right now, inside of these 2 functions we modify the destination specs from the source tensor specs. But this modification is no-op because we are working on a copy.

This PR resolves the issue by pre-populating all of specs on the destination tensor, even if the tensor is "empty". This is what we do on main.

However, a cleaner and more robust way would be pass by destination by reference, and then modify the specs as per the source tensor. Because of the existing behavior on main and on this branch, we have the following bug (pseudo code):

// Allocate tensor on mesh. Uniform specs assume each device has (1,1,32,32) tensor
t1 = allocate_mesh_tensor(shape=(1,1,32,32))
// Uneven sharding on t3k - first 7 devices get (1,1,32,32), except for the last one, which is empty
t2 = from_torch(shape=(1, 7, 32, 32), ... mesh_mapper=shard_tensor_to_mesh(dim=1))
copy_host_to_device_tensor(t2, t1)
// Bug! According to the original specs on t1, last device has data. But we didn't write to it!
t1.cpu()

I'm gonna play with this more, and have a test case for repro. We don't run into this because we need to 1) go through Python copy_host_to_device_tensor path 2) use uneven sharding. We just don't use the combination, it seems.

Okay this makes sense, thanks Oleg!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants