-
Notifications
You must be signed in to change notification settings - Fork 116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Uniformly populate tensor specs per device, when allocating a mesh tensor #18833
Uniformly populate tensor specs per device, when allocating a mesh tensor #18833
Conversation
@@ -387,6 +387,7 @@ void MeshDevice::reshape(const MeshShape& new_shape) { | |||
} | |||
|
|||
bool MeshDevice::close() { | |||
mesh_command_queues_.clear(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tt-asaigal added your fix here as well. That was it, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yup, thank you!
DeviceStorage(std::shared_ptr<Buffer> buffer_); | ||
DeviceStorage(std::shared_ptr<distributed::MeshBuffer> mesh_buffer_); | ||
DeviceStorage( | ||
std::shared_ptr<distributed::MeshBuffer> mesh_buffer_, | ||
std::map<distributed::MeshCoordinate, TensorSpec> specs_, | ||
DistributedTensorConfig strategy_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is all not final structure. For single device path with std::shared_ptr<Buffer>
, we never populate the specs and the strategy, and don't touch it either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sorry I don't fully understand this comment. The single device Buffer
path will be deleted with TT-Mesh in the picture. Don't we need per device specs for uneven cases? Or are you saying that this will be refactored?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The single device Buffer path will be deleted with TT-Mesh in the picture.
Yep. I was saying if there is a single Buffer we don't populate specs. But we need these specs in MeshBuffer to support uneven sharding - right now the structure doesn't capture this representation correctly, as you can access / manually set specs for single Buffer.
thanks @omilyutin-tt! To confirm: the issue was with specs not being initialized for |
Yeah, and the reason we have to do it is because This PR resolves the issue by pre-populating all of specs on the destination tensor, even if the tensor is "empty". This is what we do on main. However, a cleaner and more robust way would be pass by destination by reference, and then modify the specs as per the source tensor. Because of the existing behavior on main and on this branch, we have the following bug (pseudo code):
I'm gonna play with this more, and have a test case for repro. We don't run into this because we need to 1) go through Python |
Okay this makes sense, thanks Oleg! |
Ticket
N/A
Problem description
Several failing tests that exercise a path that copies tensor into pre-allocated storage.
What's changed
This aligns the behavior with the existing
MultiDeviceStorage
on main. See this in particular:tt-metal/ttnn/cpp/ttnn/tensor/tensor.cpp
Lines 1016 to 1041 in bed8ae9