-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create specific TTNN op for empty tensor on host #2006
Open
vwellsTT
wants to merge
28
commits into
main
Choose a base branch
from
vwells/ttnn_empty_op_for_host
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
8f9b450
to
342f06b
Compare
3 tasks
#2127) Implemented conversion patterns between dialects and flatbuffer. Added tests cases for all valid multichip mesh configurations (1x2, 1x8, 2x4, 1x32, 8x4).
Add a script that splits stablehlo module into sub modules representing each op individually. Need to pip install -r tools/stablehlo_splitter/requirements.txt to install dependencies. ### Ticket #2317 ### Problem description Front-ends (tt-xla, tt-torch) want to run stablehlo graphs op-by-op. We need a mechanism to break down stablehlo modules into standalone sub-modules corresponding to each op individually. ### What's changed Added `tools/stablehlo_splitter/shlo_split.py` and `tools/stablehlo_splitter/requirements.txt` ``` # SPDX-FileCopyrightText: (c) 2024 Tenstorrent AI ULC # # SPDX-License-Identifier: Apache-2.0 ## pip install stablehlo -f https://github.com/openxla/stablehlo/releases/expanded_assets/dev-wheels from mlir.ir import Context, Module import mlir.dialects.stablehlo as stablehlo def parse_module_from_str(module_str): module = None with Context() as ctx: stablehlo.register_dialect(ctx) module = Module.parse(module_str) return module class StablehloSplitter: def __init__(self, module: str): self.module = module self.parsed_module = parse_module_from_str(module) self.sub_ops = [] self.get_ops_in_module() def get_ops_in_module(self): for func_op in self.parsed_module.body.operations: for block in func_op.regions[0].blocks: for op in block.operations: if op.name.startswith(("func.", "return")): continue inputs = { operand.get_name(): str(operand.type) for operand in op.operands } args_str = ", ".join(f"{key}: {typ}" for key, typ in inputs.items()) # Handle multiple results in the operation result_names = [str(result.get_name()) for result in op.results] result_types = [str(result.type) for result in op.results] # Construct the function signature based on the number of results if len(result_names) == 1: result_str = f"{result_types[0]}" return_stmt = f"return {result_names[0]} : {result_types[0]}" else: result_str = f"({', '.join(result_types)})" return_stmt = f"return ({', '.join(result_names)}) : ({', '.join(result_types)})" # Build the new module string new_module_str = f"""module {{ func.func @main({args_str}) -> {result_str} {{ {str(op)} {return_stmt} }} }}""" dict_item = { "op_id": ", ".join(result_names), "op": str(op), "module": new_module_str, } self.sub_ops.append(dict_item) ``` ### How to use the script? From each front-end, we need to run `pip install -r third_party/tt-mlir/src/tt-mlir/tools/stablehlo_splitter/requirements.txt` I have tested in both tt-xla and tt-mlir that the script works. [tt-xla test branch](https://github.com/tenstorrent/tt-xla/tree/try_shlo_splitter) [tt-torch test branch](https://github.com/tenstorrent/tt-torch/tree/try_shlo_splitter) The test script is as follows: ``` # SPDX-FileCopyrightText: (c) 2024 Tenstorrent AI ULC # # SPDX-License-Identifier: Apache-2.0 import importlib.util import sys # Define the path to the script script_path = "third_party/tt-mlir/src/tt-mlir/tools/stablehlo_splitter/shlo_split.py" # Load the module spec = importlib.util.spec_from_file_location("shlo_split", script_path) shlo_split = importlib.util.module_from_spec(spec) sys.modules["shlo_split"] = shlo_split spec.loader.exec_module(shlo_split) mlir_path = "./Autoencoder.mlir" module_str = "" with open(mlir_path, "r") as file: module_str = file.read() # Now you can use the StablehloSplitter class splitter = shlo_split.StablehloSplitter(module_str) print(splitter.sub_ops) ``` This file reads the stablehlo module coming from autoencoder model in tt-torch, which is saved as `Autoencoder.mlir` If you run `python temp.py` you can see the sub-ops printed as attached to this PR. [Autoencoder_sub_ops.txt](https://github.com/user-attachments/files/19015699/Autoencoder_sub_ops.txt) I am creating this as a draft PR to start a discussion on: - are we happy with the way that each individual op is represented right now? Currently, each op is a dictionary. op_id is the id of the op in the original mlir string op is how it was referenced in the original mlir string module is the new standalone module that can be compiled/ executed standalone i.e. {'op_id': '%33', 'op': '%33 = stablehlo.dot_general %32, %arg7, contracting_dims = [1] x [0] : (tensor<1x12xf32>, tensor<12x3xf32>) -> tensor<1x3xf32>', 'module': 'module {\n func.func @main(%32: tensor<1x12xf32>, %arg7: tensor<12x3xf32>) -> tensor<1x3xf32> {\n %33 = stablehlo.dot_general %32, %arg7, contracting_dims = [1] x [0] : (tensor<1x12xf32>, tensor<12x3xf32>) -> tensor<1x3xf32>\n return %33 : tensor<1x3xf32>\n }\n }'} - is it ok if this script returns a list of dictionaries, each dictionary representing an op? tt-torch has it's own class representing an op which wouldn't translate into a generic op easily. So I thought each front-end could use this dictionary and populate its internal data structures on their own. - any concerns regarding the structure/ usability of the script for front-ends?
### Ticket closes #2057 ### Problem description tt-metal has some restrictions for `ttnn.argmax` op. Details can be found here https://docs.tenstorrent.com/tt-metal/latest/ttnn/ttnn/api/ttnn.argmax.html ### What's changed - Add data type, layout, and reshape workaround. ### Checklist - [X] New/Existing tests provide coverage for changes
This change adds a few new passes: ## `linalg.generic` to affine passes Simply calls the upstream pass for converting a `linalg.generic` into an affine loop nest. ## Lower affine pass Again, just uses an upstream pass for converting affine into SCF and arith. ## Linearize memref accesses pass A custom TTIR pass that takes a nested loop structure over n-dimensional memrefs and linearizes them into a single dimension. This is a useful because circular buffers in metal are only one-dimensional. Example, this pass will convert the following code: ```mlir affine.for %arg5 = 0 to 2 { affine.for %arg6 = 0 to 4 { %0 = affine.load %arg2[%arg5, %arg6] : memref<2x4x!tt.tile<32x32, f32>, #l1_> %1 = affine.load %arg3[%arg5, %arg6] : memref<2x4x!tt.tile<32x32, f32>, #l1_> %2 = "ttir.tile_maximum"(%0, %1) : (!tt.tile<32x32, f32>, !tt.tile<32x32, f32>) -> !tt.tile<32x32, f32> affine.store %2, %arg4[%arg5, %arg6] : memref<2x4x!tt.tile<32x32, f32>, #l1_> } } ``` Into: ```mlir %collapse_shape = memref.collapse_shape %arg2 [[0, 1]] : memref<2x4x!tt.tile<32x32, f32>, #l1_> into memref<8x!tt.tile<32x32, f32>, #l1_> %collapse_shape_0 = memref.collapse_shape %arg3 [[0, 1]] : memref<2x4x!tt.tile<32x32, f32>, #l1_> into memref<8x!tt.tile<32x32, f32>, #l1_> %collapse_shape_1 = memref.collapse_shape %arg4 [[0, 1]] : memref<2x4x!tt.tile<32x32, f32>, #l1_> into memref<8x!tt.tile<32x32, f32>, #l1_> affine.for %arg5 = 0 to 2 { affine.for %arg6 = 0 to 4 { %0 = affine.load %collapse_shape[%arg5 * 4 + %arg6] : memref<8x!tt.tile<32x32, f32>, #l1_> %1 = affine.load %collapse_shape_0[%arg5 * 4 + %arg6] : memref<8x!tt.tile<32x32, f32>, #l1_> %2 = "ttir.tile_maximum"(%0, %1) : (!tt.tile<32x32, f32>, !tt.tile<32x32, f32>) -> !tt.tile<32x32, f32> affine.store %2, %collapse_shape_1[%arg5 * 4 + %arg6] : memref<8x!tt.tile<32x32, f32>, #l1_> } } ``` Closes #1910 Closes #1911
This change adds 2 new TTIR layout related ops and makes a few refactors to better share common interface and verifier code between them. The verifiers are also significantly improved and check for many more illegal cases. ## StreamLayout Operation StreamLayout operation, similar to the ToLayout operation, but with the difference that this op is not eagerly evaluated and is instead used as a means for defining a stream. The primary usecases include, to enable streaming a large tensor out of dram via a small L1 buffer and also as a means for forming reduce or gather multicast operations. A stream definition includes: - The tensor to be streamed. - The storage buffer to be used for streaming. - Backing memory for a list of DMA transactions to be filled in by the backend. - A result, which is also able to take a view over the input, i.e. same semantics as the ViewLayout op. Additional constraints: - It is not capable of changing the data type nor the memory space of the tensor. ```llvm %alloc = memref.alloc() {alignment = 64 : i64} : memref<2x4x4x6x!tt.tile<32x32, f32>, #l1_> %alloc_0 = memref.alloc() {alignment = 64 : i64} : memref<2x4x1x1x!tt.tile<32x32, f32>, #l1_> %stream = "ttir.stream_layout"(%arg0, %alloc_0) : (memref<2x4x4x6x!tt.tile<32x32, f32>, #l1_>, memref<2x4x1x1x!tt.tile<32x32, f32>, #l1_>) -> memref<2x4x4x6x!tt.tile<32x32, f32>, #tt.stream<(d0, d1, d2, d3) ``` ## ViewLayout Operation ViewLayout operation, nearly identical to ToLayout operation, but with the difference that this op is not eagerly evaluated. Its primary usecase is to allow reinterpreting the layout of a tensor without actually moving the data. Additional notes/constraints: - It is not capable of changing the data type nor the memory space of the tensor. - All ViewLayout ops can trivially be converted to ToLayout ops. ```llvm #layout = #tt.metal_layout<8192x128x1, undef, <1x1>, memref<64x128xf32, #system>> #layout1 = #tt.metal_layout<8192x128x1, undef, <1x1>, memref<64x128xf32, #l1_>> %1 = "ttir.view_layout"(%arg0, %0) : (tensor<64x128xf32, #layout>, tensor<64x128xf32, #layout1>) -> tensor<64x128xf32, #layout1> ``` Closes #587
### Ticket #2254 More specifically: #2178 #2247 #2249 ### Problem description Want to be able to write reader and writer kernels used in matmul programming examples. ### What's changed `InterleavedAddrGenFast` cpp struct mapping: - struct itself is exposed to python through TTKernelOp - added `noc_async_read_tile` ttkernel op - added `noc_async_write_tile` ttkernel op
### Ticket Related to #2277 ### Problem description Optimizer needs constraints and runtime model for every op in order to shard them properly. This commit adds support for multiply op in that regard. ### What's changed Implementation for GetOpConstraints & GetOpRuntime for MultiplyOp. Unit tests for MultiplyOp API, unified with existing AddOp tests. Checklist - [x] New/Existing tests provide coverage for changes
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Problem
As a quality-of-life improvement for cpu hoisting, I want a way to lower
tensor.empty
to on host. Currently,TTNN::EmptyOp
does not work because thett_metal
api for it requires adevice
andmemory_config
which don't have meaningful values for on-host tensors. Existing tests workaround the issue by using sayttnn.zeros
instead, but this is a bit clunky--I propose an op to directly invoke a tensor ctor instead.Changes
Create new ConstructTensor op which maps to Tensor ctor invocation at runtime. Lower
tensor.empty
ops in system_memory into this instead ofttnn.empty_op
+ add support in flatbuffer and runtimeExample
input
output