Create specific TTNN op for empty tensor on host #2006

vwellsTT · 2025-01-28T21:19:54Z

Problem

As a quality-of-life improvement for cpu hoisting, I want a way to lower tensor.empty to on host. Currently, TTNN::EmptyOp does not work because the tt_metal api for it requires a device and memory_config which don't have meaningful values for on-host tensors. Existing tests workaround the issue by using say ttnn.zeros instead, but this is a bit clunky--I propose an op to directly invoke a tensor ctor instead.

Changes

Create new ConstructTensor op which maps to Tensor ctor invocation at runtime. Lower tensor.empty ops in system_memory into this instead of ttnn.empty_op + add support in flatbuffer and runtime

Example

input

  func.func @add(%arg0: tensor<32x32xf32>, %arg1: tensor<32x32xf32>) -> tensor<32x32xf32> {
    %0 = tensor.empty() : tensor<32x32xf32>
    %1 = "ttir.add"(%arg0, %arg1, %0) <{operandSegmentSizes = array<i32: 2, 1>}> {should_hoist} : (tensor<32x32xf32>, tensor<32x32xf32>, tensor<32x32xf32>) -> tensor<32x32xf32>
    return %1 : tensor<32x32xf32>
  }

output

func.func @add(%arg0: tensor<32x32xf32, #ttnn_layout>, %arg1: tensor<32x32xf32, #ttnn_layout>) -> tensor<32x32xf32, #ttnn_layout> {
        // input data movement
        %5 = "ttnn.construct_tensor"() <{dtype = #tt.supportedDataTypes<f32>, layout = #ttnn.layout<row_major>, shape = #ttnn.shape<32x32>}> : () -> tensor<32x32xf32, #ttnn_layout2>
        %6 = call @hoisted_ttir_add_32x32_32x32_32x32_func_decl(%2, %4, %5) : (tensor<32x32xf32, #ttnn_layout2>, tensor<32x32xf32, #ttnn_layout2>, tensor<32x32xf32, #ttnn_layout2>) -> tensor<32x32xf32, #ttnn_layout2>
        // output data motion
        return %8 : tensor<32x32xf32, #ttnn_layout>
      }

#2127) Implemented conversion patterns between dialects and flatbuffer. Added tests cases for all valid multichip mesh configurations (1x2, 1x8, 2x4, 1x32, 8x4).

@main

Add a script that splits stablehlo module into sub modules representing each op individually. Need to pip install -r tools/stablehlo_splitter/requirements.txt to install dependencies. ### Ticket #2317 ### Problem description Front-ends (tt-xla, tt-torch) want to run stablehlo graphs op-by-op. We need a mechanism to break down stablehlo modules into standalone sub-modules corresponding to each op individually. ### What's changed Added `tools/stablehlo_splitter/shlo_split.py` and `tools/stablehlo_splitter/requirements.txt` ``` # SPDX-FileCopyrightText: (c) 2024 Tenstorrent AI ULC # # SPDX-License-Identifier: Apache-2.0 ## pip install stablehlo -f https://github.com/openxla/stablehlo/releases/expanded_assets/dev-wheels from mlir.ir import Context, Module import mlir.dialects.stablehlo as stablehlo def parse_module_from_str(module_str): module = None with Context() as ctx: stablehlo.register_dialect(ctx) module = Module.parse(module_str) return module class StablehloSplitter: def __init__(self, module: str): self.module = module self.parsed_module = parse_module_from_str(module) self.sub_ops = [] self.get_ops_in_module() def get_ops_in_module(self): for func_op in self.parsed_module.body.operations: for block in func_op.regions[0].blocks: for op in block.operations: if op.name.startswith(("func.", "return")): continue inputs = { operand.get_name(): str(operand.type) for operand in op.operands } args_str = ", ".join(f"{key}: {typ}" for key, typ in inputs.items()) # Handle multiple results in the operation result_names = [str(result.get_name()) for result in op.results] result_types = [str(result.type) for result in op.results] # Construct the function signature based on the number of results if len(result_names) == 1: result_str = f"{result_types[0]}" return_stmt = f"return {result_names[0]} : {result_types[0]}" else: result_str = f"({', '.join(result_types)})" return_stmt = f"return ({', '.join(result_names)}) : ({', '.join(result_types)})" # Build the new module string new_module_str = f"""module {{ func.func @main({args_str}) -> {result_str} {{ {str(op)} {return_stmt} }} }}""" dict_item = { "op_id": ", ".join(result_names), "op": str(op), "module": new_module_str, } self.sub_ops.append(dict_item) ``` ### How to use the script? From each front-end, we need to run `pip install -r third_party/tt-mlir/src/tt-mlir/tools/stablehlo_splitter/requirements.txt` I have tested in both tt-xla and tt-mlir that the script works. [tt-xla test branch](https://github.com/tenstorrent/tt-xla/tree/try_shlo_splitter) [tt-torch test branch](https://github.com/tenstorrent/tt-torch/tree/try_shlo_splitter) The test script is as follows: ``` # SPDX-FileCopyrightText: (c) 2024 Tenstorrent AI ULC # # SPDX-License-Identifier: Apache-2.0 import importlib.util import sys # Define the path to the script script_path = "third_party/tt-mlir/src/tt-mlir/tools/stablehlo_splitter/shlo_split.py" # Load the module spec = importlib.util.spec_from_file_location("shlo_split", script_path) shlo_split = importlib.util.module_from_spec(spec) sys.modules["shlo_split"] = shlo_split spec.loader.exec_module(shlo_split) mlir_path = "./Autoencoder.mlir" module_str = "" with open(mlir_path, "r") as file: module_str = file.read() # Now you can use the StablehloSplitter class splitter = shlo_split.StablehloSplitter(module_str) print(splitter.sub_ops) ``` This file reads the stablehlo module coming from autoencoder model in tt-torch, which is saved as `Autoencoder.mlir` If you run `python temp.py` you can see the sub-ops printed as attached to this PR. [Autoencoder_sub_ops.txt](https://github.com/user-attachments/files/19015699/Autoencoder_sub_ops.txt) I am creating this as a draft PR to start a discussion on: - are we happy with the way that each individual op is represented right now? Currently, each op is a dictionary. op_id is the id of the op in the original mlir string op is how it was referenced in the original mlir string module is the new standalone module that can be compiled/ executed standalone i.e. {'op_id': '%33', 'op': '%33 = stablehlo.dot_general %32, %arg7, contracting_dims = [1] x [0] : (tensor<1x12xf32>, tensor<12x3xf32>) -> tensor<1x3xf32>', 'module': 'module {\n func.func @main(%32: tensor<1x12xf32>, %arg7: tensor<12x3xf32>) -> tensor<1x3xf32> {\n %33 = stablehlo.dot_general %32, %arg7, contracting_dims = [1] x [0] : (tensor<1x12xf32>, tensor<12x3xf32>) -> tensor<1x3xf32>\n return %33 : tensor<1x3xf32>\n }\n }'} - is it ok if this script returns a list of dictionaries, each dictionary representing an op? tt-torch has it's own class representing an op which wouldn't translate into a generic op easily. So I thought each front-end could use this dictionary and populate its internal data structures on their own. - any concerns regarding the structure/ usability of the script for front-ends?

…p configurations (#2339) ### Ticket [Link to Github Issue](#2022) ### Problem description Increase coverage for data and tensor parallel tests for n300, t3000 and tg systems for all supported mesh types.

### Ticket closes #2057 ### Problem description tt-metal has some restrictions for `ttnn.argmax` op. Details can be found here https://docs.tenstorrent.com/tt-metal/latest/ttnn/ttnn/api/ttnn.argmax.html ### What's changed - Add data type, layout, and reshape workaround. ### Checklist - [X] New/Existing tests provide coverage for changes

This change adds a few new passes: ## `linalg.generic` to affine passes Simply calls the upstream pass for converting a `linalg.generic` into an affine loop nest. ## Lower affine pass Again, just uses an upstream pass for converting affine into SCF and arith. ## Linearize memref accesses pass A custom TTIR pass that takes a nested loop structure over n-dimensional memrefs and linearizes them into a single dimension. This is a useful because circular buffers in metal are only one-dimensional. Example, this pass will convert the following code: ```mlir affine.for %arg5 = 0 to 2 { affine.for %arg6 = 0 to 4 { %0 = affine.load %arg2[%arg5, %arg6] : memref<2x4x!tt.tile<32x32, f32>, #l1_> %1 = affine.load %arg3[%arg5, %arg6] : memref<2x4x!tt.tile<32x32, f32>, #l1_> %2 = "ttir.tile_maximum"(%0, %1) : (!tt.tile<32x32, f32>, !tt.tile<32x32, f32>) -> !tt.tile<32x32, f32> affine.store %2, %arg4[%arg5, %arg6] : memref<2x4x!tt.tile<32x32, f32>, #l1_> } } ``` Into: ```mlir %collapse_shape = memref.collapse_shape %arg2 [[0, 1]] : memref<2x4x!tt.tile<32x32, f32>, #l1_> into memref<8x!tt.tile<32x32, f32>, #l1_> %collapse_shape_0 = memref.collapse_shape %arg3 [[0, 1]] : memref<2x4x!tt.tile<32x32, f32>, #l1_> into memref<8x!tt.tile<32x32, f32>, #l1_> %collapse_shape_1 = memref.collapse_shape %arg4 [[0, 1]] : memref<2x4x!tt.tile<32x32, f32>, #l1_> into memref<8x!tt.tile<32x32, f32>, #l1_> affine.for %arg5 = 0 to 2 { affine.for %arg6 = 0 to 4 { %0 = affine.load %collapse_shape[%arg5 * 4 + %arg6] : memref<8x!tt.tile<32x32, f32>, #l1_> %1 = affine.load %collapse_shape_0[%arg5 * 4 + %arg6] : memref<8x!tt.tile<32x32, f32>, #l1_> %2 = "ttir.tile_maximum"(%0, %1) : (!tt.tile<32x32, f32>, !tt.tile<32x32, f32>) -> !tt.tile<32x32, f32> affine.store %2, %collapse_shape_1[%arg5 * 4 + %arg6] : memref<8x!tt.tile<32x32, f32>, #l1_> } } ``` Closes #1910 Closes #1911

This change adds 2 new TTIR layout related ops and makes a few refactors to better share common interface and verifier code between them. The verifiers are also significantly improved and check for many more illegal cases. ## StreamLayout Operation StreamLayout operation, similar to the ToLayout operation, but with the difference that this op is not eagerly evaluated and is instead used as a means for defining a stream. The primary usecases include, to enable streaming a large tensor out of dram via a small L1 buffer and also as a means for forming reduce or gather multicast operations. A stream definition includes: - The tensor to be streamed. - The storage buffer to be used for streaming. - Backing memory for a list of DMA transactions to be filled in by the backend. - A result, which is also able to take a view over the input, i.e. same semantics as the ViewLayout op. Additional constraints: - It is not capable of changing the data type nor the memory space of the tensor. ```llvm %alloc = memref.alloc() {alignment = 64 : i64} : memref<2x4x4x6x!tt.tile<32x32, f32>, #l1_> %alloc_0 = memref.alloc() {alignment = 64 : i64} : memref<2x4x1x1x!tt.tile<32x32, f32>, #l1_> %stream = "ttir.stream_layout"(%arg0, %alloc_0) : (memref<2x4x4x6x!tt.tile<32x32, f32>, #l1_>, memref<2x4x1x1x!tt.tile<32x32, f32>, #l1_>) -> memref<2x4x4x6x!tt.tile<32x32, f32>, #tt.stream<(d0, d1, d2, d3) ``` ## ViewLayout Operation ViewLayout operation, nearly identical to ToLayout operation, but with the difference that this op is not eagerly evaluated. Its primary usecase is to allow reinterpreting the layout of a tensor without actually moving the data. Additional notes/constraints: - It is not capable of changing the data type nor the memory space of the tensor. - All ViewLayout ops can trivially be converted to ToLayout ops. ```llvm #layout = #tt.metal_layout<8192x128x1, undef, <1x1>, memref<64x128xf32, #system>> #layout1 = #tt.metal_layout<8192x128x1, undef, <1x1>, memref<64x128xf32, #l1_>> %1 = "ttir.view_layout"(%arg0, %0) : (tensor<64x128xf32, #layout>, tensor<64x128xf32, #layout1>) -> tensor<64x128xf32, #layout1> ``` Closes #587

### Ticket #2254 More specifically: #2178 #2247 #2249 ### Problem description Want to be able to write reader and writer kernels used in matmul programming examples. ### What's changed `InterleavedAddrGenFast` cpp struct mapping: - struct itself is exposed to python through TTKernelOp - added `noc_async_read_tile` ttkernel op - added `noc_async_write_tile` ttkernel op

### Ticket Related to #2277 ### Problem description Optimizer needs constraints and runtime model for every op in order to shard them properly. This commit adds support for multiply op in that regard. ### What's changed Implementation for GetOpConstraints & GetOpRuntime for MultiplyOp. Unit tests for MultiplyOp API, unified with existing AddOp tests. Checklist - [x] New/Existing tests provide coverage for changes

vwellsTT changed the title ~~first stab at creating host_empty ops e2e~~ Create specific TTNN op for empty tensor on host Jan 28, 2025

vwellsTT added 5 commits March 4, 2025 15:46

first stab at creating host_empty ops e2e

01cbf31

add comment + testcase

353ab62

add emitC logic, but not testcase because it doesn't work currently

e95f6f5

.

aa02196

rename

342f06b

vwellsTT force-pushed the vwells/ttnn_empty_op_for_host branch from 8f9b450 to 342f06b Compare March 4, 2025 16:38

vwellsTT added 3 commits March 4, 2025 18:51

builds but doesn't pass tests

ad661ad

missing files

e3680b8

fix tests

57b4129

vwellsTT marked this pull request as ready for review March 4, 2025 19:29

vwellsTT requested review from jnie-TT, kmabeeTT, AleksKnezevic, pilkicTT, sdjordjevicTT, svuckovicTT, mtopalovicTT, jserbedzijaTT and azecevicTT as code owners March 4, 2025 19:29

vwellsTT mentioned this pull request Mar 6, 2025

Runtime support for CPU hoist ops #2152

Open

3 tasks

tapspatel and others added 9 commits March 6, 2025 19:25

#2005: Added representation of reduce scatter in ttir and ttnn dialect (

bf32f56

#2127) Implemented conversion patterns between dialects and flatbuffer. Added tests cases for all valid multichip mesh configurations (1x2, 1x8, 2x4, 1x32, 8x4).

#2022: Added data parallel and tensor parallel tests for all multichi…

be59664

…p configurations (#2339) ### Ticket [Link to Github Issue](#2022) ### Problem description Increase coverage for data and tensor parallel tests for n300, t3000 and tg systems for all supported mesh types.

Remove DPS interface for ttnn.argmax (#2361)

a02db04

vwellsTT requested review from ddilbazTT, odjuricicTT, tt-mpantic, vprajapati-tt, vcanicTT, ctodTT, tapspatel, vroubtsovTT, mrakitaTT, mmanzoorTT, wooseokTT, vmilosevic, jmcgrathTT and nsumrakTT as code owners March 6, 2025 19:25

Merge branch 'main' into vwells/ttnn_empty_op_for_host

dd0a635

vwellsTT removed request for nsmithtt, tt-mpantic, vcanicTT, tapspatel, vmilosevic, odjuricicTT, mrakitaTT, vprajapati-tt, mmanzoorTT, ddilbazTT, wooseokTT, vroubtsovTT, ctodTT, jmcgrathTT and nsumrakTT March 6, 2025 19:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create specific TTNN op for empty tensor on host #2006

Create specific TTNN op for empty tensor on host #2006

vwellsTT commented Jan 28, 2025 •

edited

Loading

Create specific TTNN op for empty tensor on host #2006

Are you sure you want to change the base?

Create specific TTNN op for empty tensor on host #2006

Conversation

vwellsTT commented Jan 28, 2025 • edited Loading

Problem

Changes

Example

vwellsTT commented Jan 28, 2025 •

edited

Loading