Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create specific TTNN op for empty tensor on host #2006

Open
wants to merge 28 commits into
base: main
Choose a base branch
from

Conversation

vwellsTT
Copy link
Contributor

@vwellsTT vwellsTT commented Jan 28, 2025

Problem

As a quality-of-life improvement for cpu hoisting, I want a way to lower tensor.empty to on host. Currently, TTNN::EmptyOp does not work because the tt_metal api for it requires a device and memory_config which don't have meaningful values for on-host tensors. Existing tests workaround the issue by using say ttnn.zeros instead, but this is a bit clunky--I propose an op to directly invoke a tensor ctor instead.

Changes

Create new ConstructTensor op which maps to Tensor ctor invocation at runtime. Lower tensor.empty ops in system_memory into this instead of ttnn.empty_op + add support in flatbuffer and runtime

Example

input

  func.func @add(%arg0: tensor<32x32xf32>, %arg1: tensor<32x32xf32>) -> tensor<32x32xf32> {
    %0 = tensor.empty() : tensor<32x32xf32>
    %1 = "ttir.add"(%arg0, %arg1, %0) <{operandSegmentSizes = array<i32: 2, 1>}> {should_hoist} : (tensor<32x32xf32>, tensor<32x32xf32>, tensor<32x32xf32>) -> tensor<32x32xf32>
    return %1 : tensor<32x32xf32>
  }

output

func.func @add(%arg0: tensor<32x32xf32, #ttnn_layout>, %arg1: tensor<32x32xf32, #ttnn_layout>) -> tensor<32x32xf32, #ttnn_layout> {
        // input data movement
        %5 = "ttnn.construct_tensor"() <{dtype = #tt.supportedDataTypes<f32>, layout = #ttnn.layout<row_major>, shape = #ttnn.shape<32x32>}> : () -> tensor<32x32xf32, #ttnn_layout2>
        %6 = call @hoisted_ttir_add_32x32_32x32_32x32_func_decl(%2, %4, %5) : (tensor<32x32xf32, #ttnn_layout2>, tensor<32x32xf32, #ttnn_layout2>, tensor<32x32xf32, #ttnn_layout2>) -> tensor<32x32xf32, #ttnn_layout2>
        // output data motion
        return %8 : tensor<32x32xf32, #ttnn_layout>
      }

@vwellsTT vwellsTT changed the title first stab at creating host_empty ops e2e Create specific TTNN op for empty tensor on host Jan 28, 2025
@vwellsTT vwellsTT force-pushed the vwells/ttnn_empty_op_for_host branch from 8f9b450 to 342f06b Compare March 4, 2025 16:38
tapspatel and others added 9 commits March 6, 2025 19:25
#2127)

Implemented conversion patterns between dialects and flatbuffer. Added
tests cases for all valid multichip mesh configurations (1x2, 1x8, 2x4, 1x32, 8x4).
Add a script that splits stablehlo module into sub modules representing
each op individually. Need to pip install -r
tools/stablehlo_splitter/requirements.txt to install dependencies.

### Ticket
#2317

### Problem description
Front-ends (tt-xla, tt-torch) want to run stablehlo graphs op-by-op. We
need a mechanism to break down stablehlo modules into standalone
sub-modules corresponding to each op individually.

### What's changed
Added `tools/stablehlo_splitter/shlo_split.py` and
`tools/stablehlo_splitter/requirements.txt`
```
# SPDX-FileCopyrightText: (c) 2024 Tenstorrent AI ULC
#
# SPDX-License-Identifier: Apache-2.0

## pip install stablehlo -f https://github.com/openxla/stablehlo/releases/expanded_assets/dev-wheels

from mlir.ir import Context, Module
import mlir.dialects.stablehlo as stablehlo


def parse_module_from_str(module_str):
    module = None
    with Context() as ctx:
        stablehlo.register_dialect(ctx)
        module = Module.parse(module_str)
    return module


class StablehloSplitter:
    def __init__(self, module: str):
        self.module = module
        self.parsed_module = parse_module_from_str(module)
        self.sub_ops = []
        self.get_ops_in_module()

    def get_ops_in_module(self):
        for func_op in self.parsed_module.body.operations:
            for block in func_op.regions[0].blocks:
                for op in block.operations:
                    if op.name.startswith(("func.", "return")):
                        continue

                    inputs = {
                        operand.get_name(): str(operand.type) for operand in op.operands
                    }
                    args_str = ", ".join(f"{key}: {typ}" for key, typ in inputs.items())

                    # Handle multiple results in the operation
                    result_names = [str(result.get_name()) for result in op.results]
                    result_types = [str(result.type) for result in op.results]

                    # Construct the function signature based on the number of results
                    if len(result_names) == 1:
                        result_str = f"{result_types[0]}"
                        return_stmt = f"return {result_names[0]} : {result_types[0]}"
                    else:
                        result_str = f"({', '.join(result_types)})"
                        return_stmt = f"return ({', '.join(result_names)}) : ({', '.join(result_types)})"
                    # Build the new module string
                    new_module_str = f"""module {{
        func.func @main({args_str}) -> {result_str} {{
            {str(op)}
            {return_stmt}
        }}
    }}"""
                    dict_item = {
                        "op_id": ", ".join(result_names),
                        "op": str(op),
                        "module": new_module_str,
                    }
                    self.sub_ops.append(dict_item)

```

### How to use the script?
From each front-end, we need to run `pip install -r
third_party/tt-mlir/src/tt-mlir/tools/stablehlo_splitter/requirements.txt`
I have tested in both tt-xla and tt-mlir that the script works.
[tt-xla test
branch](https://github.com/tenstorrent/tt-xla/tree/try_shlo_splitter)
[tt-torch test
branch](https://github.com/tenstorrent/tt-torch/tree/try_shlo_splitter)

The test script is as follows: 
```
# SPDX-FileCopyrightText: (c) 2024 Tenstorrent AI ULC
#
# SPDX-License-Identifier: Apache-2.0
import importlib.util
import sys

# Define the path to the script
script_path = "third_party/tt-mlir/src/tt-mlir/tools/stablehlo_splitter/shlo_split.py"

# Load the module
spec = importlib.util.spec_from_file_location("shlo_split", script_path)
shlo_split = importlib.util.module_from_spec(spec)
sys.modules["shlo_split"] = shlo_split
spec.loader.exec_module(shlo_split)

mlir_path = "./Autoencoder.mlir"
module_str = ""
with open(mlir_path, "r") as file:
    module_str = file.read()
# Now you can use the StablehloSplitter class
splitter = shlo_split.StablehloSplitter(module_str)
print(splitter.sub_ops)
```
This file reads the stablehlo module coming from autoencoder model in
tt-torch, which is saved as `Autoencoder.mlir` If you run `python
temp.py` you can see the sub-ops printed as attached to this PR.

[Autoencoder_sub_ops.txt](https://github.com/user-attachments/files/19015699/Autoencoder_sub_ops.txt)

I am creating this as a draft PR to start a discussion on:
- are we happy with the way that each individual op is represented right
now? Currently, each op is a dictionary.
op_id is the id of the op in the original mlir string
op is how it was referenced in the original mlir string
module is the new standalone module that can be compiled/ executed
standalone
i.e. {'op_id': '%33', 'op': '%33 = stablehlo.dot_general %32, %arg7,
contracting_dims = [1] x [0] : (tensor<1x12xf32>, tensor<12x3xf32>) ->
tensor<1x3xf32>', 'module': 'module {\n func.func @main(%32:
tensor<1x12xf32>, %arg7: tensor<12x3xf32>) -> tensor<1x3xf32> {\n %33 =
stablehlo.dot_general %32, %arg7, contracting_dims = [1] x [0] :
(tensor<1x12xf32>, tensor<12x3xf32>) -> tensor<1x3xf32>\n return %33 :
tensor<1x3xf32>\n }\n }'}
- is it ok if this script returns a list of dictionaries, each
dictionary representing an op? tt-torch has it's own class representing
an op which wouldn't translate into a generic op easily. So I thought
each front-end could use this dictionary and populate its internal data
structures on their own.
- any concerns regarding the structure/ usability of the script for
front-ends?
…p configurations (#2339)

### Ticket
[Link to Github
Issue](#2022)

### Problem description
Increase coverage for data and tensor parallel tests for n300, t3000 and
tg systems for all supported mesh types.
### Ticket
closes #2057

### Problem description
tt-metal has some restrictions for `ttnn.argmax` op. Details can be
found here
https://docs.tenstorrent.com/tt-metal/latest/ttnn/ttnn/api/ttnn.argmax.html

### What's changed
- Add data type, layout, and reshape workaround. 

### Checklist
- [X] New/Existing tests provide coverage for changes
This change adds a few new passes:

## `linalg.generic` to affine passes

Simply calls the upstream pass for converting a `linalg.generic` into an
affine loop nest.

## Lower affine pass

Again, just uses an upstream pass for converting affine into SCF and
arith.

## Linearize memref accesses pass

A custom TTIR pass that takes a nested loop structure over n-dimensional
memrefs and linearizes them into a single dimension. This is a useful
because circular buffers in metal are only one-dimensional.
    
Example, this pass will convert the following code:        
```mlir        
  affine.for %arg5 = 0 to 2 {        
    affine.for %arg6 = 0 to 4 {        
      %0 = affine.load %arg2[%arg5, %arg6] : memref<2x4x!tt.tile<32x32, f32>, #l1_>        
      %1 = affine.load %arg3[%arg5, %arg6] : memref<2x4x!tt.tile<32x32, f32>, #l1_>        
      %2 = "ttir.tile_maximum"(%0, %1) : (!tt.tile<32x32, f32>, !tt.tile<32x32, f32>) -> !tt.tile<32x32, f32>        
      affine.store %2, %arg4[%arg5, %arg6] : memref<2x4x!tt.tile<32x32, f32>, #l1_>                                                       
    }        
  }                                                                              
```
                                                             
Into:
```mlir                                                                                    
  %collapse_shape = memref.collapse_shape %arg2 [[0, 1]] : memref<2x4x!tt.tile<32x32, f32>, #l1_> into memref<8x!tt.tile<32x32, f32>, #l1_>        
  %collapse_shape_0 = memref.collapse_shape %arg3 [[0, 1]] : memref<2x4x!tt.tile<32x32, f32>, #l1_> into memref<8x!tt.tile<32x32, f32>, #l1_>        
  %collapse_shape_1 = memref.collapse_shape %arg4 [[0, 1]] : memref<2x4x!tt.tile<32x32, f32>, #l1_> into memref<8x!tt.tile<32x32, f32>, #l1_>        
  affine.for %arg5 = 0 to 2 {                                    
    affine.for %arg6 = 0 to 4 {                                                       
      %0 = affine.load %collapse_shape[%arg5 * 4 + %arg6] : memref<8x!tt.tile<32x32, f32>, #l1_>        
      %1 = affine.load %collapse_shape_0[%arg5 * 4 + %arg6] : memref<8x!tt.tile<32x32, f32>, #l1_>                                                                         
      %2 = "ttir.tile_maximum"(%0, %1) : (!tt.tile<32x32, f32>, !tt.tile<32x32, f32>) -> !tt.tile<32x32, f32>        
      affine.store %2, %collapse_shape_1[%arg5 * 4 + %arg6] : memref<8x!tt.tile<32x32, f32>, #l1_>        
    }                                                         
  }                                     
```

Closes #1910
Closes #1911
This change adds 2 new TTIR layout related ops and makes a few refactors
to better share common interface and verifier code between them. The
verifiers are also significantly improved and check for many more
illegal cases.

## StreamLayout Operation

StreamLayout operation, similar to the ToLayout operation, but with the
difference that this op is not eagerly evaluated and is instead used as
a means for defining a stream. The primary usecases include, to enable
streaming a large tensor out of dram via a small L1 buffer and also as a
means for forming reduce or gather multicast operations. A stream
definition includes:

- The tensor to be streamed.
- The storage buffer to be used for streaming.
- Backing memory for a list of DMA transactions to be filled in by the
backend.
- A result, which is also able to take a view over the input, i.e. same
semantics as the ViewLayout op.

Additional constraints:
- It is not capable of changing the data type nor the memory space of
the tensor.

```llvm
%alloc = memref.alloc() {alignment = 64 : i64} : memref<2x4x4x6x!tt.tile<32x32, f32>, #l1_>
%alloc_0 = memref.alloc() {alignment = 64 : i64} : memref<2x4x1x1x!tt.tile<32x32, f32>, #l1_>
%stream = "ttir.stream_layout"(%arg0, %alloc_0) : (memref<2x4x4x6x!tt.tile<32x32, f32>, #l1_>, memref<2x4x1x1x!tt.tile<32x32, f32>, #l1_>) -> memref<2x4x4x6x!tt.tile<32x32, f32>, #tt.stream<(d0, d1, d2, d3)
```

## ViewLayout Operation

ViewLayout operation, nearly identical to ToLayout operation, but with
the difference that this op is not eagerly evaluated. Its primary
usecase is to allow reinterpreting the layout of a tensor without
actually moving the data.

Additional notes/constraints:
- It is not capable of changing the data type nor the memory space of
the tensor.
- All ViewLayout ops can trivially be converted to ToLayout ops.

```llvm
#layout = #tt.metal_layout<8192x128x1, undef, <1x1>, memref<64x128xf32, #system>>
#layout1 = #tt.metal_layout<8192x128x1, undef, <1x1>, memref<64x128xf32, #l1_>>
%1 = "ttir.view_layout"(%arg0, %0) : (tensor<64x128xf32, #layout>, tensor<64x128xf32, #layout1>) -> tensor<64x128xf32, #layout1>
```

Closes #587
### Ticket
#2254

More specifically: 
#2178
#2247
#2249

### Problem description
Want to be able to write reader and writer kernels used in matmul
programming examples.

### What's changed
`InterleavedAddrGenFast` cpp struct mapping:
- struct itself is exposed to python through TTKernelOp
- added `noc_async_read_tile` ttkernel op
- added `noc_async_write_tile` ttkernel op
### Ticket
Related to #2277 

### Problem description
Optimizer needs constraints and runtime model for every op in order to
shard them properly. This commit adds support for multiply op in that
regard.

### What's changed
Implementation for GetOpConstraints & GetOpRuntime for MultiplyOp.   
Unit tests for MultiplyOp API, unified with existing AddOp tests.

Checklist

- [x]  New/Existing tests provide coverage for changes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.