Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Operation very slow to compile since jax 0.4.36 #26162

Open
btaba opened this issue Jan 28, 2025 · 12 comments
Open

Operation very slow to compile since jax 0.4.36 #26162

btaba opened this issue Jan 28, 2025 · 12 comments
Labels
bug Something isn't working

Comments

@btaba
Copy link

btaba commented Jan 28, 2025

Description

Hi folks,

We've been having slow compilation issues since jax 0.4.36 with some of our JAX code. The slow compilation (i.e. takes O(hours) to run instead of <2 minutes) pops up for certain devices and for certain environments (with seemingly benign code changes). It has been quite tricky to pin down the issue over the last couple of months, and I haven't been able to create a smaller MRE than the following (on my local RTX 4090):

Install packages
pip install --upgrade jax[cuda] jaxlib 
pip install --upgrade mujoco
pip install --upgrade mujoco_mjx
pip install --upgrade brax
mujoco==3.2.7
mujoco-mjx==3.2.7
brax==0.12.1

Run this Python code:

import functools
from mujoco_playground import registry
from mujoco_playground import wrapper
from mujoco_playground.config import manipulation_params
from brax.training.agents.ppo import train as ppo
from brax.training.agents.ppo import networks as ppo_networks


env_name = 'PandaRobotiqPushCube'
env = registry.load(env_name)
env_cfg = registry.get_default_config(env_name)

ppo_params = manipulation_params.brax_ppo_config(env_name)
ppo_training_params = dict(ppo_params)
network_factory = ppo_networks.make_ppo_networks
if "network_factory" in ppo_params:
  del ppo_training_params["network_factory"]
  network_factory = functools.partial(
      ppo_networks.make_ppo_networks,
      **ppo_params.network_factory
  )

train_fn = functools.partial(
    ppo.train, **dict(ppo_training_params),
    network_factory=network_factory,
)
make_inference_fn, params, metrics = train_fn(
    environment=env,
    wrap_env_fn=wrapper.wrap_for_brax_training,
)

The corresponding XLA dump is attached.

I also reran the same script with num_evals=0 within train_fn, and the code runs fine (the slow compilation occurs somewhere here). I'm attaching both the working and non-working XLA dumps. We would really appreciate any help on this issue.

xla_dump_hanging_compilation.tar.gz
xla_dump_working.tar.gz

System info (python version, jaxlib version, accelerator, etc.)

jax:    0.5.0
jaxlib: 0.5.0
numpy:  1.26.4
python: 3.12.3 (main, Sep 10 2024, 15:47:39) [GCC 13.2.0]
device info: NVIDIA GeForce RTX 4090-1, 1 local devices"
process_count: 1
platform: uname_result(system='Linux', node='btaba.mtv.corp.google.com', release='6.10.11-1rodete2-amd64', version='#1 SMP PREEMPT_DYNAMIC Debian 6.10.11-1rodete2 (2024-10-16)', machine='x86_64')


$ nvidia-smi
Tue Jan 28 13:30:18 2025       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.216.01             Driver Version: 535.216.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 4090        Off | 00000000:61:00.0 Off |                  Off |
|  0%   50C    P2              22W / 450W |    397MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A   2663746      C   .../.pyenv/versions/mjx-312/bin/python      390MiB |
+---------------------------------------------------------------------------------------+
@btaba
Copy link
Author

btaba commented Jan 28, 2025

@kevinzakka
Copy link

@mattjj Would appreciate your help here 🙏

@jreiffers
Copy link
Contributor

jreiffers commented Jan 31, 2025

HLO reproducer: https://gist.github.com/jreiffers/b6b8427ef64c00e688e79fd5af25b571

bazel run --config=cuda //xla/tools:run_hlo_module -c opt  -- --xla_disable_all_hlo_passes --input_format=hlo --random_init_input_literals --platform=CUDA /tmp/repro.hlo

I'll take a look which pass is blowing up. Hopefully it's not in LLVM :-).

@jreiffers
Copy link
Contributor

I initially suspected the unnecessary concatenate at the end, but it happens even when that's fixed manually. There are at least two issues here:

  1. codegen/emitters/computation_partitioner.cc does a really poor job partitioning this, because of the bitcasts above the concatenate. The partitioner does not track the real indexing, so after the bitcast it considers the inputs to the bitcasts (e.g. add.56926.1.clone.1) to have users with inconsistent indexing (the root tuple and the bitcast), which cascades all the way to the parameters. We end up with lots of functions even though ~everything is elementwise.
  2. Afterwards, inlining breaks, generating a huge function. I stopped it after a few hundred thousand ops. Not entirely sure yet what's happening here, maybe failed/insufficient CSE or canonicalization.

The correct fix for 1. is to use proper indexing maps in the partitioner. A quick hack is to change all_users_elementwise in computation_partitioner.cc to this:

    bool all_users_elementwise =
        absl::c_all_of(instr->users(), [&](const HloInstruction* user) {
          return HloInstruction::IsOpElementwise(user->opcode()) ||
                 user->opcode() == HloOpcode::kTuple ||
                 user->opcode() == HloOpcode::kBitcast;
        });

However, I'm not sure that's safe. @pifon2a could you give that a try and see what breaks?

@pifon2a
Copy link
Contributor

pifon2a commented Jan 31, 2025

@jreiffers Thank you for the reproducer. Trying.

@pifon2a
Copy link
Contributor

pifon2a commented Jan 31, 2025

The quick hack did not work. I'll try to migrate us to the indexing maps next week.

@btaba
Copy link
Author

btaba commented Jan 31, 2025

Thanks @jreiffers and @pifon2a for taking a look, really appreciate it!

@kevinzakka
Copy link

+1, thank you!

@pifon2a
Copy link
Contributor

pifon2a commented Feb 14, 2025

So, i have a new version of a computation partitioner that completely relies on indexing maps. That did not help and it even outlines the same number of functions. Disabling the inliner helps and it compiles quickly. I will check, what's happening there.

@pifon2a
Copy link
Contributor

pifon2a commented Feb 18, 2025

Ok, the issue was within the inliner interface itself (XlaInlinerInterface::isLegalToInline). I will upload the fix today after checking that I did not regress anything.

copybara-service bot pushed a commit to openxla/xla that referenced this issue Feb 18, 2025
copybara-service bot pushed a commit to tensorflow/tensorflow that referenced this issue Feb 18, 2025
copybara-service bot pushed a commit to openxla/xla that referenced this issue Feb 18, 2025
copybara-service bot pushed a commit to tensorflow/tensorflow that referenced this issue Feb 18, 2025
copybara-service bot pushed a commit to openxla/xla that referenced this issue Feb 18, 2025
copybara-service bot pushed a commit to tensorflow/tensorflow that referenced this issue Feb 18, 2025
@pifon2a
Copy link
Contributor

pifon2a commented Feb 18, 2025

The fix makes compilation of one of the tests in JAX slow. Everything else became better... Looking.

copybara-service bot pushed a commit to openxla/xla that referenced this issue Feb 21, 2025
At the moment we inline a callee that calls one of the functions called by the caller. This PR adjusts the logic to inline only such callees that call a subset of the caller's functions. That way we can be sure that after inlining the caller calls the same set of functions excluding the inlined one.

Background: jax-ml/jax#26162 contains an example of a MoF fusion that takes forever to compile.

The [indexing-based partitioner](44bc816) in combination with this change fix the issue.

PiperOrigin-RevId: 729079659
copybara-service bot pushed a commit to tensorflow/tensorflow that referenced this issue Feb 21, 2025
At the moment we inline a callee that calls one of the functions called by the caller. This PR adjusts the logic to inline only such callees that call a subset of the caller's functions. That way we can be sure that after inlining the caller calls the same set of functions excluding the inlined one.

Background: jax-ml/jax#26162 contains an example of a MoF fusion that takes forever to compile.

The [indexing-based partitioner](openxla/xla@44bc816) in combination with this change fix the issue.

PiperOrigin-RevId: 729079659
copybara-service bot pushed a commit to openxla/xla that referenced this issue Feb 21, 2025
At the moment we inline a callee that calls one of the functions called by the caller. This PR adjusts the logic to inline only such callees that call a subset of the caller's functions. That way we can be sure that after inlining the caller calls the same set of functions excluding the inlined one.

Background: jax-ml/jax#26162 contains an example of a MoF fusion that takes forever to compile.

The [indexing-based partitioner](44bc816) in combination with this change fix the issue.

PiperOrigin-RevId: 729079659
copybara-service bot pushed a commit to tensorflow/tensorflow that referenced this issue Feb 21, 2025
At the moment we inline a callee that calls one of the functions called by the caller. This PR adjusts the logic to inline only such callees that call a subset of the caller's functions. That way we can be sure that after inlining the caller calls the same set of functions excluding the inlined one.

Background: jax-ml/jax#26162 contains an example of a MoF fusion that takes forever to compile.

The [indexing-based partitioner](openxla/xla@44bc816) in combination with this change fix the issue.

PiperOrigin-RevId: 729079659
copybara-service bot pushed a commit to openxla/xla that referenced this issue Feb 24, 2025
Inline only if there are more than 1 call to the callee in the caller.

Background: jax-ml/jax#26162 contains an example of a MoF fusion that takes forever to compile.

The [indexing-based partitioner](44bc816) in combination with this change fixes the issue.

PiperOrigin-RevId: 729079659
copybara-service bot pushed a commit to tensorflow/tensorflow that referenced this issue Feb 24, 2025
Inline only if there are more than 1 call to the callee in the caller.

Background: jax-ml/jax#26162 contains an example of a MoF fusion that takes forever to compile.

The [indexing-based partitioner](openxla/xla@44bc816) in combination with this change fixes the issue.

PiperOrigin-RevId: 729079659
copybara-service bot pushed a commit to openxla/xla that referenced this issue Feb 24, 2025
Inline only if there are more than 1 call to the callee in the caller.

Background: jax-ml/jax#26162 contains an example of a MoF fusion that takes forever to compile.

The [indexing-based partitioner](44bc816) in combination with this change fixes the issue.

PiperOrigin-RevId: 729079659
copybara-service bot pushed a commit to tensorflow/tensorflow that referenced this issue Feb 24, 2025
Inline only if there are more than 1 call to the callee in the caller.

Background: jax-ml/jax#26162 contains an example of a MoF fusion that takes forever to compile.

The [indexing-based partitioner](openxla/xla@44bc816) in combination with this change fixes the issue.

PiperOrigin-RevId: 729079659
copybara-service bot pushed a commit to openxla/xla that referenced this issue Feb 24, 2025
Inline only if there are more than 1 call to the callee in the caller.

Background: jax-ml/jax#26162 contains an example of a MoF fusion that takes forever to compile.

The [indexing-based partitioner](44bc816) in combination with this change fixes the issue.

PiperOrigin-RevId: 730436982
copybara-service bot pushed a commit to tensorflow/tensorflow that referenced this issue Feb 24, 2025
Inline only if there are more than 1 call to the callee in the caller.

Background: jax-ml/jax#26162 contains an example of a MoF fusion that takes forever to compile.

The [indexing-based partitioner](openxla/xla@44bc816) in combination with this change fixes the issue.

PiperOrigin-RevId: 730436982
@pifon2a
Copy link
Contributor

pifon2a commented Feb 24, 2025

Indexing map-based partitioner and Tweaked inliner fixed the issue. Let me know if you still have problems with this.

saisindhuri91 added a commit to linux-on-ibm-z/tensorflow that referenced this issue Feb 26, 2025
commit d56c042
Author: Adrian Kuegel <akuegel@google.com>
Date:   Tue Feb 25 22:27:49 2025 -0800

    Let FusionDeduplicationCache handle ProducerConsumer multi-output fusions.

    This will be needed when we want to allow such fusions in PriorityFusion.

    PiperOrigin-RevId: 731165217

commit 77ba208
Author: Majid Dadashi <majiddadashi@google.com>
Date:   Tue Feb 25 21:08:34 2025 -0800

    Enable folding of quantized reshape with per-axis scales

    PiperOrigin-RevId: 731144237

commit 446fac2
Author: Eunjae Kim <eunjaekim@google.com>
Date:   Tue Feb 25 21:05:05 2025 -0800

    Introduce `FunctionBody::Finalize()` to populate `AllocatorAttribute`s for arg/ret nodes and release unnecessary resources

    PiperOrigin-RevId: 731143677

commit 58269e0
Author: Weiyi Wang <weiyiw@google.com>
Date:   Tue Feb 25 18:27:37 2025 -0800

    Flip default of _experimental_enable_composite_direct_lowering flag to True

    PiperOrigin-RevId: 731105623

commit 7d4ce51
Author: A. Unique TensorFlower <gardener@tensorflow.org>
Date:   Tue Feb 25 17:42:26 2025 -0800

    Move some strategy generation utilities from auto_sharding_dot_handler.cc to
    auto_sharding_strategy.h with the intention of using the utilities more broadly
    throughout the codebase.

    PiperOrigin-RevId: 731094359

commit af03154
Author: Yin Zhang <yinzz@google.com>
Date:   Tue Feb 25 17:09:21 2025 -0800

    Reverts changelist 723349025

    PiperOrigin-RevId: 731085146

commit 2bb741a
Author: Eric Yang <yijieyang@google.com>
Date:   Tue Feb 25 17:07:40 2025 -0800

    Add HLO adapter

    PiperOrigin-RevId: 731084644

commit 745b9dd
Author: A. Unique TensorFlower <gardener@tensorflow.org>
Date:   Tue Feb 25 16:37:04 2025 -0800

    Always set use_global_scheduler/rank_queues with priority_merge policy.

    PiperOrigin-RevId: 731074632

commit 16e6b9f
Author: A. Unique TensorFlower <gardener@tensorflow.org>
Date:   Tue Feb 25 16:23:15 2025 -0800

    Integrate LLVM at llvm/llvm-project@9889de834b0a

    Updates LLVM usage to match
    [9889de834b0a](llvm/llvm-project@9889de834b0a)

    PiperOrigin-RevId: 731070091

commit 1e392a4
Author: A. Unique TensorFlower <gardener@tensorflow.org>
Date:   Tue Feb 25 16:17:51 2025 -0800

    Update ops-related pbtxt files.

    PiperOrigin-RevId: 731068451

commit fcedb3c
Author: Pat Notz <patn@google.com>
Date:   Tue Feb 25 16:14:40 2025 -0800

    Flag guard the option to disable embedding pipelining when summary ops are present

    PiperOrigin-RevId: 731067500

commit 8ebbd6c
Author: A. Unique TensorFlower <gardener@tensorflow.org>
Date:   Tue Feb 25 16:08:09 2025 -0800

    Go: Update generated wrapper functions for TensorFlow ops.

    PiperOrigin-RevId: 731065650

commit 8a759e6
Author: Derek Murray <mrry@google.com>
Date:   Tue Feb 25 16:05:29 2025 -0800

    Introduce `TPUDummyInput` as a specialization of `Fill` for ICI weight distribution.

    The new op has a few benefits over the previous version:
    * We can generate a single op instead of three ops for each dummy input.
    * The new op is marked as `DoNotOptimize` and `TF_NoConstantFold`, so it will never be accidentally constant-folded to a large memory footprint.

    PiperOrigin-RevId: 731064699

commit 4723816
Author: A. Unique TensorFlower <gardener@tensorflow.org>
Date:   Tue Feb 25 15:36:07 2025 -0800

    Support setting up global prioritized batching via the batch op rewriter.

    PiperOrigin-RevId: 731054770

commit ae1d10b
Author: Luke Boyer <lukeboyer@google.com>
Date:   Tue Feb 25 14:29:03 2025 -0800

    Add serialization options to the public API for alignment for bytecode.

    PiperOrigin-RevId: 731030707

commit 1333586
Author: Tai Ly <tai.ly@arm.com>
Date:   Tue Feb 25 16:28:32 2025 -0600

    [tosa] Fix lowering of tf/tfl expand_dims for negative dim (tensorflow#67859)

    This fixes lowering of tf/tfl expand_dims to tosa,
    for negative dim values such that dim=-1 means adding
     inner most dimension

commit 78dd108
Author: Terry Heo <terryheo@google.com>
Date:   Tue Feb 25 14:04:09 2025 -0800

    litert: Fix broken Dispatch API tests

    Provide valid DispatchOption to LiteRtDispatchInitialize()

    PiperOrigin-RevId: 731021714

commit b120e3e
Author: Michael Hudgins <michaelhudgins@google.com>
Date:   Tue Feb 25 13:43:31 2025 -0800

    [XLA:OSS] Add CI connection step to the ci workflows.

    PiperOrigin-RevId: 731013692

commit 0e5ec72
Author: Reed Wanderman-Milne <reedwm@google.com>
Date:   Tue Feb 25 13:34:28 2025 -0800

    Fix race condition in the predicate in GPU thunks.

    WhileThunk and ConditionalThunk stored CUDA host memory that would store the predicate. The thunks would transfer the predicate from device to host into the CUDA host memory. But if the thunks were called multiple times in parallel, each call would use the same memory, causing a race condition which could result in incorrect predicate values.

    Now a pool of host memory is used so different calls to the thunk get different pointers to host memory. The pool has a fixed size of 128, so if there are more parallel callers than that, an error will be raised. I think it's unlikely there will be that many parallel calls in practice.

    PiperOrigin-RevId: 731010318

commit b273bba
Author: Chenguang Wang <chenguangwang@google.com>
Date:   Tue Feb 25 13:07:11 2025 -0800

    Fix Android ARM64 build for hlo_to_mhlo.

    See also commit ce2bae2.

    PiperOrigin-RevId: 731000510

commit 97d5495
Author: Andrew Zhang <yunandrew@google.com>
Date:   Tue Feb 25 12:47:12 2025 -0800

    Directly overwrite ADSP_LIBRARY_PATH if shared lib path is provided to qnn manager.

    Fix the issue where existing ADSP_LIBRARY_PATH contains other versions QNN lib files.

    PiperOrigin-RevId: 730992932

commit f4e0633
Author: Julia Guo <juliagmt@google.com>
Date:   Tue Feb 25 12:43:48 2025 -0800

    [XLA:GPU] Fix xspace.pb path

    PiperOrigin-RevId: 730991615

commit 0a6967b
Author: Oleg Shyshkov <shyshkov@google.com>
Date:   Tue Feb 25 12:33:49 2025 -0800

    [XLA:GPU] Fix thunk emitter for degenerate ops.

    The condition to get index of the output buffer wasn't always correct. It's possible to have an op with 1 operand and result with a tuple of 1 element. For example, a degenerate a2a will look like:

    ```
    a2a = (u32[2]) all-to-all(u32[2] a1), replica_groups={{0},{1}}
    ```

    It's better to check that output is a tuple.

    PiperOrigin-RevId: 730988026

commit 95e9577
Author: A. Unique TensorFlower <gardener@tensorflow.org>
Date:   Tue Feb 25 12:02:27 2025 -0800

    Fix HLO stats table to use int types as ints (instead of strings).

    PiperOrigin-RevId: 730976625

commit 754f826
Author: A. Unique TensorFlower <gardener@tensorflow.org>
Date:   Tue Feb 25 11:53:35 2025 -0800

    Reverts 1e0f639

    PiperOrigin-RevId: 730973217

commit 9b75a55
Author: A. Unique TensorFlower <gardener@tensorflow.org>
Date:   Tue Feb 25 11:39:38 2025 -0800

    Cleanup: Fix includes.

    PiperOrigin-RevId: 730967918

commit 09806e6
Author: Luke Boyer <lukeboyer@google.com>
Date:   Tue Feb 25 11:36:33 2025 -0800

    Add support for aligned byte code in internal model serialize API

    PiperOrigin-RevId: 730966854

commit 14aeefb
Author: Penporn Koanantakool <penporn@google.com>
Date:   Tue Feb 25 10:25:55 2025 -0800

    [xla:cpu:onednn] Support basic MatMul in oneDNN fusion thunk.

    PiperOrigin-RevId: 730937945

commit c47c195
Author: David Dunleavy <ddunleavy@google.com>
Date:   Tue Feb 25 10:21:01 2025 -0800

    Remove TensorFlow specific configs in `tensorflow.bazelrc`

    PiperOrigin-RevId: 730935687

commit 5c4dddd
Author: Eugene Zhulenev <ezhulenev@google.com>
Date:   Tue Feb 25 10:18:18 2025 -0800

    [xla:cpu] Move dot_kernel_emitter under codegen/dot

    PiperOrigin-RevId: 730934538

commit e69ca84
Author: Vladimir Belitskiy <belitskiy@google.com>
Date:   Tue Feb 25 10:06:41 2025 -0800

    Patch rules_python to point to the newest Python 3.12 patch version.

    This should hopefully resolve Windows RBE test runs on Python3.12 flaking with
    WMI query errors (python/cpython#125315).

    PiperOrigin-RevId: 730930044

commit f1dc591
Author: Won Jong Jeon <won.jeon@arm.com>
Date:   Tue Feb 25 10:16:42 2025 -0800

    [mlir][tosa] Fix lit tests for resize (tensorflow#87976)

    Change-Id: I8cb88a0b6344259d57a37d6ddd2c0810bb7a61e7

    Signed-off-by: Won Jeon <won.jeon@arm.com>

commit 0f1a45d
Author: Quentin Khan <qkhan@google.com>
Date:   Tue Feb 25 09:52:32 2025 -0800

    #litert Create the NPU accelerator.

    The accelerator is not yet automatically registered to the LiteRT environment.

    PiperOrigin-RevId: 730924856

commit 57859a1
Author: Aliia Khasanova <aliia@google.com>
Date:   Tue Feb 25 09:46:10 2025 -0800

    Overwrite xla_dump_as_* options in raw_options only if raw_options.xla_dump_to is set. Otherwise keep debug_options settings.

    This is needed to access the flags state in PjRtStreamExecutorLoadedExecutable::Execute. Specifically, I need to access dumping options in order to dump unoptimized hlo module with arguments during execution correctly.

    PiperOrigin-RevId: 730922688

commit c9c731e
Author: Quentin Khan <qkhan@google.com>
Date:   Tue Feb 25 09:30:03 2025 -0800

    #litert Fix `LITERT_RETURN_IF_ERROR` when checking bool return values.

    - `false` return values are errors.
    - Add `kLiteRtStatusErrorUnknown` for unknown errors.
    - When converting a boolean error to a `LiteRtStatus`/`litert::Expected`, the
      error value is `kLiteRtStatusErrorUnknown`.

    PiperOrigin-RevId: 730917169

commit 573c1ff
Author: Ilia Sergachev <isergachev@nvidia.com>
Date:   Tue Feb 25 09:12:55 2025 -0800

    PR tensorflow#23078: Revert "PR tensorflow#22292: [GPU] Support cuDNN explicit CUDA graph construction."

    Imported from GitHub PR openxla/xla#23078

    This reverts commit 65b4b8874b659d7f11523f7b1d6df1613cfc8984.
    Copybara import of the project:

    --
    f2cc964f5b849b149626a007045cccc32778ee27 by Ilia Sergachev <isergachev@nvidia.com>:

    Revert "PR tensorflow#22292: [GPU] Support cuDNN explicit CUDA graph construction."

    This reverts commit 65b4b8874b659d7f11523f7b1d6df1613cfc8984.

    Merging this change closes tensorflow#23078

    PiperOrigin-RevId: 730911296

commit 10f7fe6
Author: Ilia Sergachev <isergachev@nvidia.com>
Date:   Tue Feb 25 09:05:55 2025 -0800

    PR tensorflow#22898: [GPU] GEMM fusion autotuner: dump unoptimized fusions before profiling them.

    Imported from GitHub PR openxla/xla#22898

    This helps debugging failures during profiling.
    Copybara import of the project:

    --
    e63f7865126281a7eb5b410394424826275037a8 by Ilia Sergachev <isergachev@nvidia.com>:

    [GPU] GEMM fusion autotuner: dump unoptimized fusions before profiling them.

    This helps debugging failures during profiling.

    Merging this change closes tensorflow#22898

    PiperOrigin-RevId: 730909003

commit ca77b1a
Author: Penporn Koanantakool <penporn@google.com>
Date:   Tue Feb 25 08:37:29 2025 -0800

    [xla:cpu:onednn] Support elementwise Add and Mul in oneDNN fusion thunk

    PiperOrigin-RevId: 730899327

commit c42688e
Author: Ilia Sergachev <isergachev@nvidia.com>
Date:   Tue Feb 25 08:23:14 2025 -0800

    PR tensorflow#23068: [GPU] Fix missing cuDNN symbols.

    Imported from GitHub PR openxla/xla#23068

    This fixes JAX builds with cuDNN 9.5.0+ after openxla/xla@65b4b88.
    Copybara import of the project:

    --
    3aa286e5a849e2187ef3d44c22c54d518dd168ec by Ilia Sergachev <isergachev@nvidia.com>:

    [GPU] Fix missing cuDNN symbols.

    Merging this change closes tensorflow#23068

    PiperOrigin-RevId: 730895063

commit 6b098f7
Author: Benjamin Kramer <kramerb@google.com>
Date:   Tue Feb 25 07:59:38 2025 -0800

    Integrate LLVM at llvm/llvm-project@d23da7d6300e

    Updates LLVM usage to match
    [d23da7d6300e](llvm/llvm-project@d23da7d6300e)

    PiperOrigin-RevId: 730887012

commit 847b2df
Author: Aliia Khasanova <aliia@google.com>
Date:   Tue Feb 25 07:55:28 2025 -0800

    [XLA:GPU] Reset `CodedInputStream` after parsing each literal in the serialization of large snapshots.

    `CodedInputStream` has an internal int32 counter for total bytes read, limiting the bytes read by a single instance to 2 GiB.
    I've changed the deserialization implementation to parse each literal with a separate `CodedInputStream`. This fix still limits the *size of each literal* to 2 GiB.

    PiperOrigin-RevId: 730885881

commit 366d129
Author: Alexander Lyashuk <crem@google.com>
Date:   Tue Feb 25 07:51:52 2025 -0800

    [XLA] Preserve AUTO layout when converting from HLO to StableHLO

    In HLO, AUTO layout is encoded as missing layout in `entry_computation_layout`.

    In StableHLO, it's marked using `mhlo.layout_mode = "auto"` attribute of the main@ function argument or return value.

    PiperOrigin-RevId: 730884950

commit c8f3847
Author: Mohammed Anany <manany@google.com>
Date:   Tue Feb 25 07:32:02 2025 -0800

    [XLA:GPU/TMA] Adding verification for triton_xla ops and custom type.

    PiperOrigin-RevId: 730879041

commit 50054d5
Author: Tori Baker <vwbaker@google.com>
Date:   Tue Feb 25 06:58:03 2025 -0800

    [xla:gpu:triton] Create tma_utils with functions & tests that help with emitting TMA through triton. (see child cl to see how most of these get used).

    This also helps to isolate TMA that can be used in other places.

    PiperOrigin-RevId: 730869356

commit 1b776c9
Author: Oleg Shyshkov <shyshkov@google.com>
Date:   Tue Feb 25 06:49:57 2025 -0800

    [XLA:GPU] Init output data with -1.

    Makes it easier to detect cases when we overwrite data out of the update range.

    PiperOrigin-RevId: 730867242

commit 522b1b9
Author: A. Unique TensorFlower <gardener@tensorflow.org>
Date:   Tue Feb 25 05:47:53 2025 -0800

    Upgrade Bazel to 7.4.1

    PiperOrigin-RevId: 730848636

commit 41cc4b5
Author: Goran Flegar <gflegar@google.com>
Date:   Tue Feb 25 05:09:33 2025 -0800

    Log which "test case" we are running in TritonAndBlasSupport... Regular2DDot

    It is not quite ideal that we have a test that in effect consists of several test-cases, since it's difficult to figure out which one failed when one of them crashes.

    I do understand the idea that we want an easy to see support matrix, and splitting it up into individual tests would prevent us from doing that.

    As a middle ground, adding some logging so it's easy to tell what failed from the log.

    PiperOrigin-RevId: 730839144

commit 450341f
Author: Henning Becker <hebecker@google.com>
Date:   Tue Feb 25 03:26:41 2025 -0800

    [XLA] Remove the `device_util` build rule from XLA

    The header file for this build rule doesn't exist anymore.

    PiperOrigin-RevId: 730812030

commit e41890c
Author: A. Unique TensorFlower <gardener@tensorflow.org>
Date:   Tue Feb 25 01:03:01 2025 -0800

    Update GraphDef version to 2149.

    PiperOrigin-RevId: 730773101

commit aed230f
Author: A. Unique TensorFlower <gardener@tensorflow.org>
Date:   Tue Feb 25 01:02:52 2025 -0800

    compat: Update forward compatibility horizon to 2025-02-25

    PiperOrigin-RevId: 730773034

commit 581787c
Author: A. Unique TensorFlower <gardener@tensorflow.org>
Date:   Mon Feb 24 23:58:42 2025 -0800

    Automated Code Change

    PiperOrigin-RevId: 730754505

commit b685cdc
Author: A. Unique TensorFlower <gardener@tensorflow.org>
Date:   Mon Feb 24 23:40:42 2025 -0800

    update SafeDivide() function to reference the correct lib from tsl::profiler
    Internal change

    PiperOrigin-RevId: 730749825

commit 4902208
Author: A. Unique TensorFlower <gardener@tensorflow.org>
Date:   Mon Feb 24 23:03:19 2025 -0800

    Fixes sub key generation for the stacked variable.

    PiperOrigin-RevId: 730740034

commit 205f198
Author: A. Unique TensorFlower <gardener@tensorflow.org>
Date:   Mon Feb 24 22:44:09 2025 -0800

    Automated Code Change

    PiperOrigin-RevId: 730734324

commit 7a51c9f
Author: A. Unique TensorFlower <gardener@tensorflow.org>
Date:   Mon Feb 24 22:32:07 2025 -0800

    Automated Code Change

    PiperOrigin-RevId: 730731111

commit 023d8cc
Author: Yin Zhang <yinzz@google.com>
Date:   Mon Feb 24 22:15:51 2025 -0800

    Switch from tsl::Mutex to absl::Mutex

    PiperOrigin-RevId: 730727424

commit c349c84
Author: A. Unique TensorFlower <gardener@tensorflow.org>
Date:   Mon Feb 24 22:12:40 2025 -0800

    Internal change for visibility

    PiperOrigin-RevId: 730726643

commit 0105096
Author: Gunhyun Park <gunhyun@google.com>
Date:   Mon Feb 24 21:40:29 2025 -0800

    Bump the priority of CHLO->MHLO ragged dot pass to highest.

    PiperOrigin-RevId: 730719320

commit 52f1cfe
Author: Ezekiel Calubaquib <ecalubaquib@google.com>
Date:   Mon Feb 24 21:08:58 2025 -0800

    Fix duplicate error in LiteRT by replacing tensorflow.lite with tflite.python.lite

    PiperOrigin-RevId: 730711746

commit b482271
Author: Gunhyun Park <gunhyun@google.com>
Date:   Mon Feb 24 20:52:27 2025 -0800

    Integrate StableHLO at openxla/stablehlo@5e9b356b

    PiperOrigin-RevId: 730707859

commit 3449eea
Author: Alexander Pivovarov <pivovaa@amazon.com>
Date:   Mon Feb 24 19:15:45 2025 -0800

    PR tensorflow#22930: Initialize num_slices_ to 0 in Heap Simulator

    Imported from GitHub PR openxla/xla#22930

    Ensure `num_slices_` class member is explicitly initialized to 0 in `SliceTimeAllPermutationIterator` and `SliceTimePreferredPermutationIterator` to prevent potential uninitialized variable issues.
    Copybara import of the project:

    --
    53a76f188330d4e72171e3b5349e79aafa68132c by Alexander Pivovarov <pivovaa@amazon.com>:

    Initialize num_slices_ to 0 in Heap Simulator

    Merging this change closes tensorflow#22930

    PiperOrigin-RevId: 730686675

commit dc6c496
Author: Alexander Pivovarov <pivovaa@amazon.com>
Date:   Mon Feb 24 19:12:40 2025 -0800

    PR tensorflow#22953: Fix const qualifier on status prevents automatic move semantics

    Imported from GitHub PR openxla/xla#22953

    reason for change - const qualifier on `status` prevents automatic move semantics in return.

    When return status; is executed, the compiler cannot invoke the move constructor of `absl::Status` because status is const.
    Copybara import of the project:

    --
    b1722312a9e697d9e55d8758eb1c083005fefcda by Alexander Pivovarov <pivovaa@amazon.com>:

    Fix const qualifier on status prevents automatic move semantics

    Merging this change closes tensorflow#22953

    PiperOrigin-RevId: 730686035

commit bbbc58a
Author: Yunlong Liu <yliu120@users.noreply.github.com>
Date:   Mon Feb 24 18:47:56 2025 -0800

    PR tensorflow#22956: vlog device id in while_thunk.

    Imported from GitHub PR openxla/xla#22956

    Copybara import of the project:

    --
    d4623150b29e8c3960a1839c3da2234eae71adac by Yunlong Liu <yliu120@users.noreply.github.com>:

    vlog device id in while_thunk.

    Merging this change closes tensorflow#22956

    PiperOrigin-RevId: 730681273

commit 564b4a1
Author: Eugene Zhulenev <ezhulenev@google.com>
Date:   Mon Feb 24 18:13:17 2025 -0800

    [xla:cpu] InProcessCommunicator: compute collective operations in parallel using all ranks

    PiperOrigin-RevId: 730672408

commit c90652e
Author: Yin Zhang <yinzz@google.com>
Date:   Mon Feb 24 17:47:08 2025 -0800

    Migrate callers from tensorflow::profiler math_utils to tsl/profiler/utils/math_utils.h. No functional changes expected.

    PiperOrigin-RevId: 730665145

commit 8cf7713
Author: Luke Boyer <lukeboyer@google.com>
Date:   Mon Feb 24 17:41:38 2025 -0800

    Add a flatbuffer util (python) function for getting the builtin options as a given type.

    PiperOrigin-RevId: 730663554

commit f19575d
Author: A. Unique TensorFlower <gardener@tensorflow.org>
Date:   Mon Feb 24 17:00:51 2025 -0800

    Ensure that fetch window is never bigger than the full trace duration.

    PiperOrigin-RevId: 730649115

commit 6e3ec2a
Author: Luke Boyer <lukeboyer@google.com>
Date:   Mon Feb 24 17:00:26 2025 -0800

    Add flag (only on flatbuffer export tool) to disable buffer sharing in flatbuffer.

    Some downstream tools won't support this.

    PiperOrigin-RevId: 730648967

commit 988ab99
Author: Luke Boyer <lukeboyer@google.com>
Date:   Mon Feb 24 16:44:49 2025 -0800

    Integrate the compiler flags into the tooling

    PiperOrigin-RevId: 730644395

commit f84cf9d
Author: Tom Natan <tomnatan@google.com>
Date:   Mon Feb 24 16:23:27 2025 -0800

    Build absl::string_view(data, length) (instead of StringRef::str) explicitly since the llvm::StringRef to absl::string_view converter is not (always?) available on
    Android.

    END_PUBLIC

    PiperOrigin-RevId: 730637186

commit f0061c7
Merge: d42e2d6 a47a227
Author: TensorFlower Gardener <gardener@tensorflow.org>
Date:   Mon Feb 24 15:42:54 2025 -0800

    Merge pull request tensorflow#87937 from jiunkaiy:dev/chuntl/revise_log

    PiperOrigin-RevId: 730617466

commit d42e2d6
Author: Alexander Pivovarov <pivovaa@amazon.com>
Date:   Mon Feb 24 15:05:10 2025 -0800

    PR tensorflow#22822: Fix ambiguous constructor call in SourceTargetPairs initialization

    Imported from GitHub PR openxla/xla#22822

    ### Description
    Resolve a build failure (with GCC-11) in `collective_permute_cycle_test` caused by an ambiguous constructor call when initializing `SourceTargetPairs` with an empty list (`{{}}`).

    #### Issue
    When calling `SourceTargetPairs({{}})`, the compiler could not determine whether to use the `std::vector<std::pair<int64_t, int64_t>>` constructor or the default copy/move constructors, leading to an error:
    ```
    xla/service/collective_permute_cycle_test.cc:130:48: error: call of overloaded 'SourceTargetPairs(<brace-enclosed initializer list>)' is ambiguous
      130 |   EXPECT_EQ(GetCycleType(SourceTargetPairs({{}})), CycleType::kNone);
    ```

    #### Solution
    1. Explicitly define an `initializer_list` constructor for `SourceTargetPairs` to properly handle `{}` and `{{src, tgt}}` initializations.
    2. Update the test case to use default ctor `SourceTargetPairs()` instead of `SourceTargetPairs({{}})`, ensuring clarity and correctness.

    This fix ensures proper initialization and eliminates ambiguity

    Tested with GCC-11
    Copybara import of the project:

    --
    f97c38d47c8373ec609fdfbaedff3856f123fc33 by Alexander Pivovarov <pivovaa@amazon.com>:

    Fix ambiguous constructor call in SourceTargetPairs initialization

    Merging this change closes tensorflow#22822

    PiperOrigin-RevId: 730610452

commit 64e4135
Author: A. Unique TensorFlower <gardener@tensorflow.org>
Date:   Mon Feb 24 15:03:32 2025 -0800

    Change size_in_bytes argument type from int to size_t.

    Other uses of it are size_t, so this makes it consistent.

    PiperOrigin-RevId: 730609856

commit c0562df
Author: A. Unique TensorFlower <gardener@tensorflow.org>
Date:   Mon Feb 24 14:53:13 2025 -0800

    Expose ExecutableBuildOptions::CompilationEnvironments::CreateFromProto to python
    Add a default TpuCompilationEnvironment to the wiz export

    PiperOrigin-RevId: 730606534

commit ce2bae2
Author: Chenguang Wang <chenguangwang@google.com>
Date:   Mon Feb 24 14:50:30 2025 -0800

    Fix Android ARM64 build.

    The llvm::StringRef to absl::string_view converter is not (always?) available on
    Android, so inserting StringRef::str() calls where necessary.

    PiperOrigin-RevId: 730605410

commit dc4dbaf
Author: David Dunleavy <ddunleavy@google.com>
Date:   Mon Feb 24 14:38:05 2025 -0800

    Remove `release` configs from XLA's version of the TensorFlow bazelrc except for MacOS

    PiperOrigin-RevId: 730600681

commit 015bab9
Author: A. Unique TensorFlower <gardener@tensorflow.org>
Date:   Mon Feb 24 14:32:06 2025 -0800

    Make xla_test, etc, shuffle tests by default.

    This helps catch test order dependencies at presubmit time.

    PiperOrigin-RevId: 730598576

commit e25e378
Author: Oleg Shyshkov <shyshkov@google.com>
Date:   Mon Feb 24 14:14:13 2025 -0800

    [XLA:GPU] Give descriptive names to test case parameters.

    By default the parameterized test suites have numbers as names.

    PiperOrigin-RevId: 730592124

commit 10f8a18
Author: David Dunleavy <ddunleavy@google.com>
Date:   Mon Feb 24 13:38:49 2025 -0800

    Remove iOS, Android, and `with_xla_support` configs from XLA's copy of the TensorFlow .bazelrc

    PiperOrigin-RevId: 730579046

commit d313af9
Author: Frederik Gossen <frgossen@google.com>
Date:   Mon Feb 24 13:38:24 2025 -0800

    [XLA:GPU] Fix `HasCycle` function

    This is needed to avoid deadlocks when running maxtext with PP and FSDP.
    In this case, we see collective-permutes with multiple cycles, that were falsely categorized as acyclic.
    The result is a decomposed collective-permute issuing a cyclic recv leading into a deadlock.

    PiperOrigin-RevId: 730578883

commit dfae6d7
Author: Sandeep Dasgupta <sdasgup@google.com>
Date:   Mon Feb 24 13:37:31 2025 -0800

    Fix "ops w/o operand and followed by quant accidentally matching dq-op-q patter"

    PiperOrigin-RevId: 730578564

commit 4e2dfdb
Author: A. Unique TensorFlower <gardener@tensorflow.org>
Date:   Mon Feb 24 12:50:50 2025 -0800

    Switch xla_test, etc to static linking within Google.

    Previously, we switched xla_test, etc to static linking to catch duplicate main() definitions at build time. We had to revert the change as it increased test binary sizes and broke Nvidia's build.

    In this second attempt, we make the change only for the Google internal build, so that external users aren't affected.

    PiperOrigin-RevId: 730561451

commit 5d8c1f9
Author: Julia Guo <juliagmt@google.com>
Date:   Mon Feb 24 12:44:01 2025 -0800

    [XLA] Use built-in environment variable to find paths

    PiperOrigin-RevId: 730558831

commit 1535c85
Author: Nitin Srinivasan <srnitin@google.com>
Date:   Mon Feb 24 12:33:56 2025 -0800

    Move `immutabledict` install to the Dockerfile

    PiperOrigin-RevId: 730555599

commit e77316a
Author: A. Unique TensorFlower <gardener@tensorflow.org>
Date:   Mon Feb 24 12:31:31 2025 -0800

    Reverts 52fc64b

    PiperOrigin-RevId: 730554805

commit 7587767
Author: A. Unique TensorFlower <gardener@tensorflow.org>
Date:   Mon Feb 24 12:21:22 2025 -0800

    Use `addressable_devices_` instead of `devices_` in case of the multi-host environment.

    PiperOrigin-RevId: 730551286

commit d615e26
Author: Quentin Khan <qkhan@google.com>
Date:   Mon Feb 24 11:49:38 2025 -0800

    #litert Add a automatically added accelerator compilation structure.

    This structure allows passing metadata that is generated during the model
    compilation onto accelerators when they alter the underlying runtime.

    PiperOrigin-RevId: 730538526

commit 3be10ae
Author: Farzin Houshmand <farzinh@google.com>
Date:   Mon Feb 24 11:42:09 2025 -0800

    [XLA:MSA] Remove reference to internal names.

    PiperOrigin-RevId: 730535804

commit 4cbcaca
Author: Abhinav Gunjal <agunjal@google.com>
Date:   Mon Feb 24 11:40:33 2025 -0800

    hlo/tools : move hlo tools tests to dedicated hlo/tools/tests directory.

    PiperOrigin-RevId: 730535112

commit f189fb0
Author: Luke Boyer <lukeboyer@google.com>
Date:   Mon Feb 24 11:36:14 2025 -0800

    Add to/from string for compiler flags. Also move to compiler/plugin, this doesn't belong in vendor code.

    PiperOrigin-RevId: 730533416

commit 36b41f8
Author: Steeve Morin <steeve.morin@gmail.com>
Date:   Mon Feb 24 11:34:20 2025 -0800

    Various MacOS QoL enhancements

    Part 1 of openxla/xla#16696

    PiperOrigin-RevId: 730532747

commit 0df96d2
Author: Ilia Sergachev <isergachev@nvidia.com>
Date:   Mon Feb 24 11:31:18 2025 -0800

    PR tensorflow#22292: [GPU] Support cuDNN explicit CUDA graph construction.

    Imported from GitHub PR openxla/xla#22292

    Some cuDNN graph engines now support explicit CUDA graph construction instead of stream capture. XLA will now switch between explicit construction and the already implemented stream capture accordingly.
    Copybara import of the project:

    --
    caf22d33e606a6b2ab00d14aa9082550515c404c by Ilia Sergachev <isergachev@nvidia.com>:

    [GPU] Support cuDNN explicit CUDA graph construction.

    Some cuDNN graph engines now support explicit CUDA graph construction
    instead of stream capture. XLA will now switch between explicit
    construction and the already implemented stream capture accordingly.

    --
    23bb1ea89959a10b90b7892196bec41621c9b093 by Ilia Sergachev <isergachev@nvidia.com>:

    Log graphs that don't support CUDA graph native API.

    --
    dd31aeab7edc21a39531817e96a6eecfb0d5b96f by Ilia Sergachev <isergachev@nvidia.com>:

    Skip the added test with old cuDNN versions.

    --
    eeafdbf5f61b111fa3285fb2cfcb65efc91c6b62 by Ilia Sergachev <isergachev@nvidia.com>:

    Address review comments.

    --
    c03beef9515c0198d6eb1518b10a483b6a1b9c41 by Ilia Sergachev <isergachev@nvidia.com>:

    Fix build errors.

    Merging this change closes tensorflow#22292

    PiperOrigin-RevId: 730531507

commit 386f7e6
Author: Shraiysh <svaishay@nvidia.com>
Date:   Mon Feb 24 11:16:39 2025 -0800

    PR tensorflow#22970: Fix bug in post order traversal of computation instructions

    Imported from GitHub PR openxla/xla#22970

    While creating post order traversal, an instruction may have a user outside the computation. This is the case when we are constructing new instructions to store in replacements for cloning the computation later. This user should be ignored. Added test for the same.

    Because of this, functions like `ToString()`, and
    `GetUniqueGteInstruction()` encounter errors. They rely on post-order traversal to have all the instructions.
    Copybara import of the project:

    --
    326469b7cab50e90616094dffe36758afef815e1 by Shraiysh Vaishay <svaishay@nvidia.com>:

    Fix bug in post order traversal of computation instructions

    While creating post order traversal, an instruction may have a user
    outside the computation. This is the case when we are constructing
    new instructions to store in replacements for cloning the computation
    later. This user should be ignored. Added test for the same.

    Because of this, functions like `ToString()`, and
    `GetUniqueGteInstruction()` encounter errors. They rely on post-order
    traversal to have all the instructions.

    Merging this change closes tensorflow#22970

    PiperOrigin-RevId: 730525630

commit 0171f72
Author: Yang Chen <yangchen@google.com>
Date:   Mon Feb 24 10:42:20 2025 -0800

    Cleanup: Fix includes.

    PiperOrigin-RevId: 730511326

commit c70f83a
Author: Yang Chen <yangchen@google.com>
Date:   Mon Feb 24 10:38:55 2025 -0800

    Cleanup: Fix includes.

    PiperOrigin-RevId: 730509796

commit 3fd4e66
Author: Yang Chen <yangchen@google.com>
Date:   Mon Feb 24 10:35:22 2025 -0800

    Cleanup: Fix includes.

    PiperOrigin-RevId: 730508090

commit c536176
Author: Yang Chen <yangchen@google.com>
Date:   Mon Feb 24 10:35:11 2025 -0800

    Cleanup: Fix includes.

    PiperOrigin-RevId: 730507999

commit c80f582
Merge: fd6bd5a e9009ce
Author: TensorFlower Gardener <gardener@tensorflow.org>
Date:   Mon Feb 24 10:49:01 2025 -0800

    Merge pull request tensorflow#83372 from cybersupersoap:transpose-crash-fix

    PiperOrigin-RevId: 730504254

commit fd6bd5a
Author: Michael Whittaker <mwhittaker@google.com>
Date:   Mon Feb 24 10:19:16 2025 -0800

    Added incarnation to `GetTaskState` RPC in coordination service.

    PiperOrigin-RevId: 730501913

commit ea61820
Author: A. Unique TensorFlower <gardener@tensorflow.org>
Date:   Mon Feb 24 10:17:00 2025 -0800

    Make xla_cc_test default to shuffling test cases.

    This helps catch test case order dependencies at presubmit time.

    PiperOrigin-RevId: 730500863

commit ed0d218
Author: Michael Whittaker <mwhittaker@google.com>
Date:   Mon Feb 24 10:00:15 2025 -0800

    Don't run CUDA test with msan.

    PiperOrigin-RevId: 730493506

commit 9f94996
Author: A. Unique TensorFlower <gardener@tensorflow.org>
Date:   Mon Feb 24 09:35:21 2025 -0800

    Adds visibility restriction to some XLA bzl files to prevent them from being used outside of XLA, as they are internal implementation details.

    This CL is not complete. It's the first step that establishes the mechanism. Once I get buy-in on the approach, I'll follow up with more CLs to add visibility restriction to the other XLA bazl files.

    PiperOrigin-RevId: 730484507

commit 200be96
Author: Won Jong Jeon <won.jeon@arm.com>
Date:   Mon Feb 24 09:37:37 2025 -0800

    [mlir][tosa] Update Tensorflow to match TOSA v1.0 specification (part 3) (tensorflow#87273)

    * [mlir][tosa] Change 'shape' attribute of RESHAPE operator to become an input

    including minor change from:
    Slice the input of kernel based ops to the actual used size

    Change-Id: Ifebe0d1b3459300df0fa2edc9ba24a867caec3d3

    Signed-off-by: Won Jeon <won.jeon@arm.com>
    Change-Id: I938503349f38b64db5e77a01c3a7b2bb33e8f041

    * [Tosa] Refactor QuantizationAttr

    changes due to removal of quantization attr in TOSA dialect
    and due to name changes in while_loop region names

    Signed-off-by: Tai Ly <tai.ly@arm.com>
    Change-Id: I09533bffcd8e2179505c7e11e1320b673266585d

    * [mlir][tosa] ClampOp attributes changes

    This patch implements changes required by Tosa ClampOp's
    new min_val/max_val attributes

    including clamp_max update code from:

    commit 04055fa510522af659aa56bac3b4796961131546
    Author: Thibaut Goetghebuer-Planchon <thibaut.goetghebuer-planchon@arm.com>
    Date:   Thu Sep 21 17:22:14 2023 +0000

        [TOSA] During quantized ReLU legalization, limit the clamp_max attribute to the max value of the quantized type

        Change-Id: I781229be0eb86ecb3cf1a305ede98ad630e5bcfd

    Signed-off-by: Tai Ly <tai.ly@arm.com>
    Change-Id: I25ba0d077fa44d4c384ab094a6070a4743383414

    * [TOSA] Calculate unknown reshape dimension when input is static

    This commit updates the reshape legalization to calculate static
    shape and output type when a static input shape is provided and
    only one dimension is unknown.

    Change-Id: I0843549b47131b0852fbf375f00846b1fcbe8bc6
    Signed-off-by: Luke Hutton <luke.hutton@arm.com>

    * [TOSA] Numerical mismatch on tfl.transpose_conv layer

    * Legalization now handles cases where the layer has a bias

    Author: Tom Allsop <tom.allsop@arm.com>
    Change-Id: Ie3ba38644d1cf8e5d6f71271e8bb6f1b5636f406

    * [mlir][tosa] Change resize attrs to inputs

    This patch implements changes required by Tosa resize op's
    scale/offset/border changing from attributes to inputs.

    Signed-off-by: Tai Ly <tai.ly@arm.com>
    Change-Id: I9a4319ac53298c25568fc651e249528b9a9477fc

    * [mlir][tosa] Update LIT tests

    Combination of test file updates from the following commits:
    * [mlir][tosa] Change 'shape' attribute of RESHAPE operator to become an input
    * [mlir][tosa] Switch zero point of convolutions to input variable type
    * [Tosa] Refactor QuantizationAttr
    * [TOSA] During quantized ReLU legalization, limit the clamp_max attribute to the ma
    x value of the quantized type
    * [mlir][tosa] ClampOp attributes changes
    * [TOSA] Calculate unknown reshape dimension when input is static
    * [TOSA] Numerical mismatch on tfl.transpose_conv layer
    * [mlir][tosa] Change resize attrs to inputs

    Co-authored-by: Tai Ly <tai.ly@arm.com>
    Co-authored-by: Thibaut Goetghebuer-Planchon <thibaut.goetghebuer-planchon@arm.com>
    Co-authored-by: Luke Hutton <luke.hutton@arm.com>
    Co-authored-by: Tom Allsop <tom.allsop@arm.com>

    Signed-off-by: Won Jeon <won.jeon@arm.com>
    Change-Id: Ia5731e659d262c74374e8326d49beccf6a60032e

    ---------

    Signed-off-by: Won Jeon <won.jeon@arm.com>
    Signed-off-by: Tai Ly <tai.ly@arm.com>
    Signed-off-by: Luke Hutton <luke.hutton@arm.com>

commit b3a79af
Author: Julia Guo <juliagmt@google.com>
Date:   Mon Feb 24 09:15:18 2025 -0800

    Fix cpu/gpu benchmarks github workflows to run on steps correctly.

    PiperOrigin-RevId: 730477678

commit 1fe5433
Author: Emily Fertig <emilyaf@google.com>
Date:   Mon Feb 24 08:52:51 2025 -0800

    Plumb layout through the creation of PjRtArrays.

    This is in preparation to support arrays with no local shards, so that layout may not be accessible from a buffer.

    PiperOrigin-RevId: 730469597

commit f01ad0b
Author: Bart Chrzaszcz <bartchr@google.com>
Date:   Mon Feb 24 07:36:37 2025 -0800

    #sdy Make XLA changes to support JAX export.

    - Shardy isn't serializable yet with StableHLO, so we need to expose the `SdyRoundTripExportPipeline` to JAX to remove the dialect before serializing.
    - Pass an option to `refine_polymoprhic_shapes` if shardy is enabled as we need to undo `SdyRoundTripExportPipeline` through importing again with `SdyRoundTripImportPipeline`
    - Add `is_tile_maximal` as a nanobind python binding for `OpSharding`

    PiperOrigin-RevId: 730445364

commit fb32129
Author: Emilio Cota <ecg@google.com>
Date:   Mon Feb 24 07:25:48 2025 -0800

    [xla:emitters] tag XLA, XLA:CPU and XLA:GPU dialects as non-prod-compatible

    This paves the way for XLA:CPU fusion emitters.

    Note that XLA:CPU is non-prod-compatible, whereas XLA:GPU is
    not. The CPU fusion emitters will depend on the XLA, XLA:CPU
    and XLA:GPU dialects, and given that the emitters' dependents
    in XLA:CPU are non-prod-compatible, the three dialects have
    to be as well.

    XLA:CPU passes also have to be tagged. Crucially, thanks to
    the parent CLs, XLA:GPU passes are not used anymore by any of
    the above dialects nor by XLA:CPU passes, so XLA:GPU remains
    essentially untouched; we just tag the XLA:GPU dialect.

    Some common libraries in xla/codegen/emitters are also tagged.

    PiperOrigin-RevId: 730442237

commit a7aaad7
Author: Alexander Belyaev <pifon@google.com>
Date:   Mon Feb 24 07:06:50 2025 -0800

    [XLA:GPU][Emitters] Restrict the inliner.

    Inline only if there are more than 1 call to the callee in the caller.

    Background: jax-ml/jax#26162 contains an example of a MoF fusion that takes forever to compile.

    The [indexing-based partitioner](openxla/xla@44bc816) in combination with this change fixes the issue.

    PiperOrigin-RevId: 730436982

commit 3ab2013
Author: Benjamin Kramer <kramerb@google.com>
Date:   Mon Feb 24 06:57:46 2025 -0800

    Integrate LLVM at llvm/llvm-project@c80b99d98ad0

    Updates LLVM usage to match
    [c80b99d98ad0](llvm/llvm-project@c80b99d98ad0)

    PiperOrigin-RevId: 730434387

commit 68abab9
Author: Alexander Belyaev <pifon@google.com>
Date:   Mon Feb 24 06:38:26 2025 -0800

    [XLA:GPU][TMA] Add an alias for TmaDescriptorAttr.

    PiperOrigin-RevId: 730429137

commit 4750f67
Author: Quentin Khan <qkhan@google.com>
Date:   Mon Feb 24 04:14:51 2025 -0800

    Add missing newline in `accelerator.h`

    PiperOrigin-RevId: 730391255

commit a47a227
Author: chuntl <quic_chuntl@quicinc.com>
Date:   Thu Feb 20 18:04:47 2025 +0800

    Qualcomm AI Engine Direct - Add log utils for core module

    Summary:
    - Implement default and android version of log utils for core module
    - Add test for log util
    - Use LogOff as default log level
    - Unify to use log util in core module

commit 5ee65d4
Author: A. Unique TensorFlower <gardener@tensorflow.org>
Date:   Mon Feb 24 04:14:40 2025 -0800

    Avoids Segmentation fault when dispatcher library is not found

    PiperOrigin-RevId: 730391202

commit 46ed7f6
Author: Fergus Henderson <fergus@google.com>
Date:   Mon Feb 24 03:32:37 2025 -0800

    Some minor polishing of the release docs for 2.19.

    1. Fix indentation.  The indentation of the first three bullet points in the markdown sources did not match the indentation of the fourth and fifth bullet points, nor of the bullet points further below.

    2. Wrap some long lines in the markdown sources, in particular where there were some
    lines wrapped but others not wrapped in the same bullet point list.

    3. Use "Python API" rather than "Interpreter" as the subheading for changes
    affecting the `tf.lite.Interpreter` Python class, for consistency with the earlier
    heading "C++ API" in the same bullet point list.

    PiperOrigin-RevId: 730380377

commit c699ef3
Author: A. Unique TensorFlower <gardener@tensorflow.org>
Date:   Mon Feb 24 03:21:38 2025 -0800

    Automated Code Change

    PiperOrigin-RevId: 730377762

commit 2c045ad
Author: A. Unique TensorFlower <gardener@tensorflow.org>
Date:   Mon Feb 24 03:09:13 2025 -0800

    Automated Code Change

    PiperOrigin-RevId: 730374491

commit b375fd2
Author: A. Unique TensorFlower <gardener@tensorflow.org>
Date:   Mon Feb 24 02:56:50 2025 -0800

    Adds LITERT_FATAL to logging

    PiperOrigin-RevId: 730371384

commit 474d368
Author: A. Unique TensorFlower <gardener@tensorflow.org>
Date:   Mon Feb 24 02:51:36 2025 -0800

    [XLA] Clean up the implementation for broadcast sinking past elementwise ops and add a test.

    This is a pure refactoring - no functional changes.

    PiperOrigin-RevId: 730369875

commit 8286168
Author: A. Unique TensorFlower <gardener@tensorflow.org>
Date:   Mon Feb 24 02:28:59 2025 -0800

    Fix invalid pointer in environment_options

    PiperOrigin-RevId: 730363601

commit 121ddb6
Author: A. Unique TensorFlower <gardener@tensorflow.org>
Date:   Mon Feb 24 01:02:36 2025 -0800

    Update GraphDef version to 2148.

    PiperOrigin-RevId: 730338341

commit 23beb26
Author: A. Unique TensorFlower <gardener@tensorflow.org>
Date:   Mon Feb 24 01:02:28 2025 -0800

    compat: Update forward compatibility horizon to 2025-02-24

    PiperOrigin-RevId: 730338320

commit 8a907d6
Author: Shraiysh <svaishay@nvidia.com>
Date:   Mon Feb 24 00:56:14 2025 -0800

    PR tensorflow#22614: Fix hlo_opt printing of Hlo module

    Imported from GitHub PR openxla/xla#22614

    The tool `hlo-opt` was not honoring the debug options within the HloModule while printing the HloModule.

    These options should be honored by the default printing of the HloModule as they are a part of the same HloModule. Fixed the print method to do this. This should now be reflected in all the tools using these debug options.
    Copybara import of the project:

    --
    a22584a819a0fc6ee8f41b4c50f4f8d68a6a2184 by Shraiysh Vaishay <svaishay@nvidia.com>:

    Fix hlo_opt printing of Hlo module

    The tool `hlo-opt` was not honoring the debug options within the HloModule while printing the HloModule.

    These options should be honored by the default printing of the HloModule as they are a part of the same HloModule. Fixed the print method to do this. This should now be reflected in all the tools using these debug options.

    --
    b42178b4da3fd5f81fc2d50346cb2f9b18153ab5 by Shraiysh Vaishay <svaishay@nvidia.com>:

    Rebase and avoid edits to testcases.

    --
    51cdfbfa355efe34936073fd68d4e19191131bb7 by Shraiysh Vaishay <svaishay@nvidia.com>:

    Addressed failing test

    Merging this change closes tensorflow#22614

    PiperOrigin-RevId: 730336982

commit 5431408
Author: Eugene Zhulenev <ezhulenev@google.com>
Date:   Mon Feb 24 00:32:19 2025 -0800

    [xla:cpu] Align KernelArgs to enable aligned moves on a hot path

    ```
    name                                     old cpu/op   new cpu/op   delta
    BM_SelectAndScatterF32/128/process_time   318µs ± 2%   306µs ± 2%  -3.62%  (p=0.000 n=38+38)
    BM_SelectAndScatterF32/256/process_time  1.28ms ± 1%  1.23ms ± 2%  -4.24%  (p=0.000 n=39+35)
    BM_SelectAndScatterF32/512/process_time  5.75ms ± 2%  5.57ms ± 2%  -3.06%  (p=0.000 n=35+36)

    name                                     old time/op          new time/op          delta
    BM_SelectAndScatterF32/128/process_time   318µs ± 2%           307µs ± 2%  -3.66%  (p=0.000 n=38+40)
    BM_SelectAndScatterF32/256/process_time  1.28ms ± 1%          1.23ms ± 2%  -4.19%  (p=0.000 n=39+37)
    BM_SelectAndScatterF32/512/process_time  5.39ms ± 1%          5.21ms ± 2%  -3.41%  (p=0.000 n=38+38)
    ```

    PiperOrigin-RevId: 730330680

commit 99a4f2c
Author: Zixuan Jiang <zixuanjiang@google.com>
Date:   Sun Feb 23 22:56:26 2025 -0800

    Move the sharding axes from dimensions that need replication to batch dimensions, such that we replace an `all-gather` with an `all-to-all`.

    Given the following input
    ```
    ENTRY entry {
      %param0 = f32[14,257] parameter(0), sharding={devices=[1,2]0,1}
      %param1 = f32[14,116] parameter(1), sharding={devices=[1,2]0,1}
      ROOT %concatenate = f32[14,373] concatenate(%param0, %param1),
        dimensions={1}, sharding={devices=[1,2]0,1}
    }
    ```

    Previously, we (1) replicate the input along the concat dimension, (2) apply concat, (3) partition the result with dynamic-slice. With this change, we (1) use all-to-all to move sharding axis from the concat dim to batch dim for operands, (2) apply concat, and then (3) use all-to-all to reshard the result.

    Reverts 81b0a48

    PiperOrigin-RevId: 730308137

commit 0f8d58f
Author: A. Unique TensorFlower <gardener@tensorflow.org>
Date:   Sun Feb 23 22:23:36 2025 -0800

    Automated Code Change

    PiperOrigin-RevId: 730300711

commit 65d9195
Author: A. Unique TensorFlower <gardener@tensorflow.org>
Date:   Sun Feb 23 21:52:10 2025 -0800

    Automated Code Change

    PiperOrigin-RevId: 730294118

commit e9009ce
Author: Assoap <112180992+Assoap@users.noreply.github.com>
Date:   Thu Dec 19 23:33:18 2024 +0800

    Fix crash of transpose
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants