Operation very slow to compile since jax 0.4.36 #26162

btaba · 2025-01-28T21:31:36Z

Description

Hi folks,

We've been having slow compilation issues since jax 0.4.36 with some of our JAX code. The slow compilation (i.e. takes O(hours) to run instead of <2 minutes) pops up for certain devices and for certain environments (with seemingly benign code changes). It has been quite tricky to pin down the issue over the last couple of months, and I haven't been able to create a smaller MRE than the following (on my local RTX 4090):

Install packages

pip install --upgrade jax[cuda] jaxlib 
pip install --upgrade mujoco
pip install --upgrade mujoco_mjx
pip install --upgrade brax

mujoco==3.2.7
mujoco-mjx==3.2.7
brax==0.12.1

Run this Python code:

import functools
from mujoco_playground import registry
from mujoco_playground import wrapper
from mujoco_playground.config import manipulation_params
from brax.training.agents.ppo import train as ppo
from brax.training.agents.ppo import networks as ppo_networks


env_name = 'PandaRobotiqPushCube'
env = registry.load(env_name)
env_cfg = registry.get_default_config(env_name)

ppo_params = manipulation_params.brax_ppo_config(env_name)
ppo_training_params = dict(ppo_params)
network_factory = ppo_networks.make_ppo_networks
if "network_factory" in ppo_params:
  del ppo_training_params["network_factory"]
  network_factory = functools.partial(
      ppo_networks.make_ppo_networks,
      **ppo_params.network_factory
  )

train_fn = functools.partial(
    ppo.train, **dict(ppo_training_params),
    network_factory=network_factory,
)
make_inference_fn, params, metrics = train_fn(
    environment=env,
    wrap_env_fn=wrapper.wrap_for_brax_training,
)

The corresponding XLA dump is attached.

I also reran the same script with num_evals=0 within train_fn, and the code runs fine (the slow compilation occurs somewhere here). I'm attaching both the working and non-working XLA dumps. We would really appreciate any help on this issue.

xla_dump_hanging_compilation.tar.gz
xla_dump_working.tar.gz

System info (python version, jaxlib version, accelerator, etc.)

jax:    0.5.0
jaxlib: 0.5.0
numpy:  1.26.4
python: 3.12.3 (main, Sep 10 2024, 15:47:39) [GCC 13.2.0]
device info: NVIDIA GeForce RTX 4090-1, 1 local devices"
process_count: 1
platform: uname_result(system='Linux', node='btaba.mtv.corp.google.com', release='6.10.11-1rodete2-amd64', version='#1 SMP PREEMPT_DYNAMIC Debian 6.10.11-1rodete2 (2024-10-16)', machine='x86_64')


$ nvidia-smi
Tue Jan 28 13:30:18 2025       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.216.01             Driver Version: 535.216.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 4090        Off | 00000000:61:00.0 Off |                  Off |
|  0%   50C    P2              22W / 450W |    397MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A   2663746      C   .../.pyenv/versions/mjx-312/bin/python      390MiB |
+---------------------------------------------------------------------------------------+

The text was updated successfully, but these errors were encountered:

btaba · 2025-01-28T21:37:01Z

Related to:

google/brax#569

google-deepmind/mujoco_playground#11

@erikfrey

kevinzakka · 2025-01-28T21:40:49Z

@mattjj Would appreciate your help here 🙏

jreiffers · 2025-01-31T07:54:02Z

HLO reproducer: https://gist.github.com/jreiffers/b6b8427ef64c00e688e79fd5af25b571

bazel run --config=cuda //xla/tools:run_hlo_module -c opt  -- --xla_disable_all_hlo_passes --input_format=hlo --random_init_input_literals --platform=CUDA /tmp/repro.hlo

I'll take a look which pass is blowing up. Hopefully it's not in LLVM :-).

jreiffers · 2025-01-31T09:54:16Z

I initially suspected the unnecessary concatenate at the end, but it happens even when that's fixed manually. There are at least two issues here:

codegen/emitters/computation_partitioner.cc does a really poor job partitioning this, because of the bitcasts above the concatenate. The partitioner does not track the real indexing, so after the bitcast it considers the inputs to the bitcasts (e.g. add.56926.1.clone.1) to have users with inconsistent indexing (the root tuple and the bitcast), which cascades all the way to the parameters. We end up with lots of functions even though ~everything is elementwise.
Afterwards, inlining breaks, generating a huge function. I stopped it after a few hundred thousand ops. Not entirely sure yet what's happening here, maybe failed/insufficient CSE or canonicalization.

The correct fix for 1. is to use proper indexing maps in the partitioner. A quick hack is to change all_users_elementwise in computation_partitioner.cc to this:

    bool all_users_elementwise =
        absl::c_all_of(instr->users(), [&](const HloInstruction* user) {
          return HloInstruction::IsOpElementwise(user->opcode()) ||
                 user->opcode() == HloOpcode::kTuple ||
                 user->opcode() == HloOpcode::kBitcast;
        });

However, I'm not sure that's safe. @pifon2a could you give that a try and see what breaks?

pifon2a · 2025-01-31T10:05:24Z

@jreiffers Thank you for the reproducer. Trying.

pifon2a · 2025-01-31T12:01:34Z

The quick hack did not work. I'll try to migrate us to the indexing maps next week.

btaba · 2025-01-31T18:34:18Z

Thanks @jreiffers and @pifon2a for taking a look, really appreciate it!

kevinzakka · 2025-02-06T03:03:49Z

+1, thank you!

pifon2a · 2025-02-14T16:30:43Z

So, i have a new version of a computation partitioner that completely relies on indexing maps. That did not help and it even outlines the same number of functions. Disabling the inliner helps and it compiles quickly. I will check, what's happening there.

pifon2a · 2025-02-18T11:02:05Z

Ok, the issue was within the inliner interface itself (XlaInlinerInterface::isLegalToInline). I will upload the fix today after checking that I did not regress anything.

PiperOrigin-RevId: 728125830

pifon2a · 2025-02-18T17:04:48Z

The fix makes compilation of one of the tests in JAX slow. Everything else became better... Looking.

At the moment we inline a callee that calls one of the functions called by the caller. This PR adjusts the logic to inline only such callees that call a subset of the caller's functions. That way we can be sure that after inlining the caller calls the same set of functions excluding the inlined one. Background: jax-ml/jax#26162 contains an example of a MoF fusion that takes forever to compile. The [indexing-based partitioner](44bc816) in combination with this change fix the issue. PiperOrigin-RevId: 729079659

At the moment we inline a callee that calls one of the functions called by the caller. This PR adjusts the logic to inline only such callees that call a subset of the caller's functions. That way we can be sure that after inlining the caller calls the same set of functions excluding the inlined one. Background: jax-ml/jax#26162 contains an example of a MoF fusion that takes forever to compile. The [indexing-based partitioner](openxla/xla@44bc816) in combination with this change fix the issue. PiperOrigin-RevId: 729079659

At the moment we inline a callee that calls one of the functions called by the caller. This PR adjusts the logic to inline only such callees that call a subset of the caller's functions. That way we can be sure that after inlining the caller calls the same set of functions excluding the inlined one. Background: jax-ml/jax#26162 contains an example of a MoF fusion that takes forever to compile. The [indexing-based partitioner](44bc816) in combination with this change fix the issue. PiperOrigin-RevId: 729079659

At the moment we inline a callee that calls one of the functions called by the caller. This PR adjusts the logic to inline only such callees that call a subset of the caller's functions. That way we can be sure that after inlining the caller calls the same set of functions excluding the inlined one. Background: jax-ml/jax#26162 contains an example of a MoF fusion that takes forever to compile. The [indexing-based partitioner](openxla/xla@44bc816) in combination with this change fix the issue. PiperOrigin-RevId: 729079659

Inline only if there are more than 1 call to the callee in the caller. Background: jax-ml/jax#26162 contains an example of a MoF fusion that takes forever to compile. The [indexing-based partitioner](44bc816) in combination with this change fixes the issue. PiperOrigin-RevId: 729079659

Inline only if there are more than 1 call to the callee in the caller. Background: jax-ml/jax#26162 contains an example of a MoF fusion that takes forever to compile. The [indexing-based partitioner](openxla/xla@44bc816) in combination with this change fixes the issue. PiperOrigin-RevId: 729079659

Inline only if there are more than 1 call to the callee in the caller. Background: jax-ml/jax#26162 contains an example of a MoF fusion that takes forever to compile. The [indexing-based partitioner](44bc816) in combination with this change fixes the issue. PiperOrigin-RevId: 729079659

Inline only if there are more than 1 call to the callee in the caller. Background: jax-ml/jax#26162 contains an example of a MoF fusion that takes forever to compile. The [indexing-based partitioner](openxla/xla@44bc816) in combination with this change fixes the issue. PiperOrigin-RevId: 729079659

Inline only if there are more than 1 call to the callee in the caller. Background: jax-ml/jax#26162 contains an example of a MoF fusion that takes forever to compile. The [indexing-based partitioner](44bc816) in combination with this change fixes the issue. PiperOrigin-RevId: 730436982

Inline only if there are more than 1 call to the callee in the caller. Background: jax-ml/jax#26162 contains an example of a MoF fusion that takes forever to compile. The [indexing-based partitioner](openxla/xla@44bc816) in combination with this change fixes the issue. PiperOrigin-RevId: 730436982

pifon2a · 2025-02-24T15:45:31Z

Indexing map-based partitioner and Tweaked inliner fixed the issue. Let me know if you still have problems with this.

commit d56c042 Author: Adrian Kuegel <akuegel@google.com> Date: Tue Feb 25 22:27:49 2025 -0800 Let FusionDeduplicationCache handle ProducerConsumer multi-output fusions. This will be needed when we want to allow such fusions in PriorityFusion. PiperOrigin-RevId: 731165217 commit 77ba208 Author: Majid Dadashi <majiddadashi@google.com> Date: Tue Feb 25 21:08:34 2025 -0800 Enable folding of quantized reshape with per-axis scales PiperOrigin-RevId: 731144237 commit 446fac2 Author: Eunjae Kim <eunjaekim@google.com> Date: Tue Feb 25 21:05:05 2025 -0800 Introduce `FunctionBody::Finalize()` to populate `AllocatorAttribute`s for arg/ret nodes and release unnecessary resources PiperOrigin-RevId: 731143677 commit 58269e0 Author: Weiyi Wang <weiyiw@google.com> Date: Tue Feb 25 18:27:37 2025 -0800 Flip default of _experimental_enable_composite_direct_lowering flag to True PiperOrigin-RevId: 731105623 commit 7d4ce51 Author: A. Unique TensorFlower <gardener@tensorflow.org> Date: Tue Feb 25 17:42:26 2025 -0800 Move some strategy generation utilities from auto_sharding_dot_handler.cc to auto_sharding_strategy.h with the intention of using the utilities more broadly throughout the codebase. PiperOrigin-RevId: 731094359 commit af03154 Author: Yin Zhang <yinzz@google.com> Date: Tue Feb 25 17:09:21 2025 -0800 Reverts changelist 723349025 PiperOrigin-RevId: 731085146 commit 2bb741a Author: Eric Yang <yijieyang@google.com> Date: Tue Feb 25 17:07:40 2025 -0800 Add HLO adapter PiperOrigin-RevId: 731084644 commit 745b9dd Author: A. Unique TensorFlower <gardener@tensorflow.org> Date: Tue Feb 25 16:37:04 2025 -0800 Always set use_global_scheduler/rank_queues with priority_merge policy. PiperOrigin-RevId: 731074632 commit 16e6b9f Author: A. Unique TensorFlower <gardener@tensorflow.org> Date: Tue Feb 25 16:23:15 2025 -0800 Integrate LLVM at llvm/llvm-project@9889de834b0a Updates LLVM usage to match [9889de834b0a](llvm/llvm-project@9889de834b0a) PiperOrigin-RevId: 731070091 commit 1e392a4 Author: A. Unique TensorFlower <gardener@tensorflow.org> Date: Tue Feb 25 16:17:51 2025 -0800 Update ops-related pbtxt files. PiperOrigin-RevId: 731068451 commit fcedb3c Author: Pat Notz <patn@google.com> Date: Tue Feb 25 16:14:40 2025 -0800 Flag guard the option to disable embedding pipelining when summary ops are present PiperOrigin-RevId: 731067500 commit 8ebbd6c Author: A. Unique TensorFlower <gardener@tensorflow.org> Date: Tue Feb 25 16:08:09 2025 -0800 Go: Update generated wrapper functions for TensorFlow ops. PiperOrigin-RevId: 731065650 commit 8a759e6 Author: Derek Murray <mrry@google.com> Date: Tue Feb 25 16:05:29 2025 -0800 Introduce `TPUDummyInput` as a specialization of `Fill` for ICI weight distribution. The new op has a few benefits over the previous version: * We can generate a single op instead of three ops for each dummy input. * The new op is marked as `DoNotOptimize` and `TF_NoConstantFold`, so it will never be accidentally constant-folded to a large memory footprint. PiperOrigin-RevId: 731064699 commit 4723816 Author: A. Unique TensorFlower <gardener@tensorflow.org> Date: Tue Feb 25 15:36:07 2025 -0800 Support setting up global prioritized batching via the batch op rewriter. PiperOrigin-RevId: 731054770 commit ae1d10b Author: Luke Boyer <lukeboyer@google.com> Date: Tue Feb 25 14:29:03 2025 -0800 Add serialization options to the public API for alignment for bytecode. PiperOrigin-RevId: 731030707 commit 1333586 Author: Tai Ly <tai.ly@arm.com> Date: Tue Feb 25 16:28:32 2025 -0600 [tosa] Fix lowering of tf/tfl expand_dims for negative dim (tensorflow#67859) This fixes lowering of tf/tfl expand_dims to tosa, for negative dim values such that dim=-1 means adding inner most dimension commit 78dd108 Author: Terry Heo <terryheo@google.com> Date: Tue Feb 25 14:04:09 2025 -0800 litert: Fix broken Dispatch API tests Provide valid DispatchOption to LiteRtDispatchInitialize() PiperOrigin-RevId: 731021714 commit b120e3e Author: Michael Hudgins <michaelhudgins@google.com> Date: Tue Feb 25 13:43:31 2025 -0800 [XLA:OSS] Add CI connection step to the ci workflows. PiperOrigin-RevId: 731013692 commit 0e5ec72 Author: Reed Wanderman-Milne <reedwm@google.com> Date: Tue Feb 25 13:34:28 2025 -0800 Fix race condition in the predicate in GPU thunks. WhileThunk and ConditionalThunk stored CUDA host memory that would store the predicate. The thunks would transfer the predicate from device to host into the CUDA host memory. But if the thunks were called multiple times in parallel, each call would use the same memory, causing a race condition which could result in incorrect predicate values. Now a pool of host memory is used so different calls to the thunk get different pointers to host memory. The pool has a fixed size of 128, so if there are more parallel callers than that, an error will be raised. I think it's unlikely there will be that many parallel calls in practice. PiperOrigin-RevId: 731010318 commit b273bba Author: Chenguang Wang <chenguangwang@google.com> Date: Tue Feb 25 13:07:11 2025 -0800 Fix Android ARM64 build for hlo_to_mhlo. See also commit ce2bae2. PiperOrigin-RevId: 731000510 commit 97d5495 Author: Andrew Zhang <yunandrew@google.com> Date: Tue Feb 25 12:47:12 2025 -0800 Directly overwrite ADSP_LIBRARY_PATH if shared lib path is provided to qnn manager. Fix the issue where existing ADSP_LIBRARY_PATH contains other versions QNN lib files. PiperOrigin-RevId: 730992932 commit f4e0633 Author: Julia Guo <juliagmt@google.com> Date: Tue Feb 25 12:43:48 2025 -0800 [XLA:GPU] Fix xspace.pb path PiperOrigin-RevId: 730991615 commit 0a6967b Author: Oleg Shyshkov <shyshkov@google.com> Date: Tue Feb 25 12:33:49 2025 -0800 [XLA:GPU] Fix thunk emitter for degenerate ops. The condition to get index of the output buffer wasn't always correct. It's possible to have an op with 1 operand and result with a tuple of 1 element. For example, a degenerate a2a will look like: ``` a2a = (u32[2]) all-to-all(u32[2] a1), replica_groups={{0},{1}} ``` It's better to check that output is a tuple. PiperOrigin-RevId: 730988026 commit 95e9577 Author: A. Unique TensorFlower <gardener@tensorflow.org> Date: Tue Feb 25 12:02:27 2025 -0800 Fix HLO stats table to use int types as ints (instead of strings). PiperOrigin-RevId: 730976625 commit 754f826 Author: A. Unique TensorFlower <gardener@tensorflow.org> Date: Tue Feb 25 11:53:35 2025 -0800 Reverts 1e0f639 PiperOrigin-RevId: 730973217 commit 9b75a55 Author: A. Unique TensorFlower <gardener@tensorflow.org> Date: Tue Feb 25 11:39:38 2025 -0800 Cleanup: Fix includes. PiperOrigin-RevId: 730967918 commit 09806e6 Author: Luke Boyer <lukeboyer@google.com> Date: Tue Feb 25 11:36:33 2025 -0800 Add support for aligned byte code in internal model serialize API PiperOrigin-RevId: 730966854 commit 14aeefb Author: Penporn Koanantakool <penporn@google.com> Date: Tue Feb 25 10:25:55 2025 -0800 [xla:cpu:onednn] Support basic MatMul in oneDNN fusion thunk. PiperOrigin-RevId: 730937945 commit c47c195 Author: David Dunleavy <ddunleavy@google.com> Date: Tue Feb 25 10:21:01 2025 -0800 Remove TensorFlow specific configs in `tensorflow.bazelrc` PiperOrigin-RevId: 730935687 commit 5c4dddd Author: Eugene Zhulenev <ezhulenev@google.com> Date: Tue Feb 25 10:18:18 2025 -0800 [xla:cpu] Move dot_kernel_emitter under codegen/dot PiperOrigin-RevId: 730934538 commit e69ca84 Author: Vladimir Belitskiy <belitskiy@google.com> Date: Tue Feb 25 10:06:41 2025 -0800 Patch rules_python to point to the newest Python 3.12 patch version. This should hopefully resolve Windows RBE test runs on Python3.12 flaking with WMI query errors (python/cpython#125315). PiperOrigin-RevId: 730930044 commit f1dc591 Author: Won Jong Jeon <won.jeon@arm.com> Date: Tue Feb 25 10:16:42 2025 -0800 [mlir][tosa] Fix lit tests for resize (tensorflow#87976) Change-Id: I8cb88a0b6344259d57a37d6ddd2c0810bb7a61e7 Signed-off-by: Won Jeon <won.jeon@arm.com> commit 0f1a45d Author: Quentin Khan <qkhan@google.com> Date: Tue Feb 25 09:52:32 2025 -0800 #litert Create the NPU accelerator. The accelerator is not yet automatically registered to the LiteRT environment. PiperOrigin-RevId: 730924856 commit 57859a1 Author: Aliia Khasanova <aliia@google.com> Date: Tue Feb 25 09:46:10 2025 -0800 Overwrite xla_dump_as_* options in raw_options only if raw_options.xla_dump_to is set. Otherwise keep debug_options settings. This is needed to access the flags state in PjRtStreamExecutorLoadedExecutable::Execute. Specifically, I need to access dumping options in order to dump unoptimized hlo module with arguments during execution correctly. PiperOrigin-RevId: 730922688 commit c9c731e Author: Quentin Khan <qkhan@google.com> Date: Tue Feb 25 09:30:03 2025 -0800 #litert Fix `LITERT_RETURN_IF_ERROR` when checking bool return values. - `false` return values are errors. - Add `kLiteRtStatusErrorUnknown` for unknown errors. - When converting a boolean error to a `LiteRtStatus`/`litert::Expected`, the error value is `kLiteRtStatusErrorUnknown`. PiperOrigin-RevId: 730917169 commit 573c1ff Author: Ilia Sergachev <isergachev@nvidia.com> Date: Tue Feb 25 09:12:55 2025 -0800 PR tensorflow#23078: Revert "PR tensorflow#22292: [GPU] Support cuDNN explicit CUDA graph construction." Imported from GitHub PR openxla/xla#23078 This reverts commit 65b4b8874b659d7f11523f7b1d6df1613cfc8984. Copybara import of the project: -- f2cc964f5b849b149626a007045cccc32778ee27 by Ilia Sergachev <isergachev@nvidia.com>: Revert "PR tensorflow#22292: [GPU] Support cuDNN explicit CUDA graph construction." This reverts commit 65b4b8874b659d7f11523f7b1d6df1613cfc8984. Merging this change closes tensorflow#23078 PiperOrigin-RevId: 730911296 commit 10f7fe6 Author: Ilia Sergachev <isergachev@nvidia.com> Date: Tue Feb 25 09:05:55 2025 -0800 PR tensorflow#22898: [GPU] GEMM fusion autotuner: dump unoptimized fusions before profiling them. Imported from GitHub PR openxla/xla#22898 This helps debugging failures during profiling. Copybara import of the project: -- e63f7865126281a7eb5b410394424826275037a8 by Ilia Sergachev <isergachev@nvidia.com>: [GPU] GEMM fusion autotuner: dump unoptimized fusions before profiling them. This helps debugging failures during profiling. Merging this change closes tensorflow#22898 PiperOrigin-RevId: 730909003 commit ca77b1a Author: Penporn Koanantakool <penporn@google.com> Date: Tue Feb 25 08:37:29 2025 -0800 [xla:cpu:onednn] Support elementwise Add and Mul in oneDNN fusion thunk PiperOrigin-RevId: 730899327 commit c42688e Author: Ilia Sergachev <isergachev@nvidia.com> Date: Tue Feb 25 08:23:14 2025 -0800 PR tensorflow#23068: [GPU] Fix missing cuDNN symbols. Imported from GitHub PR openxla/xla#23068 This fixes JAX builds with cuDNN 9.5.0+ after openxla/xla@65b4b88. Copybara import of the project: -- 3aa286e5a849e2187ef3d44c22c54d518dd168ec by Ilia Sergachev <isergachev@nvidia.com>: [GPU] Fix missing cuDNN symbols. Merging this change closes tensorflow#23068 PiperOrigin-RevId: 730895063 commit 6b098f7 Author: Benjamin Kramer <kramerb@google.com> Date: Tue Feb 25 07:59:38 2025 -0800 Integrate LLVM at llvm/llvm-project@d23da7d6300e Updates LLVM usage to match [d23da7d6300e](llvm/llvm-project@d23da7d6300e) PiperOrigin-RevId: 730887012 commit 847b2df Author: Aliia Khasanova <aliia@google.com> Date: Tue Feb 25 07:55:28 2025 -0800 [XLA:GPU] Reset `CodedInputStream` after parsing each literal in the serialization of large snapshots. `CodedInputStream` has an internal int32 counter for total bytes read, limiting the bytes read by a single instance to 2 GiB. I've changed the deserialization implementation to parse each literal with a separate `CodedInputStream`. This fix still limits the *size of each literal* to 2 GiB. PiperOrigin-RevId: 730885881 commit 366d129 Author: Alexander Lyashuk <crem@google.com> Date: Tue Feb 25 07:51:52 2025 -0800 [XLA] Preserve AUTO layout when converting from HLO to StableHLO In HLO, AUTO layout is encoded as missing layout in `entry_computation_layout`. In StableHLO, it's marked using `mhlo.layout_mode = "auto"` attribute of the main@ function argument or return value. PiperOrigin-RevId: 730884950 commit c8f3847 Author: Mohammed Anany <manany@google.com> Date: Tue Feb 25 07:32:02 2025 -0800 [XLA:GPU/TMA] Adding verification for triton_xla ops and custom type. PiperOrigin-RevId: 730879041 commit 50054d5 Author: Tori Baker <vwbaker@google.com> Date: Tue Feb 25 06:58:03 2025 -0800 [xla:gpu:triton] Create tma_utils with functions & tests that help with emitting TMA through triton. (see child cl to see how most of these get used). This also helps to isolate TMA that can be used in other places. PiperOrigin-RevId: 730869356 commit 1b776c9 Author: Oleg Shyshkov <shyshkov@google.com> Date: Tue Feb 25 06:49:57 2025 -0800 [XLA:GPU] Init output data with -1. Makes it easier to detect cases when we overwrite data out of the update range. PiperOrigin-RevId: 730867242 commit 522b1b9 Author: A. Unique TensorFlower <gardener@tensorflow.org> Date: Tue Feb 25 05:47:53 2025 -0800 Upgrade Bazel to 7.4.1 PiperOrigin-RevId: 730848636 commit 41cc4b5 Author: Goran Flegar <gflegar@google.com> Date: Tue Feb 25 05:09:33 2025 -0800 Log which "test case" we are running in TritonAndBlasSupport... Regular2DDot It is not quite ideal that we have a test that in effect consists of several test-cases, since it's difficult to figure out which one failed when one of them crashes. I do understand the idea that we want an easy to see support matrix, and splitting it up into individual tests would prevent us from doing that. As a middle ground, adding some logging so it's easy to tell what failed from the log. PiperOrigin-RevId: 730839144 commit 450341f Author: Henning Becker <hebecker@google.com> Date: Tue Feb 25 03:26:41 2025 -0800 [XLA] Remove the `device_util` build rule from XLA The header file for this build rule doesn't exist anymore. PiperOrigin-RevId: 730812030 commit e41890c Author: A. Unique TensorFlower <gardener@tensorflow.org> Date: Tue Feb 25 01:03:01 2025 -0800 Update GraphDef version to 2149. PiperOrigin-RevId: 730773101 commit aed230f Author: A. Unique TensorFlower <gardener@tensorflow.org> Date: Tue Feb 25 01:02:52 2025 -0800 compat: Update forward compatibility horizon to 2025-02-25 PiperOrigin-RevId: 730773034 commit 581787c Author: A. Unique TensorFlower <gardener@tensorflow.org> Date: Mon Feb 24 23:58:42 2025 -0800 Automated Code Change PiperOrigin-RevId: 730754505 commit b685cdc Author: A. Unique TensorFlower <gardener@tensorflow.org> Date: Mon Feb 24 23:40:42 2025 -0800 update SafeDivide() function to reference the correct lib from tsl::profiler Internal change PiperOrigin-RevId: 730749825 commit 4902208 Author: A. Unique TensorFlower <gardener@tensorflow.org> Date: Mon Feb 24 23:03:19 2025 -0800 Fixes sub key generation for the stacked variable. PiperOrigin-RevId: 730740034 commit 205f198 Author: A. Unique TensorFlower <gardener@tensorflow.org> Date: Mon Feb 24 22:44:09 2025 -0800 Automated Code Change PiperOrigin-RevId: 730734324 commit 7a51c9f Author: A. Unique TensorFlower <gardener@tensorflow.org> Date: Mon Feb 24 22:32:07 2025 -0800 Automated Code Change PiperOrigin-RevId: 730731111 commit 023d8cc Author: Yin Zhang <yinzz@google.com> Date: Mon Feb 24 22:15:51 2025 -0800 Switch from tsl::Mutex to absl::Mutex PiperOrigin-RevId: 730727424 commit c349c84 Author: A. Unique TensorFlower <gardener@tensorflow.org> Date: Mon Feb 24 22:12:40 2025 -0800 Internal change for visibility PiperOrigin-RevId: 730726643 commit 0105096 Author: Gunhyun Park <gunhyun@google.com> Date: Mon Feb 24 21:40:29 2025 -0800 Bump the priority of CHLO->MHLO ragged dot pass to highest. PiperOrigin-RevId: 730719320 commit 52f1cfe Author: Ezekiel Calubaquib <ecalubaquib@google.com> Date: Mon Feb 24 21:08:58 2025 -0800 Fix duplicate error in LiteRT by replacing tensorflow.lite with tflite.python.lite PiperOrigin-RevId: 730711746 commit b482271 Author: Gunhyun Park <gunhyun@google.com> Date: Mon Feb 24 20:52:27 2025 -0800 Integrate StableHLO at openxla/stablehlo@5e9b356b PiperOrigin-RevId: 730707859 commit 3449eea Author: Alexander Pivovarov <pivovaa@amazon.com> Date: Mon Feb 24 19:15:45 2025 -0800 PR tensorflow#22930: Initialize num_slices_ to 0 in Heap Simulator Imported from GitHub PR openxla/xla#22930 Ensure `num_slices_` class member is explicitly initialized to 0 in `SliceTimeAllPermutationIterator` and `SliceTimePreferredPermutationIterator` to prevent potential uninitialized variable issues. Copybara import of the project: -- 53a76f188330d4e72171e3b5349e79aafa68132c by Alexander Pivovarov <pivovaa@amazon.com>: Initialize num_slices_ to 0 in Heap Simulator Merging this change closes tensorflow#22930 PiperOrigin-RevId: 730686675 commit dc6c496 Author: Alexander Pivovarov <pivovaa@amazon.com> Date: Mon Feb 24 19:12:40 2025 -0800 PR tensorflow#22953: Fix const qualifier on status prevents automatic move semantics Imported from GitHub PR openxla/xla#22953 reason for change - const qualifier on `status` prevents automatic move semantics in return. When return status; is executed, the compiler cannot invoke the move constructor of `absl::Status` because status is const. Copybara import of the project: -- b1722312a9e697d9e55d8758eb1c083005fefcda by Alexander Pivovarov <pivovaa@amazon.com>: Fix const qualifier on status prevents automatic move semantics Merging this change closes tensorflow#22953 PiperOrigin-RevId: 730686035 commit bbbc58a Author: Yunlong Liu <yliu120@users.noreply.github.com> Date: Mon Feb 24 18:47:56 2025 -0800 PR tensorflow#22956: vlog device id in while_thunk. Imported from GitHub PR openxla/xla#22956 Copybara import of the project: -- d4623150b29e8c3960a1839c3da2234eae71adac by Yunlong Liu <yliu120@users.noreply.github.com>: vlog device id in while_thunk. Merging this change closes tensorflow#22956 PiperOrigin-RevId: 730681273 commit 564b4a1 Author: Eugene Zhulenev <ezhulenev@google.com> Date: Mon Feb 24 18:13:17 2025 -0800 [xla:cpu] InProcessCommunicator: compute collective operations in parallel using all ranks PiperOrigin-RevId: 730672408 commit c90652e Author: Yin Zhang <yinzz@google.com> Date: Mon Feb 24 17:47:08 2025 -0800 Migrate callers from tensorflow::profiler math_utils to tsl/profiler/utils/math_utils.h. No functional changes expected. PiperOrigin-RevId: 730665145 commit 8cf7713 Author: Luke Boyer <lukeboyer@google.com> Date: Mon Feb 24 17:41:38 2025 -0800 Add a flatbuffer util (python) function for getting the builtin options as a given type. PiperOrigin-RevId: 730663554 commit f19575d Author: A. Unique TensorFlower <gardener@tensorflow.org> Date: Mon Feb 24 17:00:51 2025 -0800 Ensure that fetch window is never bigger than the full trace duration. PiperOrigin-RevId: 730649115 commit 6e3ec2a Author: Luke Boyer <lukeboyer@google.com> Date: Mon Feb 24 17:00:26 2025 -0800 Add flag (only on flatbuffer export tool) to disable buffer sharing in flatbuffer. Some downstream tools won't support this. PiperOrigin-RevId: 730648967 commit 988ab99 Author: Luke Boyer <lukeboyer@google.com> Date: Mon Feb 24 16:44:49 2025 -0800 Integrate the compiler flags into the tooling PiperOrigin-RevId: 730644395 commit f84cf9d Author: Tom Natan <tomnatan@google.com> Date: Mon Feb 24 16:23:27 2025 -0800 Build absl::string_view(data, length) (instead of StringRef::str) explicitly since the llvm::StringRef to absl::string_view converter is not (always?) available on Android. END_PUBLIC PiperOrigin-RevId: 730637186 commit f0061c7 Merge: d42e2d6 a47a227 Author: TensorFlower Gardener <gardener@tensorflow.org> Date: Mon Feb 24 15:42:54 2025 -0800 Merge pull request tensorflow#87937 from jiunkaiy:dev/chuntl/revise_log PiperOrigin-RevId: 730617466 commit d42e2d6 Author: Alexander Pivovarov <pivovaa@amazon.com> Date: Mon Feb 24 15:05:10 2025 -0800 PR tensorflow#22822: Fix ambiguous constructor call in SourceTargetPairs initialization Imported from GitHub PR openxla/xla#22822 ### Description Resolve a build failure (with GCC-11) in `collective_permute_cycle_test` caused by an ambiguous constructor call when initializing `SourceTargetPairs` with an empty list (`{{}}`). #### Issue When calling `SourceTargetPairs({{}})`, the compiler could not determine whether to use the `std::vector<std::pair<int64_t, int64_t>>` constructor or the default copy/move constructors, leading to an error: ``` xla/service/collective_permute_cycle_test.cc:130:48: error: call of overloaded 'SourceTargetPairs(<brace-enclosed initializer list>)' is ambiguous 130 | EXPECT_EQ(GetCycleType(SourceTargetPairs({{}})), CycleType::kNone); ``` #### Solution 1. Explicitly define an `initializer_list` constructor for `SourceTargetPairs` to properly handle `{}` and `{{src, tgt}}` initializations. 2. Update the test case to use default ctor `SourceTargetPairs()` instead of `SourceTargetPairs({{}})`, ensuring clarity and correctness. This fix ensures proper initialization and eliminates ambiguity Tested with GCC-11 Copybara import of the project: -- f97c38d47c8373ec609fdfbaedff3856f123fc33 by Alexander Pivovarov <pivovaa@amazon.com>: Fix ambiguous constructor call in SourceTargetPairs initialization Merging this change closes tensorflow#22822 PiperOrigin-RevId: 730610452 commit 64e4135 Author: A. Unique TensorFlower <gardener@tensorflow.org> Date: Mon Feb 24 15:03:32 2025 -0800 Change size_in_bytes argument type from int to size_t. Other uses of it are size_t, so this makes it consistent. PiperOrigin-RevId: 730609856 commit c0562df Author: A. Unique TensorFlower <gardener@tensorflow.org> Date: Mon Feb 24 14:53:13 2025 -0800 Expose ExecutableBuildOptions::CompilationEnvironments::CreateFromProto to python Add a default TpuCompilationEnvironment to the wiz export PiperOrigin-RevId: 730606534 commit ce2bae2 Author: Chenguang Wang <chenguangwang@google.com> Date: Mon Feb 24 14:50:30 2025 -0800 Fix Android ARM64 build. The llvm::StringRef to absl::string_view converter is not (always?) available on Android, so inserting StringRef::str() calls where necessary. PiperOrigin-RevId: 730605410 commit dc4dbaf Author: David Dunleavy <ddunleavy@google.com> Date: Mon Feb 24 14:38:05 2025 -0800 Remove `release` configs from XLA's version of the TensorFlow bazelrc except for MacOS PiperOrigin-RevId: 730600681 commit 015bab9 Author: A. Unique TensorFlower <gardener@tensorflow.org> Date: Mon Feb 24 14:32:06 2025 -0800 Make xla_test, etc, shuffle tests by default. This helps catch test order dependencies at presubmit time. PiperOrigin-RevId: 730598576 commit e25e378 Author: Oleg Shyshkov <shyshkov@google.com> Date: Mon Feb 24 14:14:13 2025 -0800 [XLA:GPU] Give descriptive names to test case parameters. By default the parameterized test suites have numbers as names. PiperOrigin-RevId: 730592124 commit 10f8a18 Author: David Dunleavy <ddunleavy@google.com> Date: Mon Feb 24 13:38:49 2025 -0800 Remove iOS, Android, and `with_xla_support` configs from XLA's copy of the TensorFlow .bazelrc PiperOrigin-RevId: 730579046 commit d313af9 Author: Frederik Gossen <frgossen@google.com> Date: Mon Feb 24 13:38:24 2025 -0800 [XLA:GPU] Fix `HasCycle` function This is needed to avoid deadlocks when running maxtext with PP and FSDP. In this case, we see collective-permutes with multiple cycles, that were falsely categorized as acyclic. The result is a decomposed collective-permute issuing a cyclic recv leading into a deadlock. PiperOrigin-RevId: 730578883 commit dfae6d7 Author: Sandeep Dasgupta <sdasgup@google.com> Date: Mon Feb 24 13:37:31 2025 -0800 Fix "ops w/o operand and followed by quant accidentally matching dq-op-q patter" PiperOrigin-RevId: 730578564 commit 4e2dfdb Author: A. Unique TensorFlower <gardener@tensorflow.org> Date: Mon Feb 24 12:50:50 2025 -0800 Switch xla_test, etc to static linking within Google. Previously, we switched xla_test, etc to static linking to catch duplicate main() definitions at build time. We had to revert the change as it increased test binary sizes and broke Nvidia's build. In this second attempt, we make the change only for the Google internal build, so that external users aren't affected. PiperOrigin-RevId: 730561451 commit 5d8c1f9 Author: Julia Guo <juliagmt@google.com> Date: Mon Feb 24 12:44:01 2025 -0800 [XLA] Use built-in environment variable to find paths PiperOrigin-RevId: 730558831 commit 1535c85 Author: Nitin Srinivasan <srnitin@google.com> Date: Mon Feb 24 12:33:56 2025 -0800 Move `immutabledict` install to the Dockerfile PiperOrigin-RevId: 730555599 commit e77316a Author: A. Unique TensorFlower <gardener@tensorflow.org> Date: Mon Feb 24 12:31:31 2025 -0800 Reverts 52fc64b PiperOrigin-RevId: 730554805 commit 7587767 Author: A. Unique TensorFlower <gardener@tensorflow.org> Date: Mon Feb 24 12:21:22 2025 -0800 Use `addressable_devices_` instead of `devices_` in case of the multi-host environment. PiperOrigin-RevId: 730551286 commit d615e26 Author: Quentin Khan <qkhan@google.com> Date: Mon Feb 24 11:49:38 2025 -0800 #litert Add a automatically added accelerator compilation structure. This structure allows passing metadata that is generated during the model compilation onto accelerators when they alter the underlying runtime. PiperOrigin-RevId: 730538526 commit 3be10ae Author: Farzin Houshmand <farzinh@google.com> Date: Mon Feb 24 11:42:09 2025 -0800 [XLA:MSA] Remove reference to internal names. PiperOrigin-RevId: 730535804 commit 4cbcaca Author: Abhinav Gunjal <agunjal@google.com> Date: Mon Feb 24 11:40:33 2025 -0800 hlo/tools : move hlo tools tests to dedicated hlo/tools/tests directory. PiperOrigin-RevId: 730535112 commit f189fb0 Author: Luke Boyer <lukeboyer@google.com> Date: Mon Feb 24 11:36:14 2025 -0800 Add to/from string for compiler flags. Also move to compiler/plugin, this doesn't belong in vendor code. PiperOrigin-RevId: 730533416 commit 36b41f8 Author: Steeve Morin <steeve.morin@gmail.com> Date: Mon Feb 24 11:34:20 2025 -0800 Various MacOS QoL enhancements Part 1 of openxla/xla#16696 PiperOrigin-RevId: 730532747 commit 0df96d2 Author: Ilia Sergachev <isergachev@nvidia.com> Date: Mon Feb 24 11:31:18 2025 -0800 PR tensorflow#22292: [GPU] Support cuDNN explicit CUDA graph construction. Imported from GitHub PR openxla/xla#22292 Some cuDNN graph engines now support explicit CUDA graph construction instead of stream capture. XLA will now switch between explicit construction and the already implemented stream capture accordingly. Copybara import of the project: -- caf22d33e606a6b2ab00d14aa9082550515c404c by Ilia Sergachev <isergachev@nvidia.com>: [GPU] Support cuDNN explicit CUDA graph construction. Some cuDNN graph engines now support explicit CUDA graph construction instead of stream capture. XLA will now switch between explicit construction and the already implemented stream capture accordingly. -- 23bb1ea89959a10b90b7892196bec41621c9b093 by Ilia Sergachev <isergachev@nvidia.com>: Log graphs that don't support CUDA graph native API. -- dd31aeab7edc21a39531817e96a6eecfb0d5b96f by Ilia Sergachev <isergachev@nvidia.com>: Skip the added test with old cuDNN versions. -- eeafdbf5f61b111fa3285fb2cfcb65efc91c6b62 by Ilia Sergachev <isergachev@nvidia.com>: Address review comments. -- c03beef9515c0198d6eb1518b10a483b6a1b9c41 by Ilia Sergachev <isergachev@nvidia.com>: Fix build errors. Merging this change closes tensorflow#22292 PiperOrigin-RevId: 730531507 commit 386f7e6 Author: Shraiysh <svaishay@nvidia.com> Date: Mon Feb 24 11:16:39 2025 -0800 PR tensorflow#22970: Fix bug in post order traversal of computation instructions Imported from GitHub PR openxla/xla#22970 While creating post order traversal, an instruction may have a user outside the computation. This is the case when we are constructing new instructions to store in replacements for cloning the computation later. This user should be ignored. Added test for the same. Because of this, functions like `ToString()`, and `GetUniqueGteInstruction()` encounter errors. They rely on post-order traversal to have all the instructions. Copybara import of the project: -- 326469b7cab50e90616094dffe36758afef815e1 by Shraiysh Vaishay <svaishay@nvidia.com>: Fix bug in post order traversal of computation instructions While creating post order traversal, an instruction may have a user outside the computation. This is the case when we are constructing new instructions to store in replacements for cloning the computation later. This user should be ignored. Added test for the same. Because of this, functions like `ToString()`, and `GetUniqueGteInstruction()` encounter errors. They rely on post-order traversal to have all the instructions. Merging this change closes tensorflow#22970 PiperOrigin-RevId: 730525630 commit 0171f72 Author: Yang Chen <yangchen@google.com> Date: Mon Feb 24 10:42:20 2025 -0800 Cleanup: Fix includes. PiperOrigin-RevId: 730511326 commit c70f83a Author: Yang Chen <yangchen@google.com> Date: Mon Feb 24 10:38:55 2025 -0800 Cleanup: Fix includes. PiperOrigin-RevId: 730509796 commit 3fd4e66 Author: Yang Chen <yangchen@google.com> Date: Mon Feb 24 10:35:22 2025 -0800 Cleanup: Fix includes. PiperOrigin-RevId: 730508090 commit c536176 Author: Yang Chen <yangchen@google.com> Date: Mon Feb 24 10:35:11 2025 -0800 Cleanup: Fix includes. PiperOrigin-RevId: 730507999 commit c80f582 Merge: fd6bd5a e9009ce Author: TensorFlower Gardener <gardener@tensorflow.org> Date: Mon Feb 24 10:49:01 2025 -0800 Merge pull request tensorflow#83372 from cybersupersoap:transpose-crash-fix PiperOrigin-RevId: 730504254 commit fd6bd5a Author: Michael Whittaker <mwhittaker@google.com> Date: Mon Feb 24 10:19:16 2025 -0800 Added incarnation to `GetTaskState` RPC in coordination service. PiperOrigin-RevId: 730501913 commit ea61820 Author: A. Unique TensorFlower <gardener@tensorflow.org> Date: Mon Feb 24 10:17:00 2025 -0800 Make xla_cc_test default to shuffling test cases. This helps catch test case order dependencies at presubmit time. PiperOrigin-RevId: 730500863 commit ed0d218 Author: Michael Whittaker <mwhittaker@google.com> Date: Mon Feb 24 10:00:15 2025 -0800 Don't run CUDA test with msan. PiperOrigin-RevId: 730493506 commit 9f94996 Author: A. Unique TensorFlower <gardener@tensorflow.org> Date: Mon Feb 24 09:35:21 2025 -0800 Adds visibility restriction to some XLA bzl files to prevent them from being used outside of XLA, as they are internal implementation details. This CL is not complete. It's the first step that establishes the mechanism. Once I get buy-in on the approach, I'll follow up with more CLs to add visibility restriction to the other XLA bazl files. PiperOrigin-RevId: 730484507 commit 200be96 Author: Won Jong Jeon <won.jeon@arm.com> Date: Mon Feb 24 09:37:37 2025 -0800 [mlir][tosa] Update Tensorflow to match TOSA v1.0 specification (part 3) (tensorflow#87273) * [mlir][tosa] Change 'shape' attribute of RESHAPE operator to become an input including minor change from: Slice the input of kernel based ops to the actual used size Change-Id: Ifebe0d1b3459300df0fa2edc9ba24a867caec3d3 Signed-off-by: Won Jeon <won.jeon@arm.com> Change-Id: I938503349f38b64db5e77a01c3a7b2bb33e8f041 * [Tosa] Refactor QuantizationAttr changes due to removal of quantization attr in TOSA dialect and due to name changes in while_loop region names Signed-off-by: Tai Ly <tai.ly@arm.com> Change-Id: I09533bffcd8e2179505c7e11e1320b673266585d * [mlir][tosa] ClampOp attributes changes This patch implements changes required by Tosa ClampOp's new min_val/max_val attributes including clamp_max update code from: commit 04055fa510522af659aa56bac3b4796961131546 Author: Thibaut Goetghebuer-Planchon <thibaut.goetghebuer-planchon@arm.com> Date: Thu Sep 21 17:22:14 2023 +0000 [TOSA] During quantized ReLU legalization, limit the clamp_max attribute to the max value of the quantized type Change-Id: I781229be0eb86ecb3cf1a305ede98ad630e5bcfd Signed-off-by: Tai Ly <tai.ly@arm.com> Change-Id: I25ba0d077fa44d4c384ab094a6070a4743383414 * [TOSA] Calculate unknown reshape dimension when input is static This commit updates the reshape legalization to calculate static shape and output type when a static input shape is provided and only one dimension is unknown. Change-Id: I0843549b47131b0852fbf375f00846b1fcbe8bc6 Signed-off-by: Luke Hutton <luke.hutton@arm.com> * [TOSA] Numerical mismatch on tfl.transpose_conv layer * Legalization now handles cases where the layer has a bias Author: Tom Allsop <tom.allsop@arm.com> Change-Id: Ie3ba38644d1cf8e5d6f71271e8bb6f1b5636f406 * [mlir][tosa] Change resize attrs to inputs This patch implements changes required by Tosa resize op's scale/offset/border changing from attributes to inputs. Signed-off-by: Tai Ly <tai.ly@arm.com> Change-Id: I9a4319ac53298c25568fc651e249528b9a9477fc * [mlir][tosa] Update LIT tests Combination of test file updates from the following commits: * [mlir][tosa] Change 'shape' attribute of RESHAPE operator to become an input * [mlir][tosa] Switch zero point of convolutions to input variable type * [Tosa] Refactor QuantizationAttr * [TOSA] During quantized ReLU legalization, limit the clamp_max attribute to the ma x value of the quantized type * [mlir][tosa] ClampOp attributes changes * [TOSA] Calculate unknown reshape dimension when input is static * [TOSA] Numerical mismatch on tfl.transpose_conv layer * [mlir][tosa] Change resize attrs to inputs Co-authored-by: Tai Ly <tai.ly@arm.com> Co-authored-by: Thibaut Goetghebuer-Planchon <thibaut.goetghebuer-planchon@arm.com> Co-authored-by: Luke Hutton <luke.hutton@arm.com> Co-authored-by: Tom Allsop <tom.allsop@arm.com> Signed-off-by: Won Jeon <won.jeon@arm.com> Change-Id: Ia5731e659d262c74374e8326d49beccf6a60032e --------- Signed-off-by: Won Jeon <won.jeon@arm.com> Signed-off-by: Tai Ly <tai.ly@arm.com> Signed-off-by: Luke Hutton <luke.hutton@arm.com> commit b3a79af Author: Julia Guo <juliagmt@google.com> Date: Mon Feb 24 09:15:18 2025 -0800 Fix cpu/gpu benchmarks github workflows to run on steps correctly. PiperOrigin-RevId: 730477678 commit 1fe5433 Author: Emily Fertig <emilyaf@google.com> Date: Mon Feb 24 08:52:51 2025 -0800 Plumb layout through the creation of PjRtArrays. This is in preparation to support arrays with no local shards, so that layout may not be accessible from a buffer. PiperOrigin-RevId: 730469597 commit f01ad0b Author: Bart Chrzaszcz <bartchr@google.com> Date: Mon Feb 24 07:36:37 2025 -0800 #sdy Make XLA changes to support JAX export. - Shardy isn't serializable yet with StableHLO, so we need to expose the `SdyRoundTripExportPipeline` to JAX to remove the dialect before serializing. - Pass an option to `refine_polymoprhic_shapes` if shardy is enabled as we need to undo `SdyRoundTripExportPipeline` through importing again with `SdyRoundTripImportPipeline` - Add `is_tile_maximal` as a nanobind python binding for `OpSharding` PiperOrigin-RevId: 730445364 commit fb32129 Author: Emilio Cota <ecg@google.com> Date: Mon Feb 24 07:25:48 2025 -0800 [xla:emitters] tag XLA, XLA:CPU and XLA:GPU dialects as non-prod-compatible This paves the way for XLA:CPU fusion emitters. Note that XLA:CPU is non-prod-compatible, whereas XLA:GPU is not. The CPU fusion emitters will depend on the XLA, XLA:CPU and XLA:GPU dialects, and given that the emitters' dependents in XLA:CPU are non-prod-compatible, the three dialects have to be as well. XLA:CPU passes also have to be tagged. Crucially, thanks to the parent CLs, XLA:GPU passes are not used anymore by any of the above dialects nor by XLA:CPU passes, so XLA:GPU remains essentially untouched; we just tag the XLA:GPU dialect. Some common libraries in xla/codegen/emitters are also tagged. PiperOrigin-RevId: 730442237 commit a7aaad7 Author: Alexander Belyaev <pifon@google.com> Date: Mon Feb 24 07:06:50 2025 -0800 [XLA:GPU][Emitters] Restrict the inliner. Inline only if there are more than 1 call to the callee in the caller. Background: jax-ml/jax#26162 contains an example of a MoF fusion that takes forever to compile. The [indexing-based partitioner](openxla/xla@44bc816) in combination with this change fixes the issue. PiperOrigin-RevId: 730436982 commit 3ab2013 Author: Benjamin Kramer <kramerb@google.com> Date: Mon Feb 24 06:57:46 2025 -0800 Integrate LLVM at llvm/llvm-project@c80b99d98ad0 Updates LLVM usage to match [c80b99d98ad0](llvm/llvm-project@c80b99d98ad0) PiperOrigin-RevId: 730434387 commit 68abab9 Author: Alexander Belyaev <pifon@google.com> Date: Mon Feb 24 06:38:26 2025 -0800 [XLA:GPU][TMA] Add an alias for TmaDescriptorAttr. PiperOrigin-RevId: 730429137 commit 4750f67 Author: Quentin Khan <qkhan@google.com> Date: Mon Feb 24 04:14:51 2025 -0800 Add missing newline in `accelerator.h` PiperOrigin-RevId: 730391255 commit a47a227 Author: chuntl <quic_chuntl@quicinc.com> Date: Thu Feb 20 18:04:47 2025 +0800 Qualcomm AI Engine Direct - Add log utils for core module Summary: - Implement default and android version of log utils for core module - Add test for log util - Use LogOff as default log level - Unify to use log util in core module commit 5ee65d4 Author: A. Unique TensorFlower <gardener@tensorflow.org> Date: Mon Feb 24 04:14:40 2025 -0800 Avoids Segmentation fault when dispatcher library is not found PiperOrigin-RevId: 730391202 commit 46ed7f6 Author: Fergus Henderson <fergus@google.com> Date: Mon Feb 24 03:32:37 2025 -0800 Some minor polishing of the release docs for 2.19. 1. Fix indentation. The indentation of the first three bullet points in the markdown sources did not match the indentation of the fourth and fifth bullet points, nor of the bullet points further below. 2. Wrap some long lines in the markdown sources, in particular where there were some lines wrapped but others not wrapped in the same bullet point list. 3. Use "Python API" rather than "Interpreter" as the subheading for changes affecting the `tf.lite.Interpreter` Python class, for consistency with the earlier heading "C++ API" in the same bullet point list. PiperOrigin-RevId: 730380377 commit c699ef3 Author: A. Unique TensorFlower <gardener@tensorflow.org> Date: Mon Feb 24 03:21:38 2025 -0800 Automated Code Change PiperOrigin-RevId: 730377762 commit 2c045ad Author: A. Unique TensorFlower <gardener@tensorflow.org> Date: Mon Feb 24 03:09:13 2025 -0800 Automated Code Change PiperOrigin-RevId: 730374491 commit b375fd2 Author: A. Unique TensorFlower <gardener@tensorflow.org> Date: Mon Feb 24 02:56:50 2025 -0800 Adds LITERT_FATAL to logging PiperOrigin-RevId: 730371384 commit 474d368 Author: A. Unique TensorFlower <gardener@tensorflow.org> Date: Mon Feb 24 02:51:36 2025 -0800 [XLA] Clean up the implementation for broadcast sinking past elementwise ops and add a test. This is a pure refactoring - no functional changes. PiperOrigin-RevId: 730369875 commit 8286168 Author: A. Unique TensorFlower <gardener@tensorflow.org> Date: Mon Feb 24 02:28:59 2025 -0800 Fix invalid pointer in environment_options PiperOrigin-RevId: 730363601 commit 121ddb6 Author: A. Unique TensorFlower <gardener@tensorflow.org> Date: Mon Feb 24 01:02:36 2025 -0800 Update GraphDef version to 2148. PiperOrigin-RevId: 730338341 commit 23beb26 Author: A. Unique TensorFlower <gardener@tensorflow.org> Date: Mon Feb 24 01:02:28 2025 -0800 compat: Update forward compatibility horizon to 2025-02-24 PiperOrigin-RevId: 730338320 commit 8a907d6 Author: Shraiysh <svaishay@nvidia.com> Date: Mon Feb 24 00:56:14 2025 -0800 PR tensorflow#22614: Fix hlo_opt printing of Hlo module Imported from GitHub PR openxla/xla#22614 The tool `hlo-opt` was not honoring the debug options within the HloModule while printing the HloModule. These options should be honored by the default printing of the HloModule as they are a part of the same HloModule. Fixed the print method to do this. This should now be reflected in all the tools using these debug options. Copybara import of the project: -- a22584a819a0fc6ee8f41b4c50f4f8d68a6a2184 by Shraiysh Vaishay <svaishay@nvidia.com>: Fix hlo_opt printing of Hlo module The tool `hlo-opt` was not honoring the debug options within the HloModule while printing the HloModule. These options should be honored by the default printing of the HloModule as they are a part of the same HloModule. Fixed the print method to do this. This should now be reflected in all the tools using these debug options. -- b42178b4da3fd5f81fc2d50346cb2f9b18153ab5 by Shraiysh Vaishay <svaishay@nvidia.com>: Rebase and avoid edits to testcases. -- 51cdfbfa355efe34936073fd68d4e19191131bb7 by Shraiysh Vaishay <svaishay@nvidia.com>: Addressed failing test Merging this change closes tensorflow#22614 PiperOrigin-RevId: 730336982 commit 5431408 Author: Eugene Zhulenev <ezhulenev@google.com> Date: Mon Feb 24 00:32:19 2025 -0800 [xla:cpu] Align KernelArgs to enable aligned moves on a hot path ``` name old cpu/op new cpu/op delta BM_SelectAndScatterF32/128/process_time 318µs ± 2% 306µs ± 2% -3.62% (p=0.000 n=38+38) BM_SelectAndScatterF32/256/process_time 1.28ms ± 1% 1.23ms ± 2% -4.24% (p=0.000 n=39+35) BM_SelectAndScatterF32/512/process_time 5.75ms ± 2% 5.57ms ± 2% -3.06% (p=0.000 n=35+36) name old time/op new time/op delta BM_SelectAndScatterF32/128/process_time 318µs ± 2% 307µs ± 2% -3.66% (p=0.000 n=38+40) BM_SelectAndScatterF32/256/process_time 1.28ms ± 1% 1.23ms ± 2% -4.19% (p=0.000 n=39+37) BM_SelectAndScatterF32/512/process_time 5.39ms ± 1% 5.21ms ± 2% -3.41% (p=0.000 n=38+38) ``` PiperOrigin-RevId: 730330680 commit 99a4f2c Author: Zixuan Jiang <zixuanjiang@google.com> Date: Sun Feb 23 22:56:26 2025 -0800 Move the sharding axes from dimensions that need replication to batch dimensions, such that we replace an `all-gather` with an `all-to-all`. Given the following input ``` ENTRY entry { %param0 = f32[14,257] parameter(0), sharding={devices=[1,2]0,1} %param1 = f32[14,116] parameter(1), sharding={devices=[1,2]0,1} ROOT %concatenate = f32[14,373] concatenate(%param0, %param1), dimensions={1}, sharding={devices=[1,2]0,1} } ``` Previously, we (1) replicate the input along the concat dimension, (2) apply concat, (3) partition the result with dynamic-slice. With this change, we (1) use all-to-all to move sharding axis from the concat dim to batch dim for operands, (2) apply concat, and then (3) use all-to-all to reshard the result. Reverts 81b0a48 PiperOrigin-RevId: 730308137 commit 0f8d58f Author: A. Unique TensorFlower <gardener@tensorflow.org> Date: Sun Feb 23 22:23:36 2025 -0800 Automated Code Change PiperOrigin-RevId: 730300711 commit 65d9195 Author: A. Unique TensorFlower <gardener@tensorflow.org> Date: Sun Feb 23 21:52:10 2025 -0800 Automated Code Change PiperOrigin-RevId: 730294118 commit e9009ce Author: Assoap <112180992+Assoap@users.noreply.github.com> Date: Thu Dec 19 23:33:18 2024 +0800 Fix crash of transpose

btaba added the bug Something isn't working label Jan 28, 2025

This was referenced Jan 28, 2025

acting.Evaluator.run_evaluation compilation hangs google/brax#569

Open

some warnning happened when i run manipulation.ipynb google-deepmind/mujoco_playground#11

Open

canergen mentioned this issue Feb 13, 2025

MrVI slowdown due to JAX compilation update scverse/scvi-tools#3179

Open

copybara-service bot pushed a commit to openxla/xla that referenced this issue Feb 18, 2025

[XLA:GPU] Fix slow compile time for jax-ml/jax#26162

4c43694

PiperOrigin-RevId: 728125830

copybara-service bot mentioned this issue Feb 18, 2025

[XLA:GPU] Fix slow compile time for https://github.com/jax-ml/jax/issues/26162 openxla/xla#22804

Open

copybara-service bot pushed a commit to tensorflow/tensorflow that referenced this issue Feb 18, 2025

[XLA:GPU] Fix slow compile time for jax-ml/jax#26162

4ba15a0

PiperOrigin-RevId: 728125830

copybara-service bot mentioned this issue Feb 18, 2025

[XLA:GPU] Fix slow compile time for https://github.com/jax-ml/jax/issues/26162 tensorflow/tensorflow#87469

Draft

copybara-service bot pushed a commit to openxla/xla that referenced this issue Feb 18, 2025

[XLA:GPU] Fix slow compile time for jax-ml/jax#26162

e6bffdc

PiperOrigin-RevId: 728125830

copybara-service bot pushed a commit to tensorflow/tensorflow that referenced this issue Feb 18, 2025

[XLA:GPU] Fix slow compile time for jax-ml/jax#26162

4064ef8

PiperOrigin-RevId: 728125830

copybara-service bot pushed a commit to openxla/xla that referenced this issue Feb 18, 2025

[XLA:GPU] Fix slow compile time for jax-ml/jax#26162

5125ce3

PiperOrigin-RevId: 728125830

copybara-service bot pushed a commit to tensorflow/tensorflow that referenced this issue Feb 18, 2025

[XLA:GPU] Fix slow compile time for jax-ml/jax#26162

2136cf0

PiperOrigin-RevId: 728125830

copybara-service bot mentioned this issue Feb 20, 2025

[XLA:GPU][Emitters] Restrict the inliner. openxla/xla#22917

Closed

copybara-service bot mentioned this issue Feb 20, 2025

[XLA:GPU][Emitters] Restrict the inliner. tensorflow/tensorflow#87707

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Operation very slow to compile since jax 0.4.36 #26162

Operation very slow to compile since jax 0.4.36 #26162

btaba commented Jan 28, 2025

btaba commented Jan 28, 2025 •

edited

Loading

kevinzakka commented Jan 28, 2025

jreiffers commented Jan 31, 2025 •

edited

Loading

jreiffers commented Jan 31, 2025

pifon2a commented Jan 31, 2025

pifon2a commented Jan 31, 2025

btaba commented Jan 31, 2025

kevinzakka commented Feb 6, 2025

pifon2a commented Feb 14, 2025

pifon2a commented Feb 18, 2025 •

edited

Loading

pifon2a commented Feb 18, 2025

pifon2a commented Feb 24, 2025

Operation very slow to compile since jax 0.4.36 #26162

Operation very slow to compile since jax 0.4.36 #26162

Comments

btaba commented Jan 28, 2025

Description

System info (python version, jaxlib version, accelerator, etc.)

btaba commented Jan 28, 2025 • edited Loading

kevinzakka commented Jan 28, 2025

jreiffers commented Jan 31, 2025 • edited Loading

jreiffers commented Jan 31, 2025

pifon2a commented Jan 31, 2025

pifon2a commented Jan 31, 2025

btaba commented Jan 31, 2025

kevinzakka commented Feb 6, 2025

pifon2a commented Feb 14, 2025

pifon2a commented Feb 18, 2025 • edited Loading

pifon2a commented Feb 18, 2025

pifon2a commented Feb 24, 2025

btaba commented Jan 28, 2025 •

edited

Loading

jreiffers commented Jan 31, 2025 •

edited

Loading

pifon2a commented Feb 18, 2025 •

edited

Loading