Move getConsumerPosAlignedToProducerCA to MaxPosCalculator #3766

naoyam · 2025-01-28T03:04:57Z

Moved getConsumerPosAlignedToProducerCA to MaxPosCalculator to reuse an IdModel built for the entire fusion.

No output should be affected, but there are two main reasons of the change, one is efficiency and another is functionality.

Previously, each call to TensorView::updateMaxProducerPosition created a (small) IdModel, which is inefficient since in common cases updateMaxProducerPosition is called from routines that already holds an IdModel through MaxPosCalculator. Moving the function to MaxPosCalculator can reuse the same IdModel.
A local IdModel that is created by just looking at a single expression may miss exact mappings previously registered through Fusion::registerExactMapping. For example:

t0: [i0]
t1: [i1, i2]

t2 = broadcast(t0, {false, true});
// t1: [i3, b4]

t3 = t1 + t2
// [i5, i7]

t4 = t3 + 1
// [i7, i8]

In this case, there are exact groups of {i0, i1, i3, i5, i7}, {i2, i7, i8}and {b4}. We could set the loop domain of t2 look like t4 by scheduleLoopDomainsBy({t2}, t4->getLoopDomain()), which would clone i8 and set the loop domain of t2 as {i3, i9}, where i9 is a clone of i8 and is registered as exact mapped with i8. This should all be fine as long as we looked at the whole fusion, but if an IdModel was built only for t3 = t1 + t2, the loop domain of t2, {i3, i9}, wouldn't be grouped together with the loop domain of t3, {i5, i7}. It would discover the mapping of i3 and i5, but it wouldn't know anything between i9 and i7. The registered exact mapping wouldn't help because it's registered with a pair of i8 and i9. It is obvious that i8 is mapped with i7, only when the IdModel analysis also looks at the t4 definition. When an IdModel is created only for t3 = t1 + t2, the registered mapping would do nothing. This problem can be avoided by using a whole-fusion IdModel.

github-actions · 2025-01-28T03:05:41Z

PR Reviewer Guide 🔍

(Review updated until commit `612ab0f`)

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 3 🔵🔵🔵⚪⚪

🧪 No relevant tests

⚡ Recommended focus areas for review

Possible Performance Issue

The getConsumerPosAlignedToProducerCA function has been moved to MaxPosCalculator to reuse an IdModel built for the entire fusion. However, the performance impact of this change is not explicitly stated. It would be beneficial to include a comment or a note explaining the expected performance gain from this change.

// Try to find the aligned position on consumer's domain corresponding to a
//  position of producer domain. No checking on actual
//  producer-consumer relationship.
int64_t MaxPosCalculator::getConsumerPosAlignedToProducerCA(
    TensorView* consumer,
    TensorView* producer,
    int64_t producer_pos) {
  // Locate consumer's position that aligns with
  //  the producer's position. We need broadcast axes forwarded so we
  //  need to replay PasC as CasP will not forward braodcast dims. For example
  //  if we have:
  // T2[ iS22{( 3 * 1 )} ] ca_pos( 1 ) = broadcast( T1[ iS1{3} ] ca_pos( 1 )
  // produce_pos( 1) ) CasP will have the mapping iS1{3} -> iS2{3} and PasC will
  // have the mapping iS22{( 3 * 1 )} <- iS1{3} We need the latter. Refer to
  // NVFuserTest.FusionComplexBCast1_CUDA

  int64_t consumer_pos = consumer->nDims();

  const bool may_need_forwarding =
      ir_utils::isLoopDomainFullyDerivedFromLogicalDomain(producer) &&
      ir_utils::isLoopDomainFullyDerivedFromLogicalDomain(consumer);

  if (may_need_forwarding) {
    auto disjoint_sets = BestEffortReplay::replayPasC(
                             producer,
                             consumer,
                             -1,
                             PairwiseLogicalDomainMap(producer, consumer))
                             .getIterDomainEquivalence();

    // Find the innermost position of consumer that has
    //  been mapped within the producer ca axis.

    while (consumer_pos > 0) {
      auto consumer_id = consumer->axis(consumer_pos - 1);
      const auto& p_dom = producer->getLoopDomain();
      if (std::any_of(
              p_dom.begin(),
              p_dom.begin() + producer_pos,
              [&consumer_id, &disjoint_sets](IterDomain* p_id) {
                return disjoint_sets.permissiveAreMapped(consumer_id, p_id);
              })) {
        break;
      }
      consumer_pos--;
    }
  } else {
    while (consumer_pos > 0) {
      auto consumer_id = consumer->axis(consumer_pos - 1);
      const auto& p_dom = producer->getLoopDomain();
      if (std::any_of(
              p_dom.begin(),
              p_dom.begin() + producer_pos,
              [&](IterDomain* p_id) {
                return inliningGraph().disjointValSets().strictAreMapped(
                    consumer_id, p_id);
              })) {
        break;
      }
      consumer_pos--;
    }
  }

  return consumer_pos;
}

Potential Bug

In the updateMaxProducerPosition function, the calc pointer is not checked for null before being used. This could lead to a null pointer dereference if calc is null. It would be beneficial to add a null check before using the calc pointer.

void TensorView::updateMaxProducerPosition(MaxPosCalculator* calc) {
  std::unique_ptr<MaxPosCalculator> calc_owner;
  if (calc == nullptr) {
    calc_owner = std::make_unique<MaxPosCalculator>();
    calc = calc_owner.get();
  }

  for (auto producer : ir_utils::producerTvsOf(this)) {
    max_producer_pos_ = std::max(
        max_producer_pos_,
        calc->getConsumerPosAlignedToProducerCA(
            this, producer, producer->getComputePosition(this)));
  }

  maybe_max_producer_pos_ = max_producer_pos_;

  // When a producer may be computed with this tensor, i.e., it isn't
  // yet resolved, reflect that in maybe_max_producer_pos_. If all
  // producers are already resolved, i.e., after the initial
  // resolveComputeWith in lowering, this should be just equal to
  // max_producer_pos_.
  for (auto producer : ir_utils::producerTvsOf(this)) {
    if (producer->hasComputeWith() && !producer->hasResolvedComputeWith()) {
      maybe_max_producer_pos_ = std::max(
          maybe_max_producer_pos_,
          calc->getConsumerPosAlignedToProducerCA(
              this, producer, producer->getComputeWithPosition()));
    }
  }
}

naoyam · 2025-01-28T03:12:48Z

!test --diff

naoyam · 2025-01-28T03:44:01Z

csrc/tensor_view.cpp

-// Try to find the aligned position on consumer's domain corresponding to a
-//  position of producer domain. No checking on actual
-//  producer-consumer relationship.
-int64_t getConsumerPosAlignedToProducerCA(


This function is moved to MaxPosCalculator, except for one difference noted below.

naoyam · 2025-01-28T03:46:35Z

csrc/tensor_view.cpp

-    IdModel id_model({consumer->definition()}, {}, false);
-    id_model.buildBroadcastGraph();
-    const auto& inlining_graph = id_model.idGraph(IdMappingMode::BROADCAST);


This part is removed when moved to MaxPosCalculator. We need an IdModel to determine valid inlining positions. Previously, it created a per-tensor IdModel here. Since it's now moved to MaxPosCalculator, it doesn't need to create a new IdModel as long as the same instance of MaxPosCalculator is reused.

xwang233 · 2025-01-28T19:16:32Z

Suggested title

Move getConsumerPosAlignedToProducerCA to MaxPosCalculator for efficiency and functionality

Description

Moved getConsumerPosAlignedToProducerCA to MaxPosCalculator
Updated TensorView::updateMaxProducerPosition to accept MaxPosCalculator
Improved efficiency by reusing IdModel
Ensured exact mappings are considered across the entire fusion

Changes walkthrough 📝

Relevant files

Enhancement

inlining.cpp Added `getConsumerPosAlignedToProducerCA` to `MaxPosCalculator` csrc/scheduler/tools/inlining.cpp Added `getConsumerPosAlignedToProducerCA` method to `MaxPosCalculator`	+66/-0
tensor_view.cpp Updated `updateMaxProducerPosition` to use `MaxPosCalculator` csrc/tensor_view.cpp Updated `updateMaxProducerPosition` to accept `MaxPosCalculator` Removed redundant `getConsumerPosAlignedToProducerCA` function	+8/-76
interface_nodes.h Modified `updateMaxProducerPosition` signature csrc/ir/interface_nodes.h Modified `updateMaxProducerPosition` to accept `MaxPosCalculator` parameter	+1/-1
inlining.h Added `getConsumerPosAlignedToProducerCA` declaration csrc/scheduler/tools/inlining.h Added `getConsumerPosAlignedToProducerCA` declaration to `MaxPosCalculator`	+5/-0

naoyam · 2025-01-29T04:24:52Z

!test --diff

naoyam · 2025-02-04T16:47:29Z

!test --diff

github-actions · 2025-02-04T16:48:06Z

Description

Moved getConsumerPosAlignedToProducerCA to MaxPosCalculator
Updated TensorView::updateMaxProducerPosition to use MaxPosCalculator
Added MaxPosCalculator* parameter to TensorView::updateMaxProducerPosition

Changes walkthrough 📝

Relevant files

Enhancement

inlining.cpp Added `getConsumerPosAlignedToProducerCA` to `MaxPosCalculator` csrc/scheduler/tools/inlining.cpp Added `getConsumerPosAlignedToProducerCA` method to `MaxPosCalculator`	+66/-0
tensor_view.cpp Updated `updateMaxProducerPosition` to use `MaxPosCalculator` csrc/tensor_view.cpp Updated `TensorView::updateMaxProducerPosition` to use `MaxPosCalculator` Added `MaxPosCalculator*` parameter to `TensorView::updateMaxProducerPosition`	+8/-76
interface_nodes.h Updated `updateMaxProducerPosition` signature csrc/ir/interface_nodes.h Updated `updateMaxProducerPosition` method signature to include `MaxPosCalculator*` parameter	+1/-1
inlining.h Added `getConsumerPosAlignedToProducerCA` declaration csrc/scheduler/tools/inlining.h Added `getConsumerPosAlignedToProducerCA` method declaration to `MaxPosCalculator`	+5/-0

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

🧪 No relevant tests

⚡ Recommended focus areas for review

Performance Impact

The PR claims efficiency improvements by reusing an IdModel built for the entire fusion. However, the performance impact of this change should be quantitatively evaluated and compared against the previous implementation.

// Try to find the aligned position on consumer's domain corresponding to a
//  position of producer domain. No checking on actual
//  producer-consumer relationship.
int64_t MaxPosCalculator::getConsumerPosAlignedToProducerCA(
    TensorView* consumer,
    TensorView* producer,
    int64_t producer_pos) {
  // Locate consumer's position that aligns with
  //  the producer's position. We need broadcast axes forwarded so we
  //  need to replay PasC as CasP will not forward braodcast dims. For example
  //  if we have:
  // T2[ iS22{( 3 * 1 )} ] ca_pos( 1 ) = broadcast( T1[ iS1{3} ] ca_pos( 1 )
  // produce_pos( 1) ) CasP will have the mapping iS1{3} -> iS2{3} and PasC will
  // have the mapping iS22{( 3 * 1 )} <- iS1{3} We need the latter. Refer to
  // NVFuserTest.FusionComplexBCast1_CUDA

  int64_t consumer_pos = consumer->nDims();

  const bool may_need_forwarding =
      ir_utils::isLoopDomainFullyDerivedFromLogicalDomain(producer) &&
      ir_utils::isLoopDomainFullyDerivedFromLogicalDomain(consumer);

  if (may_need_forwarding) {
    auto disjoint_sets = BestEffortReplay::replayPasC(
                             producer,
                             consumer,
                             -1,
                             PairwiseLogicalDomainMap(producer, consumer))
                             .getIterDomainEquivalence();

    // Find the innermost position of consumer that has
    //  been mapped within the producer ca axis.

    while (consumer_pos > 0) {
      auto consumer_id = consumer->axis(consumer_pos - 1);
      const auto& p_dom = producer->getLoopDomain();
      if (std::any_of(
              p_dom.begin(),
              p_dom.begin() + producer_pos,
              [&consumer_id, &disjoint_sets](IterDomain* p_id) {
                return disjoint_sets.permissiveAreMapped(consumer_id, p_id);
              })) {
        break;
      }
      consumer_pos--;
    }
  } else {
    while (consumer_pos > 0) {
      auto consumer_id = consumer->axis(consumer_pos - 1);
      const auto& p_dom = producer->getLoopDomain();
      if (std::any_of(
              p_dom.begin(),
              p_dom.begin() + producer_pos,
              [&](IterDomain* p_id) {
                return inliningGraph().disjointValSets().strictAreMapped(
                    consumer_id, p_id);
              })) {
        break;
      }
      consumer_pos--;
    }
  }

  return consumer_pos;

Test Coverage

The PR modifies the updateMaxProducerPosition method to accept a MaxPosCalculator pointer. Ensure that new tests are added to cover scenarios where this new parameter is used, especially when calc is nullptr.

void TensorView::updateMaxProducerPosition(MaxPosCalculator* calc) {
  std::unique_ptr<MaxPosCalculator> calc_owner;
  if (calc == nullptr) {
    calc_owner = std::make_unique<MaxPosCalculator>();
    calc = calc_owner.get();
  }

  for (auto producer : ir_utils::producerTvsOf(this)) {
    max_producer_pos_ = std::max(
        max_producer_pos_,
        calc->getConsumerPosAlignedToProducerCA(
            this, producer, producer->getComputePosition(this)));
  }

  maybe_max_producer_pos_ = max_producer_pos_;

Code Duplication

The logic for determining the aligned position on the consumer's domain corresponding to a position of the producer domain is now duplicated in MaxPosCalculator::getConsumerPosAlignedToProducerCA and TensorView::updateMaxProducerPosition. Consider refactoring to avoid duplication.

void TensorView::updateMaxProducerPosition(MaxPosCalculator* calc) {
  std::unique_ptr<MaxPosCalculator> calc_owner;
  if (calc == nullptr) {
    calc_owner = std::make_unique<MaxPosCalculator>();
    calc = calc_owner.get();
  }

  for (auto producer : ir_utils::producerTvsOf(this)) {
    max_producer_pos_ = std::max(
        max_producer_pos_,
        calc->getConsumerPosAlignedToProducerCA(
            this, producer, producer->getComputePosition(this)));
  }

  maybe_max_producer_pos_ = max_producer_pos_;

Move updateMaxProducerPos to MaxPosCalculator

c757253

naoyam commented Jan 28, 2025

View reviewed changes

naoyam marked this pull request as ready for review January 28, 2025 04:09

naoyam requested a review from zasdfgbnm January 28, 2025 04:09

zasdfgbnm approved these changes Jan 28, 2025

View reviewed changes

Merge branch 'main' into move_update_max_pos_to_max_pos_calc

612ab0f

Merge branch 'main' into move_update_max_pos_to_max_pos_calc

34c4124

naoyam merged commit a1baafa into main Feb 4, 2025
58 of 60 checks passed

naoyam deleted the move_update_max_pos_to_max_pos_calc branch February 4, 2025 20:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move getConsumerPosAlignedToProducerCA to MaxPosCalculator #3766

Move getConsumerPosAlignedToProducerCA to MaxPosCalculator #3766

naoyam commented Jan 28, 2025 •

edited

Loading

github-actions bot commented Jan 28, 2025 •

edited

Loading

naoyam commented Jan 28, 2025

naoyam Jan 28, 2025

naoyam Jan 28, 2025

xwang233 commented Jan 28, 2025

naoyam commented Jan 29, 2025

naoyam commented Feb 4, 2025

github-actions bot commented Feb 4, 2025

Move getConsumerPosAlignedToProducerCA to MaxPosCalculator #3766

Move getConsumerPosAlignedToProducerCA to MaxPosCalculator #3766

Conversation

naoyam commented Jan 28, 2025 • edited Loading

github-actions bot commented Jan 28, 2025 • edited Loading

PR Reviewer Guide 🔍

(Review updated until commit 612ab0f)

naoyam commented Jan 28, 2025

naoyam Jan 28, 2025

Choose a reason for hiding this comment

naoyam Jan 28, 2025

Choose a reason for hiding this comment

xwang233 commented Jan 28, 2025

Suggested title

Description

Changes walkthrough 📝

naoyam commented Jan 29, 2025

naoyam commented Feb 4, 2025

github-actions bot commented Feb 4, 2025

Description

Changes walkthrough 📝

PR Reviewer Guide 🔍

naoyam commented Jan 28, 2025 •

edited

Loading

github-actions bot commented Jan 28, 2025 •

edited

Loading

(Review updated until commit `612ab0f`)