Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move getConsumerPosAlignedToProducerCA to MaxPosCalculator #3766

Merged
merged 3 commits into from
Feb 4, 2025

Conversation

naoyam
Copy link
Collaborator

@naoyam naoyam commented Jan 28, 2025

Moved getConsumerPosAlignedToProducerCA to MaxPosCalculator to reuse an IdModel built for the entire fusion.

No output should be affected, but there are two main reasons of the change, one is efficiency and another is functionality.

  • Previously, each call to TensorView::updateMaxProducerPosition created a (small) IdModel, which is inefficient since in common cases updateMaxProducerPosition is called from routines that already holds an IdModel through MaxPosCalculator. Moving the function to MaxPosCalculator can reuse the same IdModel.
  • A local IdModel that is created by just looking at a single expression may miss exact mappings previously registered through Fusion::registerExactMapping. For example:
t0: [i0]
t1: [i1, i2]

t2 = broadcast(t0, {false, true});
// t1: [i3, b4]

t3 = t1 + t2
// [i5, i7]

t4 = t3 + 1
// [i7, i8]

In this case, there are exact groups of {i0, i1, i3, i5, i7}, {i2, i7, i8}and {b4}. We could set the loop domain of t2 look like t4 by scheduleLoopDomainsBy({t2}, t4->getLoopDomain()), which would clone i8 and set the loop domain of t2 as {i3, i9}, where i9 is a clone of i8 and is registered as exact mapped with i8. This should all be fine as long as we looked at the whole fusion, but if an IdModel was built only for t3 = t1 + t2, the loop domain of t2, {i3, i9}, wouldn't be grouped together with the loop domain of t3, {i5, i7}. It would discover the mapping of i3 and i5, but it wouldn't know anything between i9 and i7. The registered exact mapping wouldn't help because it's registered with a pair of i8 and i9. It is obvious that i8 is mapped with i7, only when the IdModel analysis also looks at the t4 definition. When an IdModel is created only for t3 = t1 + t2, the registered mapping would do nothing. This problem can be avoided by using a whole-fusion IdModel.

Copy link

github-actions bot commented Jan 28, 2025

PR Reviewer Guide 🔍

(Review updated until commit 612ab0f)

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 3 🔵🔵🔵⚪⚪
🧪 No relevant tests
⚡ Recommended focus areas for review

Possible Performance Issue

The getConsumerPosAlignedToProducerCA function has been moved to MaxPosCalculator to reuse an IdModel built for the entire fusion. However, the performance impact of this change is not explicitly stated. It would be beneficial to include a comment or a note explaining the expected performance gain from this change.

// Try to find the aligned position on consumer's domain corresponding to a
//  position of producer domain. No checking on actual
//  producer-consumer relationship.
int64_t MaxPosCalculator::getConsumerPosAlignedToProducerCA(
    TensorView* consumer,
    TensorView* producer,
    int64_t producer_pos) {
  // Locate consumer's position that aligns with
  //  the producer's position. We need broadcast axes forwarded so we
  //  need to replay PasC as CasP will not forward braodcast dims. For example
  //  if we have:
  // T2[ iS22{( 3 * 1 )} ] ca_pos( 1 ) = broadcast( T1[ iS1{3} ] ca_pos( 1 )
  // produce_pos( 1) ) CasP will have the mapping iS1{3} -> iS2{3} and PasC will
  // have the mapping iS22{( 3 * 1 )} <- iS1{3} We need the latter. Refer to
  // NVFuserTest.FusionComplexBCast1_CUDA

  int64_t consumer_pos = consumer->nDims();

  const bool may_need_forwarding =
      ir_utils::isLoopDomainFullyDerivedFromLogicalDomain(producer) &&
      ir_utils::isLoopDomainFullyDerivedFromLogicalDomain(consumer);

  if (may_need_forwarding) {
    auto disjoint_sets = BestEffortReplay::replayPasC(
                             producer,
                             consumer,
                             -1,
                             PairwiseLogicalDomainMap(producer, consumer))
                             .getIterDomainEquivalence();

    // Find the innermost position of consumer that has
    //  been mapped within the producer ca axis.

    while (consumer_pos > 0) {
      auto consumer_id = consumer->axis(consumer_pos - 1);
      const auto& p_dom = producer->getLoopDomain();
      if (std::any_of(
              p_dom.begin(),
              p_dom.begin() + producer_pos,
              [&consumer_id, &disjoint_sets](IterDomain* p_id) {
                return disjoint_sets.permissiveAreMapped(consumer_id, p_id);
              })) {
        break;
      }
      consumer_pos--;
    }
  } else {
    while (consumer_pos > 0) {
      auto consumer_id = consumer->axis(consumer_pos - 1);
      const auto& p_dom = producer->getLoopDomain();
      if (std::any_of(
              p_dom.begin(),
              p_dom.begin() + producer_pos,
              [&](IterDomain* p_id) {
                return inliningGraph().disjointValSets().strictAreMapped(
                    consumer_id, p_id);
              })) {
        break;
      }
      consumer_pos--;
    }
  }

  return consumer_pos;
}
Potential Bug

In the updateMaxProducerPosition function, the calc pointer is not checked for null before being used. This could lead to a null pointer dereference if calc is null. It would be beneficial to add a null check before using the calc pointer.

void TensorView::updateMaxProducerPosition(MaxPosCalculator* calc) {
  std::unique_ptr<MaxPosCalculator> calc_owner;
  if (calc == nullptr) {
    calc_owner = std::make_unique<MaxPosCalculator>();
    calc = calc_owner.get();
  }

  for (auto producer : ir_utils::producerTvsOf(this)) {
    max_producer_pos_ = std::max(
        max_producer_pos_,
        calc->getConsumerPosAlignedToProducerCA(
            this, producer, producer->getComputePosition(this)));
  }

  maybe_max_producer_pos_ = max_producer_pos_;

  // When a producer may be computed with this tensor, i.e., it isn't
  // yet resolved, reflect that in maybe_max_producer_pos_. If all
  // producers are already resolved, i.e., after the initial
  // resolveComputeWith in lowering, this should be just equal to
  // max_producer_pos_.
  for (auto producer : ir_utils::producerTvsOf(this)) {
    if (producer->hasComputeWith() && !producer->hasResolvedComputeWith()) {
      maybe_max_producer_pos_ = std::max(
          maybe_max_producer_pos_,
          calc->getConsumerPosAlignedToProducerCA(
              this, producer, producer->getComputeWithPosition()));
    }
  }
}

@naoyam
Copy link
Collaborator Author

naoyam commented Jan 28, 2025

!test --diff

// Try to find the aligned position on consumer's domain corresponding to a
// position of producer domain. No checking on actual
// producer-consumer relationship.
int64_t getConsumerPosAlignedToProducerCA(
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is moved to MaxPosCalculator, except for one difference noted below.

Comment on lines -249 to -254
IdModel id_model({consumer->definition()}, {}, false);
id_model.buildBroadcastGraph();
const auto& inlining_graph = id_model.idGraph(IdMappingMode::BROADCAST);
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part is removed when moved to MaxPosCalculator. We need an IdModel to determine valid inlining positions. Previously, it created a per-tensor IdModel here. Since it's now moved to MaxPosCalculator, it doesn't need to create a new IdModel as long as the same instance of MaxPosCalculator is reused.

@naoyam naoyam marked this pull request as ready for review January 28, 2025 04:09
@naoyam naoyam requested a review from zasdfgbnm January 28, 2025 04:09
@xwang233
Copy link
Collaborator

Suggested title

Move getConsumerPosAlignedToProducerCA to MaxPosCalculator for efficiency and functionality


Description

  • Moved getConsumerPosAlignedToProducerCA to MaxPosCalculator

  • Updated TensorView::updateMaxProducerPosition to accept MaxPosCalculator

  • Improved efficiency by reusing IdModel

  • Ensured exact mappings are considered across the entire fusion


Changes walkthrough 📝

Relevant files
Enhancement
inlining.cpp
Added `getConsumerPosAlignedToProducerCA` to `MaxPosCalculator`

csrc/scheduler/tools/inlining.cpp

  • Added getConsumerPosAlignedToProducerCA method to MaxPosCalculator
  • +66/-0   
    tensor_view.cpp
    Updated `updateMaxProducerPosition` to use `MaxPosCalculator`

    csrc/tensor_view.cpp

  • Updated updateMaxProducerPosition to accept MaxPosCalculator
  • Removed redundant getConsumerPosAlignedToProducerCA function
  • +8/-76   
    interface_nodes.h
    Modified `updateMaxProducerPosition` signature                     

    csrc/ir/interface_nodes.h

  • Modified updateMaxProducerPosition to accept MaxPosCalculator
    parameter
  • +1/-1     
    inlining.h
    Added `getConsumerPosAlignedToProducerCA` declaration       

    csrc/scheduler/tools/inlining.h

  • Added getConsumerPosAlignedToProducerCA declaration to
    MaxPosCalculator
  • +5/-0     

    @naoyam
    Copy link
    Collaborator Author

    naoyam commented Jan 29, 2025

    !test --diff

    @naoyam
    Copy link
    Collaborator Author

    naoyam commented Feb 4, 2025

    !test --diff

    Copy link

    github-actions bot commented Feb 4, 2025

    Description

    • Moved getConsumerPosAlignedToProducerCA to MaxPosCalculator

    • Updated TensorView::updateMaxProducerPosition to use MaxPosCalculator

    • Added MaxPosCalculator* parameter to TensorView::updateMaxProducerPosition


    Changes walkthrough 📝

    Relevant files
    Enhancement
    inlining.cpp
    Added `getConsumerPosAlignedToProducerCA` to `MaxPosCalculator`

    csrc/scheduler/tools/inlining.cpp

  • Added getConsumerPosAlignedToProducerCA method to MaxPosCalculator
  • +66/-0   
    tensor_view.cpp
    Updated `updateMaxProducerPosition` to use `MaxPosCalculator`

    csrc/tensor_view.cpp

  • Updated TensorView::updateMaxProducerPosition to use MaxPosCalculator
  • Added MaxPosCalculator* parameter to
    TensorView::updateMaxProducerPosition
  • +8/-76   
    interface_nodes.h
    Updated `updateMaxProducerPosition` signature                       

    csrc/ir/interface_nodes.h

  • Updated updateMaxProducerPosition method signature to include
    MaxPosCalculator* parameter
  • +1/-1     
    inlining.h
    Added `getConsumerPosAlignedToProducerCA` declaration       

    csrc/scheduler/tools/inlining.h

  • Added getConsumerPosAlignedToProducerCA method declaration to
    MaxPosCalculator
  • +5/-0     

    PR Reviewer Guide 🔍

    Here are some key observations to aid the review process:

    🧪 No relevant tests
    ⚡ Recommended focus areas for review

    Performance Impact

    The PR claims efficiency improvements by reusing an IdModel built for the entire fusion. However, the performance impact of this change should be quantitatively evaluated and compared against the previous implementation.

    // Try to find the aligned position on consumer's domain corresponding to a
    //  position of producer domain. No checking on actual
    //  producer-consumer relationship.
    int64_t MaxPosCalculator::getConsumerPosAlignedToProducerCA(
        TensorView* consumer,
        TensorView* producer,
        int64_t producer_pos) {
      // Locate consumer's position that aligns with
      //  the producer's position. We need broadcast axes forwarded so we
      //  need to replay PasC as CasP will not forward braodcast dims. For example
      //  if we have:
      // T2[ iS22{( 3 * 1 )} ] ca_pos( 1 ) = broadcast( T1[ iS1{3} ] ca_pos( 1 )
      // produce_pos( 1) ) CasP will have the mapping iS1{3} -> iS2{3} and PasC will
      // have the mapping iS22{( 3 * 1 )} <- iS1{3} We need the latter. Refer to
      // NVFuserTest.FusionComplexBCast1_CUDA
    
      int64_t consumer_pos = consumer->nDims();
    
      const bool may_need_forwarding =
          ir_utils::isLoopDomainFullyDerivedFromLogicalDomain(producer) &&
          ir_utils::isLoopDomainFullyDerivedFromLogicalDomain(consumer);
    
      if (may_need_forwarding) {
        auto disjoint_sets = BestEffortReplay::replayPasC(
                                 producer,
                                 consumer,
                                 -1,
                                 PairwiseLogicalDomainMap(producer, consumer))
                                 .getIterDomainEquivalence();
    
        // Find the innermost position of consumer that has
        //  been mapped within the producer ca axis.
    
        while (consumer_pos > 0) {
          auto consumer_id = consumer->axis(consumer_pos - 1);
          const auto& p_dom = producer->getLoopDomain();
          if (std::any_of(
                  p_dom.begin(),
                  p_dom.begin() + producer_pos,
                  [&consumer_id, &disjoint_sets](IterDomain* p_id) {
                    return disjoint_sets.permissiveAreMapped(consumer_id, p_id);
                  })) {
            break;
          }
          consumer_pos--;
        }
      } else {
        while (consumer_pos > 0) {
          auto consumer_id = consumer->axis(consumer_pos - 1);
          const auto& p_dom = producer->getLoopDomain();
          if (std::any_of(
                  p_dom.begin(),
                  p_dom.begin() + producer_pos,
                  [&](IterDomain* p_id) {
                    return inliningGraph().disjointValSets().strictAreMapped(
                        consumer_id, p_id);
                  })) {
            break;
          }
          consumer_pos--;
        }
      }
    
      return consumer_pos;
    Test Coverage

    The PR modifies the updateMaxProducerPosition method to accept a MaxPosCalculator pointer. Ensure that new tests are added to cover scenarios where this new parameter is used, especially when calc is nullptr.

    void TensorView::updateMaxProducerPosition(MaxPosCalculator* calc) {
      std::unique_ptr<MaxPosCalculator> calc_owner;
      if (calc == nullptr) {
        calc_owner = std::make_unique<MaxPosCalculator>();
        calc = calc_owner.get();
      }
    
      for (auto producer : ir_utils::producerTvsOf(this)) {
        max_producer_pos_ = std::max(
            max_producer_pos_,
            calc->getConsumerPosAlignedToProducerCA(
                this, producer, producer->getComputePosition(this)));
      }
    
      maybe_max_producer_pos_ = max_producer_pos_;
    Code Duplication

    The logic for determining the aligned position on the consumer's domain corresponding to a position of the producer domain is now duplicated in MaxPosCalculator::getConsumerPosAlignedToProducerCA and TensorView::updateMaxProducerPosition. Consider refactoring to avoid duplication.

    void TensorView::updateMaxProducerPosition(MaxPosCalculator* calc) {
      std::unique_ptr<MaxPosCalculator> calc_owner;
      if (calc == nullptr) {
        calc_owner = std::make_unique<MaxPosCalculator>();
        calc = calc_owner.get();
      }
    
      for (auto producer : ir_utils::producerTvsOf(this)) {
        max_producer_pos_ = std::max(
            max_producer_pos_,
            calc->getConsumerPosAlignedToProducerCA(
                this, producer, producer->getComputePosition(this)));
      }
    
      maybe_max_producer_pos_ = max_producer_pos_;

    @naoyam naoyam merged commit a1baafa into main Feb 4, 2025
    58 of 60 checks passed
    @naoyam naoyam deleted the move_update_max_pos_to_max_pos_calc branch February 4, 2025 20:02
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    None yet
    Projects
    None yet
    Development

    Successfully merging this pull request may close these issues.

    3 participants