Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Scheduling propagation through broadcast IDs #3763

Draft
wants to merge 41 commits into
base: main
Choose a base branch
from

Conversation

naoyam
Copy link
Collaborator

@naoyam naoyam commented Jan 27, 2025

No description provided.

Copy link

github-actions bot commented Jan 27, 2025

PR Reviewer Guide 🔍

(Review updated until commit bf6412e)

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 5 🔵🔵🔵🔵🔵
🧪 No relevant tests
⚡ Recommended focus areas for review

Possible Issue

The loop promotion analysis may not work correctly for cyclic graphs.

bool isLoopGraphUniform(const IdModel& id_model) {
  for (const auto tv : id_model.tvs()) {
    if (tv->isFusionInput()) {
      continue;
    }
    for (const auto loop_id : tv->getLoopDomain()) {
      const auto& loop_group =
          id_model.idGraph(IdMappingMode::LOOP).toGroup(loop_id);
      const auto all_exact_groups =
          id_model.idGraph(IdMappingMode::EXACT).toGroups(*loop_group);
      if (all_exact_groups.size() > 1) {
        return false;
      }
    }
  }

  return true;
}
Possible Issue

The ValGraphVisitor class may not handle cyclic graphs correctly.

//
// [i0, 1]
// merge
// [i0*1]
// map i0 and i0*1
// ValGroups: {{i0, i0*1}, {1}}
//
// Then, {i0, i0*1} and {1} would be visited first, then the merge
// expr group would be visited. {i0, i0*1} is also an output group
// of the merge but since it's already in the visited set, it would
// not be visited again.
//
// Similarly, when there are five groups as shown belwo:
//
//   i0 -> i1  ->  i2 -> i3
//          ^       |
//          |- i4 <-+
//
//  (Edges: i0->i1, i1->i2, i2->i3, i2->i4, i4->i1)
//
// is_val_ready for i1 would ignore the incoming edge from i4. The
// traversal order would look like:
//
// i0->i1, i1->i2, i2->i3, i2->i4
//
// See also IdModelTest.ValGraphStmtSort3 for a concrete
// example. See IdModelTest.LoopPromotionWithCyclicGraph for some
// use cases of this traversal for the loop promotion analysis with
// cyclic graphs.
auto is_val_ready = [&](const ValGroup& val_group) -> bool {
  if (starting_groups.has(val_group)) {
    return true;
  }
  const ExprGroups& unique_defs = graph().getDefinitions(val_group);
  return std::all_of(
      unique_defs.begin(), unique_defs.end(), [&](ExprGroup expr_group) {
        if (expr_group->empty() || visited_exprs.has(expr_group)) {
          return true;
Possible Issue

The LoopDomainScheduler class may not handle cyclic graphs correctly.

 // Create the loop domain of a given tensor as specified by the
 // reference. The new loop domain is connected to the existing IDs of
 // the tensor by replaying exprs found in the ValGraph.
 void schedule(TensorView* tv) const;

 void scheduleByUpdatingLoopDomainOnly(TensorView* tv) const;

private:
 ValGraph& graph() const {
   if (update_loop_domain_only_) {
     id_model_->maybeBuildGraph(IdMappingMode::BROADCAST);
     return id_model_->idGraph(IdMappingMode::BROADCAST);
   } else {
     return id_model_->idGraph(IdMappingMode::EXACT);
   }
 }

 ValGraphBFS::ExprPath getReplayPath(TensorView* tv) const;

 // Replay an ExprGroup with given lists of input and output
 // groups. NOte that inputs and outputs are based on a given
 // direction. If it's Backward, the given inputs are used as the
 // outputs of the expr.
 Expr* replay(
     const ExprGroup& expr_g,
     Direction dir,
     const ValGroups& input_groups,
     const ValGroups& output_groups,
     const std::unordered_map<ValGroup, IterDomain*>& group_to_id) const {
   std::vector<IterDomain*> inputs;
   std::vector<IterDomain*> outputs;
   std::transform(
       input_groups.begin(),
       input_groups.end(),
       std::back_inserter(inputs),
       [&](const ValGroup& input_g) -> IterDomain* {
         return group_to_id.at(input_g);
       });
   std::transform(
       output_groups.begin(),
       output_groups.end(),
       std::back_inserter(outputs),
       [&](const ValGroup& output_g) -> IterDomain* {
         return group_to_id.at(output_g);
       });
   Expr* replayed_expr = LoopDomainSchedulerReplayTransform::replayAs(
       dir == Direction::Forward ? inputs : outputs,
       dir == Direction::Forward ? outputs : inputs,
       expr_g->front());
   return replayed_expr;
 }

private:
 std::vector<IterDomain*> ref_loop_dom_;
 // If true, uses the current loop domain as the starting domain and
 // updates it to make it look like the given reference loop domain
 bool update_loop_domain_only_ = false;

liqiangxl and others added 8 commits January 27, 2025 08:54
**Background**
To avoid running Hopper matmul tests on Blackwell GPUs (there are
unsupported ptx), `HopperBase` was changed from `Hopper & Newer` to
`Hopper Only` at
[#3754](https://github.com/NVIDIA/Fuser/pull/3754/files#diff-fb98e47b0b389ef77407ef12badf08c42718a0e81dfce9ac5a6eca06feaca1c4R616-R617)
.

**Changes in this PR**
To still keep testing `TMA` on both Hopper and Blackwell GPUs, in this
PR, the parent class of `TMATest` is changed from `HopperBase` to
`TMABase`. `TMABase` allows running tests on `Hopper & Newer`
Just found some unintuitive traversal with the permissive BFS. For
example, when there's a graph like:

```
a -> b, c (e.g., split)
```

When traversing from just `{b}`, the normal BFS won't allow any move
since the backward traversal requires both `b` and `c`. The permissive
BFS, on the other hand, allows to visit `a` since it allows traversal
whenever there's at least one node already visited.

That's all I had in mind when I added the permissive BFS, but it turned
out that it also visits `c` as well. The first move is `b, c -> a` while
allowing the missing `c`, and the second move is `a -> b, c` since `a`
is now visited and `c` is not yet visited. This doesn't seem to make
sense since the reason `a` is visited is because we take the backward
traversal of the edge. That in turn allows the forward traversal doesn't
seem to be the right thing to do.

While I'm not aware of any particular impact due to this behavior, this
PR prevents such traversal by checking if any of Val nodes is already
marked as a previous node of an Expr node.
@naoyam naoyam force-pushed the schedule_loop_domains_broadcast branch from a8d3a41 to 34a57d5 Compare January 28, 2025 04:28
Copy link

github-actions bot commented Feb 1, 2025

Description

  • Enabled IdModel by default in lower2device.cpp.

  • Updated validation.cpp to use IdMappingMode::ALMOSTEXACT and removed an unnecessary error check.

  • Refactored indexing.cpp to use a new method for dumping graphviz dot files.

  • Enhanced indexing_traversal.cpp to handle cases where IDs are not found in producer or consumer tensors.

  • Modified loop_promotion.cpp to improve loop graph uniformity checks and added methods for handling input groups.

  • Updated resize.cpp to include a new scheduling path and added debug prints.

  • Added new methods and logic in inlining.cpp for calculating consumer positions aligned to producers.

  • Enhanced loop_domain_scheduler.cpp to support scheduling by updating loop domains only.

  • Refactored tensor_view.cpp to update max producer positions with an optional calculator.

  • Added new BFS traversal methods in bfs.h for strict dependence.

  • Updated id_model.h to include a method for retrieving TV expressions.

  • Enhanced loop_promotion.h with methods for computing covered groups and handling input groups.

  • Updated interface_nodes.h to make updateMaxProducerPosition accept an optional calculator.

  • Added a new method in inlining.h for calculating consumer positions aligned to producers.

  • Enhanced loop_domain_scheduler.h to support scheduling by a given path.

  • Added a method in val_graph.h for dumping graphviz dot files.

  • Updated val_graph_visitor.h to support additional starting groups for traversal.


Changes walkthrough 📝

Relevant files
Enhancement
17 files
lower2device.cpp
Enable IdModel by default                                                               
+1/-1     
validation.cpp
Update IdMappingMode and remove error check                           
+2/-2     
indexing.cpp
Refactor graphviz dot file dumping                                             
+1/-4     
indexing_traversal.cpp
Handle missing IDs in producer/consumer tensors                   
+62/-25 
loop_promotion.cpp
Improve loop graph uniformity checks and input group handling
+112/-46
resize.cpp
Add new scheduling path and debug prints                                 
+95/-13 
inlining.cpp
Add method for calculating consumer positions aligned to producers
+66/-0   
loop_domain_scheduler.cpp
Support scheduling by updating loop domains only                 
+186/-11
tensor_view.cpp
Update max producer positions with optional calculator     
+8/-76   
bfs.h
Add BFS traversal methods for strict dependence                   
+58/-0   
id_model.h
Include method for retrieving TV expressions                         
+4/-0     
loop_promotion.h
Add methods for computing covered groups and handling input groups
+39/-0   
interface_nodes.h
Make updateMaxProducerPosition accept optional calculator
+1/-1     
inlining.h
Add method for calculating consumer positions aligned to producers
+5/-0     
loop_domain_scheduler.h
Support scheduling by given path                                                 
+6/-0     
val_graph.h
Add method for dumping graphviz dot files                               
+2/-0     
val_graph_visitor.h
Support additional starting groups for traversal                 
+20/-0   
Additional files
4 files
val_graph.cpp +9/-0     
val_graph_visitor.cpp +56/-20 
test_id_model.cpp +258/-72
test_rope.cpp +9/-6     

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

🧪 PR contains tests
⚡ Recommended focus areas for review

Debug Code

The if (true || idModelOptions().buildIdModel()) condition seems to be a placeholder or debugging code. It should be reviewed to ensure it's not accidentally left in the final code.

if (true || idModelOptions().buildIdModel()) {
Code Duplication

The commented-out code block from lines 200-234 seems to be duplicated and moved to lines 206-235. This duplication should be resolved to avoid confusion and maintain clean code.

// Similarly, to_ids may not be IDs found in any of the producer and
// consumer tensors of this expr. For example, if it's an allocation
// ID, it may be a loop promotion ID.
ValGroups to_groups;
for (auto to_id : to_ids) {
  if (local_graph.hasGroup(to_id)) {
    to_groups.pushBack(local_graph.toGroup(to_id));
    continue;
  }
  // to_id is not found in the producer or consumer tensors of the
  // expr. Look for a mapped ID in the ID group of the global graph.
  bool found = false;
  const auto& global_group = graph.toGroup(to_id);
  for (const auto& vg : local_graph.disjointValSets().disjointSets()) {
    if (global_group->has(vg->front())) {
      to_groups.pushBack(vg);
      found = true;
      break;
    }
  }
  // NVF_ERROR(found, "Indexing path for resize not found: ",
  // to_id->toString());
  if (!found) {
    // If promoted, this to_id should be a loop ID. This WAR should
    // not be necessary when indexing a loop ID.
    return std::nullopt;
    ;
  }
}
Environment Variable Check

The use of getenv("NO_SKIP_BROADCAST") and getenv("NEW") without any default behavior or error handling if the environment variable is not set can lead to unexpected behavior. It's recommended to provide default values or handle cases where the environment variable is not set.

if (getenv("NO_SKIP_BROADCAST")) {
  if (registry_utils::hasNonUniqueBcast(fusion)) {
    scheduler_debug_utils::canScheduleRejectReason(
        schedulerType(),
        "Broadcasting dimension might be broadcasting to multiple sizes.");
    return false;
  }
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants