[WIP] Scheduling propagation through broadcast IDs #3763

naoyam · 2025-01-27T16:49:20Z

No description provided.

…yclic_graph

github-actions · 2025-01-27T16:50:10Z

PR Reviewer Guide 🔍

(Review updated until commit `bf6412e`)

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 5 🔵🔵🔵🔵🔵

🧪 No relevant tests

⚡ Recommended focus areas for review

Possible Issue

The loop promotion analysis may not work correctly for cyclic graphs.

bool isLoopGraphUniform(const IdModel& id_model) {
  for (const auto tv : id_model.tvs()) {
    if (tv->isFusionInput()) {
      continue;
    }
    for (const auto loop_id : tv->getLoopDomain()) {
      const auto& loop_group =
          id_model.idGraph(IdMappingMode::LOOP).toGroup(loop_id);
      const auto all_exact_groups =
          id_model.idGraph(IdMappingMode::EXACT).toGroups(*loop_group);
      if (all_exact_groups.size() > 1) {
        return false;
      }
    }
  }

  return true;
}

Possible Issue

The ValGraphVisitor class may not handle cyclic graphs correctly.

//
// [i0, 1]
// merge
// [i0*1]
// map i0 and i0*1
// ValGroups: {{i0, i0*1}, {1}}
//
// Then, {i0, i0*1} and {1} would be visited first, then the merge
// expr group would be visited. {i0, i0*1} is also an output group
// of the merge but since it's already in the visited set, it would
// not be visited again.
//
// Similarly, when there are five groups as shown belwo:
//
//   i0 -> i1  ->  i2 -> i3
//          ^       |
//          |- i4 <-+
//
//  (Edges: i0->i1, i1->i2, i2->i3, i2->i4, i4->i1)
//
// is_val_ready for i1 would ignore the incoming edge from i4. The
// traversal order would look like:
//
// i0->i1, i1->i2, i2->i3, i2->i4
//
// See also IdModelTest.ValGraphStmtSort3 for a concrete
// example. See IdModelTest.LoopPromotionWithCyclicGraph for some
// use cases of this traversal for the loop promotion analysis with
// cyclic graphs.
auto is_val_ready = [&](const ValGroup& val_group) -> bool {
  if (starting_groups.has(val_group)) {
    return true;
  }
  const ExprGroups& unique_defs = graph().getDefinitions(val_group);
  return std::all_of(
      unique_defs.begin(), unique_defs.end(), [&](ExprGroup expr_group) {
        if (expr_group->empty() || visited_exprs.has(expr_group)) {
          return true;

Possible Issue

The LoopDomainScheduler class may not handle cyclic graphs correctly.

 // Create the loop domain of a given tensor as specified by the
 // reference. The new loop domain is connected to the existing IDs of
 // the tensor by replaying exprs found in the ValGraph.
 void schedule(TensorView* tv) const;

 void scheduleByUpdatingLoopDomainOnly(TensorView* tv) const;

private:
 ValGraph& graph() const {
   if (update_loop_domain_only_) {
     id_model_->maybeBuildGraph(IdMappingMode::BROADCAST);
     return id_model_->idGraph(IdMappingMode::BROADCAST);
   } else {
     return id_model_->idGraph(IdMappingMode::EXACT);
   }
 }

 ValGraphBFS::ExprPath getReplayPath(TensorView* tv) const;

 // Replay an ExprGroup with given lists of input and output
 // groups. NOte that inputs and outputs are based on a given
 // direction. If it's Backward, the given inputs are used as the
 // outputs of the expr.
 Expr* replay(
     const ExprGroup& expr_g,
     Direction dir,
     const ValGroups& input_groups,
     const ValGroups& output_groups,
     const std::unordered_map<ValGroup, IterDomain*>& group_to_id) const {
   std::vector<IterDomain*> inputs;
   std::vector<IterDomain*> outputs;
   std::transform(
       input_groups.begin(),
       input_groups.end(),
       std::back_inserter(inputs),
       [&](const ValGroup& input_g) -> IterDomain* {
         return group_to_id.at(input_g);
       });
   std::transform(
       output_groups.begin(),
       output_groups.end(),
       std::back_inserter(outputs),
       [&](const ValGroup& output_g) -> IterDomain* {
         return group_to_id.at(output_g);
       });
   Expr* replayed_expr = LoopDomainSchedulerReplayTransform::replayAs(
       dir == Direction::Forward ? inputs : outputs,
       dir == Direction::Forward ? outputs : inputs,
       expr_g->front());
   return replayed_expr;
 }

private:
 std::vector<IterDomain*> ref_loop_dom_;
 // If true, uses the current loop domain as the starting domain and
 // updates it to make it look like the given reference loop domain
 bool update_loop_domain_only_ = false;

**Background** To avoid running Hopper matmul tests on Blackwell GPUs (there are unsupported ptx), `HopperBase` was changed from `Hopper & Newer` to `Hopper Only` at [#3754](https://github.com/NVIDIA/Fuser/pull/3754/files#diff-fb98e47b0b389ef77407ef12badf08c42718a0e81dfce9ac5a6eca06feaca1c4R616-R617) . **Changes in this PR** To still keep testing `TMA` on both Hopper and Blackwell GPUs, in this PR, the parent class of `TMATest` is changed from `HopperBase` to `TMABase`. `TMABase` allows running tests on `Hopper & Newer`

Just found some unintuitive traversal with the permissive BFS. For example, when there's a graph like: ``` a -> b, c (e.g., split) ``` When traversing from just `{b}`, the normal BFS won't allow any move since the backward traversal requires both `b` and `c`. The permissive BFS, on the other hand, allows to visit `a` since it allows traversal whenever there's at least one node already visited. That's all I had in mind when I added the permissive BFS, but it turned out that it also visits `c` as well. The first move is `b, c -> a` while allowing the missing `c`, and the second move is `a -> b, c` since `a` is now visited and `c` is not yet visited. This doesn't seem to make sense since the reason `a` is visited is because we take the backward traversal of the edge. That in turn allows the forward traversal doesn't seem to be the right thing to do. While I'm not aware of any particular impact due to this behavior, this PR prevents such traversal by checking if any of Val nodes is already marked as a previous node of an Expr node.

…alc' into schedule_loop_domains_broadcast

github-actions · 2025-02-01T00:41:58Z

Description

Enabled IdModel by default in lower2device.cpp.
Updated validation.cpp to use IdMappingMode::ALMOSTEXACT and removed an unnecessary error check.
Refactored indexing.cpp to use a new method for dumping graphviz dot files.
Enhanced indexing_traversal.cpp to handle cases where IDs are not found in producer or consumer tensors.
Modified loop_promotion.cpp to improve loop graph uniformity checks and added methods for handling input groups.
Updated resize.cpp to include a new scheduling path and added debug prints.
Added new methods and logic in inlining.cpp for calculating consumer positions aligned to producers.
Enhanced loop_domain_scheduler.cpp to support scheduling by updating loop domains only.
Refactored tensor_view.cpp to update max producer positions with an optional calculator.
Added new BFS traversal methods in bfs.h for strict dependence.
Updated id_model.h to include a method for retrieving TV expressions.
Enhanced loop_promotion.h with methods for computing covered groups and handling input groups.
Updated interface_nodes.h to make updateMaxProducerPosition accept an optional calculator.
Added a new method in inlining.h for calculating consumer positions aligned to producers.
Enhanced loop_domain_scheduler.h to support scheduling by a given path.
Added a method in val_graph.h for dumping graphviz dot files.
Updated val_graph_visitor.h to support additional starting groups for traversal.

Changes walkthrough 📝

Relevant files

Enhancement

17 files

lower2device.cpp `Enable IdModel by default`	+1/-1
validation.cpp `Update IdMappingMode and remove error check`	+2/-2
indexing.cpp `Refactor graphviz dot file dumping`	+1/-4
indexing_traversal.cpp `Handle missing IDs in producer/consumer tensors`	+62/-25
loop_promotion.cpp `Improve loop graph uniformity checks and input group handling`	+112/-46
resize.cpp `Add new scheduling path and debug prints`	+95/-13
inlining.cpp `Add method for calculating consumer positions aligned to producers`	+66/-0
loop_domain_scheduler.cpp `Support scheduling by updating loop domains only`	+186/-11
tensor_view.cpp `Update max producer positions with optional calculator`	+8/-76
bfs.h `Add BFS traversal methods for strict dependence`	+58/-0
id_model.h `Include method for retrieving TV expressions`	+4/-0
loop_promotion.h `Add methods for computing covered groups and handling input groups`	+39/-0
interface_nodes.h `Make updateMaxProducerPosition accept optional calculator`	+1/-1
inlining.h `Add method for calculating consumer positions aligned to producers`	+5/-0
loop_domain_scheduler.h `Support scheduling by given path`	+6/-0
val_graph.h `Add method for dumping graphviz dot files`	+2/-0
val_graph_visitor.h `Support additional starting groups for traversal`	+20/-0

Additional files

4 files

val_graph.cpp	+9/-0
val_graph_visitor.cpp	+56/-20
test_id_model.cpp	+258/-72
test_rope.cpp	+9/-6

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

🧪 PR contains tests

⚡ Recommended focus areas for review

Debug Code

The if (true || idModelOptions().buildIdModel()) condition seems to be a placeholder or debugging code. It should be reviewed to ensure it's not accidentally left in the final code.

if (true || idModelOptions().buildIdModel()) {

Code Duplication

The commented-out code block from lines 200-234 seems to be duplicated and moved to lines 206-235. This duplication should be resolved to avoid confusion and maintain clean code.

// Similarly, to_ids may not be IDs found in any of the producer and
// consumer tensors of this expr. For example, if it's an allocation
// ID, it may be a loop promotion ID.
ValGroups to_groups;
for (auto to_id : to_ids) {
  if (local_graph.hasGroup(to_id)) {
    to_groups.pushBack(local_graph.toGroup(to_id));
    continue;
  }
  // to_id is not found in the producer or consumer tensors of the
  // expr. Look for a mapped ID in the ID group of the global graph.
  bool found = false;
  const auto& global_group = graph.toGroup(to_id);
  for (const auto& vg : local_graph.disjointValSets().disjointSets()) {
    if (global_group->has(vg->front())) {
      to_groups.pushBack(vg);
      found = true;
      break;
    }
  }
  // NVF_ERROR(found, "Indexing path for resize not found: ",
  // to_id->toString());
  if (!found) {
    // If promoted, this to_id should be a loop ID. This WAR should
    // not be necessary when indexing a loop ID.
    return std::nullopt;
    ;
  }
}

Environment Variable Check

The use of getenv("NO_SKIP_BROADCAST") and getenv("NEW") without any default behavior or error handling if the environment variable is not set can lead to unexpected behavior. It's recommended to provide default values or handle cases where the environment variable is not set.

if (getenv("NO_SKIP_BROADCAST")) {
  if (registry_utils::hasNonUniqueBcast(fusion)) {
    scheduler_debug_utils::canScheduleRejectReason(
        schedulerType(),
        "Broadcasting dimension might be broadcasting to multiple sizes.");
    return false;
  }
}

naoyam added 29 commits January 22, 2025 23:43

setup

5f30214

extra sanity check

081e7c1

cleanup

1da0635

computeCoveredGroups

54957a2

WIP

f515c88

waring

d039e7d

cleanup

8970d10

Merge remote-tracking branch 'origin/main' into loop_promotion_with_c…

759b720

…yclic_graph

cleanup

a93cbb4

Merge remote-tracking branch 'origin/main' into loop_promotion_with_c…

ef93955

…yclic_graph

cleanup

f3acf2c

cleanup

8016604

fix; test cleanup

461c379

Simplify the shortcut

fd78e47

cleanup

922c107

cleanup

f2a045e

dont map broadcast

30742b1

refactoring

03f8ac1

fix

3711093

WIP

7b3267f

fix

dbd0635

cleanup

95244b8

enable idmodel

b71d6da

simplify

0dcdf8a

cleanup

bd04116

Merge remote-tracking branch 'origin/main' into loop_promotion_with_c…

d90ba71

…yclic_graph

cleanup

e2ca882

cleanup

e0b408e

cleanup

84273ac

liqiangxl and others added 8 commits January 27, 2025 08:54

WIP

4be8883

Create a concrete ID instead of broadcast

569f016

Move updateMaxProducerPos to MaxPosCalculator

c757253

resize indexing war extension

16e1599

test fix

4a87662

resize scheduler fix

34a57d5

naoyam force-pushed the schedule_loop_domains_broadcast branch from a8d3a41 to 34a57d5 Compare January 28, 2025 04:28

naoyam added 4 commits January 27, 2025 20:29

Merge branch 'main' into schedule_loop_domains_broadcast

bf6412e

Merge remote-tracking branch 'origin/move_update_max_pos_to_max_pos_c…

f2aec00

…alc' into schedule_loop_domains_broadcast

fix

7edac49

WIP

bfa8947

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Scheduling propagation through broadcast IDs #3763

[WIP] Scheduling propagation through broadcast IDs #3763

naoyam commented Jan 27, 2025

github-actions bot commented Jan 27, 2025 •

edited

Loading

github-actions bot commented Feb 1, 2025

[WIP] Scheduling propagation through broadcast IDs #3763

Are you sure you want to change the base?

[WIP] Scheduling propagation through broadcast IDs #3763

Conversation

naoyam commented Jan 27, 2025

github-actions bot commented Jan 27, 2025 • edited Loading

PR Reviewer Guide 🔍

(Review updated until commit bf6412e)

github-actions bot commented Feb 1, 2025

Description

Changes walkthrough 📝

PR Reviewer Guide 🔍

github-actions bot commented Jan 27, 2025 •

edited

Loading

(Review updated until commit `bf6412e`)