Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

optimize goal_count (callgrind ir/call : 816 -> 126) #237

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

guicho271828
Copy link

No description provided.

Copy link
Member

@FlorianPommerening FlorianPommerening left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the pull request. I'm still unclear about where the inefficiency is coming from.

  • Is it that the state is packed (I think it shouldn't be)?
  • Or is creating the FactProxy objects an issue (should be inlined)?

We often recommend people new to the planner to look into the implementation of this heuristic for an example, so it would be good if we could keep this code as simple and straight-forward as possible. I see three main changes here (unpacking the state, precomputing a list of goals, and storing goal facts as pair<int, int> rather than using FactProxies). Did you look at the performance impact of these changes individually?

src/search/heuristics/goal_count_heuristic.cc Show resolved Hide resolved
Comment on lines +32 to +33
const int& var = it.first;
const int& val = it.second;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
const int& var = it.first;
const int& val = it.second;
int var = it.first;
int val = it.second;

We'd generally copy the values here instead of taking a reference. Should be equally fast as creating a reference.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, now that we have C++20, this should work as well:

for (auto[var, val] : goals) {
...
}

@@ -5,6 +5,7 @@

namespace goal_count_heuristic {
class GoalCountHeuristic : public Heuristic {
std::vector<std::pair<int,int>> goals;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line could use a comment explaining that the goals are stored here for performance reasons.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we run an experiment to have numbers on performance? callgrind data is useful, but usually not conclusive. The VM has its own issues when comparing to actual performance on hardware, and due to the vagaries of inlining, runtime can shift from one place to another in a way that makes it hard to draw conclusions from the profile data alone.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used the code for implementing BFWS which uses goalcount. The difference was significant. I also observed similar bottlenecks from ***Proxy while implementing other parts of BFWS code.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I measured the difference in the release build on ipc instances. I lost the data though

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line could use a comment explaining that the goals are stored here for performance reasons.

done

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could re-run the experiment on our servers but it might take me some time to get around to it. It would help if you prepare a revision of the code that is in between this one and the main branch, where you only do the precomputation/storing of the goals, still using fact proxies in the heuristic computation. This way, we can separate the impact of this and the change avoiding the fact proxies.

@guicho271828
Copy link
Author

Here is the disassembly comparison (objdump -d -S -C) of two versions from release builds.
I definitely see many call jmp je instructions in the c7d6a4d87 (old) versions around the call into proxies. (54 vs 21).

goal_count_heuristic.cc.o-c7d6a4d87.txt
goal_count_heuristic.cc.o-eb48e6815.txt
dump.tar.gz (contains the binary and the script for getting these info)

@guicho271828
Copy link
Author

g++ --version
g++ (GCC) 14.2.1 20240912 (Red Hat 14.2.1-3)

@FlorianPommerening
Copy link
Member

I can run experiments on our grid but it will take a while as we are currently having technical issues with it. I have it on my to-do list though.

@FlorianPommerening
Copy link
Member

By the way, we have https://issues.fast-downward.org/issue997 which might tackle the same problem on a more global level. Not sure it the reduction in complexity there is enough to allow inlining here, but it's worth a try.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants