Fix tokio task leak #151

rodoufu · 2025-02-27T16:02:08Z

While running the agent version v.2.12.1, it crashed after some time with an out of memory issue, having to be killed by the OS.
The machine I was using has 48GB of RAM

It was happening for both mainet and testnet.

I did not see anything particularly special about the resources it was using:

But the number of tasks seems large.

One can see 7721 tasks while a lot of them are idle for some time.
After observing it, the number of tasks would increase with time until the point it would, the binary would be killed.
The idle tasks are created here https://github.com/pyth-network/pyth-agent/blob/main/src/agent/services/oracle.rs#L132 as can be seen in the image, where the subscriber is handling the handle_price_account_update.
That line is crating tokio tasks without keeping track of the task handle JoinHandle, in cases where those are created faster than they are .awaited this can lead to a leak.

After the proposed change I can see a much more comfortable number of tasks:

The number of tasks now is stable around 100, being 114 in the attached image.
It is worthy mentioning that I used 100 worker tasks to wait for the previously leaked ones to finish, so this number can be way smaller with another configuration.

In order to reproduce the tokio console one can follow the instructions in https://github.com/tokio-rs/console which are basic:

Install tokio-console with cargo install --locked tokio-console
Add console-subscriber = "0.3.0" as a dependency
Add console_subscriber::init(); as the first line in the main funciton
Run the binary with RUSTFLAGS="--cfg tokio_unstable" cargo run --bin agent -- --config <config file path>
Run tokio-console to watch its data

rodoufu · 2025-03-05T12:39:05Z

Everything seems to be green this time.

$ pre-commit --version && pre-commit run --all-files && echo $?
pre-commit 4.1.0

Trim Trailing Whitespace.................................................Passed
Fix End of Files.........................................................Passed
Check for added large files..............................................Passed
rustfmt..................................................................Passed
Integration Test Artifact Checksums......................................Passed

0

rodoufu · 2025-03-05T12:53:36Z

I have also this other MR for the project, for which I would love to hear your thoughts.

cc @aditya520 @ali-bahjati

Riateche · 2025-03-06T13:08:03Z

src/agent/services/oracle.rs

+        let receiver = Arc::new(tokio::sync::Mutex::new(receiver));
+        for _ in 0..number_of_workers {
+            let receiver = receiver.clone();
+            handles.push(tokio::spawn(async move {
+                loop {
+                    let mut receiver = receiver.lock().await;
+                    if let Some(task) = receiver.recv().await {
+                        drop(receiver);
+                        if let Err(err) = task.await {
+                            tracing::error!(%err, "error running price update");
+                        }
+                    }
+                }
+            }));
+        }


This implementation basically limits the amount of tasks that can be run at the same time. I think a more straightforward way for doing this is to use a semaphore. You can create a semaphore permit before spawning a task and move it inside the task. When the task is finished, the permit will automatically be released, allowing more tasks to be spawned. It's simpler because it doesn't require an additional channel and doesn't require using join handles.

Riateche · 2025-03-06T13:11:09Z

src/agent/services/oracle.rs

+                sender
+                    .send(tokio::spawn(async move {
+                        if let Err(err) =
+                            Oracle::handle_price_account_update(&*state, network, &pubkey, &account)


I wonder why it's possible to accumulate a lot of long-lived tasks. Maybe we should add a timeout to Oracle::handle_price_account_update? Especially considering that now if tasks become stuck, handling price updates can halt completely.

Riateche · 2025-03-06T13:18:11Z

src/agent/services/oracle.rs

+                        }
+                    }))
+                    .await
+                    .context("sending handle_price_account_update task to worker")?;


I presume we were using spawn here to make sure subscriber() keeps running while we handling the updates. With this change, subscriber() will potentially wait at this .await point for previous tasks to complete. This can delay handling new updates. It can also interfere with the pubsub client because we'll not be fetching new updates while we're waiting for old tasks to complete. (It's probably still better than spawning an unlimited amount of tasks.)

rodoufu added 5 commits March 5, 2025 11:10

Fix tokio task leak

9972a0c

Using config

63f0b90

Using config file

228d0af

Bump version

927ae6c

Format

ff9eeba

rodoufu force-pushed the fixTokioTaskLeak branch from 208d34f to ff9eeba Compare March 5, 2025 14:10

Riateche reviewed Mar 6, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix tokio task leak #151

Fix tokio task leak #151

rodoufu commented Feb 27, 2025

rodoufu commented Mar 5, 2025

rodoufu commented Mar 5, 2025

Riateche Mar 6, 2025

Riateche Mar 6, 2025

Riateche Mar 6, 2025

Fix tokio task leak #151

Are you sure you want to change the base?

Fix tokio task leak #151

Conversation

rodoufu commented Feb 27, 2025

rodoufu commented Mar 5, 2025

rodoufu commented Mar 5, 2025

Riateche Mar 6, 2025

Choose a reason for hiding this comment

Riateche Mar 6, 2025

Choose a reason for hiding this comment

Riateche Mar 6, 2025

Choose a reason for hiding this comment