Best practices for retries and error handling #244

hypeJunction · 2024-06-12T07:08:28Z

hypeJunction
Jun 12, 2024

We are migrating some of the older code to the newer version of the ws client apis, and while looking at the older implementation, I started wondering if all of the boilerplate we have around retries and redundancies still makes sense. Are there any best practices for graceful error handling, and how much of it is actually handled by the client. I wonder if you could shed some light on these topics, if you have time. I read through the examples for ws and deno, but didn't see anything specific that would answer my questions, and I also ran into this reply #193 (comment), which seems to briefly touch on the subject, but doesn't fully answer it.

For context, we have a web app that relies on real-time data from NATS, and we want it to always be active: the ultimate question is when should we throw the towel, and tell the user that something gone horribly wrong.

What does a closed connection mean in practice, if it was not requested explicitly? Would intermittent network issues cause a connection to close, or would these be handled by the client retrying the connection? Would it make sense to try and reconnect, or should we assume that something is off with the server?
What implications does a closed connection have on active subscriptions? Based on the above, is there a method to reestablish the connection in order to preserve the subscriptions that were attached to it?
What does an error actually mean for an active subscription? Does an error mean that the subscription has been permanently terminated? Would running the returned unsubscribe function actually do anything in this case? Does the behavior differ between async generator and callback function?
And, in case of a JetStream subscription, what implications do connections have on consumers?

Answered by aricart

Jun 12, 2024

@hypeJunction really it depends on the context on how you view the connection.

For a service - you would want the connection to always retry to heal itself if there's a disconnect (network outages are expected) - this is how NATS connections in general work. On a web application possibly the same idea but only while a "session" is active.

Now - a closed connection means that the only way to connect again is to call the connect() function - Any API usage on a closed connection will fail with an error. Note the difference - a disconnected state simply means that there's no connection now, but the client is actively retrying as per its connect configuration.

When a client disconnects - the s…

View full answer

aricart · 2024-06-12T13:39:50Z

aricart
Jun 12, 2024
Maintainer

@hypeJunction really it depends on the context on how you view the connection.

For a service - you would want the connection to always retry to heal itself if there's a disconnect (network outages are expected) - this is how NATS connections in general work. On a web application possibly the same idea but only while a "session" is active.

Now - a closed connection means that the only way to connect again is to call the connect() function - Any API usage on a closed connection will fail with an error. Note the difference - a disconnected state simply means that there's no connection now, but the client is actively retrying as per its connect configuration.

When a client disconnects - the server removes all interests from the client (subscriptions) - remember NATS core, is receive messages "at most once" - there's no guarantee for delivery - you must be there when a message comes. However the client knows what subscriptions it had active, so on a reconnect it re-subscribes, so you'll resume getting messages that are published to that subject when you reconnect (of course you'll miss the ones that were sent while you were gone).

For request reply operations, all pending requests will be rejected (you were disconnected, so you are very unlikely to get a response within the reconnect window - if you have requests that take a long time to resolve, likely you need to figure out a different way of staging a response, and then checking later for the results).

An error for a subscription is terminal - these are typically permissions errors - you can subscribe today to foo and you may or not have permissions to do so - if not you get an error. Similarly, your process could be active for weeks and an admin decides to change your permissions, and then you loose the ability to subscribe to that subject. When that error is received the server discarded your interest (subscription) on that subject. You can try to re-subscribe, but you may get errors until they fix your permissions. When there's such an issue - if you are using an iterator, the iterator will stop, and the code will resume after your for await.

JetStream is a different matter - here you are talking to a server holding messages for a consumer. If your consumer exists, you can resume processing messages once you reconnect - the consume apis will retry - and fetch/next will work when you next call them and you are connected.

2 replies

aricart Jun 12, 2024
Maintainer

Not sure what you are referring to as old websocket client - the lib API hasn't changed...

hypeJunction Jun 12, 2024
Author

We are replacing deprecated jetstream subscription APIs, as they started causing some issues in our kubernetes cluster. I can consult my colleagues to clarify the issue, but my understanding is that the deprecated jetstream methods fail to identify the correct server replica to connect to.

hypeJunction · 2024-06-12T14:44:27Z

hypeJunction
Jun 12, 2024
Author

@aricart Thank you so much for clarifying this. It is starting to make sense, and I see the beauty of this design.

One last clarification: does JetStream consumer subscription behave similarly to a NATS subscription when it comes to errors? Is there any scenario where you want to explicitly resubscribe?

2 replies

aricart Jun 12, 2024
Maintainer

Ah - don't use the subscribe/pull etc from the js client - they are deprecated see the link I put below and use the Consumer instead - https://github.com/nats-io/nats.ws/blob/main/README.md#jetstream

aricart Jun 12, 2024
Maintainer

here too: https://github.com/nats-io/nats.deno/blob/main/jetstream.md#retrieving-the-consumer

aricart · 2024-06-12T15:20:28Z

aricart
Jun 12, 2024
Maintainer

So it is possible that there could be an error on creating a subscription for jetstream, that error will telegraph. However, in the case of JetStream, the attitude for consume() is like a service - if the consumer was able to consume messages it will keep trying (there are options to allow you to abort on jetstream specific things such as the consumer is gone). For jetstream always create your consumer, then get it, and then use consume/fetch/next - see the jetstream.md file in the repo for an explanation on what the different strategies for reading messages means.

1 reply

aricart Jun 12, 2024
Maintainer

https://github.com/nats-io/nats.ws/blob/main/README.md#jetstream

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Best practices for retries and error handling #244

{{title}}

Replies: 3 comments 5 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Best practices for retries and error handling #244

hypeJunction Jun 12, 2024

Replies: 3 comments · 5 replies

aricart Jun 12, 2024 Maintainer

aricart Jun 12, 2024 Maintainer

hypeJunction Jun 12, 2024 Author

hypeJunction Jun 12, 2024 Author

aricart Jun 12, 2024 Maintainer

aricart Jun 12, 2024 Maintainer

aricart Jun 12, 2024 Maintainer

aricart Jun 12, 2024 Maintainer

hypeJunction
Jun 12, 2024

Replies: 3 comments 5 replies

aricart
Jun 12, 2024
Maintainer

aricart Jun 12, 2024
Maintainer

hypeJunction Jun 12, 2024
Author

hypeJunction
Jun 12, 2024
Author

aricart Jun 12, 2024
Maintainer

aricart Jun 12, 2024
Maintainer

aricart
Jun 12, 2024
Maintainer

aricart Jun 12, 2024
Maintainer