Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

source-postgres: Handle NULL confirmed_flush_lsn #2436

Merged
merged 1 commit into from
Feb 21, 2025

Conversation

willdonnelly
Copy link
Member

@willdonnelly willdonnelly commented Feb 20, 2025

Description:

In normal operation a replication slot will always have a non-null confirmed_flush_lsn, but we saw the other day that it is actually possible to observe a null value for that field if the replication slot is stuck in the middle of being created because it has to wait for a long-running transaction to complete.

Since one major cause of replication slot recreation is when the old slot gets invalidated, and one major cause of invalidation is when a long-running transaction forces excessive WAL retention, this is actually less rare than it seems. It will happen any time a long-running transaction causes slot invalidation and the user just hits "Backfill All" without killing the transaction (assuming it didn't end on its own, of course).

Since I would really like to make these queryReplicationSlotInfo checks fatal errors in the near future this logic needs to be bulletproof, so we need to handle that situation.


This change is Reviewable

In normal operation a replication slot will always have a non-null
`confirmed_flush_lsn`, but we saw the other day that it _is_
actually possible to observe a null value for that field if
the replication slot is stuck in the middle of being created
because it has to wait for a long-running transaction to
complete.

Since one major cause of replication slot recreation is when the
old slot gets invalidated, and one major cause of invalidation is
when a long-running transaction forces excessive WAL retention,
this is actually less rare than it seems. It will happen any time
a long-running transaction causes slot invalidation and the user
just hits "Backfill All" without killing the transaction (assuming
it didn't end on its own, of course).

Since I would really like to make these `queryReplicationSlotInfo`
checks fatal errors in the near future this logic needs to be
bulletproof, so we need to handle that situation.
@willdonnelly willdonnelly added the change:unplanned Unplanned change, useful for things like doc updates label Feb 20, 2025
@willdonnelly willdonnelly requested a review from a team February 20, 2025 23:20
Copy link
Member

@williamhbaker williamhbaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@willdonnelly willdonnelly merged commit b38ecc1 into main Feb 21, 2025
52 of 56 checks passed
@willdonnelly willdonnelly deleted the wgd/postgres-nil-confirmedlsn-20250220 branch February 21, 2025 18:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
change:unplanned Unplanned change, useful for things like doc updates
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants