[Request for Comment] Should destinations soft-delete or remove CDC deletions? #31242
Replies: 12 comments 13 replies
-
Some of the reasons I'm on team "soft delete"
|
Beta Was this translation helpful? Give feedback.
-
+1 for 'soft delete' There are definitely use cases for soft delete so if only one option is possible, I'd definitely vote soft delete. Otherwise, it's something of a feature gap for certain use cases where auditing would be required. For instance, I've run into cases where I could not explain a change in reports from one day to the next without record of underlying data being deleted. Another reason to only soft delete is that on dimension tables, we lose ability to "left join" historical fact records back to their dimension table(s) if those dimension records later disappear by way of hard delete. I'm also a fan of a config option for 'hard_delete` (bool) - but I understand we want to minimize the matrix of options to support. |
Beta Was this translation helpful? Give feedback.
-
Are we thinking of an only soft or only hard solution? Soft delete makes sense to me but I wonder about cases such as privacy related data deletion for example. Not having a hard delete implies that users would need to consider an additional data removal process. |
Beta Was this translation helpful? Give feedback.
-
My thinking is that going with the "hard delete" option will keep things simple: |
Beta Was this translation helpful? Give feedback.
-
On seeing others' comments, I'm going to add another take on my prior response. While I still vote for adding the 'soft delete' option, it's worth noting that this may be considered a breaking change for users who today expect hard deletes (status quo) to continue. Specifically this point from OP text above:
So, I'll just put out there that I think this is the implementation I'd advocate for:
Regarding the type of breaking change: Important to note that users may not be reliably self-identify regarding whether they would be affected by changing to soft-delete or not. A query that works today with no problem may begin to produce incorrect results - and users will have 10s or 100s of tables being queried in different ways - by the data team themselves and by their internal customers. It's also worth noting that the 'fix' to the breaking change is simply to train users - and/or to update dbt models and other SQL queries to include the clause In summary, when reconsidering this as a potentially breaking change, I would suggest we not set soft-deletes as the new default behavior. Instead, I would vote to hold off until/unless we can add the behavior as configurable, so users are opting in consciously and when they have ability to prioritize the minor code changes which may be required downstream. |
Beta Was this translation helpful? Give feedback.
-
One thing that I don't see mentioned yet - if the data is hard deleted at the source, and for any reason, you need to reset the connection in airbyte and start with an initial sync + snapshot, the deleted data is not replicated. Therefore, wouldn't it be best to delete the data by default so that the final table is consistent after a reset? Opt in soft-delete timestamps are an option, but almost all of these audit use cases are better solved by using soft-deletes at the source, which will allow the state to be maintained through a reset and initial snapshot. I think any behavior that makes the final table dependent on the state of the connection needs to be opt in. |
Beta Was this translation helpful? Give feedback.
-
I am team Soft Delete, all the way. Just echoing what others have said... I see the benefit of Hard Delete: the source and destination match. It is simple. But... I would like a record of everything that happened in my data warehouse. loaded_at, updated_at, deleted_at help us reconcile. The manager says "How is it possible that Sales went down? Shouldn't it only go up?", we can prove the source was not immutable and someone deleted a row. In practice, I think I will always filter |
Beta Was this translation helpful? Give feedback.
-
Team soft-delete here as well, but best would be being able to configure wether keeping deleted (like letting airbyte handle a second pass on rows to hard delete the deleted row), by a setting in the destination |
Beta Was this translation helpful? Give feedback.
-
+1 for soft delete (as a feature) In my experience the data analysts that I work with are very comfortable when it comes to working with more normalised data, and data that "is there". Although from an engineering point I agree with hard delete, I think that most users of the data are not always equipped to understand the CDC process, and barely touch the raw data. A Data engineer should be the one solving this for the data users, but knowing how hard it is to find data engineers, I opt for the tool doing as much work as possible to get the data ready for consumption. Lastly, I think @aaronsteers has a good point. Make both soft- and hard-delete possible with soft-delete disabled by default. |
Beta Was this translation helpful? Give feedback.
-
@evantahler. I've encountered this issue previously and while a hard delete is best for the neatness of data in the destination. Soft deletes always help us track changes over time. Another option I may suggest would be configuring Soft deletes to send data to an audit_logs kind of table. This is something I've implemented in production with a bunch of on_delete triggers. This way, when something gets deleted from source.. The removal is replicated in destination, but the old record is logged In an similar table name maybe suffixed with _delete_audit_logs alongside a timestamp of when this occurred. |
Beta Was this translation helpful? Give feedback.
-
Adding a note about making "soft delete" or "hard remove" an option: In the fullness of time, I think this is where Airbyte will go, but in the near-term, this would be a rather big change. As every source represents it's
|
Beta Was this translation helpful? Give feedback.
-
Can I ask if any definitive resolution was reached? I have the need for the soft indicator to flow through so was wondering if this was on a roadmap? |
Beta Was this translation helpful? Give feedback.
-
Hello Airbyte Community!
Now that we've launched Destinations V2 (#26028) for Snowflake and Bigquery (other destinations coming soon), we are able to more easily make adjust how data is represented in your data warehouse's final tables, based on your feedback. The first area of investigation is how we represent deletions from CDC sources (e.g. from
source-postgres
). We are conducting a survey to learn what Airbyte users prefer.Today, if a row is deleted in your source, we remove it in the Destination's final table as part of the deduplication process - the row will not be present in the final table. However, an alternative exists in which we leave the row in the final table, but have a "Tombstone" or "Soft Delete" column present that is either null (the row exists in the source) or non-null (the row has been deleted from the source), e.g.:
null
In the example above, "Evan" (user 1) exists in the source, and "Edward" (user 2) has been deleted. In most cases (depending on the CDC source), the tombstone column will be
_ab_cdc_deleted_at
, a timestamp. You would gain the additional information about when the row was deleted as well. Should you want a view of your data in the destination which more closely resembles the source (e.g. current behavior), you can filter outWHERE _ab_cdc_deleted_at IS NULL
and either make a new table or view for your downstream analysis.Of note, we currently only remove deleted rows for CDC database sources. There are many API source which also provide a deletion/tombstone column (e.g. source-salesforce) whose records will remain in your final table. Switching CDC deletes to soft-deletes would homogenize how Airbyte works for all sources. It also has the benefit of speeding up the Typing and Deduping process.
So... which do you prefer? Respond with a 👍 to change soft-deleting (leaving the row in the final tables + tombstone column) or a 👎 to keep the existing behavior and remove deleted rows from the final table.
Beta Was this translation helpful? Give feedback.
All reactions