estuary-cdk: expand capabilities for incremental json processing #2318

williamhbaker · 2025-01-31T20:28:37Z

Description:

A prior implementation of incremental JSON processing was able to yield records from a JSON stream by prefix, discarding everything else from the stream. That suits cases where the entire dataset is represented in a single large response, but there are cases where the dataset is represented by multiple potentially very large responses, and each of those responses contains some extra information we need to proceed, such as a paging token. Parsing and processing those responses while keeping memory usage efficient within the connector can be a challenge.

This adds the ability to read a JSON stream incrementally and yield records by prefix, while also returning anything that's left at the end. Effectively this "remainder" is built up while processing the document as an object in-memory. As long as this remainder is relatively small, memory usage should be negligible.

I had assumed CPU performance for this mechanism would be worse than the standard processing of a response in its entirety through a Pydantic model, but it turns out it's actually about 5-10% faster. I believe this is because the JSON parser used by ijson is faster than the one used by Pydantic. We're still building Pydantic models from the python objects yielded by the ijson parser, but doing so using model_validate instead of model_validate_json.

I suspect CPU performance could be improved even more by not using Pydantic models at all and instead using msgspec structs, but this does not seem necessary right now.

Workflow steps:

(How does one use this feature, and how has it changed)

Documentation links affected:

(list any documentation links that you created, or existing ones that you've identified as needing updates, along with a brief description)

Notes for reviewers:

(anything that might help someone review this PR)

This change is

williamhbaker · 2025-01-31T20:40:53Z

estuary-cdk/estuary_cdk/http.py

@@ -125,31 +103,20 @@ async def request_lines(

        return

-    async def request_object_stream(
+    async def request_stream(


Having this functionality tied to http felt a little cumbersome, and I guess you could end up with a byte stream in some other way, so I separated the JSON processing from the byte stream requesting.

A prior implementation of incremental json processing was able to yield records from a JSON stream by prefix, discarding everything else from the stream. That suits cases where the entire dataset is represented in a single large response, but there are cases where the dataset is represented by multiple potentially very large responses, and each of those responses contains some extra information we need to proceed, such as a paging token. Parsing and processing those responses while keeping memory usage efficient within the connector can be a challenge. This adds the ability to read a JSON stream incrementally and yield records by prefix, while also returning anything that's left at the end. Effectively this "remainder" is built up while processing the document as an object in-memory. As long as this remainder is relatively small, memory usage should be negligible. I had assumed CPU performance for this mechanism would be worse than the standard processing of a response in its entirety through a Pydantic model, but it turns out it's actually about 5-10% faster. I believe this is because the JSON parser used by ijson is faster than the one used by Pydantic. We're still building Pydantic models from the python objects yielded by the ijson parser, but doing so using `model_validate` instead of `model_validate_json`. I suspect CPU performance could be improved even more by not using Pydantic models at all and instead using msgspec structs, but this does not seem necessary right now.

williamhbaker · 2025-01-31T20:44:18Z

source-zendesk-support-native/source_zendesk_support_native/api.py

-) -> tuple[str | None, bool]:
-    # Instead of using Pydantic's model_validate_json that uses json.loads internally,
-    # use json.JSONDecoder().raw_decode to reduce memory overhead when processing the response.
-    raw_response_bytes = await http.request(log, url, params=params)


This would previously require buffering the entire response's bytes and plucking out a couple values from it, then making the same request again to incrementally process. Now we will only need to make the request a single time while still being able to incrementally process it.

williamhbaker · 2025-01-31T21:29:09Z

@Alex-Bair I think you have a test for zendesk-native memory usage so I'd be interested to know what that looks like with this change.

Alex-Bair · 2025-02-03T15:24:06Z

@Alex-Bair I think you have a test for zendesk-native memory usage so I'd be interested to know what that looks like with this change.

The steady-state memory usage looked about the same to me, hovering between 13-17%. But with this change, there were no spikes up to ~34% since the connector doesn't need to make an additional request for pagination information.

Alex-Bair

LGTM! Really nice job!

williamhbaker force-pushed the wb/object-stream-remainder branch from 456a5c6 to b550c8b Compare January 31, 2025 20:38

williamhbaker commented Jan 31, 2025

View reviewed changes

williamhbaker force-pushed the wb/object-stream-remainder branch from b550c8b to 6e8e31a Compare January 31, 2025 20:42

williamhbaker commented Jan 31, 2025

View reviewed changes

williamhbaker marked this pull request as ready for review January 31, 2025 20:45

williamhbaker requested a review from Alex-Bair January 31, 2025 20:45

Alex-Bair approved these changes Feb 3, 2025

View reviewed changes

williamhbaker merged commit 142c259 into main Feb 3, 2025
73 of 84 checks passed

williamhbaker deleted the wb/object-stream-remainder branch February 3, 2025 15:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

estuary-cdk: expand capabilities for incremental json processing #2318

estuary-cdk: expand capabilities for incremental json processing #2318

williamhbaker commented Jan 31, 2025 •

edited

Loading

williamhbaker Jan 31, 2025

williamhbaker Jan 31, 2025

williamhbaker commented Jan 31, 2025

Alex-Bair commented Feb 3, 2025

Alex-Bair left a comment •

edited

Loading

estuary-cdk: expand capabilities for incremental json processing #2318

estuary-cdk: expand capabilities for incremental json processing #2318

Conversation

williamhbaker commented Jan 31, 2025 • edited Loading

williamhbaker Jan 31, 2025

Choose a reason for hiding this comment

williamhbaker Jan 31, 2025

Choose a reason for hiding this comment

williamhbaker commented Jan 31, 2025

Alex-Bair commented Feb 3, 2025

Alex-Bair left a comment • edited Loading

Choose a reason for hiding this comment

williamhbaker commented Jan 31, 2025 •

edited

Loading

Alex-Bair left a comment •

edited

Loading