feat: stream csv downloads #428

muhammad-ammar · 2024-02-07T09:06:39Z

JIRA: https://2u-internal.atlassian.net/browse/ENT-8301

Dependency: openedx/frontend-app-admin-portal#1167

Merge checklist:

Any new requirements are in the right place (do not manually modify the requirements/*.txt files)
- base.in if needed in production but edx-analytics-data-api doesn't install it
- test-master.in if edx-analytics-data-api pins it, with a matching version
- make upgrade && make requirements have been run to regenerate requirements
make static has been run to update webpack bundling if any static content was updated
./manage.py makemigrations has been run
- Checkout the Database Migration Confluence page for helpful tips on creating migrations.
- Note: This must be run if you modified any models.
  - It may or may not make a migration depending on exactly what you modified, but it should still be run.
- This should be run from either a venv with all the edx-analytics-data-api requirements installed or if you checked out edx-enterprise-data into the src directory used by edx-analytics-data-api, you can run this command through an edx-analytics-data-api shell.
  - It would be ./manage.py makemigrations in the shell.
Version bumped
Changelog record added
Translations updated (see docs/internationalization.rst but also this isn't blocking for merge atm)

Post merge:

Tag pushed and a new version released
- Note: Assets will be added automatically. You just need to provide a tag (should match your version number) and title and description.
After versioned build finishes in Travis, verify version has been pushed to PyPI
- Each step in the release build has a condition flag that checks if the rest of the steps are done and if so will deploy to PyPi.
  (so basically once your build finishes, after maybe a minute you should see the new version in PyPi automatically (on refresh))
PR created in edx-analytics-data-api to upgrade dependencies (including edx-enterprise-data)
- This must be done after the version is visible in PyPi as make upgrade in edx-analytics-data-api will look for the latest version in PyPi.
- Note: the edx-enterprise-data constraint in edx-analytics-data-api must also be bumped to the latest version in PyPi.

iloveagent57 · 2024-02-07T14:55:22Z

enterprise_data/api/v1/views.py

+            return self.get_paginated_response(serializer.data)
+
+        def data_gen(queryset):
+            paginator = Paginator(queryset, per_page=10000)


small suggestion: make the 10000 value come from settings, so we can tune it up or down without deployment.

iloveagent57 · 2024-02-07T14:58:06Z

enterprise_data/api/v1/views.py

+        page = self.paginate_queryset(queryset)
+        if page is not None:
+            serializer = self.get_serializer(page, many=True)
+            return self.get_paginated_response(serializer.data)


Do we want this here? Should the block of code that conditionally renders the CSV come first?

iloveagent57 · 2024-02-07T14:59:06Z

enterprise_data/api/v1/views.py

+    def list(self, request, *args, **kwargs):
+        """
+        Override the list method to handle streaming CSV download.
+        """


One idea for rollout: introduce a feature flag, where if the flag is off, this method can probably just return super().list(...). And if it's on, it can do all the new stuff you've introduced.

Also, here's an example from SO that might give you an idea for how to structure this code a little differently: https://stackoverflow.com/a/65564367 It might help you simplify this a little bit.

What do you say if enable/disable old/new functionality based on a query param passed from admin-portal?

Yeah, sure, that's a good idea too.

iloveagent57 · 2024-02-07T15:01:46Z

enterprise_data/renderers.py

+
+
+class EnrollmentsCSVRenderer(CSVStreamingRenderer):
+    header = [


You could probably do something like

header = [field.name for field in EnterpriseLearnerEnrollment._meta.get_fields()]

iloveagent57 · 2024-02-07T15:20:29Z

enterprise_data/api/v1/views.py

+                serializer = self.get_serializer(enrollments, many=True)
+                yield serializer.data
+
+        if self.request.query_params.get('data') == 'csv':


Why is this change necessary? Does requesting ...enrollments.csv no longer work?

You might be able to do something like this if you want to switch behavior based on whether .csv is being requested: https://www.django-rest-framework.org/api-guide/renderers/#varying-behavior-by-media-type

Why is this change necessary? Does requesting ...enrollments.csv no longer work?

Unfortunately, yes. I really want to avoid this. I will give it another look.

Requests like enrollments.csv will now work

muhammad-ammar · 2024-02-07T17:12:56Z

@iloveagent57 Bundle of the thanks for the feedback. This work is in progress. I wanted to get your thoughts on the overall approach. I will address all the feedback.

One question, Things are clear regarding streaming csv. How about chunking the DB reads?

iloveagent57

The code looks good! Have you been able to test this out on a fake/large data set?

iloveagent57 · 2024-02-09T14:21:07Z

enterprise_data/api/v1/views.py

+        if self.request.query_params.get('streaming_csv_enabled') == 'true':
+            if request.accepted_renderer.format == 'csv':
+                return StreamingHttpResponse(
+                    EnrollmentsCSVRenderer().render(self._stream_serialized_data()),
+                    content_type='text/csv'
+                )
+
+        return super().list(request, *args, **kwargs)
+
+    def _stream_serialized_data(self):
+        """
+        Stream the serialized data.
+        """
+        queryset = self.filter_queryset(self.get_queryset())
+        serializer = self.get_serializer_class()
+        paginator = Paginator(queryset, per_page=settings.ENROLLMENTS_PAGE_SIZE)
+        for page_number in paginator.page_range:
+            yield from serializer(paginator.page(page_number).object_list, many=True).data


Awesome, nice change to clean this up.

muhammad-ammar · 2024-02-09T16:00:11Z

The code looks good! Have you been able to test this out on a fake/large data set?

Yes, I am using the dataset for https://portal.edx.org/national-university-singapore/admin/learners for local testing.

Results with reading chunks of 10000 records from DB

With streaming

CSV Size: 76.8 MB
CSV Rows: 127500
Average: 33 seconds

Without streaming

CSV Size: 76.8 MB
CSV Rows: 127500
Average: 28 seconds

muhammad-ammar force-pushed the ammar/streaming-csv branch 3 times, most recently from 9f3a070 to 8f36eec Compare February 7, 2024 09:13

muhammad-ammar mentioned this pull request Feb 7, 2024

feat: streaming csv flag + pass audit enrollments flag to backend openedx/frontend-app-admin-portal#1167

Merged

3 tasks

muhammad-ammar force-pushed the ammar/streaming-csv branch from 8f36eec to f30b1ad Compare February 7, 2024 12:37

iloveagent57 reviewed Feb 7, 2024

View reviewed changes

muhammad-ammar force-pushed the ammar/streaming-csv branch 2 times, most recently from 1faf678 to 410536c Compare February 9, 2024 05:24

iloveagent57 approved these changes Feb 9, 2024

View reviewed changes

muhammad-ammar force-pushed the ammar/streaming-csv branch 6 times, most recently from b3ac8f1 to db03208 Compare February 13, 2024 11:08

feat: stream csv downloads

8f8dd89

muhammad-ammar force-pushed the ammar/streaming-csv branch from db03208 to 8f8dd89 Compare February 13, 2024 11:24

muhammad-ammar merged commit bb2fa8b into master Feb 13, 2024
7 checks passed

muhammad-ammar deleted the ammar/streaming-csv branch February 13, 2024 11:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: stream csv downloads #428

feat: stream csv downloads #428

muhammad-ammar commented Feb 7, 2024 •

edited

Loading

iloveagent57 Feb 7, 2024

iloveagent57 Feb 7, 2024

iloveagent57 Feb 7, 2024

iloveagent57 Feb 7, 2024

muhammad-ammar Feb 8, 2024

iloveagent57 Feb 9, 2024

iloveagent57 Feb 7, 2024

iloveagent57 Feb 7, 2024

iloveagent57 Feb 7, 2024

muhammad-ammar Feb 7, 2024

muhammad-ammar Feb 8, 2024

muhammad-ammar commented Feb 7, 2024

iloveagent57 left a comment

iloveagent57 Feb 9, 2024

muhammad-ammar commented Feb 9, 2024



		class EnrollmentsCSVRenderer(CSVStreamingRenderer):
		header = [

feat: stream csv downloads #428

feat: stream csv downloads #428

Conversation

muhammad-ammar commented Feb 7, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

muhammad-ammar commented Feb 7, 2024

iloveagent57 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

muhammad-ammar commented Feb 9, 2024

muhammad-ammar commented Feb 7, 2024 •

edited

Loading