Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: stream csv downloads #428

Merged
merged 1 commit into from
Feb 13, 2024
Merged

feat: stream csv downloads #428

merged 1 commit into from
Feb 13, 2024

Conversation

muhammad-ammar
Copy link
Contributor

@muhammad-ammar muhammad-ammar commented Feb 7, 2024

JIRA: https://2u-internal.atlassian.net/browse/ENT-8301

Dependency: openedx/frontend-app-admin-portal#1167

Merge checklist:

  • Any new requirements are in the right place (do not manually modify the requirements/*.txt files)
    • base.in if needed in production but edx-analytics-data-api doesn't install it
    • test-master.in if edx-analytics-data-api pins it, with a matching version
    • make upgrade && make requirements have been run to regenerate requirements
  • make static has been run to update webpack bundling if any static content was updated
  • ./manage.py makemigrations has been run
    • Checkout the Database Migration Confluence page for helpful tips on creating migrations.
    • Note: This must be run if you modified any models.
      • It may or may not make a migration depending on exactly what you modified, but it should still be run.
    • This should be run from either a venv with all the edx-analytics-data-api requirements installed or if you checked out edx-enterprise-data into the src directory used by edx-analytics-data-api, you can run this command through an edx-analytics-data-api shell.
      • It would be ./manage.py makemigrations in the shell.
  • Version bumped
  • Changelog record added
  • Translations updated (see docs/internationalization.rst but also this isn't blocking for merge atm)

Post merge:

  • Tag pushed and a new version released
    • Note: Assets will be added automatically. You just need to provide a tag (should match your version number) and title and description.
  • After versioned build finishes in Travis, verify version has been pushed to PyPI
    • Each step in the release build has a condition flag that checks if the rest of the steps are done and if so will deploy to PyPi.
      (so basically once your build finishes, after maybe a minute you should see the new version in PyPi automatically (on refresh))
  • PR created in edx-analytics-data-api to upgrade dependencies (including edx-enterprise-data)
    • This must be done after the version is visible in PyPi as make upgrade in edx-analytics-data-api will look for the latest version in PyPi.
    • Note: the edx-enterprise-data constraint in edx-analytics-data-api must also be bumped to the latest version in PyPi.

return self.get_paginated_response(serializer.data)

def data_gen(queryset):
paginator = Paginator(queryset, per_page=10000)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small suggestion: make the 10000 value come from settings, so we can tune it up or down without deployment.

Comment on lines 152 to 155
page = self.paginate_queryset(queryset)
if page is not None:
serializer = self.get_serializer(page, many=True)
return self.get_paginated_response(serializer.data)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want this here? Should the block of code that conditionally renders the CSV come first?

def list(self, request, *args, **kwargs):
"""
Override the list method to handle streaming CSV download.
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One idea for rollout: introduce a feature flag, where if the flag is off, this method can probably just return super().list(...). And if it's on, it can do all the new stuff you've introduced.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, here's an example from SO that might give you an idea for how to structure this code a little differently: https://stackoverflow.com/a/65564367 It might help you simplify this a little bit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you say if enable/disable old/new functionality based on a query param passed from admin-portal?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, sure, that's a good idea too.



class EnrollmentsCSVRenderer(CSVStreamingRenderer):
header = [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could probably do something like

header = [field.name for field in EnterpriseLearnerEnrollment._meta.get_fields()]

serializer = self.get_serializer(enrollments, many=True)
yield serializer.data

if self.request.query_params.get('data') == 'csv':
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this change necessary? Does requesting ...enrollments.csv no longer work?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might be able to do something like this if you want to switch behavior based on whether .csv is being requested: https://www.django-rest-framework.org/api-guide/renderers/#varying-behavior-by-media-type

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this change necessary? Does requesting ...enrollments.csv no longer work?

Unfortunately, yes. I really want to avoid this. I will give it another look.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requests like enrollments.csv will now work

@muhammad-ammar
Copy link
Contributor Author

@iloveagent57 Bundle of the thanks for the feedback. This work is in progress. I wanted to get your thoughts on the overall approach. I will address all the feedback.

One question, Things are clear regarding streaming csv. How about chunking the DB reads?

@muhammad-ammar muhammad-ammar force-pushed the ammar/streaming-csv branch 2 times, most recently from 1faf678 to 410536c Compare February 9, 2024 05:24
Copy link
Contributor

@iloveagent57 iloveagent57 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks good! Have you been able to test this out on a fake/large data set?

Comment on lines 153 to 160
if self.request.query_params.get('streaming_csv_enabled') == 'true':
if request.accepted_renderer.format == 'csv':
return StreamingHttpResponse(
EnrollmentsCSVRenderer().render(self._stream_serialized_data()),
content_type='text/csv'
)

return super().list(request, *args, **kwargs)

def _stream_serialized_data(self):
"""
Stream the serialized data.
"""
queryset = self.filter_queryset(self.get_queryset())
serializer = self.get_serializer_class()
paginator = Paginator(queryset, per_page=settings.ENROLLMENTS_PAGE_SIZE)
for page_number in paginator.page_range:
yield from serializer(paginator.page(page_number).object_list, many=True).data
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, nice change to clean this up.

@muhammad-ammar
Copy link
Contributor Author

The code looks good! Have you been able to test this out on a fake/large data set?

Yes, I am using the dataset for https://portal.edx.org/national-university-singapore/admin/learners for local testing.

Results with reading chunks of 10000 records from DB

With streaming

  • CSV Size: 76.8 MB
  • CSV Rows: 127500
  • Average: 33 seconds

Without streaming

  • CSV Size: 76.8 MB
  • CSV Rows: 127500
  • Average: 28 seconds

@muhammad-ammar muhammad-ammar force-pushed the ammar/streaming-csv branch 6 times, most recently from b3ac8f1 to db03208 Compare February 13, 2024 11:08
@muhammad-ammar muhammad-ammar merged commit bb2fa8b into master Feb 13, 2024
7 checks passed
@muhammad-ammar muhammad-ammar deleted the ammar/streaming-csv branch February 13, 2024 11:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants