-
Notifications
You must be signed in to change notification settings - Fork 0
Home
From Wave, "This challenge will help us understand your core strengths as well as your thought process and allow us to streamline the interview accordingly. We are looking to see what you would consider production level code, so please reflect the practices that you believe are important to serve this purpose. Specifically, we are looking for simple, structured code with a clear separation of concerns."
An initial look at the challenge statement and the data provided highlighted some tight assumptions. Thus, I opted away from testing for consistency of the data or worrying too much about the logical implementations as they felt straightforward. Balancing over-engineering a take-home solution and demonstrating the knowledge I've gained via industry experience is always a tricky thing. Thus, I've decided to use this assessment to demonstrate my ability to think for production, scale and actually put myself in the shoes of a potential Dev @ Wave 🤞🏼, tasked to prototype this system.
A look at Engineering @ Wave tells us that the company primarily uses Python w/ Django REST framework for APIs/backend. While I wanted to go with that option, in light of the fact that there is active encouragement for exploration and experimentation with brand new technologies, I decided to leverage FastAPI (0.66.0) to simultaneously highlight my ability to pick up a new framework quickly, while also brushing up on the Python dev ecosystem as I've been a little out of touch due to primarily working with TS/JS codebases for my previous job 😅
FastAPI is the new(?!) kid on the block claiming to be a high performance test-driven web-framework to rapidly develop production-ready code by taking the best features from Flask/Django, mixing it with pythonic async await
, and providing numerous niceties along the way.
I took some time to look into these performance claims, and though it seems that the framework falls behind in certain situations when compared to stuff like node/go, it still seems to be the fastest Python web-framework.
Plus, given my design and the challenge prompt relies on DB and i/o operations at scale (which this framework shines in as per the linked discussion comment), I've committed to using it for this challenge!
Since we assume early stages for Wave, employee count would be the key factor in determining the number of rows in the CSV. In the sample, we have data from Feb, Nov, and Dec 2023 (unordered) for 4 employees. Over the course of these 3 months, it seems that employees are not working every single day:
employee id | days_worked |
---|---|
1 | 6 |
2 | 8 |
3 | 8 |
4 | 9 |
Therefore, we can extrapolate and make a safe assumption for 302 employees, with each working an avg. of 7.75 days over a 3 month span generating a CSV of roughly 2341 rows. If we are to consider for an annual time-report sheet, that number becomes ~9364 which can be rounded up to 10k rows, resulting in a CSV size of roughly 400kb which won't require any optimizations given 2021 cloud VM offerings and standards.
While I love the iterative approach towards scalability as necessary @ Wave, I wanted to imagine a scenario few years down the line where Wave reaches matches the scale of Amazon (1,298,000 employees) - this would net us a sizeable 10,059,500 row CSV that too just for 3 months worth of data! As you can see here 5 million records, zipped, is roughly 200 MB. This is when things start getting dicey.
Dask helps us solve numerous things in one go, but most notably, it enables us to bulk process multiple CSVs in parallel, specifically with tight memory constraints. This reduces the acts of path globbing for multiple, size-varying time-reports, chunking, concatenating, processing etc. to a one line function without us manually doing any of these things separately since the read_csv
function does all of this internally. It is clear that this is overkill for this simple project, thus, I'm not leveraging any extra packages for this assignment than necessary. Dask dataframe is a wrapper for the ubiquitous Pandas Dataframe, so it would really help out in order to perform logical operations and generate the JSON response and pretty much play around with it in memory as opposed to the current solution which offloads the computations/checks to PostgreSQL.
- Leverage Pipenv for managing and separating dev/prod dependecies akin to a Rust
cargo.toml
setup - Leverage TDD which is native to FastAPI using PyTest
- Utilize an async ORM to manage DB interactions
- Leverage FastAPI auto-documentation features to adhere to OpenAPI specs with a Swagger UI to test out endpoints
- Leverage modern Python features (3.9+) such as asyncio, type declarations
- Dockerized API and DB for multiple environments
- CI/CD via Github Workflows which builds, tests, and deploys our docker images
- https://github.com/aws-samples/csv-to-dynamodb/blob/master/CloudFormation/CSVToDynamo.template
- https://fastapi.tiangolo.com/
- https://tortoise-orm.readthedocs.io/en/latest/getting_started.html
- https://pydantic-docs.helpmanual.io/
- https://blog.miguelgrinberg.com/post/ignore-all-web-performance-benchmarks-including-this-one
- https://github.com/pyenv/pyenv/blob/master/COMMANDS.md
- https://medium.com/analytics-vidhya/optimized-ways-to-read-large-csvs-in-python-ab2b36a7914e
- https://developer.mozilla.org/en-US/docs/Learn/Server-side/Django/Deployment
- https://florimond.dev/en/posts/2019/08/introduction-to-asgi-async-python-web/
- https://stackoverflow.com/questions/67599119/fastapi-asynchronous-background-tasks-blocks-other-requests
- https://towardsdatascience.com/%EF%B8%8F-load-the-same-csv-file-10x-times-faster-and-with-10x-less-memory-%EF%B8%8F-e93b485086c7
- https://stackoverflow.com/questions/62044541/change-pytest-working-directory-to-test-case-directory
- https://github.com/ahupp/python-magic
- https://towardsdatascience.com/writing-better-code-through-testing-f3150abec6ca
- https://siddharth1.medium.com/temp-files-and-tempfile-module-in-python-with-examples-570b4ee96a38