- Clone the VC repo (https://github.com/DataGlacier/VC.gitLinks to an external site.)
- Create a new branch
- Checkout newly created branch
- Run the add.py and provide your name and fav sport as input
- Run the test script using below command:
- pytest test/test.py -s
- ignore warning and if there is no error then add,commit and push your changes to repo
- create pull request and assign to reviewer ( If you are working in the group else push the changes to your own repo and submit the URL of that)
- If reviewer approves then merge the changes to master ( optional as this is individual assignment.)
2 weeks time will be provided to complete this use case. In Week2 you will be working on code piece while in 3rd week you will be working on model building (if you are planing) and presentation. Presentation should be presented in the template provided to you in the use case and it should be professional ( remember your audience is non technical and leadership team). Upload your presentation and code on the github or any other code repo and share the URL with us. Deliverables of Week 2 are:
- EDA Notebook,
- Data Intake report
- EDA recommendation and hypothesis results
No additional details were added for this assignment.
Task:
- Select any toy data (simple data).
- Save the model
- Deploy the model on flask ( web app)
- Create pdf document (Name, Batch code, Submission date, Submitted to ) which should contain snapshot of each step of deployment)
- Upload the document to Github
- Submit the URL of the uploaded document.
Task:
- Select any toy data (simple data) ( You are allowed to use data set of week 4)
- Save the model ( You are allowed to use model of week 4)
- Deploy the model on any cloud eg: Heroku,AWS,GCP,Azure (Deployment should be API based as well as web app)
- Create pdf document (Name, Batch code, Submission date, Submitted to ) which should contain snapshot of each step of deployment)
- Upload the document and code to Github
- Submit the URL of the uploaded document. Use free credits(trial or student account) of AWS, GCP, Azure to deploy the app.
Take any csv/text file of 2+ GB of your choice. --- (You can do this assignment on Google colab) Read the file ( Present approach of reading the file ) Try different methods of file reading eg: Dask, Modin, Ray, pandas and present your findings in term of computational efficiency Perform basic validation on data columns : eg: remove special character , white spaces from the col name As you already know the schema hence create a YAML file and write the column name in YAML file. --define separator of read and write file, column name in YAML Validate number of columns and column name of ingested file with YAML. Write the file in pipe separated text file (|) in gz format. Create a summary of the file:
- Total number of rows,
- Total number of columns
- File size
- Submit a pdf document which should contain following details:
- Team member's details : Group Name (give a name to your group), Name, Email, Country, College/Company, Specialization ( Data Science, NLP, Data Analyst)
- Problem description
- Business understanding
- Project lifecycle along with deadline
- Data Intake report
- Github Repo link
- Submit a pdf document which should contain following details:
- Team member's details : Group Name (give a name to your group), Name, Email, Country, College/Company, Specialization ( Data Science, NLP, Data Analyst)
- Problem description
- Data understanding
- What type of data you have got for analysis
- What are the problems in the data ( number of NA values, outliers , skewed etc)
- What approaches you are trying to apply on your data set to overcome problems like NA value, outlier etc and why?
- Github Repo link
Data Cleansing and Transformation
- Submit a pdf document and ipynb notebook which should contain following details:
- Team member's details : Group Name (give a name to your group), Name, Email, Country, College/Company, Specialization ( Data Science, NLP, Data Analyst)
- Problem description
- Github Repo link
- Data cleansing and transformation done on the data.
- Try at least 2 techniques to clean the data ( for NA values : mean/median/mode/Model based approach to handle NA value/WOE and like this try different techniques to identify and handle outliers as well)
- for NLP try different featurization technique and also clean the data using regex and python
- Each member should code and review peers work. (Review comment should be present in the github repo)
- Each team member should work on different data cleansing approach. Note: If one team member is using mean to impute values then other member should experiment on segmented approach or any other model based approach to impute the null values. you are allowed to merge the code of each individual and work together to get good result. Make sure code of each team member is placed at provided URL (single repository for whole team).
- Submit a pdf document and EDA ipynb file which should contain following details:
- Team member's details : Group Name (give a name to your group), Name, Email, Country, College/Company, Specialization ( Data Science, NLP, Data Analyst)
- Problem description
- Github Repo link
- EDA performed on the data
- Final Recommendation
- Team member's details : Group Name (give a name to your group), Name, Email, Country, College/Company, Specialization ( Data Science, NLP, Data Analyst)
- Problem description
- Github Repo link
- EDA presentation for business users Last slide of EDA should be dedicated to technical user which should contain recommended models for this data set.
- Select your base model and then explore 1 model of each family if its classification problem then 1 model for Linear models, 1- Model for Ensemble, 1-Model for boosting and other models if you have time (like stacking)
- Please make sure selected model fits in your business requirement. For example : If your business does not want black box model then select only those models which can be used to explain the prediction.
- As this is group assignment hence upload the code of each team member and other deliverables in the single repo and share the URL of that repo.
- Interns of Data analysis Project should submit dashboard in this week. you are allowed to merge the code of each individual and work together to get good result.
- Provide the link of your code and report.
- As it was group assignment hence go far a call with your team and discuss the solution of each member and select that solution which is best and is per the requirement.
- Power point presentation is must. you are allowed to merge the code of each individual and work together to get good result.