Data_Glacier_Intership_2023

Week 1: Version Control

Clone the VC repo (https://github.com/DataGlacier/VC.gitLinks to an external site.)
Create a new branch
Checkout newly created branch
Run the add.py and provide your name and fav sport as input
Run the test script using below command:
pytest test/test.py -s
ignore warning and if there is no error then add,commit and push your changes to repo
create pull request and assign to reviewer ( If you are working in the group else push the changes to your own repo and submit the URL of that)
If reviewer approves then merge the changes to master ( optional as this is individual assignment.)

Week 2: G2M insight for Cab Investment firm

2 weeks time will be provided to complete this use case. In Week2 you will be working on code piece while in 3rd week you will be working on model building (if you are planing) and presentation. Presentation should be presented in the template provided to you in the use case and it should be professional ( remember your audience is non technical and leadership team). Upload your presentation and code on the github or any other code repo and share the URL with us. Deliverables of Week 2 are:

EDA Notebook,
Data Intake report
EDA recommendation and hypothesis results

Week 3: Presentation of Week 2 Use Case

No additional details were added for this assignment.

Week 4: Deployment on Flask

Task:

Select any toy data (simple data).
Save the model
Deploy the model on flask ( web app)
Create pdf document (Name, Batch code, Submission date, Submitted to ) which should contain snapshot of each step of deployment)
Upload the document to Github
Submit the URL of the uploaded document.

Week 5: Cloud and API deployment

Task:

Select any toy data (simple data) ( You are allowed to use data set of week 4)
Save the model ( You are allowed to use model of week 4)
Deploy the model on any cloud eg: Heroku,AWS,GCP,Azure (Deployment should be API based as well as web app)
Create pdf document (Name, Batch code, Submission date, Submitted to ) which should contain snapshot of each step of deployment)
Upload the document and code to Github
Submit the URL of the uploaded document. Use free credits(trial or student account) of AWS, GCP, Azure to deploy the app.

Week 6: File ingestion and schema validation

Take any csv/text file of 2+ GB of your choice. --- (You can do this assignment on Google colab) Read the file ( Present approach of reading the file ) Try different methods of file reading eg: Dask, Modin, Ray, pandas and present your findings in term of computational efficiency Perform basic validation on data columns : eg: remove special character , white spaces from the col name As you already know the schema hence create a YAML file and write the column name in YAML file. --define separator of read and write file, column name in YAML Validate number of columns and column name of ingested file with YAML. Write the file in pipe separated text file (|) in gz format. Create a summary of the file:

Total number of rows,
Total number of columns
File size

Week 7: Deliverables (Final Project)

Submit a pdf document which should contain following details:
Team member's details : Group Name (give a name to your group), Name, Email, Country, College/Company, Specialization ( Data Science, NLP, Data Analyst)
Problem description
Business understanding
Project lifecycle along with deadline
Data Intake report
Github Repo link

Week 8: Deliverables (Final Project)

Submit a pdf document which should contain following details:
Team member's details : Group Name (give a name to your group), Name, Email, Country, College/Company, Specialization ( Data Science, NLP, Data Analyst)
Problem description
Data understanding
What type of data you have got for analysis
What are the problems in the data ( number of NA values, outliers , skewed etc)
What approaches you are trying to apply on your data set to overcome problems like NA value, outlier etc and why?
Github Repo link

Week 9: Deliverables (Final Project)

Data Cleansing and Transformation

Submit a pdf document and ipynb notebook which should contain following details:
Team member's details : Group Name (give a name to your group), Name, Email, Country, College/Company, Specialization ( Data Science, NLP, Data Analyst)
Problem description
Github Repo link
Data cleansing and transformation done on the data.
Try at least 2 techniques to clean the data ( for NA values : mean/median/mode/Model based approach to handle NA value/WOE and like this try different techniques to identify and handle outliers as well)
for NLP try different featurization technique and also clean the data using regex and python
Each member should code and review peers work. (Review comment should be present in the github repo)
Each team member should work on different data cleansing approach. Note: If one team member is using mean to impute values then other member should experiment on segmented approach or any other model based approach to impute the null values. you are allowed to merge the code of each individual and work together to get good result. Make sure code of each team member is placed at provided URL (single repository for whole team).

Week 10: Deliverables (Final Project)

Submit a pdf document and EDA ipynb file which should contain following details:
Team member's details : Group Name (give a name to your group), Name, Email, Country, College/Company, Specialization ( Data Science, NLP, Data Analyst)
Problem description
Github Repo link
EDA performed on the data
Final Recommendation

Week 11: EDA Presentation and proposed modeling technique (Final Project)

Team member's details : Group Name (give a name to your group), Name, Email, Country, College/Company, Specialization ( Data Science, NLP, Data Analyst)
Problem description
Github Repo link
EDA presentation for business users Last slide of EDA should be dedicated to technical user which should contain recommended models for this data set.

Week 12: Model Selection and Model Building/Dashboard (Final Project)

Select your base model and then explore 1 model of each family if its classification problem then 1 model for Linear models, 1- Model for Ensemble, 1-Model for boosting and other models if you have time (like stacking)
Please make sure selected model fits in your business requirement. For example : If your business does not want black box model then select only those models which can be used to explain the prediction.
As this is group assignment hence upload the code of each team member and other deliverables in the single repo and share the URL of that repo.
Interns of Data analysis Project should submit dashboard in this week. you are allowed to merge the code of each individual and work together to get good result.

Week 13: Final Project Report and Code

Provide the link of your code and report.
As it was group assignment hence go far a call with your team and discuss the solution of each member and select that solution which is best and is per the requirement.
Power point presentation is must. you are allowed to merge the code of each individual and work together to get good result.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data_Glacier_Intership_2023

Week 1: Version Control

Week 2: G2M insight for Cab Investment firm

Week 3: Presentation of Week 2 Use Case

Week 4: Deployment on Flask

Week 5: Cloud and API deployment

Week 6: File ingestion and schema validation

Week 7: Deliverables (Final Project)

Week 8: Deliverables (Final Project)

Week 9: Deliverables (Final Project)

Week 10: Deliverables (Final Project)

Week 11: EDA Presentation and proposed modeling technique (Final Project)

Week 12: Model Selection and Model Building/Dashboard (Final Project)

Week 13: Final Project Report and Code

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 49 Commits
Week 1(VC)/VC		Week 1(VC)/VC
Week 10		Week 10
Week 11		Week 11
Week 12		Week 12
Week 13		Week 13
Week 2		Week 2
Week 3		Week 3
Week 4		Week 4
Week 5/workflows		Week 5/workflows
Week 6		Week 6
Week 7		Week 7
Week 8		Week 8
Week 9		Week 9
README.md		README.md

eceyy/Data_Glacier_Intership_2023

Folders and files

Latest commit

History

Repository files navigation

Data_Glacier_Intership_2023

Week 1: Version Control

Week 2: G2M insight for Cab Investment firm

Week 3: Presentation of Week 2 Use Case

Week 4: Deployment on Flask

Week 5: Cloud and API deployment

Week 6: File ingestion and schema validation

Week 7: Deliverables (Final Project)

Week 8: Deliverables (Final Project)

Week 9: Deliverables (Final Project)

Week 10: Deliverables (Final Project)

Week 11: EDA Presentation and proposed modeling technique (Final Project)

Week 12: Model Selection and Model Building/Dashboard (Final Project)

Week 13: Final Project Report and Code

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages