Skip to content

Latest commit

 

History

History

2__Webscraping_and_Data_Labelling

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Introduction to Web Scraping and Data Labelling

Webscraping

  • Install the required dependencies
cd webscraping_and_data_labelling
pip install -r requirements.txt
  • Go through scraper.ipynb
  • Use your learnings from scraper.ipynb to complete the scraper.py 😃

Access to live presentation slide

Conclusion

  • By the end of this, you should have an understanding of how datasets are curated and labelled for machine learning projects. Some of the most popular datasets created by webscraping include the Pushshift Reddit Dump and the AfriFashion Dataset
  • Yes, dataset creation is a valid research area and contribution!

Other Tools & Projects

S/N Tool Awesome projects
1 Scrapy Nairaland
2 Selenium AI4D-Dataset-Challenge
3 Tweepy #TODO

Project Ideas

  • Scrape a property listing website such as Nigeria Property Centre and create dataset for a Housing Price Regression problem
  • Train a spacy-ner model on the corpus (find here), this tutorial can be of help. Then use the trained model to make prediction on the pidgin corpus that you scraped. You can then use label-studio to verify the predicted entities as done in this tutorial.