- Install the required dependencies
cd webscraping_and_data_labelling
pip install -r requirements.txt
- Go through
scraper.ipynb
- Use your learnings from
scraper.ipynb
to complete thescraper.py
😃
- Data Labelling with Label Studio: Link to Slide
- By the end of this, you should have an understanding of how datasets are curated and labelled for machine learning projects. Some of the most popular datasets created by webscraping include the Pushshift Reddit Dump and the AfriFashion Dataset
- Yes, dataset creation is a valid research area and contribution!
S/N | Tool | Awesome projects |
---|---|---|
1 | Scrapy | Nairaland |
2 | Selenium | AI4D-Dataset-Challenge |
3 | Tweepy | #TODO |
- Scrape a property listing website such as Nigeria Property Centre and create dataset for a Housing Price Regression problem
- Train a spacy-ner model on the corpus (find here), this tutorial can be of help. Then use the trained model to make prediction on the pidgin corpus that you scraped. You can then use label-studio to verify the predicted entities as done in this tutorial.