Unit 1: Scraping the Web with Scrapy

This unit covers the basics of web scraping with a special focus on data extraction with Scrapy.

Topics

The anatomy of a Scrapy Spider
Building a simple spider
Web scraping with Scrapy & CSS

Sample Spiders

Spider that saves 2 pages from quotes.toscrape.com to the disk:
- spider_1_quote.py: implements start_requests.
- spider_2_quotes.py: uses start_urls attributes.
Spider that scrapes quotes.toscrapes.com:
- spider_3_quotes.py: returns a list of dicts in the parse method.
- spider_4_quotes.py: generates dicts individually via yield.

Hands-on

1. Books spider

Build a spider for books.toscrape.com that extracts title, rating, price, stock and category from the URLs listed in this file (it can be stored locally alongside your spider).

Check out the spider once you're done.

2. Reddit spider

Build a spider to extract title, link, username, user_url, score and time from each submission in the front page of reddit's /r/programming and /r/python.

Check out the spider once you're done.

References

Scrapy Tutorial
Parsel (the extraction library behind Scrapy) documentation
The 30 CSS selectors you must memorize
What does the yield keyword do in python?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Unit 1: Scraping the Web with Scrapy

Topics

Sample Spiders

Hands-on

1. Books spider

2. Reddit spider

References

Files

README.md

Latest commit

History

README.md

File metadata and controls

Unit 1: Scraping the Web with Scrapy

Topics

Sample Spiders

Hands-on

1. Books spider

2. Reddit spider

References