This unit covers the basics of web scraping with a special focus on data extraction with Scrapy.
- The anatomy of a Scrapy Spider
- Building a simple spider
- Web scraping with Scrapy & CSS
Check out the slides for this unit
- Spider that saves 2 pages from quotes.toscrape.com to the disk:
spider_1_quote.py
: implementsstart_requests
.spider_2_quotes.py
: usesstart_urls
attributes.
- Spider that scrapes quotes.toscrapes.com:
spider_3_quotes.py
: returns a list of dicts in theparse
method.spider_4_quotes.py
: generates dicts individually viayield
.
Build a spider for books.toscrape.com that extracts title
, rating
, price
, stock
and category
from the URLs listed in this file (it can be stored locally alongside your spider).
Check out the spider once you're done.
Build a spider to extract title
, link
, username
, user_url
, score
and time
from each submission in the front page of reddit's /r/programming and /r/python.