This project is a supermarket price scraper for collecting and comparing product prices from the "New World" supermarket chain in New Zealand. It includes a Python-based scraper that interacts with New Worlds APIs used for the frontend website and a PostgreSQL database for storing the retrieved data.
- Scrapes store information and product data from the New World API.
- Stores data in a PostgreSQL database with structured tables for products, stores, prices, and categories.
- Handles authentication and token refresh for API requests.
- Supports multithreaded scraping for efficiency.
- Logs activities and errors for debugging and monitoring.
init_db.sql
: SQL file to initialize the database schema.config.json
: Configuration file for database and API settings.newworld_scraper.py
: The main Python script for the scraper.
- Python 3.7+
- PostgreSQL
Install the required Python packages using:
pip install -r requirements.txt
Here are the main libraries used:
requests
(for making API requests)psycopg2
(for database connections)tenacity
(for retrying API requests)tqdm
(for progress bars, optional)colorlog
(for colored logs, optional)
Update the config.json
file with your database and API details:
{
"db": {
"host": "127.0.0.1",
"port": 5432,
"dbname": "price_comparison",
"user": "postgres",
"password": "admin"
},
"newWorld": {
"chainName": "New World",
"baseUrl": "https://www.newworld.co.nz",
"apiUrl": "https://api-prod.newworld.co.nz"
}
}
-
Initialize the Database
Run the
init_db.sql
file to set up the database schema. Use thepsql
command-line tool or a database GUI:psql -U postgres -d price_comparison -f init_db.sql
-
Run the Scraper
Execute the scraper script:
python newworld_scraper.py
-
Logs
Logs are saved to
scraper.log
. Console outputs displayWARNING
and higher levels. Full logs (includingDEBUG
) are stored in the log file.
- Load Configuration: Reads API and database configurations from
config.json
. - Initialize Database: Ensures the required tables are created.
- Authenticate: Retrieves an API token for authorized requests.
- Scrape Stores and Products:
- Fetches store details.
- Iterates through categories to scrape product data.
- Store Data: Saves product, pricing, and category information into the PostgreSQL database.
- Error Handling: Retries transient errors using exponential backoff.
Feel free to contribute by submitting a pull request or reporting issues. Ensure your code follows Python best practices and is well-documented.