A CamelCamelCamel-like application that can be configured to support any (?) website. I wrote this application to play with distributed systems. The application is composed of different processes:
- The Rails server
- Crono handling background jobs
- PostgreSQL to store data
- Rendertron rendering pages for scraping
- Elasticsearch to store historic data (so you can visualize it using Kibana)
- An SMTP server to send notifications when a price goes above or below a given threshold
Right now the application is not designed to scale! The idea is to evolve the current design introducing multiple asynchronous background jobs that can handle scraping in a more efficient way.
The application can be configured using the following environment variables:
BARGAIN_DB_NAME
: name of the databaseBARGAIN_DB_USER
: database userBARGAIN_DB_PASSWORD
: database passwordBARGAIN_DB_HOST
: database host (defaultlocalhost
)BARGAIN_DB_PORT
: database port (default5432
)RENDERTRON_URL
: URL of Rendertron (default:http://localhost:8080
)ELASTICSEARCH_URL
: URL of Elasticsearch (default:http://localhost:9200
)ELASTICSEARCH_INDEX
: Elasticsearch index (defaultprices-%Y%m
). It supportsstrftime
syntax to insert time-dependent values.ELASTICSEARCH_DOCTYPE
: Elasticsearch document type (defaultprice
)SMTP_HOST
: SMTP host (defaultlocalhost
)SMTP_PORT
: SMTP port (default25
)
APIs expose CRUD operations using standard HTTP verbs.
Endpoint: http://<host>:<port>/scrapers
It supports standard CRUD operations plus an endpoint to test a scraper:
http://<host>:<port>/scrapers/<id>/test?url=<item_url>
Scraper structure:
{
"name": "Amazon",
"hosts": [
{ "host": "www.amazon.it" },
{ "host": "www.amazon.com" }
],
"rules": [
{ "rule_type": "css", "rule_args":"#priceblock_ourprice" },
{ "rule_type": "text" },
{ "rule_type": "sub", "rule_args": "/EUR\\s+([0-9,]+)/\\1/" },
{ "rule_type": "sub", "rule_args": "/,/./" }
]
}
hosts
are the hosts for which the scrapers applies.rules
are the operations that must be applied to the resource retrieved from an URL in order to extract the price. Supported types arecss
: retrieves a DOM node using a selectorxpath
: retrieves a DOM node using XPathtext
: extracts the node textattr
: extracts an attribute of the nodesub
: substitutes a pattern
Sample request:
curl -X POST -H 'Content-type:application/json' -d @amazon.json localhost:3000/scrapers
The items to watch.
Endpoint: http://<host>:<port>/items
It supports basic CRUD operations plus an endpoint to retrieve the item price:
http://<host>:<port>/items/<id>/price
Item structure:
{
"name": "Building Microservices",
"url": "https://www.amazon.it/Building-Microservices-Designing-Fine-grained-Systems/dp/1492034029/ref=sr_1_2?ie=UTF8&qid=1552553769&sr=8-2&keywords=microservices",
"interval": 30,
"notifications": [
{ "notification_type": "email", "notification_args": "john.doe@example.com", "threshold": 30 }
]
}
interval
is the interval in minutes between price checks.
notifications
are the notifications to send when a price reaches the specified threshold.
Sample request:
curl -X POST -H 'Content-type:application/json' -d @amazon_item.json localhost:3000/items
Perform database migrations:
docker run -it --rm -e RAILS_ENV=production -e BARGAIN_DB_NAME=<db-name> -e BARGAIN_DB_USER=<db-user> -e BARGAIN_DB_PASSWORD=<db-password> -e BARGAIN_DB_HOST=<db-host> -e BARGAIN_DB_PORT=<db-port> -e SECRET_KEY_BASE=<secret-key> --entrypoint migrate.sh lorenzobenvenuti/bargain
Execute the application:
docker run -d --name bargain -p 3000:3000 -e RAILS_ENV=production -e BARGAIN_DB_NAME=<db-name> -e BARGAIN_DB_USER=<db-user> -e BARGAIN_DB_PASSWORD=<db-password> -e BARGAIN_DB_HOST=<db-host> -e BARGAIN_DB_PORT=<db-port> -e RENDERTRON_URL=<rendertron-url> -e ELASTICSEARCH_URL=<elasticsearch-url> -e SMTP_HOST=<smtp-host> -e SECRET_KEY_BASE=<secret-key> lorenzobenvenuti/bargain
- Introduce rules to retrieve and parse JSON
- Implement a queue system to allow the system to survive in case of a high item number.
- Application frontend
- Spread scrapers on multiple queues to avoid throttling
- Use Patron or Typhoeus to increase Elasticsearch client performance in index name)