Skip to content
Timothy Duffy edited this page May 9, 2015 · 5 revisions

###Scraper###

####Stand-Alone Aggregated Output####

The barkingowl scraper can be used without the AMQP message bus by importing it directly. The absolute simplest way to use the scraper is to issue it a URL and then wait for it to get all of the documents and have it return them as a large list of tuples:

from barking_owl import Scraper

# first, we need to make a Scraper object that will do the work for us
>>> s = Scraper()

# next we need to define a website that we want to scrape
>>> url_data = {
    'target_url': 'http://timduffy.me/',
    'doc_types': [
        'application/pdf',
    ],
    'title': 'Tim Duffy’s Website',
    'description': "Tim Duffy’s Personal Website.",
    'max_link_level': 2,
    'creation_datetime': str(datetime.datetime.now()),
    'allowed_domains': [ 
    ],
    'sleep_time': 0,
}

# set the url data within the scraper
>>> s.set_url_data( url )

# dispatch the scraper on the target URL.  Note that this function is blocking.
>>> data = s.start()

>>> print data['documents']
[
    {
        'url: 'http://timduffy.me/Resume-TimDuffy-20130813.pdf',
        'tag_text': 'Resume',
        'page_title': 'TimDuffy.Me',
        'page_url': 'http://timduffy.me/',
    }
]

####Stand-Alone Continuous Output####

Another way to use the scraper is to have it call a callback function each time it finds a document. This can be useful if sites take a long time to scrape, or you are worried about loosing connection with the site during the scraping session. It is also nice to queue up documents for further processing.

from barking_owl import Scraper
from mylib import save_doc

def doc_callback(_data, document):
    save_doc(
        url = document['url'],
        tag_text = document['tag_text'],
        page_title = document['page_title'],
        page_url = document['page_url'],
    )

# first, we need to make a Scraper object that will do the work for us
s = Scraper()

# next we need to define a website that we want to scrape
url_data = {
    'target_url': 'http://timduffy.me/',
    'doc_types': [
        'application/pdf',
    ],
    'title': 'Tim Duffy’s Website',
    'description': "Tim Duffy’s Personal Website.",
    'max_link_level': 2,
    'creation_datetime': str(datetime.datetime.now()),
    'allowed_domains': [ 
    
    ],
}

# set the url data within the scraper
s.set_url_data( url_data )

# we'll set a callback function that will be called each time the scrape finds a document
scraper.set_callbacks(
    found_doc_callback = doc_callback
)

# tell the scrape to find all of the documents on the website of the defined type
data = s.start()

docs = data['documents']

# each time a document is found, the local doc_callback() function will be called.  This function
# will be passed the URL info about the document, as well as the scrapers _data contents.

####Messaged####

The other way to use the scraper is with the AMQP message bus. The scrape is linked to the message bus via ScrapeWrapper class. The wrapper class will spin up the scraper in it's own unique thread, and then listen for AMQP messages on the master thread. The scraper callbacks will allow for data to be passed between the scraper and the message bus.

import uuid
from scraperwrapper import ScraperWrapper

scraper_uid = str(uuid.uuid4())

# first create our scraper wrapper which will connect to the local machine ('localhost'), on the
# AMQP exchange called 'barkingowl', and broadcast it's availability to the message bus every
# 5 seconds.
scraper_wrapper = ScraperWrapper(
    address = 'localhost',
    exchange = "barkingowl",
    broadcast_interval = 5,
    url_parameters = None,
    uid = scraper_uid,
    DEBUG = True,
)

-- or --

# additionally, you can now pass the pika URLParameters object to the ScraperWrapper instead of setting
# the address explicitly.  Note that if the url_paramters field is a non-None value, then it
# will be used when configuring pika.  Note 2: barking_owl.URLParameters is simply just an 
# import of pika.URLParameters.
scraper_wrapper = ScraperWrapper(
    address = None,
    exchange = "barkingowl",
    broadcast_interval = 5,
    url_parameters = barking_owl.URLParameters(
        'amqp://guest:guest@rabbit-server1:5672/%2F?backpressure_detection=t'
    ),
    uid = scraper_uid,
    DEBUG=True,
)

# next, all we have to do is start the scraper wrapper instance to start it listening on the bus and
# responding to commands.
scraper_wrapper.start()

# to send a URL to the scraper, do the following

import uuid
from barking_owl import BusAccess

# create an instance of BusAccess to talk to the message bus    
bus_access = BusAccess(
    uid = str(uuid.uuid4()),
    address = 'localhost',
    exchange = 'barkingowl',
    url_parameters = None, # can use this to connect to RabitMQ with URI
    DEBUG = True,
)

# define a URL packet to send
url = {
    'target_uerl': "http://timduffy.me/",
    'doc_types': [
        'application/pdf',
    ],
    'title': "TimDuffy.Me",
    'description': "Tim Duffy's Personal Website",
    'max_link_level': 1,
    'creation_datetime': str(datetime.datetime.now()),
    'allowed_domains': [
    ],
}

# send the URL packet to the scraper (note the scraper_uid is used)
bus_access.send_message(
    'command': 'url_dispatch',
    'destination_id': scraper_uid,
    'message': url,
)

# you can send the 'global_shutdown' command to any part of the barkingowl ecosystem to stop a process
# from running.  Note that this process may take a few seconds to complete.

####Scraper Tacking Configuration####

The scraper has some configuration options for how the list of 'seen' URLs is kept. We can keep them in a python list of dicts, or use SQLAlchemy to keep them in a sqlite database or a more powerful SQL Server such as MariaDB.

# Configure the scraper to use a python list of dicts
s = Scraper(
    check_type = 'dict',
    check_type_uri = None,
)

# Configure the scraper to use a local SQLite database
s = Scraper(
    check_type = 'sql',
    check_type_uri = 'sqlite:///barkingowl-scraper.sqlite',
)
Clone this wiki locally