- Feature: New option
:use-http-headers-from-content
that can be set tofalse
to disable charset detection based on the HTML response body. - Fix: Uncaught exceptions thrown by enhancers (like the DB one) should now be propagated to the toplevel and handled gracefully.
- Feature: New function
cached-document
for accessing a previous (cached) version of a page downloaded while in update mode. - Documentation: New example illustrating the use of a Redis cache backend. (Thanks to Patrick van de Glind, Carlo Sciolla, Alvin Francis Dumalus, and Oskar Gewalli for contributing!)
This release corrects the issue in 0.3.3 that caused its pom.xml to not include dependencies, but is otherwise the same.
- Feature: To facilitate debugging, processors can now set the
:skyscraper/description
key on contexts. These descriptions will be logged when downloading, instead of the URL, and won’t be propagated to child contexts. - Fix: Skyscraper now properly closes the cache when using
scrape!
and one of the processors throws an exception. - Fix: Skyscraper no longer complains when the server returns a quoted
charset in the
Content-type
header. - Fix:
:skyscraper.traverse/priority
is no longer propagated to child contexts. - Infra: Skyscraper’s dependencies are now managed with cli-tools, with Kaocha being used for testing.
- Fix: Skyscraper no longer throws exceptions when using processed-cache
and some of the processors don’t have
:cache-template
. - Fix: Skyscraper no longer throws exceptions when the server returns multiple Content-Type headers.
- Fix: Processed cache no longer garbles non-ASCII strings on macOS.
- Backwards-incompatible API changes:
parse-fn
is now expected to take three arguments, the third being the context. The aim of this change is to support cases where the HTML is known to be malformed and needs context-aware preprocessing before parsing. Built-in parse-fns have been updated to take the additional argument.- Cache backends are now expected to implement
java.io.Closeable
in addition toCacheBackend
. Built-in backends have been updated to include no-opclose
methods.
- Optimization: Skyscraper no longer generates indexes for columns
marked with
:skyscraper.db/key-columns
when creating the DB from scratch. There is also a new option,:ignore-db-keys
, to force this at all times. - Skyscraper now retries downloads upon encountering a timeout.
- Bug fixes:
- Fixed dev/scrape misbehaving when redefining processors while scraping is suspended.
- Fixed scrape mishandling errors with
:download-mode
set to:sync
. - Fixed an off-by-one bug in handling
:retries
. - Retry counts are now correctly reset on successful download.
- Skyscraper has been rewritten from scratch to be asynchronous and multithreaded, based on core.async. See Scraping modes for details.
- Skyscraper now supports saving the scrape results to a SQLite database.
- In addition to the classic
scrape
function that returns a lazy sequence of nodes, there is an alternative, non-lazy, imperative interface (scrape!
) that treats producing new results as side-effects. - reaver (using JSoup) is now available as an optional underlying HTML parsing engine, as an alternative to Enlive.
:parse-fn
and:http-options
can now be provided either per-page or globally. (Thanks to Alexander Solovyov for the suggestion.)- All options are now optional, including sane default for
process-fn
. - Backwards-incompatible API changes:
- The
skyscraper
namespace has been renamed toskyscraper.core
. - Processors are now named by keywords.
defprocessor
now takes a keyword name, and registers a function in the global registry instead of defining it. This means that it’s no longer possible to call one processor from another: if you need that, defineprocess-fn
as a named function.- The context values corresponding to
:processor
keys are now expected to be keywords.
scrape
no longer guarantees the order in which the site will be scraped. In particular, two different invocations ofscrape
are not guaranteed to return the scraped data in the same order. If you need that guarantee, setparallelism
andmax-connections
to 1.- The cache interface has been overhauled. Caching now works by storing binary blobs (rather than strings), along with metadata (e.g., HTTP headers). Caches created by Skyscraper 0.1 or 0.2 cannot be reused for 0.3.
- Error handling has been reworked.
get-cache-keys
has been removed. If you want the same effect, include:cache-key
in the desired contexts.
- The
- New feature: Custom parse functions.
- New feature: Customizable error handling strategies.
- Bugfix:
:only
now doesn’t barf on keys not appearing in seed.
- Skyscraper now uses Timbre for logging.
- New cache backend:
MemoryCache
. download
now supports arbitrarily many retries.- A situation where a context has a processor but no URL now triggers a warning instead of throwing an exception.
- New function:
get-cache-keys
. scrape
and friends can now accept a keyword as the first argument.- Cache keys are now accessible from within processors (under the
:cache-key
key in the context). - New
scrape
options::only
and:postprocess
. scrape-csv
now accepts an:all-keys
argument and has been rewritten using a helper function,save-dataset-to-csv
.
- Skyscraper now supports pluggable cache backends.
- The caching mechanism has been completely overhauled and Skyscraper no longer creates temporary files when the HTML cache is disabled.
- Support for capturing scraping results to CSV via
scrape-csv
. - Support for updating existing scrapes: new processor flag
:updatable
,scrape
now has an:update
option. - New
scrape
option::retries
. - Fixed a bug whereby scraping huge datasets would result in an
OutOfMemoryError
. (scrape
no longer holds onto the head of the lazy seq it produces).
- A processor can now return one context only. (Thanks to Bryan Maass.)
- The
processed-cache
option toscrape
now works as advertised. - New
scrape
option::html-cache
. (Thanks to ayato-p.) - Namespaced keywords are now resolved correctly to processors. (Thanks to ayato-p.)
- New official
defprocessor
clauses::url-fn
and:cache-key-fn
.- Note: these clauses existed in previous versions but were undocumented.
- All contexts except the root ones are now guaranteed to contain the
:url
key.
- Processors (
process-fn
functions) can now access current context. - Skyscraper now uses clj-http to issue HTTP GET requests.
- Skyscraper can now auto-detect page encoding thanks to clj-http’s
decode-body-headers
feature. scrape
now supports ahttp-options
argument to override HTTP options (e.g., timeouts).
- Skyscraper can now auto-detect page encoding thanks to clj-http’s
- Skyscraper’s output is now fully lazy (i.e., guaranteed to be non-chunking).
- Fixed a bug where relative URLs were incorrectly resolved in certain circumstances.
- First public release.