Skip to content

Latest commit

 

History

History
442 lines (349 loc) · 14.5 KB

API.md

File metadata and controls

442 lines (349 loc) · 14.5 KB

Introduction

Htcrawl is nodejs module for ricursivley crawl a single page application (SPA) using javascript.

Class: Crawler

The following is a typical example of using Htcrawl to crawl a page:

// Get instance of Crawler class
htcap.launch(targetUrl, options).then(crawler => {

  // Print out the url of ajax calls
  crawler.on("xhr", e => {
    console.log("XHR to " + e.params.request.url);
  });

  // Start crawling!
  crawler.start().then( () => crawler.browser.close());
});

htcap.launch(targetUrl, [options])

  • targetUrl <string>
  • options <Object>
    • referer <string> Sets the referer.
    • userAgent <string> Sets the referer user-agent.
    • setCookies <Array<Object>>
      • name <string> (required)
      • value <string> (required)
      • url <string>
      • domain <string>
      • path <string>
      • expires <number> Unix time in seconds.
      • httpOnly <boolean>
      • secure <boolean>
    • proxy <string> Sets proxy server. (protocol://host:port)
    • httpAuth <string> Sets http authentication credentials. (username:password)
    • loadWithPost <boolean> Whether to load page with POST method.
    • postData <string> Setd the data to be sent wia post.
    • headlessChrome <boolean> Whether to run chrome in headless mode.
    • openChromeDevtools <boolean> Whether to open chrome devtools. It implies headlessChrome=false.
    • extraHeaders <Object> Sets additional http headers.
    • maximumRecursion <number> Sets the limit of DOM recursion. Defaults to 15.
    • maximumAjaxChain <number> Sets the maximum number of chained ajax requests. Defaults to 30.
    • triggerEvents <boolean> Whether to trigger events. Defaults to true.
    • fillValues <boolean> Whether to fill input values. Defaults to true.
    • maxExecTime <number> Maximum execution time in milliseconds. Defaults to 300000.
    • overrideTimeoutFunctions <boolean> Whether to override timeout functions. Defaults to true.
    • randomSeed <string> Seed to generate random values to fill input values.
    • exceptionOnRedirect <boolean> Whether to throw an exception on redirect. Defaults to false.
    • navigationTimeout <number> Sets the navigation timeout. Defaults to 10000.
    • bypassCSP <boolean> Whether to bypass CSP settings. Defaults to true.
    • skipDuplicateContent <boolean> Use heuristic content deduplication. Defaults to true.
    • windowSize <int[]> width and height of the browser's window.
    • showUI <boolean> Show the UI as devtools panel. It implies 'openChromeDevtools=true'
    • customUI <Object> Configure the custom UI. It implies 'showUI=true'. See Custom UI section.
    • overridePostMessage <boolean> Whether to intercept window.postMessage. Defaults to true.
    • includeAllOrigins <boolean> Whether to crawl frames of other origins (non same-origin).

crawler.load()

Loads targetUrl. Resolves when the page is loaded and ready for crawling.
Returns: <Promise<Crawler>>

crawler.start()

Loads targetUrl and starts crawling. Resolves when the crawling is finished.
Returns: <Promise<Crawler>>

Example:

const crawler = await htcrawl.launch("https://fcvl.net");
await crawler.start();

crawler.stop()

Requests the crawling to stop. It makes start() to resolve "immediately".

crawler.navigate(url)

Navigates to url. Resolves when the the navigation is completed.
Returns: <Promise>

crawler.reload()

Reload the current page. Resolves when the page is loaded.
Returns: <Promise>

crawler.clickToNavigate(selector, timeout, untilSelector)

Clicks on selector and waits for timeout milliseconds for the navigation to be started. Resolves when the navigation is completed.
If untilSelector is provided, the navigation is considered completed when the provided selector exists on the page. Returns: <Promise>

crawler.waitForRequestsCompletion()

Waits for XHR, JSONP, fetch requests to be completed. Resolves when all requests are performed.
Returns: <Promise>

crawler.browser()

Returns Puppeteer's Browser instance.

crawler.page()

Returns Puppeteer's Page instance.

crawler.newPage(url)

Creates a new browser's page (a new tab). If url is provided, the new page will navigate to that URL when load() or start() are called.

crawler.newDetachedPage(url)

Creates a new browser's page (a new tab) that is detached form the crawler. If url is provided, the new page will navigate to that URL.
It's intended to be used in non-headless mode to perform logins or similar actions. Returns the page instance.

Example:

const page = await crawler.newDetachedPage("login-page");
// Start crawling when the user closes the page
page.on("close", async () =>{
  await crawler.start();
})

crawler.sendToUI(message)

Send a `message`` to the UI (the browser's extension).

crawler.postMessage(destination, message, targetOrigin, transfer)

Call window.postMessage() without triggering the corresponding event. Useful if there is an event registered that cancels postMessage calls.
The first argument is the CSS selector any element within the receiving window/iframe. For example html corresponds to window and inframe/iframe ; html corresponds to the first iframe.

Example:

crawler.on("postmessage", async (event, crawler) => {
  if(event.params.destination != "html"){
    await crawler.postMessage("inframe/#frm ; html" "Overrided message", "*");
    // Discart original message
    return false;
  }
})

crawler.on(event, function)

Registers an event handler.

  • event <string> Event name
  • function <function(Object, Crawler)> A function that will be called with two arguments:
    • eventObject <Object> Object containing event name parameters
      • name <string> Event name
      • params <Object> Event parameters
    • crawler <Object> Crawler instance.

crawler.removeEvent(event)

Removes an event handler.

  • event <string> Event name

Events

The following events are emitted during crawling. Some events can be cancelled by returning false.

start

Emitted when Htcrawl starts.
Cancellable: False
Parameters: None

pageInitialized

Emitted when the page is initialized and all requests are compelted.
Cancellable: False
Parameters: None

xhr

Emitted before sending an ajax request.
Cancellable: True
Parameters:

  • request <Object> Instance of Request class

Example:

  crawler.on("xhr", e => {
    console.log("XHR to " + e.params.request.url);
  });

xhrcompleted

Emitted when an ajax request is completed.
Cancellable: False
Parameters:

  • request <Object> Instance of Request class
  • response <string> Response text

fetch

Emitted before sending a fetch request.
Cancellable: True
Parameters:

  • request <Object> Instance of Request class
  • response <string> Response text

fetchcompleted

Emitted when a fetch request is completed.
Cancellable: False
Parameters:

  • request <Object> Instance of Request class
  • timedout <boolean> Whether the request is timed out

jsonp

Emitted before sending a jsonp request.
Cancellable: False
Parameters:

  • request <Object> Instance of Request class

jsonpcompleted

Emitted when a jsonp request is completed.
Cancellable: False
Parameters:

  • request <Object> Instance of Request class
  • scriptElement <string> Css selector of the added script element
  • timedout <boolean> Whether the request is timed out

websocket

Emitted before opening a websocket connection.
Cancellable: False
Parameters:

  • request <Object> Instance of Request class

websocketmessage

Emitted before sending a websocket request.
Cancellable: False
Parameters:

  • request <Object> Instance of Request class
  • message <string> Websocket message

websocketsend

Emitted before sending a message to a websocket.
Cancellable: True
Parameters:

  • request <Object> Instance of Request class
  • message <string> Websocket message

formsubmit

Emitted before submitting a form.
Cancellable: False
Parameters:

  • request <Object> Instance of Request class
  • form <string> Css selector of the form element.

fillinput

Emitted before filling an input element.
Cancellable: True
Parameters:

  • element <string> Css selector of the input element

Example:

// Set a custom value to input field and prevent auto-filling
crawler.on("fillinput" (e, crawler) => {
  await crawler.page().$eval(e.params.element, input => input.value = "My Custom Value");
  return false;
});

newdom

Emitted when new DOM content is added to the page.
If false is returned the new element won't be crawled.
Triggered only while crawling.
Cancellable: True
Parameters:

  • rootNode <string> Css selector of the root element
  • trigger <string> Css selector of the element that triggered the DOM modification

Example:

// Find links within the newly added content
crawler.on("newdom", (e, crawler) => {
  const selector = e.params.rootNode + " a";
  crawler.page().$$eval(selector, links => {
    for(let link of links)
      console.log(link);
  });
});

navigation

Emitted when the browser tries to navigate outside the current page.
Cancellable: False
Parameters:

  • request <Object> Instance of Request class

domcontentloaded

Emitted when the DOM is loaded for the first time (on page load). This event must be registered before load() Cancellable: False
Parameters: None

redirect

Emitted when a redirect is requested.
Cancellable: True
Parameters:

  • url <string> Redirect URL

earlydetach

Emitted when an element is detached before it has been analyzed.
Cancellable: False
Parameters:

  • element <string> Css selector of the detached element

triggerevent

Emitted before triggering an event. This event is available only after start()
Cancellable: True
Parameters:

  • element <string> Css selector of the element
  • event <string> Event name

eventtriggered

Emitted after en event has been triggered. This event is available only after start()
Cancellable: False
Parameters:

  • element <string> Css selector of the element
  • event <string> Event name

crawlelement

Emitted when starting crawling a new element. Cancellable: False Parameters:

  • element <string> Css selector of the element
  • event <string> Event name

postmessage

Emitted when window.postMessage is called. Cancellable: True Parameters:

  • destination <string> Css selector of the destination of the message
  • message <Object> Message
  • targetOrigin <string> targetOrigin
  • transfer <Object> transfer

Object: Request

Object used to hold informations about a request.

  • type <string> Type of request. It can be: link, xhr, fetch, websocket, jsonp, form, redirect
  • method <string> Http Method
  • url <string> URL
  • data <string> Request body (usually POST data)
  • trigger <string> Css selector of the HTML element that triggered the request
  • extra_headers <Object> Extra HTTP headers
  • timestamp <Number> Timestamp of the request

Object: Custom UI

Object used to configure the custom UI (the interface with the browser's extension). The browser's extension can be generated with npx htcrawl lib scaffold <dir>.

  • extensionPath <sting> The path to the extension's folder
  • UIMethods <Function> A function that is evaluated in the page's context that is used to set up the methods that are invoked from the browser's extension. It takes the UI object as parameter.
  • events <object> Object containing the events that are triggered from the methods defined in 'UIMethods'

Object UI

Object that can be extended with custom methods. It resides in the context of the page.
By default, it contains two properties:

  • dispatch <Function> Dispatch a message to the crawler (node-side)

  • utils <object> Some utilities to interact with the page:

    • getElementSelector <Function> Returns the CSS selector of the given element
    • createElement <Function> Creates a new element in the page that is excluded from crawlng. It takes the following arguments:
      • name: <sting> The type of element to create (e.g., 'div', 'span')
      • style: <object> CSS styles to apply
      • parent: <HTMLElement> Parent element to append to. If omitted, the element is appended to 'body'. If null the element is not attached to the DOM.
    • selectElement <Function> Enables the user to select an element on the webpage by moving the cursor over it. It visually highlights the currently hovered element with an overlay and returns a promise that resolves to an Object containing the element and its selector.

Example

const customUI = {
    extensionPath: __dirname + '/chrome-extension',
    UIMethods: UI => {  // Evaluated in the context of the page
        UI.start = () => {
            UI.dispatch("start")
        }
    },
    events: {  // Events triggered by 'UI.dispatch()' from the page context
        start: async e => {
            await crawler.start();
            // Sent a message to the browser extension
            crawler.sendToUI("DONE")
        },
    }

In the extension's ui-panel.js file:

document.getElementById('start').onclick = () => {
  pageEval("UI.start()");
};

onCrawlerMessage( message => {
  document.getElementById("console").innerText += message + "\n";
});

Selectors

Htcrawl implements all the selectors available in Puppeteer.
It also defines a custom selector to allow the selection of elements inside iframes. The iframe selctor is invoked with inframe/ followed by the selector for the iframe, the 3-char separator ' ; ' and the selector of the element(s).
For example, if we have:

<boby>
  <!-- index.html -->
  <iframe src="iframe.html">
</body>
<boby>
  <!-- iframe.html -->
  <button id=btn>Hi</button>
</body>

To select the button in the iframe the inframe/body > iframe ; #btn selector can be used.
Example:

crawler.page().$('inframe/body > iframe ; #btn');
crawler.page().$$('inframe/body > iframe ; button');
crawler.page().waitForSelector('inframe/body > iframe ; #btn');