Htcrawl is nodejs module for ricursivley crawl a single page application (SPA) using javascript.
The following is a typical example of using Htcrawl to crawl a page:
// Get instance of Crawler class
htcap.launch(targetUrl, options).then(crawler => {
// Print out the url of ajax calls
crawler.on("xhr", e => {
console.log("XHR to " + e.params.request.url);
});
// Start crawling!
crawler.start().then( () => crawler.browser.close());
});
targetUrl
<string>options
<Object>referer
<string> Sets the referer.userAgent
<string> Sets the referer user-agent.setCookies
<Array<Object>>name
<string> (required)value
<string> (required)url
<string>domain
<string>path
<string>expires
<number> Unix time in seconds.httpOnly
<boolean>secure
<boolean>
proxy
<string> Sets proxy server. (protocol://host:port)httpAuth
<string> Sets http authentication credentials. (username:password)loadWithPost
<boolean> Whether to load page with POST method.postData
<string> Setd the data to be sent wia post.headlessChrome
<boolean> Whether to run chrome in headless mode.openChromeDevtools
<boolean> Whether to open chrome devtools. It implies headlessChrome=false.extraHeaders
<Object> Sets additional http headers.maximumRecursion
<number> Sets the limit of DOM recursion. Defaults to 15.maximumAjaxChain
<number> Sets the maximum number of chained ajax requests. Defaults to 30.triggerEvents
<boolean> Whether to trigger events. Defaults to true.fillValues
<boolean> Whether to fill input values. Defaults to true.maxExecTime
<number> Maximum execution time in milliseconds. Defaults to 300000.overrideTimeoutFunctions
<boolean> Whether to override timeout functions. Defaults to true.randomSeed
<string> Seed to generate random values to fill input values.exceptionOnRedirect
<boolean> Whether to throw an exception on redirect. Defaults to false.navigationTimeout
<number> Sets the navigation timeout. Defaults to 10000.bypassCSP
<boolean> Whether to bypass CSP settings. Defaults to true.skipDuplicateContent
<boolean> Use heuristic content deduplication. Defaults to true.windowSize
<int[]> width and height of the browser's window.showUI
<boolean> Show the UI as devtools panel. It implies 'openChromeDevtools=true'customUI
<Object> Configure the custom UI. It implies 'showUI=true'. See Custom UI section.overridePostMessage
<boolean> Whether to intercept window.postMessage. Defaults to true.includeAllOrigins
<boolean> Whether to crawl frames of other origins (non same-origin).
Loads targetUrl. Resolves when the page is loaded and ready for crawling.
Returns: <Promise<Crawler>>
Loads targetUrl and starts crawling. Resolves when the crawling is finished.
Returns: <Promise<Crawler>>
Example:
const crawler = await htcrawl.launch("https://fcvl.net");
await crawler.start();
Requests the crawling to stop. It makes start()
to resolve "immediately".
Navigates to url
. Resolves when the the navigation is completed.
Returns: <Promise>
Reload the current page. Resolves when the page is loaded.
Returns: <Promise>
Clicks on selector and waits for timeout milliseconds for the navigation to be started. Resolves when the navigation is completed.
If untilSelector is provided, the navigation is considered completed when the provided selector exists on the page.
Returns: <Promise>
Waits for XHR, JSONP, fetch requests to be completed. Resolves when all requests are performed.
Returns: <Promise>
Returns Puppeteer's Browser instance.
Returns Puppeteer's Page instance.
Creates a new browser's page (a new tab). If url
is provided, the new page will navigate to that URL when load()
or start()
are called.
Creates a new browser's page (a new tab) that is detached form the crawler. If url
is provided, the new page will navigate to that URL.
It's intended to be used in non-headless mode to perform logins or similar actions.
Returns the page instance.
Example:
const page = await crawler.newDetachedPage("login-page");
// Start crawling when the user closes the page
page.on("close", async () =>{
await crawler.start();
})
Send a `message`` to the UI (the browser's extension).
Call window.postMessage() without triggering the corresponding event. Useful if there is an event registered that
cancels postMessage calls.
The first argument is the CSS selector any element within the receiving window/iframe. For example html
corresponds to
window
and inframe/iframe ; html
corresponds to the first iframe.
Example:
crawler.on("postmessage", async (event, crawler) => {
if(event.params.destination != "html"){
await crawler.postMessage("inframe/#frm ; html" "Overrided message", "*");
// Discart original message
return false;
}
})
Registers an event handler.
event
<string> Event namefunction
<function(Object, Crawler)> A function that will be called with two arguments:eventObject
<Object> Object containing event name parametersname
<string> Event nameparams
<Object> Event parameters
crawler
<Object> Crawler instance.
Removes an event handler.
event
<string> Event name
The following events are emitted during crawling. Some events can be cancelled by returning false.
Emitted when Htcrawl starts.
Cancellable: False
Parameters: None
Emitted when the page is initialized and all requests are compelted.
Cancellable: False
Parameters: None
Emitted before sending an ajax request.
Cancellable: True
Parameters:
request
<Object> Instance of Request class
Example:
crawler.on("xhr", e => {
console.log("XHR to " + e.params.request.url);
});
Emitted when an ajax request is completed.
Cancellable: False
Parameters:
request
<Object> Instance of Request classresponse
<string> Response text
Emitted before sending a fetch request.
Cancellable: True
Parameters:
request
<Object> Instance of Request classresponse
<string> Response text
Emitted when a fetch request is completed.
Cancellable: False
Parameters:
request
<Object> Instance of Request classtimedout
<boolean> Whether the request is timed out
Emitted before sending a jsonp request.
Cancellable: False
Parameters:
request
<Object> Instance of Request class
Emitted when a jsonp request is completed.
Cancellable: False
Parameters:
request
<Object> Instance of Request classscriptElement
<string> Css selector of the added script elementtimedout
<boolean> Whether the request is timed out
Emitted before opening a websocket connection.
Cancellable: False
Parameters:
request
<Object> Instance of Request class
Emitted before sending a websocket request.
Cancellable: False
Parameters:
request
<Object> Instance of Request classmessage
<string> Websocket message
Emitted before sending a message to a websocket.
Cancellable: True
Parameters:
request
<Object> Instance of Request classmessage
<string> Websocket message
Emitted before submitting a form.
Cancellable: False
Parameters:
request
<Object> Instance of Request classform
<string> Css selector of the form element.
Emitted before filling an input element.
Cancellable: True
Parameters:
element
<string> Css selector of the input element
Example:
// Set a custom value to input field and prevent auto-filling
crawler.on("fillinput" (e, crawler) => {
await crawler.page().$eval(e.params.element, input => input.value = "My Custom Value");
return false;
});
Emitted when new DOM content is added to the page.
If false
is returned the new element won't be crawled.
Triggered only while crawling.
Cancellable: True
Parameters:
rootNode
<string> Css selector of the root elementtrigger
<string> Css selector of the element that triggered the DOM modification
Example:
// Find links within the newly added content
crawler.on("newdom", (e, crawler) => {
const selector = e.params.rootNode + " a";
crawler.page().$$eval(selector, links => {
for(let link of links)
console.log(link);
});
});
Emitted when the browser tries to navigate outside the current page.
Cancellable: False
Parameters:
request
<Object> Instance of Request class
Emitted when the DOM is loaded for the first time (on page load). This event must be registered before load()
Cancellable: False
Parameters: None
Emitted when a redirect is requested.
Cancellable: True
Parameters:
url
<string> Redirect URL
Emitted when an element is detached before it has been analyzed.
Cancellable: False
Parameters:
element
<string> Css selector of the detached element
Emitted before triggering an event. This event is available only after start()
Cancellable: True
Parameters:
element
<string> Css selector of the elementevent
<string> Event name
Emitted after en event has been triggered. This event is available only after start()
Cancellable: False
Parameters:
element
<string> Css selector of the elementevent
<string> Event name
Emitted when starting crawling a new element. Cancellable: False Parameters:
element
<string> Css selector of the elementevent
<string> Event name
Emitted when window.postMessage is called. Cancellable: True Parameters:
destination
<string> Css selector of the destination of the messagemessage
<Object> MessagetargetOrigin
<string> targetOrigintransfer
<Object> transfer
Object used to hold informations about a request.
type
<string> Type of request. It can be: link, xhr, fetch, websocket, jsonp, form, redirectmethod
<string> Http Methodurl
<string> URLdata
<string> Request body (usually POST data)trigger
<string> Css selector of the HTML element that triggered the requestextra_headers
<Object> Extra HTTP headerstimestamp
<Number> Timestamp of the request
Object used to configure the custom UI (the interface with the browser's extension). The browser's extension can be generated with npx htcrawl lib scaffold <dir>
.
extensionPath
<sting> The path to the extension's folderUIMethods
<Function> A function that is evaluated in the page's context that is used to set up the methods that are invoked from the browser's extension. It takes theUI
object as parameter.events
<object> Object containing the events that are triggered from the methods defined in 'UIMethods'
Object that can be extended with custom methods. It resides in the context of the page.
By default, it contains two properties:
-
dispatch
<Function> Dispatch a message to the crawler (node-side) -
utils
<object> Some utilities to interact with the page:getElementSelector
<Function> Returns the CSS selector of the given elementcreateElement
<Function> Creates a new element in the page that is excluded from crawlng. It takes the following arguments:name
: <sting> The type of element to create (e.g., 'div', 'span')style
: <object> CSS styles to applyparent
: <HTMLElement> Parent element to append to. If omitted, the element is appended to 'body'. Ifnull
the element is not attached to the DOM.
selectElement
<Function> Enables the user to select an element on the webpage by moving the cursor over it. It visually highlights the currently hovered element with an overlay and returns a promise that resolves to an Object containing the element and its selector.
const customUI = {
extensionPath: __dirname + '/chrome-extension',
UIMethods: UI => { // Evaluated in the context of the page
UI.start = () => {
UI.dispatch("start")
}
},
events: { // Events triggered by 'UI.dispatch()' from the page context
start: async e => {
await crawler.start();
// Sent a message to the browser extension
crawler.sendToUI("DONE")
},
}
In the extension's ui-panel.js
file:
document.getElementById('start').onclick = () => {
pageEval("UI.start()");
};
onCrawlerMessage( message => {
document.getElementById("console").innerText += message + "\n";
});
Htcrawl implements all the selectors available in Puppeteer.
It also defines a custom selector to allow the selection of elements inside iframes. The iframe selctor is invoked with inframe/
followed by the selector for the iframe, the 3-char separator ' ; ' and the selector of the element(s).
For example, if we have:
<boby>
<!-- index.html -->
<iframe src="iframe.html">
</body>
<boby>
<!-- iframe.html -->
<button id=btn>Hi</button>
</body>
To select the button in the iframe the inframe/body > iframe ; #btn
selector can be used.
Example:
crawler.page().$('inframe/body > iframe ; #btn');
crawler.page().$$('inframe/body > iframe ; button');
crawler.page().waitForSelector('inframe/body > iframe ; #btn');