-
Notifications
You must be signed in to change notification settings - Fork 16
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #6 from fcavallarin/developer
Improved navigation and bugfix
- Loading branch information
Showing
5 changed files
with
451 additions
and
85 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,294 @@ | ||
# Introduction | ||
|
||
Htcrawl is nodejs module for ricursivley crawl a single page application (SPA) using javascript. | ||
|
||
# Class: Crawler | ||
|
||
The following is a typical example of using Htcrawl to crawl a page: | ||
|
||
|
||
```js | ||
// Get instance of Crawler class | ||
htcap.launch(targetUrl, options).then(crawler => { | ||
|
||
// Print out the url of ajax calls | ||
crawler.on("xhr", e => { | ||
console.log("XHR to " + e.params.request.url); | ||
}); | ||
|
||
// Start crawling! | ||
crawler.start().then( () => crawler.browser.close()); | ||
}); | ||
``` | ||
|
||
|
||
## htcap.launch(targetUrl, [options]) | ||
- `targetUrl` <string> | ||
- `options` <Object> | ||
- `referer` <string> Sets the referer. | ||
- `userAgent` <string> Sets the referer user-agent. | ||
- `setCookies` <Array<Object>> | ||
- `name` <string> (required) | ||
- `value` <string> (required) | ||
- `url` <string> | ||
- `domain` <string> | ||
- `path` <string> | ||
- `expires` <number> Unix time in seconds. | ||
- `httpOnly` <boolean> | ||
- `secure` <boolean> | ||
- `proxy` <string> Sets proxy server. (protocol://host:port) | ||
- `httpAuth` <string> Sets http authentication credentials. (username:password) | ||
- `loadWithPost` <boolean> Whether to load page with POST method. | ||
- `postData` <string> Setd the data to be sent wia post. | ||
- `headlessChrome` <boolean> Whether to run chrome in headless mode. | ||
- `openChromeDevtoos` <boolean> Whether to open chrome devtools. | ||
- `extraHeaders` <Object> Sets additional http headers. | ||
- `maximumRecursion` <number> Sets the limit of DOM recursion. Defaults to 15. | ||
- `maximumAjaxChain` <number> Sets the maximum number of chained ajax requests. Defaults to 30. | ||
- `triggerEvents` <boolean> Whether to trigger events. Defaults to true. | ||
- `fillValues` <boolean> Whether to fill input values. Defaults to true. | ||
- `maxExecTime` <number> Maximum execution time in milliseconds. Defaults to 300000. | ||
- `overrideTimeoutFunctions` <boolean> Whether to override timeout functions. Defaults to true. | ||
- `randomSeed` <string> Seed to generate random values to fill input values. | ||
- `exceptionOnRedirect` <boolean> Whether to throw an exception on redirect. Defaults to false. | ||
- `navigationTimeout` <number> Sets the navigation timeout. Defaults to 10000. | ||
- `bypassCSP` <boolean> Whether to bypass CSP settings. Defaults to true. | ||
|
||
|
||
## crawler.load() | ||
Loads targetUrl. Resolves when the crawling is finished. | ||
Returns: <Promise<Crawler>> | ||
|
||
## crawler.start() | ||
Loads targetUrl and starts crawling. Resolves when the crawling is finished. | ||
Returns: <Promise<Crawler>> | ||
|
||
## crawler.stop() | ||
Requests the crawling to stop. It makes `start()` to resolve "immediately". | ||
|
||
|
||
## crawler.navigate(url) | ||
Navigates to `url`. Resolves when the page is loaded. | ||
Returns: <Promise> | ||
|
||
## crawler.reload() | ||
Reload the current page. Resolves when the page is loaded. | ||
Returns: <Promise> | ||
|
||
## crawler.clickToNavigate(selector, timeout) | ||
Clicks on selector and waits for timeout milliseconds for the navigation to be started. Resolves when the page is loaded. | ||
Returns: <Promise> | ||
|
||
## crawler.waitForRequestsCompletion() | ||
Waits for XHR, JSONP, fetch requests to be completed. Resolves when all requests are performed. | ||
Returns: <Promise> | ||
|
||
## crawler.browser() | ||
Returns Puppeteer's Browser instance. | ||
|
||
## crawler.page() | ||
Returns Puppeteer's Page instance. | ||
|
||
## crawler.newPage(url) | ||
Creates a new browser's page (a new tab). If `url` is provided, the new page will navigate to that URL when `load()` or `start()` are called. | ||
|
||
## crawler.on(event, function) | ||
- `event` <string> Event name | ||
- `function` <function(Object, Crawler)] A function that will be called with two arguments: | ||
- `eventObject` <Object> Object containing event name parameters | ||
- `name` <string> Event name | ||
- `params` <Object> Event parameters | ||
- `crawler` <Object> Crawler instance. | ||
|
||
|
||
## Events | ||
The following events are emitted during crawling. Some events can be cancelled by returning false. | ||
|
||
### start | ||
Emitted when Htcrawl starts. | ||
Cancellable: False | ||
Parameters: None | ||
|
||
|
||
### pageInitialized | ||
Emitted when the page is initialized and all requests are compelted. | ||
Cancellable: False | ||
Parameters: None | ||
|
||
### xhr | ||
Emitted before sending an ajax request. | ||
Cancellable: True | ||
Parameters: | ||
|
||
- `request` <Object> Instance of Request class | ||
|
||
### xhrcompleted | ||
Emitted when an ajax request is completed. | ||
Cancellable: False | ||
Parameters: | ||
|
||
- `request` <Object> Instance of Request class | ||
- `response` <string> Response text | ||
- `timedout` <boolean> Whether the request is timed out | ||
|
||
### fetch | ||
Emitted before sending a fetch request. | ||
Cancellable: True | ||
Parameters: | ||
|
||
- `request` <Object> Instance of Request class | ||
|
||
|
||
### fetchcompleted | ||
Emitted when a fetch request is completed. | ||
Cancellable: False | ||
Parameters: | ||
|
||
- `request` <Object> Instance of Request class | ||
- `timedout` <boolean> Whether the request is timed out | ||
|
||
### jsonp | ||
Emitted before sending a jsonp request. | ||
Cancellable: True | ||
Parameters: | ||
|
||
- `request` <Object> Instance of Request class | ||
|
||
### jsonpcompleted | ||
Emitted when a jsonp request is completed. | ||
Cancellable: False | ||
Parameters: | ||
|
||
- `request` <Object> Instance of Request class | ||
- `scriptElement` <string> Css selector of the added script element | ||
- `timedout` <boolean> Whether the request is timed out | ||
|
||
### websocket | ||
Emitted before opening a websocket connection. | ||
Cancellable: False | ||
Parameters: | ||
|
||
- `request` <Object> Instance of Request class | ||
|
||
### websocketmessage | ||
Emitted before sending a websocket request. | ||
Cancellable: False | ||
Parameters: | ||
|
||
- `request` <Object> Instance of Request class | ||
- `message` <string> Websocket message | ||
|
||
### websocketsend | ||
Emitted before sending a message to a websocket. | ||
Cancellable: True | ||
Parameters: | ||
|
||
- `request` <Object> Instance of Request class | ||
- `message` <string> Websocket message | ||
|
||
### formsubmit | ||
Emitted before submitting a form. | ||
Cancellable: False | ||
Parameters: | ||
|
||
- `request` <Object> Instance of Request class | ||
- `form` <string> Css selector of the form element. | ||
|
||
### fillinput | ||
Emitted before filling an input element. | ||
Cancellable: True | ||
Parameters: | ||
|
||
- `element` <string> Css selector of the input element | ||
|
||
Example: | ||
|
||
```js | ||
// Set a custom value to input field and prevent auto-filling | ||
crawler.on("fillinput" (e, crawler) => { | ||
await crawler.page().$eval(e.params.element, input => input.value = "My Custom Value"); | ||
return false; | ||
}); | ||
``` | ||
|
||
|
||
### newdom | ||
Emitted when new DOM content is added to the page. | ||
Cancellable: False | ||
Parameters: | ||
|
||
- `rootNode` <string> Css selector of the root element | ||
- `trigger` <string> Css selector of the element that triggered the DOM modification | ||
|
||
Example: | ||
|
||
```js | ||
// Find links within the newly added content | ||
crawler.on("newdom", (e, crawler) => { | ||
const selector = e.params.rootNode + " a"; | ||
crawler.page().$$eval(selector, links => { | ||
for(let link of links) | ||
console.log(link); | ||
}); | ||
}); | ||
``` | ||
|
||
### navigation | ||
Emitted when the browser tries to navigate outside the current page. | ||
Cancellable: False | ||
Parameters: | ||
|
||
- `request` <Object> Instance of Request class | ||
|
||
|
||
### domcontentloaded | ||
Emitted when the DOM is loaded for the first time (on page load). This event must be registered before load() | ||
Cancellable: False | ||
Parameters: None | ||
|
||
### redirect | ||
Emitted when a redirect is requested. | ||
Cancellable: True | ||
Parameters: | ||
|
||
- `url` <string> Redirect URL | ||
|
||
### earlydetach | ||
Emitted when an element is detached before it has been analyzed. | ||
Cancellable: False | ||
Parameters: | ||
|
||
- `node` <string> Css selector of the detached element | ||
|
||
### triggerevent | ||
Emitted before triggering an event. This event is available only after start() | ||
Cancellable: True | ||
Parameters: | ||
|
||
- `node` <string> Css selector of the element | ||
- `event` <string> Event name | ||
|
||
### eventtriggered | ||
Emitted after en event has been triggered. This event is available only after start() | ||
Cancellable: False | ||
Parameters: | ||
|
||
- `node` <string> Css selector of the element | ||
- `event` <string> Event name | ||
|
||
|
||
# Object: Request | ||
Object used to hold informations about a request. | ||
|
||
- `type` <string> Type of request. It can be: link, xhr, fetch, websocket, jsonp, form, redirect | ||
- `method` <string> Http Method | ||
- `url` <string> URL | ||
- `data` <string> Request body (usually POST data) | ||
- `trigger` <string> Css selector of the HTML element that triggered the request | ||
- `extra_headers` <Object> Extra HTTP headers | ||
|
||
|
||
|
||
|
||
|
||
|
Oops, something went wrong.