Skip to content

Commit

Permalink
Merge pull request #6 from fcavallarin/developer
Browse files Browse the repository at this point in the history
Improved navigation and bugfix
  • Loading branch information
fcavallarin authored Feb 13, 2023
2 parents 21bc5e0 + 91e5edb commit 1d65b18
Show file tree
Hide file tree
Showing 5 changed files with 451 additions and 85 deletions.
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,4 +37,4 @@ This program is free software; you can redistribute it and/or modify it under th

## ABOUT

Written by Filippo Cavallarin. This project is son of Htcap (https://github.com/fcavallarin/htcap | https://htcap.org).
Written by Filippo Cavallarin. This project is son of Htcap (https://github.com/fcavallarin/htcap).
294 changes: 294 additions & 0 deletions docs/API.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,294 @@
# Introduction

Htcrawl is nodejs module for ricursivley crawl a single page application (SPA) using javascript.

# Class: Crawler

The following is a typical example of using Htcrawl to crawl a page:


```js
// Get instance of Crawler class
htcap.launch(targetUrl, options).then(crawler => {

// Print out the url of ajax calls
crawler.on("xhr", e => {
console.log("XHR to " + e.params.request.url);
});

// Start crawling!
crawler.start().then( () => crawler.browser.close());
});
```


## htcap.launch(targetUrl, [options])
- `targetUrl` <string>
- `options` <Object>
- `referer` <string> Sets the referer.
- `userAgent` <string> Sets the referer user-agent.
- `setCookies` <Array<Object>>
- `name` <string> (required)
- `value` <string> (required)
- `url` <string>
- `domain` <string>
- `path` <string>
- `expires` <number> Unix time in seconds.
- `httpOnly` <boolean>
- `secure` <boolean>
- `proxy` <string> Sets proxy server. (protocol://host:port)
- `httpAuth` <string> Sets http authentication credentials. (username:password)
- `loadWithPost` <boolean> Whether to load page with POST method.
- `postData` <string> Setd the data to be sent wia post.
- `headlessChrome` <boolean> Whether to run chrome in headless mode.
- `openChromeDevtoos` <boolean> Whether to open chrome devtools.
- `extraHeaders` <Object> Sets additional http headers.
- `maximumRecursion` <number> Sets the limit of DOM recursion. Defaults to 15.
- `maximumAjaxChain` <number> Sets the maximum number of chained ajax requests. Defaults to 30.
- `triggerEvents` <boolean> Whether to trigger events. Defaults to true.
- `fillValues` <boolean> Whether to fill input values. Defaults to true.
- `maxExecTime` <number> Maximum execution time in milliseconds. Defaults to 300000.
- `overrideTimeoutFunctions` <boolean> Whether to override timeout functions. Defaults to true.
- `randomSeed` <string> Seed to generate random values to fill input values.
- `exceptionOnRedirect` <boolean> Whether to throw an exception on redirect. Defaults to false.
- `navigationTimeout` <number> Sets the navigation timeout. Defaults to 10000.
- `bypassCSP` <boolean> Whether to bypass CSP settings. Defaults to true.


## crawler.load()
Loads targetUrl. Resolves when the crawling is finished.
Returns: <Promise<Crawler>>

## crawler.start()
Loads targetUrl and starts crawling. Resolves when the crawling is finished.
Returns: <Promise<Crawler>>

## crawler.stop()
Requests the crawling to stop. It makes `start()` to resolve "immediately".


## crawler.navigate(url)
Navigates to `url`. Resolves when the page is loaded.
Returns: <Promise>

## crawler.reload()
Reload the current page. Resolves when the page is loaded.
Returns: <Promise>

## crawler.clickToNavigate(selector, timeout)
Clicks on selector and waits for timeout milliseconds for the navigation to be started. Resolves when the page is loaded.
Returns: <Promise>

## crawler.waitForRequestsCompletion()
Waits for XHR, JSONP, fetch requests to be completed. Resolves when all requests are performed.
Returns: <Promise>

## crawler.browser()
Returns Puppeteer's Browser instance.

## crawler.page()
Returns Puppeteer's Page instance.

## crawler.newPage(url)
Creates a new browser's page (a new tab). If `url` is provided, the new page will navigate to that URL when `load()` or `start()` are called.

## crawler.on(event, function)
- `event` <string> Event name
- `function` <function(Object, Crawler)] A function that will be called with two arguments:
- `eventObject` <Object> Object containing event name parameters
- `name` <string> Event name
- `params` <Object> Event parameters
- `crawler` <Object> Crawler instance.


## Events
The following events are emitted during crawling. Some events can be cancelled by returning false.

### start
Emitted when Htcrawl starts.
Cancellable: False
Parameters: None


### pageInitialized
Emitted when the page is initialized and all requests are compelted.
Cancellable: False
Parameters: None

### xhr
Emitted before sending an ajax request.
Cancellable: True
Parameters:

- `request` <Object> Instance of Request class

### xhrcompleted
Emitted when an ajax request is completed.
Cancellable: False
Parameters:

- `request` <Object> Instance of Request class
- `response` <string> Response text
- `timedout` <boolean> Whether the request is timed out

### fetch
Emitted before sending a fetch request.
Cancellable: True
Parameters:

- `request` <Object> Instance of Request class


### fetchcompleted
Emitted when a fetch request is completed.
Cancellable: False
Parameters:

- `request` <Object> Instance of Request class
- `timedout` <boolean> Whether the request is timed out

### jsonp
Emitted before sending a jsonp request.
Cancellable: True
Parameters:

- `request` <Object> Instance of Request class

### jsonpcompleted
Emitted when a jsonp request is completed.
Cancellable: False
Parameters:

- `request` <Object> Instance of Request class
- `scriptElement` <string> Css selector of the added script element
- `timedout` <boolean> Whether the request is timed out

### websocket
Emitted before opening a websocket connection.
Cancellable: False
Parameters:

- `request` <Object> Instance of Request class

### websocketmessage
Emitted before sending a websocket request.
Cancellable: False
Parameters:

- `request` <Object> Instance of Request class
- `message` <string> Websocket message

### websocketsend
Emitted before sending a message to a websocket.
Cancellable: True
Parameters:

- `request` <Object> Instance of Request class
- `message` <string> Websocket message

### formsubmit
Emitted before submitting a form.
Cancellable: False
Parameters:

- `request` <Object> Instance of Request class
- `form` <string> Css selector of the form element.

### fillinput
Emitted before filling an input element.
Cancellable: True
Parameters:

- `element` <string> Css selector of the input element

Example:

```js
// Set a custom value to input field and prevent auto-filling
crawler.on("fillinput" (e, crawler) => {
await crawler.page().$eval(e.params.element, input => input.value = "My Custom Value");
return false;
});
```


### newdom
Emitted when new DOM content is added to the page.
Cancellable: False
Parameters:

- `rootNode` <string> Css selector of the root element
- `trigger` <string> Css selector of the element that triggered the DOM modification

Example:

```js
// Find links within the newly added content
crawler.on("newdom", (e, crawler) => {
const selector = e.params.rootNode + " a";
crawler.page().$$eval(selector, links => {
for(let link of links)
console.log(link);
});
});
```

### navigation
Emitted when the browser tries to navigate outside the current page.
Cancellable: False
Parameters:

- `request` <Object> Instance of Request class


### domcontentloaded
Emitted when the DOM is loaded for the first time (on page load). This event must be registered before load()
Cancellable: False
Parameters: None

### redirect
Emitted when a redirect is requested.
Cancellable: True
Parameters:

- `url` <string> Redirect URL

### earlydetach
Emitted when an element is detached before it has been analyzed.
Cancellable: False
Parameters:

- `node` <string> Css selector of the detached element

### triggerevent
Emitted before triggering an event. This event is available only after start()
Cancellable: True
Parameters:

- `node` <string> Css selector of the element
- `event` <string> Event name

### eventtriggered
Emitted after en event has been triggered. This event is available only after start()
Cancellable: False
Parameters:

- `node` <string> Css selector of the element
- `event` <string> Event name


# Object: Request
Object used to hold informations about a request.

- `type` <string> Type of request. It can be: link, xhr, fetch, websocket, jsonp, form, redirect
- `method` <string> Http Method
- `url` <string> URL
- `data` <string> Request body (usually POST data)
- `trigger` <string> Css selector of the HTML element that triggered the request
- `extra_headers` <Object> Extra HTTP headers






Loading

0 comments on commit 1d65b18

Please sign in to comment.