Introduction
Htcrawl is nodejs module for ricursivley crawl a single page application (SPA) using javascript.
Class: Crawler
The following is a typical example of using Htcrawl to crawl a page:
// Get instance of Crawler class
htcap.launch(targetUrl, options).then(crawler => {
// Print out the url of ajax calls
crawler.on("xhr", e => {
console.log("XHR to " + e.params.request.url);
});
// Start crawling!
crawler.start().then( () => crawler.browser.close());
});
htcap.launch(targetUrl, [options])
targetUrl
<string>options
<Object>referer
<string> Sets the referer.userAgent
<string> Sets the referer user-agent.setCookies
<Array<Object>>name
<string> (required)value
<string> (required)url
<string>domain
<string>path
<string>expires
<number> Unix time in seconds.httpOnly
<boolean>secure
<boolean>
proxy
<string> Sets proxy server. (protocol://host:port)httpAuth
<string> Sets http authentication credentials. (username:password)loadWithPost
<boolean> Whether to load page with POST method.postData
<string> Setd the data to be sent wia post.headlessChrome
<boolean> Whether to run chrome in headless mode.openChromeDevtoos
<boolean> Whether to open chrome devtools.extraHeaders
<Object> Sets additional http headers.maximumRecursion
<number> Sets the limit of DOM recursion. Defaults to 15.maximumAjaxChain
<number> Sets the maximum number of chained ajax requests. Defaults to 30.triggerEvents
<boolean> Whether to trigger events. Defaults to true.fillValues
<boolean> Whether to fill input values. Defaults to true.maxExecTime
<number> Maximum execution time in milliseconds. Defaults to 300000.overrideTimeoutFunctions
<boolean> Whether to override timeout functions. Defaults to true.randomSeed
<string> Seed to generate random values to fill input values.exceptionOnRedirect
<boolean> Whether to throw an exception on redirect. Defaults to false.navigationTimeout
<number> Sets the navigation timeout. Defaults to 10000.bypassCSP
<boolean> Whether to bypass CSP settings. Defaults to true.
crawler.load()
Loads targetUrl. Resolves when the crawling is finished.
Returns: <Promise<Crawler>>
crawler.start()
Loads targetUrl and starts crawling. Resolves when the crawling is finished.
Returns: <Promise<Crawler>>
crawler.stop()
Requests the crawling to stop. It makes start()
to resolve "immediately".
crawler.navigate(url)
Navigates to url
. Resolves when the page is loaded.
Returns: <Promise>
crawler.reload()
Reload the current page. Resolves when the page is loaded.
Returns: <Promise>
crawler.clickToNavigate(selector, timeout)
Clicks on selector and waits for timeout milliseconds for the navigation to be started. Resolves when the page is loaded.
Returns: <Promise>
crawler.waitForRequestsCompletion()
Waits for XHR, JSONP, fetch requests to be completed. Resolves when all requests are performed.
Returns: <Promise>
crawler.browser()
Returns Puppeteer's Browser instance.
crawler.page()
Returns Puppeteer's Page instance.
crawler.newPage(url)
Creates a new browser's page (a new tab). If url
is provided, the new page will navigate to that URL when load()
or start()
are called.
crawler.on(event, function)
event
<string> Event namefunction
<function(Object, Crawler)] A function that will be called with two arguments:eventObject
<Object> Object containing event name parametersname
<string> Event nameparams
<Object> Event parameters
crawler
<Object> Crawler instance.
Events
The following events are emitted during crawling. Some events can be cancelled by returning false.
start
Emitted when Htcrawl starts.
Cancellable: False
Parameters: None
pageInitialized
Emitted when the page is initialized and all requests are compelted.
Cancellable: False
Parameters: None
xhr
Emitted before sending an ajax request.
Cancellable: True
Parameters:
request
<Object> Instance of Request class
xhrcompleted
Emitted when an ajax request is completed.
Cancellable: False
Parameters:
request
<Object> Instance of Request classresponse
<string> Response texttimedout
<boolean> Whether the request is timed out
fetch
Emitted before sending a fetch request.
Cancellable: True
Parameters:
request
<Object> Instance of Request class
fetchcompleted
Emitted when a fetch request is completed.
Cancellable: False
Parameters:
request
<Object> Instance of Request classtimedout
<boolean> Whether the request is timed out
jsonp
Emitted before sending a jsonp request.
Cancellable: True
Parameters:
request
<Object> Instance of Request class
jsonpcompleted
Emitted when a jsonp request is completed.
Cancellable: False
Parameters:
request
<Object> Instance of Request classscriptElement
<string> Css selector of the added script elementtimedout
<boolean> Whether the request is timed out
websocket
Emitted before opening a websocket connection.
Cancellable: False
Parameters:
request
<Object> Instance of Request class
websocketmessage
Emitted before sending a websocket request.
Cancellable: False
Parameters:
request
<Object> Instance of Request classmessage
<string> Websocket message
websocketsend
Emitted before sending a message to a websocket.
Cancellable: True
Parameters:
request
<Object> Instance of Request classmessage
<string> Websocket message
formsubmit
Emitted before submitting a form.
Cancellable: False
Parameters:
request
<Object> Instance of Request classform
<string> Css selector of the form element.
fillinput
Emitted before filling an input element.
Cancellable: True
Parameters:
element
<string> Css selector of the input element
Example:
// Set a custom value to input field and prevent auto-filling
crawler.on("fillinput" (e, crawler) => {
await crawler.page().$eval(e.params.element, input => input.value = "My Custom Value");
return false;
});
newdom
Emitted when new DOM content is added to the page.
Cancellable: False
Parameters:
rootNode
<string> Css selector of the root elementtrigger
<string> Css selector of the element that triggered the DOM modification
Example:
// Find links within the newly added content
crawler.on("newdom", (e, crawler) => {
const selector = e.params.rootNode + " a";
crawler.page().$$eval(selector, links => {
for(let link of links)
console.log(link);
});
});
navigation
Emitted when the browser tries to navigate outside the current page.
Cancellable: False
Parameters:
request
<Object> Instance of Request class
domcontentloaded
Emitted when the DOM is loaded for the first time (on page load). This event must be registered before load()
Cancellable: False
Parameters: None
redirect
Emitted when a redirect is requested.
Cancellable: True
Parameters:
url
<string> Redirect URL
earlydetach
Emitted when an element is detached before it has been analyzed.
Cancellable: False
Parameters:
node
<string> Css selector of the detached element
triggerevent
Emitted before triggering an event. This event is available only after start()
Cancellable: True
Parameters:
node
<string> Css selector of the elementevent
<string> Event name
eventtriggered
Emitted after en event has been triggered. This event is available only after start()
Cancellable: False
Parameters:
node
<string> Css selector of the elementevent
<string> Event name
Object: Request
Object used to hold informations about a request.
type
<string> Type of request. It can be: link, xhr, fetch, websocket, jsonp, form, redirectmethod
<string> Http Methodurl
<string> URLdata
<string> Request body (usually POST data)trigger
<string> Css selector of the HTML element that triggered the requestextra_headers
<Object> Extra HTTP headers