Introduction
Htcrawl is nodejs module for ricursivley crawl a single page application (SPA) using javascript.
Class: Crawler
The following is a typical example of using Htcrawl to crawl a page:
// Get instance of Crawler class
htcap.launch(targetUrl, options).then(crawler => {
// Print out the url of ajax calls
crawler.on("xhr", e => {
console.log("XHR to " + e.params.request.url);
});
// Start crawling!
crawler.start().then( () => crawler.browser.close());
});
htcap.launch(targetUrl, [options])
targetUrl<string>options<Object>referer<string> Sets the referer.userAgent<string> Sets the referer user-agent.setCookies<Array<Object>>name<string> (required)value<string> (required)url<string>domain<string>path<string>expires<number> Unix time in seconds.httpOnly<boolean>secure<boolean>
proxy<string> Sets proxy server. (protocol://host:port)httpAuth<string> Sets http authentication credentials. (username:password)loadWithPost<boolean> Whether to load page with POST method.postData<string> Setd the data to be sent wia post.headlessChrome<boolean> Whether to run chrome in headless mode.openChromeDevtoos<boolean> Whether to open chrome devtools.extraHeaders<Object> Sets additional http headers.maximumRecursion<number> Sets the limit of DOM recursion. Defaults to 15.maximumAjaxChain<number> Sets the maximum number of chained ajax requests. Defaults to 30.triggerEvents<boolean> Whether to trigger events. Defaults to true.fillValues<boolean> Whether to fill input values. Defaults to true.maxExecTime<number> Maximum execution time in milliseconds. Defaults to 300000.overrideTimeoutFunctions<boolean> Whether to override timeout functions. Defaults to true.randomSeed<string> Seed to generate random values to fill input values.exceptionOnRedirect<boolean> Whether to throw an exception on redirect. Defaults to false.navigationTimeout<number> Sets the navigation timeout. Defaults to 10000.bypassCSP<boolean> Whether to bypass CSP settings. Defaults to true.
crawler.load()
Loads targetUrl. Resolves when the crawling is finished.
Returns: <Promise<Crawler>>
crawler.start()
Loads targetUrl and starts crawling. Resolves when the crawling is finished.
Returns: <Promise<Crawler>>
crawler.stop()
Requests the crawling to stop. It makes start() to resolve "immediately".
crawler.navigate(url)
Navigates to url. Resolves when the page is loaded.
Returns: <Promise>
crawler.reload()
Reload the current page. Resolves when the page is loaded.
Returns: <Promise>
crawler.clickToNavigate(selector, timeout)
Clicks on selector and waits for timeout milliseconds for the navigation to be started. Resolves when the page is loaded.
Returns: <Promise>
crawler.waitForRequestsCompletion()
Waits for XHR, JSONP, fetch requests to be completed. Resolves when all requests are performed.
Returns: <Promise>
crawler.browser()
Returns Puppeteer's Browser instance.
crawler.page()
Returns Puppeteer's Page instance.
crawler.newPage(url)
Creates a new browser's page (a new tab). If url is provided, the new page will navigate to that URL when load() or start() are called.
crawler.on(event, function)
event<string> Event namefunction<function(Object, Crawler)] A function that will be called with two arguments:eventObject<Object> Object containing event name parametersname<string> Event nameparams<Object> Event parameters
crawler<Object> Crawler instance.
Events
The following events are emitted during crawling. Some events can be cancelled by returning false.
start
Emitted when Htcrawl starts.
Cancellable: False
Parameters: None
pageInitialized
Emitted when the page is initialized and all requests are compelted.
Cancellable: False
Parameters: None
xhr
Emitted before sending an ajax request.
Cancellable: True
Parameters:
request<Object> Instance of Request class
xhrcompleted
Emitted when an ajax request is completed.
Cancellable: False
Parameters:
request<Object> Instance of Request classresponse<string> Response texttimedout<boolean> Whether the request is timed out
fetch
Emitted before sending a fetch request.
Cancellable: True
Parameters:
request<Object> Instance of Request class
fetchcompleted
Emitted when a fetch request is completed.
Cancellable: False
Parameters:
request<Object> Instance of Request classtimedout<boolean> Whether the request is timed out
jsonp
Emitted before sending a jsonp request.
Cancellable: True
Parameters:
request<Object> Instance of Request class
jsonpcompleted
Emitted when a jsonp request is completed.
Cancellable: False
Parameters:
request<Object> Instance of Request classscriptElement<string> Css selector of the added script elementtimedout<boolean> Whether the request is timed out
websocket
Emitted before opening a websocket connection.
Cancellable: False
Parameters:
request<Object> Instance of Request class
websocketmessage
Emitted before sending a websocket request.
Cancellable: False
Parameters:
request<Object> Instance of Request classmessage<string> Websocket message
websocketsend
Emitted before sending a message to a websocket.
Cancellable: True
Parameters:
request<Object> Instance of Request classmessage<string> Websocket message
formsubmit
Emitted before submitting a form.
Cancellable: False
Parameters:
request<Object> Instance of Request classform<string> Css selector of the form element.
fillinput
Emitted before filling an input element.
Cancellable: True
Parameters:
element<string> Css selector of the input element
Example:
// Set a custom value to input field and prevent auto-filling
crawler.on("fillinput" (e, crawler) => {
await crawler.page().$eval(e.params.element, input => input.value = "My Custom Value");
return false;
});
newdom
Emitted when new DOM content is added to the page.
Cancellable: False
Parameters:
rootNode<string> Css selector of the root elementtrigger<string> Css selector of the element that triggered the DOM modification
Example:
// Find links within the newly added content
crawler.on("newdom", (e, crawler) => {
const selector = e.params.rootNode + " a";
crawler.page().$$eval(selector, links => {
for(let link of links)
console.log(link);
});
});
navigation
Emitted when the browser tries to navigate outside the current page.
Cancellable: False
Parameters:
request<Object> Instance of Request class
domcontentloaded
Emitted when the DOM is loaded for the first time (on page load). This event must be registered before load()
Cancellable: False
Parameters: None
redirect
Emitted when a redirect is requested.
Cancellable: True
Parameters:
url<string> Redirect URL
earlydetach
Emitted when an element is detached before it has been analyzed.
Cancellable: False
Parameters:
node<string> Css selector of the detached element
triggerevent
Emitted before triggering an event. This event is available only after start()
Cancellable: True
Parameters:
node<string> Css selector of the elementevent<string> Event name
eventtriggered
Emitted after en event has been triggered. This event is available only after start()
Cancellable: False
Parameters:
node<string> Css selector of the elementevent<string> Event name
Object: Request
Object used to hold informations about a request.
type<string> Type of request. It can be: link, xhr, fetch, websocket, jsonp, form, redirectmethod<string> Http Methodurl<string> URLdata<string> Request body (usually POST data)trigger<string> Css selector of the HTML element that triggered the requestextra_headers<Object> Extra HTTP headers