NAV Navbar

Introduction

Htcrawl is nodejs module for ricursivley crawl a single page application (SPA) using javascript.

Class: Crawler

The following is a typical example of using Htcrawl to crawl a page:

// Get instance of Crawler class
htcap.launch(targetUrl, options).then(crawler => {

  // Print out the url of ajax calls
  crawler.on("xhr", e => {
    console.log("XHR to " + e.params.request.url);
  });

  // Start crawling!
  crawler.start().then( () => crawler.browser.close());
});

htcap.launch(targetUrl, [options])

crawler.start()

Loads targetUrl and starts crawling. Resolves when the crawling is finished.
Returns: <Promise<Crawler>>

crawler.browser()

Returns Puppeteer's Browser instance.

crawler.page()

Returns Puppeteer's Page instance.

crawler.on(event, function)

Events

The following events are emitted during crawling. Some events can be cancelled by returning false.

start

Emitted when Htcrawl starts.
Cancellable: False
Parameters: None

xhr

Emitted before sending an ajax request.
Cancellable: True
Parameters:

xhrcompleted

Emitted when an ajax request is completed.
Cancellable: False
Parameters:

fetch

Emitted before sending a fetch request.
Cancellable: True
Parameters:

fetchcompleted

Emitted when a fetch request is completed.
Cancellable: False
Parameters:

jsonp

Emitted before sending a jsonp request.
Cancellable: True
Parameters:

jsonpcompleted

Emitted when a jsonp request is completed.
Cancellable: False
Parameters:

websocket

Emitted before opening a websocket connection.
Cancellable: False
Parameters:

websocketmessage

Emitted before sending a websocket request.
Cancellable: False
Parameters:

websocketsend

Emitted before sending a message to a websocket.
Cancellable: True
Parameters:

formsubmit

Emitted before submitting a form.
Cancellable: False
Parameters:

fillinput

Emitted before filling an input element.
Cancellable: True
Parameters:

Example:

// Set a custom value to input field and prevent auto-filling
crawler.on("fillinput" (e, crawler) => {
  await crawler.page().$eval(e.params.element, input => input.value = "My Custom Value");
  return false;
});

newdom

Emitted when new DOM content is added to the page.
Cancellable: False
Parameters:

Example:

// Find links within the newly added content
crawler.on("newdom", (e, crawler) => {
  const selector = e.params.rootNode + " a";
  crawler.page().$$eval(selector, links => {
    for(let link of links)
      console.log(link);
  });
});

Emitted when the browser tries to navigate outside the current page.
Cancellable: True
Parameters:

domcontentloaded

Emitted when the DOM is loaded for the first time (on page load).
Cancellable: False
Parameters: None

redirect

Emitted when a redirect is requested.
Cancellable: True
Parameters:

earlydetach

Emitted when an element is detached before it has been analyzed.
Cancellable: False
Parameters:

triggerevent

Emitted before triggering an event.
Cancellable: True
Parameters:

eventtriggered

Emitted after en event has been triggered.
Cancellable: False
Parameters:

Object: Request

Object used to hold informations about a request.