NAV
javascript

Introduction

Htcrawl is nodejs module for ricursivley crawl a single page application (SPA) using javascript.

Class: Crawler

The following is a typical example of using Htcrawl to crawl a page:

// Get instance of Crawler class
htcap.launch(targetUrl, options).then(crawler => {

  // Print out the url of ajax calls
  crawler.on("xhr", e => {
    console.log("XHR to " + e.params.request.url);
  });

  // Start crawling!
  crawler.start().then( () => crawler.browser.close());
});

htcap.launch(targetUrl, [options])

crawler.load()

Loads targetUrl. Resolves when the crawling is finished.
Returns: <Promise<Crawler>>

crawler.start()

Loads targetUrl and starts crawling. Resolves when the crawling is finished.
Returns: <Promise<Crawler>>

crawler.stop()

Requests the crawling to stop. It makes start() to resolve "immediately".

crawler.navigate(url)

Navigates to url. Resolves when the page is loaded.
Returns: <Promise>

crawler.reload()

Reload the current page. Resolves when the page is loaded.
Returns: <Promise>

crawler.clickToNavigate(selector, timeout)

Clicks on selector and waits for timeout milliseconds for the navigation to be started. Resolves when the page is loaded.
Returns: <Promise>

crawler.waitForRequestsCompletion()

Waits for XHR, JSONP, fetch requests to be completed. Resolves when all requests are performed.
Returns: <Promise>

crawler.browser()

Returns Puppeteer's Browser instance.

crawler.page()

Returns Puppeteer's Page instance.

crawler.newPage(url)

Creates a new browser's page (a new tab). If url is provided, the new page will navigate to that URL when load() or start() are called.

crawler.on(event, function)

Events

The following events are emitted during crawling. Some events can be cancelled by returning false.

start

Emitted when Htcrawl starts.
Cancellable: False
Parameters: None

pageInitialized

Emitted when the page is initialized and all requests are compelted.
Cancellable: False
Parameters: None

xhr

Emitted before sending an ajax request.
Cancellable: True
Parameters:

xhrcompleted

Emitted when an ajax request is completed.
Cancellable: False
Parameters:

fetch

Emitted before sending a fetch request.
Cancellable: True
Parameters:

fetchcompleted

Emitted when a fetch request is completed.
Cancellable: False
Parameters:

jsonp

Emitted before sending a jsonp request.
Cancellable: True
Parameters:

jsonpcompleted

Emitted when a jsonp request is completed.
Cancellable: False
Parameters:

websocket

Emitted before opening a websocket connection.
Cancellable: False
Parameters:

websocketmessage

Emitted before sending a websocket request.
Cancellable: False
Parameters:

websocketsend

Emitted before sending a message to a websocket.
Cancellable: True
Parameters:

formsubmit

Emitted before submitting a form.
Cancellable: False
Parameters:

fillinput

Emitted before filling an input element.
Cancellable: True
Parameters:

Example:

// Set a custom value to input field and prevent auto-filling
crawler.on("fillinput" (e, crawler) => {
  await crawler.page().$eval(e.params.element, input => input.value = "My Custom Value");
  return false;
});

newdom

Emitted when new DOM content is added to the page.
Cancellable: False
Parameters:

Example:

// Find links within the newly added content
crawler.on("newdom", (e, crawler) => {
  const selector = e.params.rootNode + " a";
  crawler.page().$$eval(selector, links => {
    for(let link of links)
      console.log(link);
  });
});

Emitted when the browser tries to navigate outside the current page.
Cancellable: False
Parameters:

domcontentloaded

Emitted when the DOM is loaded for the first time (on page load). This event must be registered before load() Cancellable: False
Parameters: None

redirect

Emitted when a redirect is requested.
Cancellable: True
Parameters:

earlydetach

Emitted when an element is detached before it has been analyzed.
Cancellable: False
Parameters:

triggerevent

Emitted before triggering an event. This event is available only after start()
Cancellable: True
Parameters:

eventtriggered

Emitted after en event has been triggered. This event is available only after start()
Cancellable: False
Parameters:

Object: Request

Object used to hold informations about a request.