All Articles

Web Crawling with node.js and cheerio

If you are a web developer and want to get started with web crawling without any experience in a language like Python it is a good idea to use node.js with your knowledge in JavaScript. To start web crawling you dont need anything else than an installed Node Environment and access to a shell. If you dont have Node already installed you can download the installer here: https://nodejs.org/en/download/

After you installed the node environment you have to create a new directory. Open your terminal and change your current working directory to the directory you just have created. Now execute the command npm init. Just fill in the requested data (they are all optional and you can just press ENTER). The command npm init lets you create a new node project by creating a package.json file. To create the first crawling project, you have to install cheerio for parsing and working with html data and axios for making the http requests to a site.

npm install axios cheerio

Create a new file called crawler.js. In this file lives our web crawler. We will create a class and import all dependencies.

const axios = require('axios')
const cheerio = require('cheerio')

class WikipediaPageCrawler {}

In the first step, we have to load the HTML data from a wikipedia page. To do this, we will call an url with axios and add the loaded html data to class attribute. But we also have to think about how to get the URL in the class. That’s why we will use a constructor method, so we can create an object from the class with const crawler = new WikipediaPageParser(url).

const axios = require('axios')
const cheerio = require('cheerio')

class WikipediaPageCrawler {
  constructor(url) {
    this.url = url
  }

  async loadHTMLFromURL() {
    const response = await axios.get(this.url)
    return response.data
  }

  async loadHtmlData() {
    const responseHtml = await this.loadHTMLFromURL()
    this.parsedHtml = cheerio.load(responseHtml)
  }
}

Retrieving the page name

The function cheerio.load() creates a new object that can be parsed and worked with in jQuery style. So let’s add a new method to get the name of the wikipedia page. Every header on a wikipedia page has the identifier firstHeading. We can use thise identifier to get the html object like we do in jQuery:

this.parsedHtml('#firstHeading')

To get the text, we just need to append a text() method and that’s it.

class WikipediaPageCrawler {
  constructor(url) {
    this.url = url
  }

  async loadHTMLFromURL() {
    const response = await axios.get(this.url)
    return response.data
  }

  async loadHtmlData() {
    const responseHtml = await this.loadHTMLFromURL()
    this.parsedHtml = cheerio.load(responseHtml)
  }

  getPageHeader() {
    return this.parsedHtml('#firstHeading').text()
  }
}

Getting the TOC from a wikipedia page

In the next step we want to get the table of contents. We can select all parent items on the TOC with the selector .toclevel-1. But now we have to iterator over every element and transform it into an list with all elements.

this.parsedHtml('.toclevel-1')
  .map((idx, element) => {
    return this.parsedHtml(element).text()
  })
  .get()

This would work fine, but we also would get the child elements from the top TOC divided by an \n. But there is an easy way to get rid of the elements, just use split and split the text into a list and only use the first element, which is the parent element:

element.split('\n')[0]

If we add all this as class methods to our crawler, the result would look like this:

class WikipediaPageCrawler {
  constructor(url) {
    this.url = url
  }

  async loadHTMLFromURL() {
    const response = await axios.get(this.url)
    return response.data
  }

  async loadHtmlData() {
    const responseHtml = await this.loadHTMLFromURL()
    this.parsedHtml = cheerio.load(responseHtml)
  }

  getPageHeader() {
    return this.parsedHtml('#firstHeading').text()
  }

  getOnlyParentTextFromTOC(element) {
    return element.split('\n')[0]
  }

  getPageTOC() {
    return this.parsedHtml('.toclevel-1')
      .map((idx, element) => {
        return this.getOnlyParentTextFromTOC(this.parsedHtml(element).text())
      })
      .get()
  }
}

Using the Crawler

Because we are using async/await we can not just call the object methods. We have to use another async method to call it. Example:

async function crawling() {
  const crawler = new WikipediaPageCrawler(
    'https://en.wikipedia.org/wiki/Web_crawler'
  )
  await crawler.loadHtmlData()
  console.log(crawler.getPageHeader())
  console.log(crawler.getPageTOC())
}

crawling()