How to scrape data from a website with JavaScript

Use Node.js and Puppeteer to easily create a reusable tool for crawling, collecting, and scraping data

JavaScript
Node

September 14th, 2020

Introduction

The process of collecting information from a website (or websites) is often referred to as either web scraping or web crawling. Web scraping is the process of scanning a webpage/website and extracting information out of it, whereas web crawling is the process of iteratively finding and fetching web links starting from a URL or list of URLs. While there are differences between the two, you might have heard the two words used interchangeably. Although this article will be a guide on how to scrape information, the lessons learned here can very easily be used for the purposes of 'crawling'.

Hopefully I don't need to spend much time talking about why we would look to scrape data from an online resource, but quite simply, if there is data you want to collect from an online resource, scraping is how we would go about it. And if you would prefer to avoid the rigour of going through each page of a website manually, we now have tools that can automate the process.

I'll also take a moment to add that the process of web scraping is a legal grey area. You will be steering on the side of legal if you are collecting data for personal use and it is data that is otherwise freely available. Scraping data that is not otherwise freely available is where stuff enters murky water. Many websites will also have policies relating to how data can be used, so please bear those policies in mind. With all of that out of the way, let's get into it.

For the purposes of demonstration, I will be scraping my own website and will be downloading a copy of the scraped data. In doing so, we will:

  1. Set up an environment that allows us to be able to watch the automation if we choose to (the alternative is to run this in what is known as a 'headless' browser - more on that later);
  2. Automating the visit to my website;
  3. Traverse the DOM;
  4. Collect pieces of data;
  5. Download pieces of data;
  6. Learn how to handle asynchronous requests;
  7. And my favourite bit: end up with a complete project that we can reuse whenever we want to scrape data.

Now in order to do all of these, we will be making use of two things: Node.js, and Puppeteer. Now chances are you have already heard of Node.js before, so we won't go into what that is, but just know that we will be using one Node.js module: FS (File System).

Let's briefly explain what Puppeteer is.

Puppeteer

Puppeteer is a Node library which provides a high-level API to control Chrome or Chromium over the DevTools Protocol. Most things that you can do manually in the browser can be done using Puppeteer. The Puppeteer website provides a bunch of examples, such as taking screenshots and generating PDFs of webpages, automating form submission, testing UI, and so on. One thing they don't expressly mention is the concept of data scraping, likely due to the potential legal issues mentioned earlier. But as it states, anything you can do manually in a browser can be done with Puppeteer. Automating those things means that you can do it way, way faster than any human ever could.

This is going to be your new favourite website: https://pptr.dev/ Once you're finished with this article, I'd recommend bookmarking this link as you will want to refer to their API if you plan to do any super advanced things.

Installation

If you don't already have Node installed, go to https://nodejs.org/en/download/ and install the relevant version for your computer. That will also install something called npm, which is a package manager and allows us to be able to install third party packages (such as Puppeteer). We will then go and create a directory and create a package.json by typing npm init inside of the directory. Note: I actually use yarn instead of npm, so feel free to use yarn if that's what you prefer. From here on, we are going to assume that you have a basic understanding of package managers such as npm/yarn and have an understanding of Node environments. Next, go ahead and install Puppeteer by running npm i puppeteer or yarn add puppeteer.

Directory Structure

Okay, so after running npm init/yarn init and installing puppeteer, we currently have a directory made up of a node_modules folder, a package.json and a package-lock.json. Now we want to try and create our app with some separation of concerns in mind. So to begin with, we'll create a file in the root of our directory called main.js. main.js will be the file that we execute whenever we want to run our app. In our root, we will then create a folder called api. This api folder will include most of the code our application will be using. Inside of this api folder we will create three files: interface.js, system.js, and utils.js. interface.js will contain any puppeteer-specific code (so things such as opening the browser, navigating to a page etc), system.js will include any node-specific code (such as saving data to disk, opening files etc), utils.js will include any reusable bits of JavaScript code that we might create along the way.

Note: In the end, we didn't make use of utils.js in this tutorial so feel free to remove it if you think your own project will make use of it.

Basic Commands

Okay, now because a lot of the code we will be writing depends on network-requests, waiting for responses etc, we tend to write a lot of puppeteer code asynchronous. Because of this, it is common practice to wrap all of your executing code inside of an async IIFE. If you're unsure what an IIFE is, it's basically a function that executes immediately after its creation. For more info, here's an article I wrote about IIFEs. To make our IIFE asynchronous, we just add the async keyword to the beginning on it like so:

(async () => {})();

Right, so we've set up our async IIFE, but so far we have nothing to run in there. Let's fix that by enabling our ability to open a browser with Puppeteer. Let's open api/interface.js and begin by creating an object called interface. We will also want to export this object. Therefore, our initial boilerplate code inside of api/interface.js will look like this:

const interface = {};

module.exports = interface;

As we are going to be using Puppeteer, we'll need to import it. Therefore, we'll require() it at the top of our file by writing const puppeteer = require("puppeteer"); Inside of our interface object, we will create a function called async init() As mentioned earlier, a lot of our code is going to be asynchronous. Now because we want to open a browser, that may take a few seconds. We will also want to save some information into variables off the back of this. Therefore, we'll need to make this asynchronous so that our variables get the responses assigned to them. There are two pieces of data that will come from our init() function that we are going to want to store into variables inside of our interface object. Because of this, let's go ahead and create two key:value pairings inside of our interface object, like so:

const interface = {
  browser: null,
  page: null,
};

module.exports = interface;

Now that we have those set up, let's write a try/catch block inside of our init() function. For the catch part, we'll simply console.log out our error. If you'd like to handle this another way, by all means go ahead - the important bits here are what we will be putting inside of the try part. We will first set this.browser to await puppeteer.launch(). As you may expect, this simply launches a browser. The launch() function can accept an optional object where you can pass in many different options. We will leave it as is for the moment but we will return to this in a little while. Next we will set this.page to await this.browser.newPage(). As you may imagine, this will open a tab in the puppeteer browser. So far, this gives us the following code:

const puppeteer = require("puppeteer");

const interface = {
  browser: null,
  page: null,

  async init() {
    try {
      this.browser = await puppeteer.launch();
      this.page = await this.browser.newPage();
    } catch (err) {
      console.log(err);
    }
  },
};
module.exports = interface;

We're also going to add two more functions into our interface object. The first is a visitPage() function which we will use to navigate to certain pages. You will see below that it accepts a url param which will basically be the full URL that we want to visit. The second is a close() function which will basically kill the browser session. These two functions look like this:

async visitPage(url) {
  await this.page.goto(url);
 },

async close() {
  await this.browser.close();
},

Now before we try to run any code, let's add some arguments into the puppeteer.launch() function that sits inside of our init() function. As mentioned before, the launch() accepts an object as its argument. So let's write the following: puppeteer.launch({headless: false}) This will mean that when we do try to run our code, a browser will open and we will be able to see what is happening. This is great for debugging purposes as it allows us to see what is going on in front of our very eyes. As an aside, the default option here is headless: true and I would strongly advise that you keep this option set to true if you plan to run anything in production as your code will use less memory and will run faster - some environments will also have to be headless such as a cloud function. Anyway, this gives us this.browser = await puppeteer.launch({headless: false}). There's also an args: [] key which takes an array as its value. Here we can add certain things such as use of proxy IPs, incognito mode etc. Finally, there's a slowMo key that we can pass in to our object which we can use to slow down the speed of our Puppeteer interactions. There are many other options available but these are the ones that I wanted to introduce to you so far. So this is what our init() function looks like for now (use of incognito and slowMo have been commented out but left in to provide a visual aid):

 async init() {
   try {
     this.browser = await puppeteer.launch({
     args: [
     // " - incognito",
     ],
     headless: false,
     // slowMo: 250,
     });
     this.page = await this.browser.newPage();
   } catch (err) {
     console.log(err);
   }
 },

There's one other line of code we are going to add, which is await this.page.setViewport({ width: 1279, height: 768 });. This isn't necessary, but I wanted to put the option of being able to set the viewport so that when you view what is going on the browser width and height will seem a bit more normal. Feel free to adjust the width and height to be whatever you want them to be (I've set mine based on the screen size for a 13" Macbook Pro). You'll notice in the code block below that this setViewport function sits below the this.page assignment. This is important because you have to set this.page before you can see its viewport.

So now if we put everything together, this is how our interface.js file looks:

const puppeteer = require("puppeteer");

const interface = {
  browser: null,
  page: null,
  async init() {
    try {
      this.browser = await puppeteer.launch({
        args: [
          // ` - proxy-server=http=${randProxy}`,
          // " - incognito",
        ],
        headless: false, // slowMo: 250,
      });
      this.page = await this.browser.newPage();
      await this.page.setViewport({ width: 1279, height: 768 });
    } catch (err) {
      console.log(err);
    }
  },
  async visitPage(url) {
    await this.page.goto(url);
  },
  async close() {
    await this.browser.close();
  },
};

module.exports = interface;

Now, let's move back to our main.js file in the root of our directory and put use some of the code we have just written. Add the following code so that your main.js file now looks like this:

const interface = require("./api/interface");

(async () => {
  await interface.init();
  await interface.visitPage("https://sunilsandhu.com");
})();

Now go to your command line, navigate to the directory for your project and type node main.js. Providing everything has worked okay, your application will proceed to load up a browser and navigate to sunilsandhu.com (or any other website if you happened to put something else in). Pretty neat! Now during the process of writing this piece, I actually encountered an error while trying to execute this code. The error said something along the lines of Error: Could not find browser revision 782078. Run "PUPPETEER_PRODUCT=firefox n pm install" or "PUPPETEER_PRODUCT=firefox yarn install" to download a supported Firefox browser binary. This seemed quite strange to me as I was not trying to use Firefox and had not encountered this issue when using the same code for a previous project. It turns out that when installing puppeteer, it hadn't downloaded a local version of Chrome to use from within the node_modules folder. I'm not entirely sure what caused this issue (it may have been because I was hotspotting off of my phone at the time), but managed to fix the issue by simply copying over the missing files from another project I had that was using the same version of Puppeteer. If you encounter a similar issue, please let me know and I'd be curious to hear more.

Advanced Commands

Okay, so we've managed to navigate to a page, but how do we gather data from the page? This bit may look a bit confusing, so be ready to pay attention! We're going to create two functions here, one that mimics document.querySelectorAll and another that mimics document.querySelector. The difference here is that our functions will return whatever attribute/attributes from the selector you were looking for. Both functions actually use querySelector/querySelectorAll under the hood and if you have used them before, you might wonder why I am asking you to pay attention. The reason here is because the retrieval of attributes from them is not the same as it is when you're traversing the DOM in a browser. Before we talk about how the code works, let's take a look what our final function looks like:

 async querySelectorAllAttributes(selector, attribute) {
   try {
     return await this.page.$$eval(selector,
     (elements, attribute) => {
       return elements.map((element) => element[attribute]);
     }, attribute);
   } catch (error) {
       console.log(error);
   }
 },

So, we're writing another async function and we'll wrap the contents inside of a try/catch block. To begin with, we will await and return the value from an $$eval function which we have available for execution on our this.page value. Therefore, we're running return await this.page.$$eval(). $$eval is just a wrapper around document.querySelectorAll.

There's also an $eval function available (note that this one only has 1 dollar sign), which is the equivalent for using document.querySelector.

The $eval and $$eval functions accept two parameters. The first is the selector we want to run it again. So for example, if I want to find divelements, the selector would be 'div'. The second is a function which retrieves specific attributes from the result of the query selection. You will see that we are passing in two parameters to this function, the firstelementsis basically just the entire result from the previous query selection. The second is an optional value that we have decided to pass in, this beingattribute.

We then map over our query selection and find the specific attribute that we passed in as the parameter. You'll also notice that after the curly brace, we pass in the attributeagain, which is necessary because when we use\$\$evaland\$eval, it executes them in a different environment (the browser) to where the initial code was executed (in Node). When this happens, it loses context. However, we can fix this by passing it in at the end. This is simply a quirk specific to Puppeteer that we have to account for.

With regard to our function that simply returns one attribute, the difference between the code is that we simply return the attribute value rather than mapping over an array of values. Okay, so we are now in a position where we are able to query elements and retrieve values. This puts us in a great position to now be able to collect data.

So let's go back into our main.js file. I've decided that I would like to collect all of the links from my website. Therefore, I'll use the querySelectorAllAttributes function and will pass in two parameters: "a" for the selector in order to get all of the <a> tags, then "href" for the attribute in order to get the link from each <a> tag. Let's see how that code looks:

const interface = require("./api/interface");
(async () => {
  await interface.init();
  await interface.visitPage("https://sunilsandhu.com");
  let links = await interface.querySelectorAllAttributes("a", "href");
  console.log(links);
})();

Let's run node main.js again. If you already have it running from before, type cmd+c/ctrl+c and hit enter to kill the previous session. In the console you should be able to see a list of links retrieved from the website. Tip: What if you wanted to then go and visit each link? Well you could simply write a loop function that takes each value and passes it in to our visitPage function. It might look something like this:

for await (const link of links) {
  await interface.visitPage(link);
}

Saving data

Great, so we are able to visit pages and collect data. Let's take a look at how we can save this data. Note: There are of course, many options here when it comes to saving data, such as saving to a database. We are, however, going to look at how we would use Node.js to save data locally to our hard drive. If this isn't of interest to you, you can probably skip this section and swap it out for whatever approach you'd prefer to take.

Let's switch gears and go into our empty system.js file. We're just going to create one function. This function will take three parameters, but we are going to make two of them optional. Let's take a look at what our system.js file looks like, then we will review the code:

const fs = require("fs");
const system = {
  async saveFile(data, filePath = Date.now(), fileType = "json") {
    fs.writeFile(`${filePath}.${fileType}`, JSON.stringify(data), function (err) {
      if (err) return console.log(err);
    });
  },
};

module.exports = system;

So the first thing you will notice is that we are requiring an fs module at the top. This is a Node.js-specific module that is available to you as long as you have Node installed on your device. We then have our system object and we are exporting it at the bottom, this is the same process we followed for the interface.js file earlier.

Conclusion

And there we have it! We have created a new project from scratch that allows you to automate the collection of data from a website. We have gone through each of the steps involved, from initial installation of packages, right up to downloading and saving collected data. You now have a project that allows you to input any website and collect and download all of the links from.

Hopefully the methods we have outlined provide you with enough knowledge to be able to adapt the code accordingly (eg, if you want to gather a different HTML tag besides <a> tags).

What will you be using this newfound information for? I'd love to hear, so be sure to reach out to me over Twitter to let me know :)

GitHub

For anyone who is interested in checking out the code used in this article, I have put together a small package called Scrawly that can be found on GitHub. Here's the link: https://github.com/sunil-sandhu/scrawly