Introduction
The process of collecting information from a website (or websites) is often referred to as either web scraping or web crawling. Web scraping is the process of scanning a webpage/website and extracting information out of it, whereas web crawling is the process of iteratively finding and fetching web links starting from a URL or list of URLs. While there are differences between the two, you might have heard the two words used interchangeably. Although this article will be a guide on how to scrape information, the lessons learned here can very easily be used for the purposes of 'crawling'.
Hopefully I don't need to spend much time talking about why we would look to scrape data from an online resource, but quite simply, if there is data you want to collect from an online resource, scraping is how we would go about it. And if you would prefer to avoid the rigour of going through each page of a website manually, we now have tools that can automate the process.
I'll also take a moment to add that the process of web scraping is a legal grey area. You will be steering on the side of legal if you are collecting data for personal use and it is data that is otherwise freely available. Scraping data that is not otherwise freely available is where stuff enters murky water. Many websites will also have policies relating to how data can be used, so please bear those policies in mind. With all of that out of the way, let's get into it.
For the purposes of demonstration, I will be scraping my own website and will be downloading a copy of the scraped data. In doing so, we will:
- Set up an environment that allows us to be able to watch the automation if we choose to (the alternative is to run this in what is known as a 'headless' browser - more on that later);
- Automating the visit to my website;
- Traverse the DOM;
- Collect pieces of data;
- Download pieces of data;
- Learn how to handle asynchronous requests;
- And my favourite bit: end up with a complete project that we can reuse whenever we want to scrape data.
Now in order to do all of these, we will be making use of two things: Node.js, and Puppeteer. Now chances are you have already heard of Node.js before, so we won't go into what that is, but just know that we will be using one Node.js module: FS (File System).
Let's briefly explain what Puppeteer is.
Puppeteer
Puppeteer is a Node library which provides a high-level API to control Chrome or Chromium over the DevTools Protocol. Most things that you can do manually in the browser can be done using Puppeteer. The Puppeteer website provides a bunch of examples, such as taking screenshots and generating PDFs of webpages, automating form submission, testing UI, and so on. One thing they don't expressly mention is the concept of data scraping, likely due to the potential legal issues mentioned earlier. But as it states, anything you can do manually in a browser can be done with Puppeteer. Automating those things means that you can do it way, way faster than any human ever could.
This is going to be your new favourite website: https://pptr.dev/ Once you're finished with this article, I'd recommend bookmarking this link as you will want to refer to their API if you plan to do any super advanced things.
Installation
If you don't already have Node installed, go to https://nodejs.org/en/download/ and install the
relevant version for your computer. That will also install something called npm, which is a package
manager and allows us to be able to install third party packages (such as Puppeteer). We will then
go and create a directory and create a package.json by typing npm init
inside of the directory.
Note: I actually use yarn instead of npm, so feel free to use yarn if that's what you prefer. From
here on, we are going to assume that you have a basic understanding of package managers such as
npm/yarn and have an understanding of Node environments. Next, go ahead and install Puppeteer by
running npm i puppeteer
or yarn add puppeteer
.
Directory Structure
Okay, so after running npm init
/yarn init
and installing puppeteer, we currently have a
directory made up of a node_modules folder, a package.json and a package-lock.json. Now
we want to try and create our app with some separation of concerns in mind. So to begin with, we'll
create a file in the root of our directory called main.js. main.js will be the file that we
execute whenever we want to run our app. In our root, we will then create a folder called api.
This api folder will include most of the code our application will be using. Inside of this api
folder we will create three files: interface.js, system.js, and utils.js.
interface.js will contain any puppeteer-specific code (so things such as opening the browser,
navigating to a page etc), system.js will include any node-specific code (such as saving data to
disk, opening files etc), utils.js will include any reusable bits of JavaScript code that we
might create along the way.
Note: In the end, we didn't make use of utils.js in this tutorial so feel free to remove it if you think your own project will make use of it.
Basic Commands
Okay, now because a lot of the code we will be writing depends on network-requests, waiting for
responses etc, we tend to write a lot of puppeteer code asynchronous. Because of this, it is common
practice to wrap all of your executing code inside of an async IIFE. If you're unsure what an IIFE
is, it's basically a function that executes immediately after its creation. For more info,
here's an article I wrote about IIFEs.
To make our IIFE asynchronous, we just add the async
keyword to the beginning on it like so:
(async () => {})();
Right, so we've set up our async IIFE, but so far we have nothing to run in there. Let's fix that by
enabling our ability to open a browser with Puppeteer. Let's open api/interface.js and begin by
creating an object called interface
. We will also want to export this object. Therefore, our
initial boilerplate code inside of api/interface.js will look like this:
const interface = {};
module.exports = interface;
As we are going to be using Puppeteer, we'll need to import it. Therefore, we'll require()
it at
the top of our file by writing const puppeteer = require("puppeteer");
Inside of our interface
object, we will create a function called async init()
As mentioned earlier, a lot of our code is
going to be asynchronous. Now because we want to open a browser, that may take a few seconds. We
will also want to save some information into variables off the back of this. Therefore, we'll need
to make this asynchronous so that our variables get the responses assigned to them. There are two
pieces of data that will come from our init()
function that we are going to want to store into
variables inside of our interface
object. Because of this, let's go ahead and create two
key:value
pairings inside of our interface
object, like so:
const interface = {
browser: null,
page: null,
};
module.exports = interface;
Now that we have those set up, let's write a try/catch block inside of our init()
function. For
the catch
part, we'll simply console.log out our error. If you'd like to handle this another way,
by all means go ahead - the important bits here are what we will be putting inside of the try
part. We will first set this.browser
to await puppeteer.launch()
. As you may expect, this simply
launches a browser. The launch()
function can accept an optional object where you can pass in many
different options. We will leave it as is for the moment but we will return to this in a little
while. Next we will set this.page
to await this.browser.newPage()
. As you may imagine, this will
open a tab in the puppeteer browser. So far, this gives us the following code:
const puppeteer = require("puppeteer");
const interface = {
browser: null,
page: null,
async init() {
try {
this.browser = await puppeteer.launch();
this.page = await this.browser.newPage();
} catch (err) {
console.log(err);
}
},
};
module.exports = interface;
We're also going to add two more functions into our interface
object. The first is a visitPage()
function which we will use to navigate to certain pages. You will see below that it accepts a url
param which will basically be the full URL that we want to visit. The second is a close()
function
which will basically kill the browser session. These two functions look like this:
async visitPage(url) {
await this.page.goto(url);
},
async close() {
await this.browser.close();
},
Now before we try to run any code, let's add some arguments into the puppeteer.launch()
function
that sits inside of our init()
function. As mentioned before, the launch()
accepts an object as
its argument. So let's write the following: puppeteer.launch({headless: false})
This will mean
that when we do try to run our code, a browser will open and we will be able to see what is
happening. This is great for debugging purposes as it allows us to see what is going on in front of
our very eyes. As an aside, the default option here is headless: true
and I would strongly advise
that you keep this option set to true
if you plan to run anything in production as your code will
use less memory and will run faster - some environments will also have to be headless such as a
cloud function. Anyway, this gives us this.browser = await puppeteer.launch({headless: false})
.
There's also an args: []
key which takes an array as its value. Here we can add certain things
such as use of proxy IPs, incognito mode etc. Finally, there's a slowMo
key that we can pass in to
our object which we can use to slow down the speed of our Puppeteer interactions. There are many
other options available but these are the ones that I wanted to introduce to you so far. So this is
what our init()
function looks like for now (use of incognito and slowMo have been commented out
but left in to provide a visual aid):
async init() {
try {
this.browser = await puppeteer.launch({
args: [
// " - incognito",
],
headless: false,
// slowMo: 250,
});
this.page = await this.browser.newPage();
} catch (err) {
console.log(err);
}
},
There's one other line of code we are going to add, which is
await this.page.setViewport({ width: 1279, height: 768 });
. This isn't necessary, but I wanted to
put the option of being able to set the viewport so that when you view what is going on the browser
width and height will seem a bit more normal. Feel free to adjust the width and height to be
whatever you want them to be (I've set mine based on the screen size for a 13" Macbook Pro). You'll
notice in the code block below that this setViewport
function sits below the this.page
assignment. This is important because you have to set this.page
before you can see its viewport.
So now if we put everything together, this is how our interface.js file looks:
const puppeteer = require("puppeteer");
const interface = {
browser: null,
page: null,
async init() {
try {
this.browser = await puppeteer.launch({
args: [
// ` - proxy-server=http=${randProxy}`,
// " - incognito",
],
headless: false, // slowMo: 250,
});
this.page = await this.browser.newPage();
await this.page.setViewport({ width: 1279, height: 768 });
} catch (err) {
console.log(err);
}
},
async visitPage(url) {
await this.page.goto(url);
},
async close() {
await this.browser.close();
},
};
module.exports = interface;
Now, let's move back to our main.js file in the root of our directory and put use some of the code we have just written. Add the following code so that your main.js file now looks like this:
const interface = require("./api/interface");
(async () => {
await interface.init();
await interface.visitPage("https://sunilsandhu.com");
})();
Now go to your command line, navigate to the directory for your project and type node main.js
.
Providing everything has worked okay, your application will proceed to load up a browser and
navigate to sunilsandhu.com (or any other website if you happened to put something else in). Pretty
neat! Now during the process of writing this piece, I actually encountered an error while trying to
execute this code. The error said something along the lines of
Error: Could not find browser revision 782078. Run "PUPPETEER_PRODUCT=firefox n pm install" or "PUPPETEER_PRODUCT=firefox yarn install" to download a supported Firefox browser binary.
This seemed quite strange to me as I was not trying to use Firefox and had not encountered this
issue when using the same code for a previous project. It turns out that when installing puppeteer,
it hadn't downloaded a local version of Chrome to use from within the node_modules folder. I'm
not entirely sure what caused this issue (it may have been because I was hotspotting off of my phone
at the time), but managed to fix the issue by simply copying over the missing files from another
project I had that was using the same version of Puppeteer. If you encounter a similar issue, please
let me know and I'd be curious to hear more.
Advanced Commands
Okay, so we've managed to navigate to a page, but how do we gather data from the page? This bit may
look a bit confusing, so be ready to pay attention! We're going to create two functions here, one
that mimics document.querySelectorAll
and another that mimics document.querySelector
. The
difference here is that our functions will return whatever attribute/attributes from the selector
you were looking for. Both functions actually use querySelector/querySelectorAll
under the hood
and if you have used them before, you might wonder why I am asking you to pay attention. The reason
here is because the retrieval of attributes from them is not the same as it is when you're
traversing the DOM in a browser. Before we talk about how the code works, let's take a look what our
final function looks like:
async querySelectorAllAttributes(selector, attribute) {
try {
return await this.page.$$eval(selector,
(elements, attribute) => {
return elements.map((element) => element[attribute]);
}, attribute);
} catch (error) {
console.log(error);
}
},
So, we're writing another async function and we'll wrap the contents inside of a try/catch
block.
To begin with, we will await and return the value from an $$eval
function which we have available
for execution on our this.page
value. Therefore, we're running return await this.page.$$eval()
.
$$eval
is just a wrapper around document.querySelectorAll
.
There's also an $eval
function available (note that this one only has 1 dollar sign), which is
the equivalent for using document.querySelector
.
The $eval
and $$eval
functions accept two parameters. The first is the selector we want to run
it again. So for example, if I want to find div
elements, the selector would be 'div'. The second
is a function which retrieves specific attributes from the result of the query selection. You will
see that we are passing in two parameters to this function, the firstelements
is basically just the
entire result from the previous query selection. The second is an optional value that we have
decided to pass in, this beingattribute
.
We then map over our query selection and find the specific attribute that we passed in as the
parameter. You'll also notice that after the curly brace, we pass in the attribute
again, which is
necessary because when we use\$\$eval
and\$eval
, it executes them in a different environment (the
browser) to where the initial code was executed (in Node). When this happens, it loses context.
However, we can fix this by passing it in at the end. This is simply a quirk specific to Puppeteer
that we have to account for.
With regard to our function that simply returns one attribute, the difference between the code is that we simply return the attribute value rather than mapping over an array of values. Okay, so we are now in a position where we are able to query elements and retrieve values. This puts us in a great position to now be able to collect data.
So let's go back into our main.js file. I've decided that I would like to collect all of the links
from my website. Therefore, I'll use the querySelectorAllAttributes
function and will pass in two
parameters: "a" for the selector in order to get all of the <a>
tags, then "href" for the
attribute in order to get the link from each <a>
tag. Let's see how that code looks:
const interface = require("./api/interface");
(async () => {
await interface.init();
await interface.visitPage("https://sunilsandhu.com");
let links = await interface.querySelectorAllAttributes("a", "href");
console.log(links);
})();
Let's run node main.js
again. If you already have it running from before, type cmd+c
/ctrl+c
and hit enter to kill the previous session. In the console you should be able to see a list of links
retrieved from the website. Tip: What if you wanted to then go and visit each link? Well you could
simply write a loop function that takes each value and passes it in to our visitPage
function. It
might look something like this:
for await (const link of links) {
await interface.visitPage(link);
}
Saving data
Great, so we are able to visit pages and collect data. Let's take a look at how we can save this data. Note: There are of course, many options here when it comes to saving data, such as saving to a database. We are, however, going to look at how we would use Node.js to save data locally to our hard drive. If this isn't of interest to you, you can probably skip this section and swap it out for whatever approach you'd prefer to take.
Let's switch gears and go into our empty system.js file. We're just going to create one function. This function will take three parameters, but we are going to make two of them optional. Let's take a look at what our system.js file looks like, then we will review the code:
const fs = require("fs");
const system = {
async saveFile(data, filePath = Date.now(), fileType = "json") {
fs.writeFile(`${filePath}.${fileType}`, JSON.stringify(data), function (err) {
if (err) return console.log(err);
});
},
};
module.exports = system;
So the first thing you will notice is that we are requiring an fs
module at the top. This is a
Node.js-specific module that is available to you as long as you have Node installed on your device.
We then have our system object and we are exporting it at the bottom, this is the same process we
followed for the interface.js file earlier.
Conclusion
And there we have it! We have created a new project from scratch that allows you to automate the collection of data from a website. We have gone through each of the steps involved, from initial installation of packages, right up to downloading and saving collected data. You now have a project that allows you to input any website and collect and download all of the links from.
Hopefully the methods we have outlined provide you with enough knowledge to be able to adapt the
code accordingly (eg, if you want to gather a different HTML tag besides <a>
tags).
What will you be using this newfound information for? I'd love to hear, so be sure to reach out to me over Twitter to let me know :)
GitHub
For anyone who is interested in checking out the code used in this article, I have put together a small package called Scrawly that can be found on GitHub. Here's the link: https://github.com/sunil-sandhu/scrawly