Today we are going to learn about how we can do web-scraping with NodeJS and some other tools. We will be fetching the data from a web URL with the GET request and store it in a CSV file.

The codebase is available at Node-WEbScrap

Alt Text

Tools and things required:-

  • NodeJS
  • NPM packages

    1. request-promise - It helps us to make HTTP requests to the source Uri and get the data
    2. cheerio - This is used to load and parse markup data.
    3. json2csv - This is used to convert the JSON data to the CSV format
  • Basic knowledge of JavaScript

Let's get started with the project

  • Create a NodeJS project

   $ mkdir node-webscrap
   $ cd node-webscrap
   $ npm init
   $ yarn add request-promise request cheerio json2csv
  • Create an index.js file in the root directory of your project

   $ touch index.js
  • Get all the required modules inside the index.js

    const request = require("request-promise")
    const cheerio = require("cheerio")
    const fs = require("fs")
    const json2csv = require("json2csv").Parser;
  • Next, create an array of movies with proper strings. I have used rotten tomatoes to get the movie review URLs

   const movies = [
     "https://www.rottentomatoes.com/m/the_last_full_measure",
     "https://www.rottentomatoes.com/m/stray_dolls"
   ];
  • Now create a function with the below code base

   const dataRepresent = async() => {
     let rottenTomatoData = []

     for (let movie of movies) {
     const response = await request({
      uri: movie,
      headers: {
        "accept": 
"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
        "accept-encoding": "gzip, deflate, br",
        "accept-language": "en-US,en;q=0.9,es;q=0.8"
      },
      gzip: true,
     })

     let $ = cheerio.load(response);
     let title = $("h1[class='mop-ratings-wrap__title mop-ratings-wrap__title--top']").text().trim()
     let tomatoMeterObj = $('# tomato_meter_link > .mop-ratings-wrap__percentage');
     let tomatoMeter = tomatoMeterObj && tomatoMeterObj.text().trim();
     let audMeterObj = $('.audience-score > .mop-ratings-wrap__score >  .articleLink  > .mop-ratings-wrap__percentage');
     let audMeter = audMeterObj && audMeterObj.text().trim();
     let summary = $('.mop-ratings-wrap__text').text().trim()

     rottenTomatoData.push({
      title,
      tomatoMeter,
      audMeter,
      summary,
     });
   }
   const j2cp = new json2csv()
   const csv = j2cp.parse(rottenTomatoData);
   fs.writeFileSync('./rottenTomatoes.csv', csv, "utf-8")
 }
  • Call the function at the end in the index.js file

    dataRepresent();
  • After running the index.js from the command line, you should see the file "rottenTomatoes.csv" getting generated in the project's root directory

   $ node .\index.js

So here we are iterating over the movies array asynchronously and using request-promise npm module we are passing headers, uri and the required parameter like gzip to fetch the raw HTML data. Using cheerio we can parse the data by using jquery selectors to get the data.

Then we push the data into "rottenTomatoData" array and write the data in the file named as "rottenTomatoes.csv" using fs module provided by NodeJS out of the box

So that's it for the day. I will come up with some learnings and will share them with you.

Thanks for reading and please share it across with other folks and keep learning!!

This post is also available on DEV.