Crawling the Internet with Puppeteer

image

It's been 8 years Node.js is around, and many crawling techniques have been developed along Node.js releases. From the combination of request and jsdom module, to PhantomJS and higher level APIs like Horseman, a bright new solution have been brought since 2 years.

Puppeter.

As stated on their README.md:

Puppeteer is a Node library which provides a high-level API to control Chrome or Chromium over the DevTools Protocol. Puppeteer runs headless by default, but can be configured to run full (non-headless) Chrome or Chromium.

And this library, is definitively the most powerful library around for crawling. The peak.

So now enough words, let's speak code.

Hello Html

This example is about connecting to a website, filling a text input and submitting the form:

const puppeteer = require('puppeteer')

async function hello() {
  const browser = await puppeteer.launch({
    headless: false
  })

  const page = await browser.newPage()
  await page.setViewport({ width: 1280, height: 1024 })

  await page.goto('https://www.societe.com/', { waitUntil: 'networkidle2' })
  
  // Fill the form
  await page.type('#input_search', 'keymetrics sas')
  // Submit form
  await page.keyboard.press('Enter')
  await page.waitForNavigation()

  await browser.close()
}

hello()

Make it stealth

Puppeteer-extra makes it easy to add some nifty plugins, like to make your crawl stealthier and without boring ads that can derogate your crawl.

npm install puppeteer-extra
npm install puppeteer-extra-plugin-stealth
npm install puppeteer-extra-plugin-adblocker

puppeteer-extra is a drop-in replacement of the puppeteer module:

const puppeteer = require('puppeteer-extra');
const StealthPlugin = require('puppeteer-extra-plugin-stealth')
const AdblockerPlugin = require('puppeteer-extra-plugin-adblocker')

puppeteer.use(AdblockerPlugin())
puppeteer.use(StealthPlugin())

async function hello() {
  const browser = await puppeteer.launch({
    headless: false
  })
  // [...] then your code
}

hello()

Keep your Session, avoid Captcha

To save and restore Session/Cookies, to avoid login each time you execute your crawling script, here is a pretty good trick:

Login & Save Cookies

// [...]
const fs = require('fs').promises

async function loginAndSaveSession(browser) {
  const page = await browser.newPage()
  await page.setViewport({ width: 1280, height: 1024 })

  const link = 'https://order.cdiscount.com/Account/LoginLight.html';
  await page.goto(link, {waitUntil: 'networkidle2'});

  await page.type('#CustomerLogin_CustomerLoginFormData_Email', 'xxx@xxx.io')
  await page.type('#CustomerLogin_CustomerLoginFormData_Password', 'yYxy{3#')
  await page.click('.btGreen')

  await page.waitForTimeout(4000)
  
  // Here is the session magic 
  const client = await page.target().createCDPSession();
  const all_browser_cookies = (await client.send('Network.getAllCookies')).cookies
  await fs.writeFile('./cookies.json', JSON.stringify(all_browser_cookies, null, 2));
}

Load Cookies

Now at your next launch:

async function run() {
  const page = await browser.newPage();
  await page.setViewport({ width: 1280, height: 1024 })

  // Load previous cookies
  const cookiesString = await fs.readFile('./cookies.json')
  const cookies = JSON.parse(cookiesString)
  await page.setCookie(...cookies)

  // Now you should be logged into your desired website, wo having to login
  await page.goto(link, {waitUntil: 'networkidle2'})
}

Helpers

That is a pretty light article about some crawling techniques with puppeteer.

Have a good Crawl.