There are some minor code improvements and one major improvement which can be applied here.
Minor improvements: Use fewer puppeteer functions
The minor improvements boil down to using as few of the puppeteer functions as possible. Most of the puppeteer functions you use, are sending data from the Node.js environment to the browser environment via a WebSocket. While this only takes a few milliseconds, these milliseconds obviously add up in the long run. For more information on this, you can check out this question asking about the difference of using page.evaluate
vs. using more puppeteer functions.
This means, to optimize your code, you can for example use querySelector
inside of the page instead of running item.$eval
multiple times. Another optimization is to directly use the result of page.waitForSelector
. The function will return the node, when it appears. Therefore, you do not need to query it via page.$
again afterwards.
These are only minor improvements, which might slightly improve the crawling speed.
Major improvement: Use a puppeteer pool
Right now, you are using one browser with a single page to crawl multiple URLs. You can improve the speed of your script by using a pool of puppeteer resources to crawl multiple URLs in parallel. puppeteer-cluster
allows you to do exactly that (disclaimer: I'm the author). The library takes a task and applies it to a number of URLs in parallel.
The number of parallel instances, you can use depends on your CPU, memory and throughput. The more you can use the better your crawling speed will be.
Code Sample
Below is a minimal example, adapting your code to extract the same data. The code first sets up a cluster with one browser and four pages. After that, a task function is defined which will be executed for each of the queued objects.
After this, one page instance of the cluster is used to extract the IDs and URLs from the initial page. The function given to the cluster.queue
extracts the IDs and URLs from the page and calls cluster.queue
with the objects being { id: ..., url: ... }
. For each of the queued objects, the cluster.task
function is executed, which then extracts the title and prints it out next to the passed ID.
// Setup your cluster with 4 pages
const cluster = await Cluster.launch({
concurrency: Cluster.CONCURRENCY_PAGE,
maxConcurrency: 4,
});
// Define the task for the pages (go the the URL, and extract the title)
await cluster.task(async ({ page, data: { id, url } }) => {
await page.goto(url);
const itemDetailsPage = await page.waitForSelector(domObjects.itemPageWrapper);
const title = await itemDetailsPage.$eval('.page-header__title', title => title.innerText);
console.log(id, url);
});
// use one page of the cluster, to extract the links (ignoring the task function above)
cluster.queue(({ page }) => {
await page.goto(url); // URLs is given from outside
// Extract the links and ids from the initial page by using a page of the cluster
const itemData = await page.$$eval(domObjects.itemPageLink, items => items.map(item => ({
id: item.querySelector('td:last-of-type').split(',').map(item => item.trim()),
url: item.querySelector('td:first-of-type a').href,
})));
// Queue the data: { id: ..., url: ... } to start the process
itemData.forEach(data => cluster.queue(data));
});