Build a web crawler with Queues and Browser Rendering
This tutorial explains how to build and deploy a web crawler with Queues, Browser Rendering, and Puppeteer.
Puppeteer is a high-level library used to automate interactions with Chrome/Chromium browsers. On each submitted page, the crawler will find the number of links to cloudflare.com
and take a screenshot of the site, saving results to Workers KV.
You can use Puppeteer to request all images on a page, save the colors used on a site, and more.
- Sign up for a Cloudflare account.
- Install
npm
. - Install
Node.js
.
Node.js version manager
Use a Node version manager like Volta or nvm to avoid permission issues and change Node.js versions. Wrangler, discussed later in this guide, requires a Node version of 16.17.0
or later.
Additionally, you will need access to Queues.
Queues is currently in Public Beta and is included in the monthly subscription cost of your Workers Paid plan, and charges based on operations against your queues. Refer to Pricing for more details.
Before you can use Queues, you must enable it via the Cloudflare dashboard. You need a Workers Paid plan to enable Queues.
To enable Queues:
- Log in to the Cloudflare dashboard.
- Go to Workers & Pages > Queues.
- Select Enable Queues Beta.
To get started, create a Worker application using the create-cloudflare
CLI. Open a terminal window and run the following command:
$ npm create cloudflare@latest {props.one}
$ yarn create cloudflare@latest {props.one}
For setup, select the following options:
- For What would you like to start with?, choose
Hello World example
. - For Which template would you like to use?, choose
Hello World Worker`
. - For Which language do you want to use?, choose
TypeScript`
. - For Do you want to use git for version control?, choose
Yes
. - For Do you want to deploy your application?, choose
No
(we will be making some changes before deploying).
Then, move into your newly created directory:
$ cd queues-web-crawler
We need to create a KV store. This can be done through the Cloudflare dashboard or the Wrangler CLI. For this tutorial, we will use the Wrangler CLI.
$ npx wrangler kv namespace create crawler_links$ npx wrangler kv namespace create crawler_screenshots
After running the wrangler kv namespace create
subcommand, you will get the following output.
🌀 Creating namespace with title "web-crawler-crawler-links"✨ Success!Add the following to your configuration file in your kv_namespaces array:[[kv_namespaces]]binding = "crawler_links"id = "<GENERATED_NAMESPACE_ID>"
🌀 Creating namespace with title "web-crawler-crawler-screenshots"✨ Success!Add the following to your configuration file in your kv_namespaces array:[[kv_namespaces]]binding = "crawler_screenshots"id = "<GENERATED_NAMESPACE_ID>"
Then, in your wrangler.toml
file, add the following with the values generated in the terminal:
kv_namespaces = [ { binding = "CRAWLER_SCREENSHOTS_KV", id = "<GENERATED_NAMESPACE_ID>" }, { binding = "CRAWLER_LINKS_KV", id = "<GENERATED_NAMESPACE_ID>" }]
Now, you need to set up your Worker for Browser Rendering.
In your current directory, install Cloudflare’s fork of Puppeteer and also robots-parser:
$ npm install @cloudflare/puppeteer --save-dev$ npm install robots-parser
Then, add a Browser Rendering binding. Adding a Browser Rendering binding gives the Worker access to a headless Chromium instance you will control with Puppeteer.
browser = { binding = "CRAWLER_BROWSER" }
Now, we need to set up the Queue.
$ npx wrangler queues create queues-web-crawler
Creating queue queues-web-crawler.Created queue queues-web-crawler.
Then, in your wrangler.toml
file, add the following:
[[queues.consumers]]queue = "queues-web-crawler"max_batch_timeout = 60
[[queues.producers]]queue = "queues-web-crawler"binding = "CRAWLER_QUEUE"
Adding the max_batch_timeout
of 60 seconds to the consumer queue is important because Browser Rendering has a limit of two new browsers per minute per account. This timeout waits up to a minute before collecting queue messages into a batch. The Worker will then remain under this browser invocation limit.
Your final wrangler.toml
file should look similar to the one below.
#:schema node_modules/wrangler/config-schema.jsonname = "web-crawler"main = "src/index.ts"compatibility_date = "2024-07-25"compatibility_flags = ["nodejs_compat"]
kv_namespaces = [ { binding = "CRAWLER_SCREENSHOTS_KV", id = "<GENERATED_NAMESPACE_ID>" }, { binding = "CRAWLER_LINKS_KV", id = "<GENERATED_NAMESPACE_ID>" }]
browser = { binding = "CRAWLER_BROWSER" }
[[queues.consumers]]queue = "queues-web-crawler"max_batch_timeout = 60
[[queues.producers]]queue = "queues-web-crawler"binding = "CRAWLER_QUEUE"
Add the bindings to the environment interface in src/index.ts
, so TypeScript correctly types the bindings. Type the queue as Queue<any>
. The following step will show you how to change this type.
import { BrowserWorker } from "@cloudflare/puppeteer";
export interface Env { CRAWLER_QUEUE: Queue<any>; CRAWLER_SCREENSHOTS_KV: KVNamespace; CRAWLER_LINKS_KV: KVNamespace; CRAWLER_BROWSER: BrowserWorker;}
Add a fetch()
handler to the Worker to submit links to crawl.
type Message = { url: string;};
export interface Env { CRAWLER_QUEUE: Queue<Message>; // ... etc.}
export default { async fetch(req, env): Promise<Response> { await env.CRAWLER_QUEUE.send({ url: await req.text() }); return new Response("Success!"); },} satisfies ExportedHandler<Env>;
This will accept requests to any subpath and forwards the request’s body to be crawled. It expects that the request body only contains a URL. In production, you should check that the request was a POST
request and contains a well-formed URL in its body. This has been omitted for simplicity.
Add a queue()
handler to the Worker to process the links you send.
import puppeteer from "@cloudflare/puppeteer";import robotsParser from "robots-parser";
async queue(batch: MessageBatch<Message>, env: Env): Promise<void> { let browser: puppeteer.Browser | null = null; try { browser = await puppeteer.launch(env.CRAWLER_BROWSER); } catch { batch.retryAll(); return; }
for (const message of batch.messages) { const { url } = message.body;
let isAllowed = true; try { const robotsTextPath = new URL(url).origin + "/robots.txt"; const response = await fetch(robotsTextPath);
const robots = robotsParser(robotsTextPath, await response.text()); isAllowed = robots.isAllowed(url) ?? true; // respect robots.txt! } catch {}
if (!isAllowed) { message.ack(); continue; }
// TODO: crawl! message.ack(); }
await browser.close();},
This is a skeleton for the crawler. It launches the Puppeteer browser and iterates through the Queue’s received messages. It fetches the site’s robots.txt
and uses robots-parser
to check that this site allows crawling. If crawling is not allowed, the message is ack
’ed, removing it from the Queue. If crawling is allowed, you can continue to crawl the site.
The puppeteer.launch()
is wrapped in a try...catch
to allow the whole batch to be retried if the browser launch fails. The browser launch may fail due to going over the limit for number of browsers per account.
type Result = { numCloudflareLinks: number; screenshot: ArrayBuffer;};
const crawlPage = async (url: string): Promise<Result> => { const page = await (browser as puppeteer.Browser).newPage();
await page.goto(url, { waitUntil: "load", });
const numCloudflareLinks = await page.$$eval("a", (links) => { links = links.filter((link) => { try { return new URL(link.href).hostname.includes("cloudflare.com"); } catch { return false; } }); return links.length; });
await page.setViewport({ width: 1920, height: 1080, deviceScaleFactor: 1, });
return { numCloudflareLinks, screenshot: ((await page.screenshot({ fullPage: true })) as Buffer) .buffer, };};
This helper function opens a new page in Puppeteer and navigates to the provided URL. numCloudflareLinks
uses Puppeteer’s $$eval
(equivalent to document.querySelectorAll
) to find the number of links to a cloudflare.com
page. Checking if the link’s href
is to a cloudflare.com
page is wrapped in a try...catch
to handle cases where href
s may not be URLs.
Then, the function sets the browser viewport size and takes a screenshot of the full page. The screenshot is returned as a Buffer
so it can be converted to an ArrayBuffer
and written to KV.
To enable recursively crawling links, add a snippet after checking the number of Cloudflare links to send messages recursively from the queue consumer to the queue itself. Recursing too deep, as is possible with crawling, will cause a Durable Object Subrequest depth limit exceeded.
error. If one occurs, it is caught, but the links are not retried.
// const numCloudflareLinks = await page.$$eval("a", (links) => { ...
await page.$$eval("a", async (links) => { const urls: MessageSendRequest<Message>[] = links.map((link) => { return { body: { url: link.href, }, }; }); try { await env.CRAWLER_QUEUE.sendBatch(urls); } catch {} // do nothing, likely hit subrequest limit});
// await page.setViewport({ ...
Then, in the queue
handler, call crawlPage
on the URL.
// in the `queue` handler:// ...if (!isAllowed) { message.ack(); continue;}
try { const { numCloudflareLinks, screenshot } = await crawlPage(url); const timestamp = new Date().getTime(); const resultKey = `${encodeURIComponent(url)}-${timestamp}`; await env.CRAWLER_LINKS_KV.put( resultKey, numCloudflareLinks.toString(), { metadata: { date: timestamp } } ); await env.CRAWLER_SCREENSHOTS_KV.put(resultKey, screenshot, { metadata: { date: timestamp }, }); message.ack();} catch { message.retry();}
// ...
This snippet saves the results from crawlPage
into the appropriate KV namespaces. If an unexpected error occurred, the URL will be retried and resent to the queue again.
Saving the timestamp of the crawl in KV helps you avoid crawling too frequently.
Add a snippet before checking robots.txt
to check KV for a crawl within the last hour. This lists all KV keys beginning with the same URL (crawls of the same page), and check if any crawls have been done within the last hour. If any crawls have been done within the last hour, the message is ack
’ed and not retried.
type KeyMetadata = { date: number;};
// in the `queue` handler:// ...for (const message of batch.messages) { const sameUrlCrawls = await env.CRAWLER_LINKS_KV.list({ prefix: `${encodeURIComponent(url)}`, });
let shouldSkip = false; for (const key of sameUrlCrawls.keys) { if (timestamp - (key.metadata as KeyMetadata)?.date < 60 * 60 * 1000) { // if crawled in last hour, skip message.ack(); shouldSkip = true; break; } } if (shouldSkip) { continue; }
let isAllowed = true; // ...
The final script is included below.
import puppeteer, { BrowserWorker } from "@cloudflare/puppeteer";import robotsParser from "robots-parser";
type Message = { url: string;};
export interface Env { CRAWLER_QUEUE: Queue<Message>; CRAWLER_SCREENSHOTS_KV: KVNamespace; CRAWLER_LINKS_KV: KVNamespace; CRAWLER_BROWSER: BrowserWorker;}
type Result = { numCloudflareLinks: number; screenshot: ArrayBuffer;};
type KeyMetadata = { date: number;};
export default { async fetch(req: Request, env: Env): Promise<Response> { // util endpoint for testing purposes await env.CRAWLER_QUEUE.send({ url: await req.text() }); return new Response("Success!"); }, async queue(batch: MessageBatch<Message>, env: Env): Promise<void> { const crawlPage = async (url: string): Promise<Result> => { const page = await (browser as puppeteer.Browser).newPage();
await page.goto(url, { waitUntil: "load", });
const numCloudflareLinks = await page.$$eval("a", (links) => { links = links.filter((link) => { try { return new URL(link.href).hostname.includes("cloudflare.com"); } catch { return false; } }); return links.length; });
// to crawl recursively - uncomment this! /*await page.$$eval("a", async (links) => { const urls: MessageSendRequest<Message>[] = links.map((link) => { return { body: { url: link.href, }, }; }); try { await env.CRAWLER_QUEUE.sendBatch(urls); } catch {} // do nothing, might've hit subrequest limit });*/
await page.setViewport({ width: 1920, height: 1080, deviceScaleFactor: 1, });
return { numCloudflareLinks, screenshot: ((await page.screenshot({ fullPage: true })) as Buffer) .buffer, }; };
let browser: puppeteer.Browser | null = null; try { browser = await puppeteer.launch(env.CRAWLER_BROWSER); } catch { batch.retryAll(); return; }
for (const message of batch.messages) { const { url } = message.body; const timestamp = new Date().getTime(); const resultKey = `${encodeURIComponent(url)}-${timestamp}`;
const sameUrlCrawls = await env.CRAWLER_LINKS_KV.list({ prefix: `${encodeURIComponent(url)}`, });
let shouldSkip = false; for (const key of sameUrlCrawls.keys) { if (timestamp - (key.metadata as KeyMetadata)?.date < 60 * 60 * 1000) { // if crawled in last hour, skip message.ack(); shouldSkip = true; break; } } if (shouldSkip) { continue; }
let isAllowed = true; try { const robotsTextPath = new URL(url).origin + "/robots.txt"; const response = await fetch(robotsTextPath);
const robots = robotsParser(robotsTextPath, await response.text()); isAllowed = robots.isAllowed(url) ?? true; // respect robots.txt! } catch {}
if (!isAllowed) { message.ack(); continue; }
try { const { numCloudflareLinks, screenshot } = await crawlPage(url); await env.CRAWLER_LINKS_KV.put( resultKey, numCloudflareLinks.toString(), { metadata: { date: timestamp } } ); await env.CRAWLER_SCREENSHOTS_KV.put(resultKey, screenshot, { metadata: { date: timestamp }, }); message.ack(); } catch { message.retry(); } }
await browser.close(); },};
To deploy your Worker, run the following command:
$ npx wrangler deploy
You have successfully created a Worker which can submit URLs to a queue for crawling and save results to Workers KV.
To test your Worker, you could use the following cURL request to take a screenshot of this documentation page.
curl <YOUR_WORKER_URL> \ -H "Content-Type: application/json" \ -d 'https://developers.cloudflare.com/queues/tutorials/web-crawler-with-browser-rendering/'
Refer to the GitHub repository for the complete tutorial, including a front end deployed with Pages to submit URLs and view crawler results.