Skip to content

Scrapy

scrapy-playwright adds browser rendering to Scrapy. rayobrowse gives it a stealth-fingerprinted browser instead of the default detectable Chromium.

  1. Install dependencies

    Terminal window
    pip install scrapy scrapy-playwright

    You don’t need playwright install — the browser runs inside the rayobrowse container.

  2. Configure settings.py

    PLAYWRIGHT_CDP_URL = "ws://localhost:9222/connect?headless=true&os=windows"
    DOWNLOAD_HANDLERS = {
    "http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
    "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
    }
    TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
    PLAYWRIGHT_PROCESS_REQUEST_HEADERS = None
  3. Write your spider

    import scrapy
    class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = ["https://quotes.toscrape.com/js/"]
    def start_requests(self):
    for url in self.start_urls:
    yield scrapy.Request(
    url,
    meta={"playwright": True, "playwright_include_page": True},
    )
    async def parse(self, response):
    page = response.meta["playwright_page"]
    await page.close()
    for quote in response.css("div.quote"):
    yield {
    "text": quote.css("span.text::text").get(),
    "author": quote.css("small.author::text").get(),
    }
  4. Run

    Terminal window
    scrapy crawl quotes -o quotes.json
PLAYWRIGHT_CDP_URL = (
"ws://localhost:9222/connect"
"?headless=true"
"&os=windows"
"&proxy=http://user:pass@proxy.example.com:8080"
)
PLAYWRIGHT_CDP_URL = (
"ws://your-server.example.com/connect"
"?headless=true"
"&os=windows"
"&api_key=your-secret-key"
)

A ready-to-run example project is available at integrations/scrapy/ in the GitHub repository.