Scrapy

scrapy-playwright adds browser rendering to Scrapy. rayobrowse provides it with a stealth-fingerprinted browser instead of the default detectable Chromium.

Setup

Install dependencies
Terminal window
```
pip install scrapy scrapy-playwright httpx
```
You don’t need playwright install since the browser runs inside the rayobrowse container.

Configure settings.py

import httpx

_resp = httpx.get(
    "http://localhost:9222/connect",
    params={"headless": "true", "os": "windows"},
    timeout=120,
)
_resp.raise_for_status()
PLAYWRIGHT_CDP_URL = _resp.text.strip()

DOWNLOAD_HANDLERS = {
    "http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
    "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
}

TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"

PLAYWRIGHT_PROCESS_REQUEST_HEADERS = None

Write your spider

import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = ["https://quotes.toscrape.com/js/"]

    def start_requests(self):
        for url in self.start_urls:
            yield scrapy.Request(
                url,
                meta={"playwright": True, "playwright_include_page": True},
            )

    async def parse(self, response):
        page = response.meta["playwright_page"]
        await page.close()

        for quote in response.css("div.quote"):
            yield {
                "text": quote.css("span.text::text").get(),
                "author": quote.css("small.author::text").get(),
            }

Run
Terminal window
```
scrapy crawl quotes -o quotes.json
```

Using a proxy

resp = httpx.get(
    "http://localhost:9222/connect",
    params={
        "headless": "true",
        "os": "windows",
        "proxy": "http://user:[email protected]:8080",
    },
    timeout=120,
)
resp.raise_for_status()
PLAYWRIGHT_CDP_URL = resp.text.strip()

Remote or cloud endpoint

resp = httpx.get(
    "https://cloud.rayobrowse.com/connect",
    params={"headless": "true", "os": "windows"},
    headers={"x-api-key": "your-secret-key"},
    timeout=120,
)
resp.raise_for_status()
PLAYWRIGHT_CDP_URL = resp.text.strip()

A ready-to-run example project is available at integrations/scrapy/ in the GitHub repository.