Scrapy
scrapy-playwright adds browser rendering to Scrapy. rayobrowse provides it with a stealth-fingerprinted browser instead of the default detectable Chromium.
-
Install dependencies
Terminal window pip install scrapy scrapy-playwright httpxYou don’t need
playwright installsince the browser runs inside the rayobrowse container. -
Configure
settings.pyimport httpx_resp = httpx.get("http://localhost:9222/connect",params={"headless": "true", "os": "windows"},timeout=120,)_resp.raise_for_status()PLAYWRIGHT_CDP_URL = _resp.text.strip()DOWNLOAD_HANDLERS = {"http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler","https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",}TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"PLAYWRIGHT_PROCESS_REQUEST_HEADERS = None -
Write your spider
import scrapyclass QuotesSpider(scrapy.Spider):name = "quotes"start_urls = ["https://quotes.toscrape.com/js/"]def start_requests(self):for url in self.start_urls:yield scrapy.Request(url,meta={"playwright": True, "playwright_include_page": True},)async def parse(self, response):page = response.meta["playwright_page"]await page.close()for quote in response.css("div.quote"):yield {"text": quote.css("span.text::text").get(),"author": quote.css("small.author::text").get(),} -
Run
Terminal window scrapy crawl quotes -o quotes.json
Using a proxy
Section titled “Using a proxy”resp = httpx.get( "http://localhost:9222/connect", params={ "headless": "true", "os": "windows", }, timeout=120,)resp.raise_for_status()PLAYWRIGHT_CDP_URL = resp.text.strip()Remote or cloud endpoint
Section titled “Remote or cloud endpoint”resp = httpx.get( "https://cloud.rayobrowse.com/connect", params={"headless": "true", "os": "windows"}, headers={"x-api-key": "your-secret-key"}, timeout=120,)resp.raise_for_status()PLAYWRIGHT_CDP_URL = resp.text.strip()A ready-to-run example project is available at integrations/scrapy/ in the GitHub repository.