Scrapy
scrapy-playwright adds browser rendering to Scrapy. rayobrowse gives it a stealth-fingerprinted browser instead of the default detectable Chromium.
-
Install dependencies
Terminal window pip install scrapy scrapy-playwrightYou don’t need
playwright install— the browser runs inside the rayobrowse container. -
Configure
settings.pyPLAYWRIGHT_CDP_URL = "ws://localhost:9222/connect?headless=true&os=windows"DOWNLOAD_HANDLERS = {"http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler","https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",}TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"PLAYWRIGHT_PROCESS_REQUEST_HEADERS = None -
Write your spider
import scrapyclass QuotesSpider(scrapy.Spider):name = "quotes"start_urls = ["https://quotes.toscrape.com/js/"]def start_requests(self):for url in self.start_urls:yield scrapy.Request(url,meta={"playwright": True, "playwright_include_page": True},)async def parse(self, response):page = response.meta["playwright_page"]await page.close()for quote in response.css("div.quote"):yield {"text": quote.css("span.text::text").get(),"author": quote.css("small.author::text").get(),} -
Run
Terminal window scrapy crawl quotes -o quotes.json
Using a proxy
Section titled “Using a proxy”PLAYWRIGHT_CDP_URL = ( "ws://localhost:9222/connect" "?headless=true" "&os=windows" "&proxy=http://user:pass@proxy.example.com:8080")Remote mode
Section titled “Remote mode”PLAYWRIGHT_CDP_URL = ( "ws://your-server.example.com/connect" "?headless=true" "&os=windows" "&api_key=your-secret-key")A ready-to-run example project is available at integrations/scrapy/ in the GitHub repository.