Scrapeless offers flexible and feature-rich data acquisition services with extensive parameter customization and multi-format export support. These capabilities empower LangChain to integrate and leverage external data more effectively. The core functional modules include: DeepSerpDocumentation Index
Fetch the complete documentation index at: https://langchain.idochub.dev/llms.txt
Use this file to discover all available pages before exploring further.
- Google Search: Enables comprehensive extraction of Google SERP data across all result types.
- Supports selection of localized Google domains (e.g.,
google.com,google.ad) to retrieve region-specific search results. - Pagination supported for retrieving results beyond the first page.
- Supports a search result filtering toggle to control whether to exclude duplicate or similar content.
- Supports selection of localized Google domains (e.g.,
- Google Trends: Retrieves keyword trend data from Google, including popularity over time, regional interest, and related searches.
- Supports multi-keyword comparison.
- Supports multiple data types:
interest_over_time,interest_by_region,related_queries, andrelated_topics. - Allows filtering by specific Google properties (Web, YouTube, News, Shopping) for source-specific trend analysis.
- Designed for modern, JavaScript-heavy websites, allowing dynamic content extraction.
- Global premium proxy support for bypassing geo-restrictions and improving reliability.
- Crawl: Recursively crawl a website and its linked pages to extract site-wide content.
- Supports configurable crawl depth and scoped URL targeting.
- Scrape: Extract content from a single webpage with high precision.
- Supports “main content only” extraction to exclude ads, footers, and other non-essential elements.
- Allows batch scraping of multiple standalone URLs.
Overview
Integration details
| Class | Package | Serializable | JS support | Version |
|---|---|---|---|---|
| ScrapelessUniversalScrapingTool | langchain-scrapeless | ✅ | ❌ |
Tool features
| Native async | Returns artifact | Return data |
|---|---|---|
| ✅ | ✅ | html, markdown, links, metadata, structured content |
Setup
The integration lives in thelangchain-scrapeless package.
!pip install langchain-scrapeless
Credentials
You’ll need a Scrapeless API key to use this tool. You can set it as an environment variable:Instantiation
Here we show how to instantiate an instance of the Scrapeless Universal Scraping Tool. This tool allows you to scrape any website using a headless browser with JavaScript rendering capabilities, customizable output types, and geo-specific proxy support. The tool accepts the following parameters during instantiation:url(required, str): The URL of the website to scrape.headless(optional, bool): Whether to use a headless browser. Default is True.js_render(optional, bool): Whether to enable JavaScript rendering. Default is True.js_wait_until(optional, str): Defines when to consider the JavaScript-rendered page ready. Default is'domcontentloaded'. Options include:load: Wait until the page is fully loaded.domcontentloaded: Wait until the DOM is fully loaded.networkidle0: Wait until the network is idle.networkidle2: Wait until the network is idle for 2 seconds.
outputs(optional, str): The specific type of data to extract from the page. Options include:phone_numbersheadingsimagesaudiosvideoslinksmenushashtagsemailsmetadatatablesfavicon
response_type(optional, str): Defines the format of the response. Default is'html'. Options include:html: Return the raw HTML of the page.plaintext: Return the plain text content.markdown: Return a Markdown version of the page.png: Return a PNG screenshot.jpeg: Return a JPEG screenshot.
response_image_full_page(optional, bool): Whether to capture and return a full-page image when using screenshot output (png or jpeg). Default is False.selector(optional, str): A specific CSS selector to scope scraping within a part of the page. Default isNone.proxy_country(optional, str): Two-letter country code for geo-specific proxy access (e.g.,'us','gb','de','jp'). Default is'ANY'.