This covers how to useDocumentation Index
Fetch the complete documentation index at: https://langchain.idochub.dev/llms.txt
Use this file to discover all available pages before exploring further.
WebBaseLoader to load all text from HTML webpages into a document format that we can use downstream. For more custom logic for loading webpages look at some child class examples such as IMSDbLoader, AZLyricsLoader, and CollegeConfidentialLoader.
If you don’t want to worry about website crawling, bypassing JS-blocking sites, and data cleaning, consider using FireCrawlLoader or the faster option SpiderLoader.
Overview
Integration details
- TODO: Fill in table features.
- TODO: Remove JS support link if not relevant, otherwise ensure link is correct.
- TODO: Make sure API reference links are correct.
| Class | Package | Local | Serializable | JS support |
|---|---|---|---|---|
| WebBaseLoader | langchain-community | ✅ | ❌ | ❌ |
Loader features
| Source | Document Lazy Loading | Native Async Support |
|---|---|---|
| WebBaseLoader | ✅ | ✅ |
Setup
Credentials
WebBaseLoader does not require any credentials.
Installation
To use theWebBaseLoader you first need to install the langchain-community python package.
Initialization
Now we can instantiate our model object and load documents:loader.requests_kwargs = {'verify':False}
Initialization with multiple pages
You can also pass in a list of pages to load from.Load
Load multiple urls concurrently
You can speed up the scraping process by scraping and parsing multiple urls concurrently. There are reasonable limits to concurrent requests, defaulting to 2 per second. If you aren’t concerned about being a good citizen, or you control the server you are scraping and don’t care about load, you can change therequests_per_second parameter to increase the max concurrent requests. Note, while this will speed up the scraping process, but may cause the server to block you. Be careful!
Loading a xml file, or using a different BeautifulSoup parser
You can also look atSitemapLoader for an example of how to load a sitemap file, which is an example of using this feature.
Lazy Load
You can use lazy loading to only load one page at a time in order to minimize memory requirements.Async
Using proxies
Sometimes you might need to use proxies to get around IP blocks. You can pass in a dictionary of proxies to the loader (andrequests underneath) to use them.
API reference
For detailed documentation of allWebBaseLoader features and configurations head to the API reference: python.langchain.com/api_reference/community/document_loaders/langchain_community.document_loaders.web_base.WebBaseLoader.html