Docling parses PDF, DOCX, PPTX, HTML, and other formats into a rich unified representation including document layout, tables etc., making them ready for generative AI workflows like RAG. This integration provides Docling’s capabilities via theDocumentation Index
Fetch the complete documentation index at: https://langchain.idochub.dev/llms.txt
Use this file to discover all available pages before exploring further.
DoclingLoader document loader.
Overview
Integration details
| Class | Package | Local | Serializable | JS support |
|---|---|---|---|---|
| langchain_docling.DoclingLoader | langchain-docling | ✅ | ❌ | ❌ |
Loader features
| Source | Document Lazy Loading | Native Async Support |
|---|---|---|
| DoclingLoader | ✅ | ❌ |
DoclingLoader component enables you to:
- use various document types in your LLM applications with ease and speed, and
- leverage Docling’s rich format for advanced, document-native grounding.
DoclingLoader supports two different export modes:
ExportType.DOC_CHUNKS(default): if you want to have each input document chunked and to then capture each individual chunk as a separate LangChain Document downstream, orExportType.MARKDOWN: if you want to capture each input document as a separate LangChain Document
EXPORT_TYPE; depending on the
value set, the example pipeline is then set up accordingly.
Setup
For best conversion speed, use GPU acceleration whenever available; e.g. if running on Colab, use a GPU-enabled runtime.
Initialization
Basic initialization looks as follows:DoclingLoader has the following parameters:
file_path: source as single str (URL or local file) or iterable thereofconverter(optional): any specific Docling converter instance to useconvert_kwargs(optional): any specific kwargs for conversion executionexport_type(optional): export mode to use:ExportType.DOC_CHUNKS(default) orExportType.MARKDOWNmd_export_kwargs(optional): any specific Markdown export kwargs (for Markdown mode)chunker(optional): any specific Docling chunker instance to use (for doc-chunk mode)meta_extractor(optional): any specific metadata extractor to use
Load
Note: a message saying "Token indices sequence length is longer than the specified maximum sequence length..." can be ignored in this case — more details
here.
Inspecting some sample docs:
Lazy Load
Documents can also be loaded in a lazy fashion:End-to-end Example
- The following example pipeline uses HuggingFace’s Inference API; for increased LLM quota, token can be provided via env var
HF_TOKEN.- Dependencies for this pipeline can be installed as shown below (
--no-warn-conflictsmeant for Colab’s pre-populated Python env; feel free to remove for stricter usage):