HTML
The HyperText Markup Language or HTML is the standard markup language for documents designed to be displayed in a web browser.
This covers how to load HTML
documents into a document format that we can use downstream.
from langchain_community.document_loaders import UnstructuredHTMLLoader
API Reference:
loader = UnstructuredHTMLLoader("example_data/fake-content.html")
data = loader.load()
data
[Document(page_content='My First Heading\n\nMy first paragraph.', lookup_str='', metadata={'source': 'example_data/fake-content.html'}, lookup_index=0)]
Loading HTML with BeautifulSoup4
We can also use BeautifulSoup4
to load HTML documents using the BSHTMLLoader
. This will extract the text from the HTML into page_content
, and the page title as title
into metadata
.
from langchain_community.document_loaders import BSHTMLLoader
API Reference:
loader = BSHTMLLoader("example_data/fake-content.html")
data = loader.load()
data
[Document(page_content='\n\nTest Title\n\n\nMy First Heading\nMy first paragraph.\n\n\n', metadata={'source': 'example_data/fake-content.html', 'title': 'Test Title'})]
Loading HTML with SpiderLoader
Spider is the fastest crawler. It converts any website into pure HTML, markdown, metadata or text while enabling you to crawl with custom actions using AI.
Spider allows you to use high performance proxies to prevent detection, caches AI actions, webhooks for crawling status, scheduled crawls etc...
Prerequisite
You need to have a Spider api key to use this loader. You can get one on spider.cloud.
%pip install --upgrade --quiet langchain langchain-community spider-client
from langchain_community.document_loaders import SpiderLoader
loader = SpiderLoader(
api_key="YOUR_API_KEY", url="https://spider.cloud", mode="crawl"
)
data = loader.load()
API Reference:
For guides and documentation, visit Spider
Loading HTML with FireCrawlLoader
FireCrawl crawls and convert any website into markdown. It crawls all accessible subpages and give you clean markdown and metadata for each.
FireCrawl handles complex tasks such as reverse proxies, caching, rate limits, and content blocked by JavaScript.
Prerequisite
You need to have a FireCrawl API key to use this loader. You can get one by signing up at FireCrawl.
%pip install --upgrade --quiet langchain langchain-community firecrawl-py
from langchain_community.document_loaders import FireCrawlLoader
loader = FireCrawlLoader(
api_key="YOUR_API_KEY", url="https://firecrawl.dev", mode="crawl"
)
data = loader.load()
API Reference:
For more information on how to use FireCrawl, visit FireCrawl.
Loading HTML with AzureAIDocumentIntelligenceLoader
Azure AI Document Intelligence (formerly known as Azure Form Recognizer
) is machine-learning
based service that extracts texts (including handwriting), tables, document structures (e.g., titles, section headings, etc.) and key-value-pairs from
digital or scanned PDFs, images, Office and HTML files. Document Intelligence supports PDF
, JPEG/JPG
, PNG
, BMP
, TIFF
, HEIF
, DOCX
, XLSX
, PPTX
and HTML
.
This current implementation of a loader using Document Intelligence
can incorporate content page-wise and turn it into LangChain documents. The default output format is markdown, which can be easily chained with MarkdownHeaderTextSplitter
for semantic document chunking. You can also use mode="single"
or mode="page"
to return pure texts in a single page or document split by page.
Prerequisite
An Azure AI Document Intelligence resource in one of the 3 preview regions: East US, West US2, West Europe - follow this document to create one if you don't have. You will be passing <endpoint>
and <key>
as parameters to the loader.
%pip install --upgrade --quiet langchain langchain-community azure-ai-documentintelligence
from langchain_community.document_loaders import AzureAIDocumentIntelligenceLoader
file_path = "<filepath>"
endpoint = "<endpoint>"
key = "<key>"
loader = AzureAIDocumentIntelligenceLoader(
api_endpoint=endpoint, api_key=key, file_path=file_path, api_model="prebuilt-layout"
)
documents = loader.load()