Skip to main content

How to load HTML

The HyperText Markup Language or HTML is the standard markup language for documents designed to be displayed in a web browser.

This covers how to load HTML documents into a LangChain Document objects that we can use downstream.

Parsing HTML files often requires specialized tools. Here we demonstrate parsing via Unstructured. Head over to the integrations page to find integrations with additional services, such as FireCrawl.

Prerequisites

This guide assumes familiarity with the following concepts:

Installation​

yarn add @langchain/community

Setup​

Although Unstructured has an open source offering, you’re still required to provide an API key to access the service. To get everything up and running, follow these two steps:

  1. Download & start the Docker container:
docker run -p 8000:8000 -d --rm --name unstructured-api quay.io/unstructured-io/unstructured-api:latest --port 8000 --host 0.0.0.0
  1. Get a free API key & API URL here, and set it in your environment (as per the Unstructured website, it may take up to an hour to allocate your API key & URL.):
export UNSTRUCTURED_API_KEY="..."
# Replace with your `Full URL` from the email
export UNSTRUCTURED_API_URL="https://<ORG_NAME>-<SECRET>.api.unstructuredapp.io/general/v0/general"

Loading HTML with Unstructured​

import { UnstructuredLoader } from "@langchain/community/document_loaders/fs/unstructured";

const filePath =
"../../../../libs/langchain-community/src/tools/fixtures/wordoftheday.html";

const loader = new UnstructuredLoader(filePath, {
apiKey: process.env.UNSTRUCTURED_API_KEY,
apiUrl: process.env.UNSTRUCTURED_API_URL,
});

const data = await loader.load();
console.log(data.slice(0, 5));
[
Document {
pageContent: 'Word of the Day',
metadata: {
category_depth: 0,
languages: [Array],
filename: 'wordoftheday.html',
filetype: 'text/html',
category: 'Title'
}
},
Document {
pageContent: ': April 10, 2023',
metadata: {
emphasized_text_contents: [Array],
emphasized_text_tags: [Array],
languages: [Array],
parent_id: 'b845e60d85ff7d10abda4e5f9a37eec8',
filename: 'wordoftheday.html',
filetype: 'text/html',
category: 'UncategorizedText'
}
},
Document {
pageContent: 'foible',
metadata: {
category_depth: 1,
languages: [Array],
parent_id: 'b845e60d85ff7d10abda4e5f9a37eec8',
filename: 'wordoftheday.html',
filetype: 'text/html',
category: 'Title'
}
},
Document {
pageContent: 'play',
metadata: {
category_depth: 0,
link_texts: [Array],
link_urls: [Array],
link_start_indexes: [Array],
languages: [Array],
filename: 'wordoftheday.html',
filetype: 'text/html',
category: 'Title'
}
},
Document {
pageContent: 'noun',
metadata: {
category_depth: 0,
emphasized_text_contents: [Array],
emphasized_text_tags: [Array],
languages: [Array],
filename: 'wordoftheday.html',
filetype: 'text/html',
category: 'Title'
}
}
]

Was this page helpful?


You can leave detailed feedback on GitHub.