Document Loaders
The DocumentLoader provides a unified interface for loading Document Types from various sources.
Document
data Document = Document
{ pageContent :: Text
, metadata :: Map Text Value
}
The Document type is a simple data structure that contains two fields:
pageContent: ATextfield that contains the content of the document.metadata: AMapthat contains metadata about the document. The keys are of typeText, and the values are of typeValue(from the Aeson library).
Document as Monoid instance, so two documents can be appended together.
BaseLoader
class BaseLoader m where
load :: m -> IO (Either String [Document])
loadAndSplit :: m -> IO (Either String [Text])
The BaseLoader typeclass defines two methods:
load: This method takes aBaseLoaderinstance and returns anIOaction that produces either an error message or a list ofDocuments.loadAndSplit: This method takes aBaseLoaderinstance and returns anIOaction that produces either an error message or a list ofTextchunks. This is useful for splitting the document into smaller pieces.
Integrations
Right now, langchain-hs provides below integrations, with more integrations planned in the roadmap:
FileLoader: Loads documents from a file path.PdfLoader: Loads documents from a PDF file path.DirectoryLoader: Loads documents from a directory. It can recursively load files from subdirectories and filter files based on their extensions.
DiretoryLaoderOptions
data DirectoryLoaderOptions = DirectoryLoaderOptions
{ recursiveDepth :: Maybe Int
, extensions :: [String]
, excludeHidden :: Bool
, useMultithreading :: Bool
}
The DirectoryLoaderOptions type is a data structure that contains options for loading documents from a directory. It has the following fields:
recursiveDepth: An optionalIntthat specifies the maximum depth of recursion when loading files from subdirectories. IfNothing, it will load files from all subdirectories.extensions: A list ofStringthat specifies the file extensions to include when loading files. If empty, it will load all files.excludeHidden: ABoolthat specifies whether to exclude hidden files (files starting with a dot) when loading files. The default isTrue.useMultithreading: ABoolthat specifies whether to use multithreading when loading files. The default isFalse.
defaultDirectoryLoaderOptions is also provided.
Examples
- FileLoader
{-# LANGUAGE OverloadedStrings #-}
module LangchainLib (runApp) where
import Langchain.DocumentLoader.FileLoader
import Langchain.DocumentLoader.Core
runApp :: IO ()
runApp = do
let loader = FileLoader "/home/user/haskell/sample-proj/README.md"
docs <- load loader
chunks <- loadAndSplit loader
print docs
print chunks
- PdfLoader
{-# LANGUAGE OverloadedStrings #-}
module LangchainLib (runApp) where
import Langchain.DocumentLoader.Core
import Langchain.DocumentLoader.PdfLoader
runApp :: IO ()
runApp = do
let loader = PdfLoader "/home/user/Documents/TS/langchain/SOP.pdf"
docs <- load loader
chunks <- loadAndSplit loader
print docs
print chunks
- DirectoryLoader
{-# LANGUAGE OverloadedStrings #-}
module LangchainLib (runApp) where
import Langchain.DocumentLoader.Core
import Langchain.DocumentLoader.DirectoryLoader
runApp :: IO ()
runApp = do
let loader = DirectoryLoader {
dirPath = "/home/user/Documents/TS/langchain"
, directoryLoaderOptions = defaultDirectoryLoaderOptions {
recursiveDepth = Just 2
}
}
docs <- load loader
chunks <- loadAndSplit loader
print docs
print chunks
These documents are useful to embed into vector store and build RAG tools.