手工文档处理

随着所有人工智能创新的发生，人们很容易忘记，人们的工作的许多方面仍然是手工和繁琐的。这种单调的工作很大程度上源于重复的文档处理。在一家银行工作期间，我的一位前股东(风险经理)被大量手工文档处理任务淹没。她会花时间浏览冗长的基金招股说明书，以识别招股说明书中的关键信息，并将其转换为Excel电子表格(用于记录保存和下游分析)。

我预计许多读者将遇到类似的工作负载。因此，在本文中，我将介绍一种使用大型语言模型和检索管道自动化此类工作负载的简单方法。为了给您提供一个现实的用例，我用Morgan Stanley Funds UK的基金招股说明书来演示这一点，这是一份164页的公开文件，提供了他们一些基金的信息。这与风险经理可能会看到的招股说明书类型类似。我们将从招股说明书中提取并记录以下信息:

FCA产品参考编号: 6位数字，用于识别基金。
基金名称: 基金名称。
投资目标: 基金的目标，例如在10年内增加投资。
投资政策: 基金经理制定的投资规则。
投资策略: 基金经理的投资“哲学”。
ESG: 如果基金正在追求ESG战略。

看看下面的演示，让你更好地理解我们将做什么。

演示从基金招股说明书中提取信息的管道

我之所以选择这个例子，是因为我知道它与许多银行业人士息息相关。然而，这种方法可以很容易地转移到具有类似工作量的其他领域，包括保险、法律、测量、医学等。

库和先决条件

使这个例子为您工作所需的主要库是OpenAI， Haystack， Pandas和hugs Face中的句子变形器。对于OpenAI，你需要注册一个API密钥——如果你还没有，你可以遵循这个教程。

How to Build

在继续之前，看一下这个显示流程的高级架构图。下一节将解释图表上的每个过程。

Step 1 — 查询

这里的查询只是来自用户(或我们的示例中的风险管理人员)的请求。经理可能会问这样的问题:“我需要就全球可持续发展基金(Global Sustainability Fund)做报告。

Step 2 — 预处理

我们正在处理一份大型招股说明书(168页文本)，需要对其进行预处理并存储以供下游使用。预处理分为两个阶段:

读入并重新格式化:我们需要把我们的招股说明书分成更小的块或句子，我们称之为文档。我们可以利用Haystack中的“开箱即用”函数来实现这一点。只需指出[convert_files_to_docs] (https://docs.haystack.deepset.ai/reference/utils-api convert_files_to_docs:文本=模块预处理- convert_files_to_docs, python: ~:文本=模块预处理- convert_files_to_docs, python)招股说明书的存储位置和初始化函数(预处理)(https://docs.haystack.deepset.ai/reference/preprocessor-api::文本=模块预处理,预处理,python: ~:文本=模块预处理,预处理,python)对象与我们文档拆分要求，我们可以轻松地将招股说明书重新格式化为更小、更易于管理的块。
转换和存储:模型实际上并不解释单词;他们解读数字。因此，我们利用句子嵌入模型将我们的句子转换为密集向量(即，保留句子语义结构的数值表示)。我们利用了一个来自hug Face句子转换器库的预训练的句子嵌入模型all-mpnet-base-v2来做到这一点。一旦我们有了嵌入，我们需要用FAISS对它们进行索引。我们还将文档和相关元数据以文本格式存储在SQL Lite数据库中。Haystack提供了一个API [FAISSDocumentStore](https://docs.haystack.deepset.ai/reference/document-store-api#faissdocumentstore:~:text=Module faiss-，FAISSDocumentStore，- python)，使我们能够索引嵌入并将文档存储在磁盘上，以便稍后检索(我们将在下一节讨论检索)。

from haystack.nodes import PreProcessor
from haystack.utils import convert_files_to_docs
from haystack.document_stores import FAISSDocumentStore
from sqlalchemy import create_engine
from haystack.nodes import EmbeddingRetriever

# pre-process docs 
def preprocess_docs(doc_dir):
    all_docs = convert_files_to_docs(dir_path=doc_dir)
    preprocessor = PreProcessor(
        clean_empty_lines=True,
        clean_whitespace=True,
        clean_header_footer=False,
        split_by="word",
        split_respect_sentence_boundary=True,
        split_overlap=30, 
        split_length=100
    )
    docs = preprocessor.process(all_docs)
    print(f"n_files_input: {len(all_docs)}\nn_docs_output: {len(docs)}")
    return docs

# create FAISS and store
def vector_stores(docs):
    engine = create_engine('sqlite:///C:/Users/johna/anaconda3/envs/longfunctioncall_env/long_functioncall/database/database.db')  # change to your local directory
    try:
        # Attempt to drop the table
        engine.execute("DROP TABLE document")
    except Exception as e:
        # Catch any exceptions, likely due to the table not existing
        print(f"Exception occurred while trying to drop the table: {e}")
    
    document_store = FAISSDocumentStore(sql_url='sqlite:///C:/Users/johna/anaconda3/envs/longfunctioncall_env/long_functioncall/database/database.db', faiss_index_factory_str="Flat", embedding_dim=768) # change to your local directory
    document_store.write_documents(docs)
    
    return document_store

def generate_embeddings(document_store):
    retriever = EmbeddingRetriever(
    document_store=document_store,
    embedding_model="sentence-transformers/all-mpnet-base-v2"
    )
    document_store.update_embeddings(retriever)
    return retriever

接下来的两个步骤定义我们的管道，它由两个节点组成:一个检索器和一个定制的OpenAI函数调用节点。

Step 3 — Retriever

检索节点利用FAISS在文档的句子嵌入和表示用户查询的句子嵌入之间执行相似性搜索(回想一下，我们用FAISS对句子嵌入进行了索引)。根据相似度评分，检索节点返回与我们的查询相似度最高的前k个文档。从本质上讲，检索使我们能够有效地从冗长的基金文档中搜索和隔离相关部分。如果您尝试处理整个168页的文档，您会发现您将很快超过任何大型语言模型的最大上下文长度(在撰写本文时)。

*注意:研究 *表明，大型语言模型，即使是那些具有显式长上下文窗口的语言模型，在没有检索的情况下对这些文档进行操作时，其性能也会下降

Step 4 — OpenAI函数调用

对于那些不熟悉函数调用的人，您可以将其视为使用OpenAI的大型语言模型构建非结构化文本数据的一种方式。有一些非结构化文本作为输入，函数调用返回结构化JSON输出。阅读我的关于函数的文章，深入了解细节。函数调用将从检索器返回的k个最相关的文档作为输入。在函数调用中，我们将希望从招股说明书中提取的关键信息定义为提示符。

我们必须足够精确地使用这里的提示来引导大型语言模型提取正确的信息。该函数调用由OpenAI的gpt -3.5 turbo驱动，可以根据我们的提示，利用其推理能力识别并记录关键信息。

最后，函数调用返回一个JSON，其中包含我们要求它提取的键细节，记录为参数。从这里开始，将这些写入DataFrame非常简单。

注意:函数调用不能作为Haystack中的节点使用，因此必须通过创建自定义组件来创建该节点。下面的脚本向您展示了如何实现这一点。

import openai
from haystack.nodes.base import BaseComponent
from typing import List
import json

class OpenAIFunctionCall(BaseComponent):
    outgoing_edges = 1
    def __init__(self, API_KEY):
        self.API_KEY = API_KEY

    def run(self, documents: List[str]):
        
        try:
            document_content_list = [doc.content for doc in documents]
            print("documents extracted")
            document_content = " ".join(document_content_list)
        except Exception as e:
            print("Error extracting content:", e)
            return
        functions = [
            {
                "name": "update_dataframe",
                "description": "write the fund details to a dataframe",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "Product_reference_num": {
                            "type": "string",
                            "description": "The FCA product reference number which will be six or seven digits"
                        },
                        "investment_objective": {
                            "type": "string",
                            "description": "You should return the investment objective of the fund. This is likely to be something like this: The Fund aims to grow your investment over t – t + delta t years"
                        },
                        "investment_policy": {
                            "type": "string",
                            "description": "Return a summary of the fund's investment policy, no more than two sentences."
                        },
                        "investment_strategy": {
                            "type": "string",
                            "description": "Return a summary of the fund's investment strategy, no more than two sentences."
                        },
                        "ESG": {
                            "type": "string",
                            "description": "Return either True, or False. True if the fund is an ESG fund, False otherwise."
                        },
                        "fund_name": {
                            "type": "string",
                            "description": "Return the name of the fund"
                        },
                    },
                },
                "required": ["Product_reference_num", 
                             "investment_objective", 
                             "investment_policy",
                             "investment_strategy",
                             "ESG",
                             "fund_name"]
            }
        ]
        
        openai.api_key = self.API_KEY
        response = openai.ChatCompletion.create(
            model="gpt-3.5-turbo-0613",
            messages=[{"role": "system", "content": document_content}],
            functions=functions,
            function_call="auto",  # auto is default, but we'll be explicit
        )

        function_call_args = json.loads(response["choices"][0]["message"]["function_call"]["arguments"])
        
        return function_call_args, "output_1"

    def run_batch(self, documents: List[str]):
        # You can either process multiple documents in a batch here or simply loop over the run method
        results = []
        for document_content in document_content:
            result, _ = self.run(document_content)
            results.append(result)
        return results, "output_1"

一旦定义了自定义函数调用，将其与管道中的检索器结合在一起就很简单了。

from haystack import Pipeline
from function_call import OpenAIFunctionCall

def create_pipeline(retriever, API_KEY):
    p = Pipeline()
    p.add_node(component=retriever, name="retriever", inputs=["Query"])
    p.add_node(component=OpenAIFunctionCall(API_KEY), name="OpenAIFunctionCall", inputs=["retriever"])
    return p

这个Github库可供您探索；有一个笔记本和一个带有Streamlit前端的原型版本。

结论

这种方法非常有效;然而，本演示是为本基金招股说明书设计的。在实际的业务用例中，每个招股说明书可能使用不同的术语或应用不同的标准。尽管大型语言模型在这种情况下往往是健壮的，但我怀疑您将不得不以更复杂的方式设计流程来处理细微差别。我所展示的是，将函数调用与检索相结合可以提供大量自动化可能性，这些可能性相对容易实现。

原文：Automate Tedious Document Processing

我的学习笔记

使用AI(大模型)进行文档自动化处理