检索增强生成（RAG）入门指南

大型语言模型（LLM）使我们能够高效、可靠且快速地处理大量文本数据。过去两年中最流行的应用场景之一是检索增强生成（RAG）。

RAG允许我们获取多个文档（从几个到数十万个），用这些文档创建知识数据库，然后进行查询，并根据文档获得带有相关来源的答案。无需手动搜索（可能需要数小时甚至数天），我们可以让LLM在几秒钟内为我们完成搜索。

云端与本地部署

构建RAG系统有两个部分：知识数据库和LLM。可以将前者视为图书馆，后者视为非常高效的图书管理员。

创建此类系统时的第一个设计决策是选择云端托管还是本地部署。本地部署在规模上具有成本优势，并有助于保护隐私。另一方面，云端可以提供较低的启动成本和很少甚至无需维护。

为了清晰演示RAG的相关概念，本指南将选择云端部署，但最后也会提供本地部署的注意事项。

知识（向量）数据库

首先需要创建知识数据库（技术上称为向量数据库）。实现方法是通过嵌入模型处理文档，为每个文档生成向量。嵌入模型非常擅长理解文本，生成的向量在向量空间中会使相似文档更接近。

这非常方便，我们可以通过在一个二维向量空间中绘制一个假设组织的四个文档向量来说明：

如图所示，两个HR文档被分组在一起，并且远离其他类型的文档。这样，当我们收到关于HR的问题时，可以计算该问题的嵌入向量，该向量也会接近两个HR文档。

通过简单的欧几里得距离计算，我们可以匹配最相关的文档提供给LLM，以便它回答问题。

有大量嵌入算法可供选择，所有算法都在MTEB排行榜上进行比较。一个有趣的事实是，许多开源模型相比像OpenAI这样的专有提供商处于领先地位。

除了总体得分外，该排行榜上还需要考虑的两个列是模型大小和每个模型的最大令牌数。

模型大小将决定加载模型到内存中需要多少V（RAM）以及嵌入计算的速度。每个模型只能嵌入一定数量的令牌，因此非常大的文件可能需要在嵌入之前进行分割。

最后，模型只能嵌入文本，因此任何PDF都需要转换，而图像等富元素应该添加标题（使用AI图像标题模型）或丢弃。

开源本地嵌入模型可以使用transformers在本地运行。对于OpenAI嵌入模型，您需要OpenAI API密钥。

以下是使用OpenAI API和基于pickle文件系统的简单向量数据库创建嵌入的Python代码：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54


import os
from openai import OpenAI
import pickle

# 在此处输入您的OpenAI API密钥
openai = OpenAI(
  api_key="your_openai_api_key"
)

# 包含.txt文件的目录
directory = "doc1"

embeddings_store = {}

def embed_text(text):
    """使用OpenAI嵌入嵌入文本。"""
    response = openai.embeddings.create(
        input=text,
        model="text-embedding-3-large" # 使用text-embedding-3-small以获得更高的成本效益
    )
    return response.data[0].embedding

def process_and_store_files(directory):
    """处理.txt文件，嵌入它们，并存储在内存中。"""
    for filename in os.listdir(directory):
        if filename.endswith(".txt"):
            file_path = os.path.join(directory, filename)
            with open(file_path, 'r', encoding='utf-8') as file:
                content = file.read()
                embedding = embed_text(content)
                embeddings_store[filename] = embedding
                print(f"Stored embedding for {filename}")

def save_embeddings_to_file(file_path):
    """将嵌入字典保存到文件中。"""
    with open(file_path, 'wb') as f:
        pickle.dump(embeddings_store, f)
        print(f"Embeddings saved to {file_path}")

def load_embeddings_from_file(file_path):
    """从文件中加载嵌入字典。"""
    with open(file_path, 'rb') as f:
        embeddings_store = pickle.load(f)
        print(f"Embeddings loaded from {file_path}")
        return embeddings_store

# 运行处理过程
process_and_store_files(directory)

# 将嵌入保存到文件
save_embeddings_to_file("embeddings_store.pkl")

# 以后加载嵌入，使用：
# embeddings_store = load_embeddings_from_file("embeddings_store.pkl")

LLM

现在我们已经将文档存储在数据库中，让我们创建一个函数来根据查询获取前3个最相关的文档：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21


import numpy as np

def get_top_k_relevant(query, embeddings_store, top_k=3):
    """
    给定查询字符串和文档嵌入字典，
    返回最相关的top_k文档（最低欧几里得距离）。
    """
    query_embedding = embed_text(query)

    distances = []
    for doc_id, doc_embedding in embeddings_store.items():
        dist = np.linalg.norm(np.array(query_embedding) - np.array(doc_embedding))
        distances.append((doc_id, dist))

    distances.sort(key=lambda x: x[1])

    return distances[:top_k]

# 使用示例：
# best_matches = get_top_k_relevant("What is natural language processing?", embeddings_store, top_k=3)
# print(best_matches)

现在我们有了文档，接下来是简单的部分，即提示我们的LLM（本例中为GPT-4o）基于这些文档给出答案：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61


from openai import OpenAI

# 在此处输入您的OpenAI API密钥
openai = OpenAI(
  api_key="your_openai_api_key"
)

# 示例doc_store和embeddings_store：
# doc_store = {
#    "doc1.txt": "Full text of doc1",
#    "doc2.txt": "Full text of doc2",
#    ...
# }
#
# embeddings_store = {
#    "doc1.txt": [embedding_vector],
#    "doc2.txt": [embedding_vector],
#    ...
# }

def answer_query_with_context(query, doc_store, embeddings_store, top_k=3):
    """
    给定查询，找到最相关的top_k文档，并提示GPT-4o
    使用这些文档作为上下文来回答查询。
    """
    best_matches = get_top_k_relevant(query, embeddings_store, top_k)

    context = ""
    for doc_id, distance in best_matches:
        doc_content = doc_store.get(doc_id, "")
        context += f"--- Document: {doc_id} (Distance: {distance:.4f}) ---\n{doc_content}\n\n"

    completion = openai.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "system",
                "content": (
                    "You are a helpful assistant. Use the provided context to answer the user’s query. "
                    "If the answer isn't in the provided context, say you don't have enough information."
                )
            },
            {
                "role": "user",
                "content": (
                    f"Context:\n{context}\n"
                    f"Question:\n{query}\n\n"
                    "Please provide a concise, accurate answer based on the above documents."
                )
            }
        ],
        temperature=0.7 # 可以调整此参数
    )

    answer = completion.choices[0].message.content
    return answer

# 使用示例：
# query = "What are the key points from the documents?"
# response = answer_query_with_context(query, doc_store, embeddings_store, top_k=3)
# print("GPT-4 Response:", response)

结论

就是这样！这是一个直观的RAG实现，还有很多改进的空间。以下是一些下一步的建议：

使用本地LLM，甚至添加语音支持。
使用直接偏好优化（DPO）微调LLM。
对于医学或法律等高度专业化的领域，微调嵌入模型以更好地匹配文档。
对于大规模应用，使用企业级向量数据库，如Pinecone或Milvus。
如果您对开箱即用的结果不满意，可以微调LLM以更合适的方式回答问题。