简单之美 | 探索 CrewAI Knowledge 实现 RAG 应用

CrewAI 的 Knowledge 特性能够使我们非常方便地访问外部信息源，这些信息源可以是特定领域的数据，也可以是 Agent 为完成某个 Task 而需要指定会话上下文数据。可见，通过 Knowledge 特性我们可以实现基于 RAG 架构的 AI Agent 应用。
CrewAI 支持的 Knowledge Source 主要包括两大类：
一类是 Text Source，其中有 Raw String、Text File、PDF Document；
一类是 Structured Data，其中有 CVS File、Excel SpreadSheet、JSON Document。
当然，为了实现 AI Agent 应用的 Task 在执行过程中能够访问外部信息源，CrewAI 框架还支持其他各种的信息源，在 CrewAI 官网文档中可以在 “Tools” 一节中看到这些内置实现好的 Tool，能够非常方便地开箱即用，实现类似 RAG 的功能。
下面列出一些可能会比较常用的 Tool 集合：

下面，我们通过使用 PDFSearchTool 来快速探索 CrewAI 支持 RAG 的能力，以及探索 StringKnowledgeSource、PDFKnowledgeSource 这两个非常基础的 Knowledge Source 的使用，以此来了解 CrewAI 对 RAG 能力的支持和使用方法。

PDFSearchTool 使用示例

我们通过 CrewAI 提供的 PDFSearchTool 来快速实现一个简单的用户问答系统：回答用户的提问，都是基于给定的 PDF 文档的内容来进行回答，这样就把给定的文档作为会话的知识库了（这里，我是基于 https://github.com/drewzeee/crewai-rag.git 这个例子简单改了一下，基本业务逻辑未变，但是使用了本地 Ollama 管理的 DeepSeek-R1 模型来进行推理）。这个例子的基本功能是：
设置两个 Agent，一个是 Research Agent，负责检索给定的 PDF 文档，获取与用户提问相关的内容片段；另一个是 Professional Writer Agent，它根据 Research Agent 检索得到的内容，写一封满足指定要求的 Email 正文。
具体实现的完整代码，如下所示：

from crewai import Agent, Crew, Process, Task, LLM
from crewai_tools import PDFSearchTool
import litellm

litellm._logging._turn_on_debug()

# Initialize local LLM
llm = LLM(
    api_key="NA",
    temperature=0,
    model="ollama/deepseek-r1:1.5b",
    base_url="http://localhost:11434")
print("Local LLM: ", llm)

# PDF SOURCE: https://www.gpinspect.com/wp-content/uploads/2021/03/sample-home-report-inspection.pdf
pdf_search_tool = PDFSearchTool(
    pdf="./example_home_inspection.pdf",
    config=dict(
        llm=dict(
            provider="ollama",
            config=dict(
                model="deepseek-r1:1.5b",
            ),
        ),
        embedder=dict(
            provider="ollama",
            config=dict(
                model="nomic-embed-text:latest",
            ),
        ),
    )
)

research_agent = Agent(
    role="Research Agent",
    goal="Search through the PDF to find relevant answers",
    allow_delegation=False,
    verbose=True,
    backstory=(
        """
        The research agent is adept at searching and
        extracting data from documents, ensuring accurate and prompt responses.
        """
    ),
    tools=[pdf_search_tool],
    llm=llm
)

professional_writer_agent = Agent(
    role="Professional Writer",
    goal="Write professional emails based on the research agent's findings",
    allow_delegation=False,
    verbose=True,
    backstory=(
        """
        The professional writer agent has excellent writing skills and is able to craft
        clear and concise emails based on the provided information.
        """
    ),
    tools=[],
    llm=llm
)

# --- Tasks ---
answer_customer_question_task = Task(
    description=(
        """
        Answer the customer's questions based on the home inspection PDF.
        The research agent will search through the PDF to find the relevant answers.
        Your final answer MUST be clear and accurate, based on the content of the home
        inspection PDF.

        Here is the customer's question:
        {customer_question}
        """
    ),
    expected_output="""
        Provide clear and accurate answers to the customer's questions based on
        the content of the home inspection PDF.
        """,
    tools=[pdf_search_tool],
    agent=research_agent,
)

write_email_task = Task(
    description=(
        """
        - Write a professional email to a contractor based
            on the research agent's findings.
        - The email should clearly state the issues found in the specified section
            of the report and request a quote or action plan for fixing these issues.
        - Ensure the email is signed with the following details:

            Best regards,

            Brandon Hancock,
            Hancock Realty
        """
    ),
    expected_output="""
        Write a clear and concise email that can be sent to a contractor to address the
        issues found in the home inspection report.
        """,
    tools=[],
    agent=professional_writer_agent,
)

crew = Crew(
    agents=[research_agent, professional_writer_agent],
    tasks=[answer_customer_question_task, write_email_task],
    process=Process.sequential,
)

customer_question = input(
    "Which section of the report would you like to generate a work order for?\n"
)
result = crew.kickoff_for_each(inputs=[{"customer_question": "Bathroom and Components"}])
print(result)

上面代码中，我们输入的用户问题为：Bathroom and Components，亦即从 PDF 中的这个段落里检索相关内容，并最终生成一个 Work Order 的 Email 正文。
运行代码，我们可以通过日志来看一看程序执行过程中的一些关键内容：

1、使用自定义 LLM 与 Embedding 模型

在 PDFSearchTool 中通过 config 来指定自己想使用的 LLM 和 Embedding 模型，如下所示：

  config=dict(
      llm=dict(
          provider="ollama",
          config=dict(
              model="deepseek-r1:1.5b",
          ),
      ),
      embedder=dict(
          provider="ollama",
          config=dict(
              model="nomic-embed-text:latest",
          ),
      ),
  )

如果不指定默认的 Embedding 模型，PDFSearchTool 默认会使用 OpenAI 的 Embedding 模型，直接在本地调试运行可能会调用 OpenAI 的 API 失败而无法使用。另外，在运行程序之前，我们需要在本地通过 Ollama 下载 nomic-embed-text 模型，并启动 Ollama 服务，这样程序才能正常运行。

2、Research Agent 检索 PDF 文档

程序运行过程中，会把 PDF 文档内容的 Embedding 数据写入 Chromadb，如下所示：

Local LLM:  <crewai.llm.LLM object at 0x122f65e20>
/Users/dean/anaconda3/envs/crewai/lib/python3.12/site-packages/embedchain/embedder/ollama.py:27: LangChainDeprecationWarning: The class `OllamaEmbeddings` was deprecated in LangChain 0.3.1 and will be removed in 1.0.0. An updated version of the class exists in the :class:`~langchain-ollama package and should be used instead. To use it run `pip install -U :class:`~langchain-ollama` and import as `from :class:`~langchain_ollama import OllamaEmbeddings``.
  embeddings = OllamaEmbeddings(model=self.config.model, base_url=config.base_url)
Inserting batches in chromadb: 100%|██████████| 3/3 [02:14<00:00, 44.84s/it]

我们可以看到，在当前目录下生成了一个 db 文件目录，里面保存对应的数据：

db
├── 3b95287f-b015-43cd-937c-4a45617dff1c
│   ├── data_level0.bin
│   ├── header.bin
│   ├── length.bin
│   └── link_lists.bin
└── chroma.sqlite3

3、Research Agent 检索 PDF 文档

Research Agent 检索 PDF 文档，检索关键词是 Roof，通过运行日志可以看到得到的内容，如下所示：

# Agent: Research Agent
## Final Answer:
The home inspection PDF provides detailed information about the bathroom system, including water heater,龙头, Tank, Shower, and Sink. It also outlines fixtures like showerheads, vanities, and basic components such as flushers and toilet tanks.

这是 Research Agent 从 PDF 文档中检索到的相关内容。

4、Professional Writer Agent 与 LLM 交互

将 Research Agent 检索得到的结果内容作为 Professional Writer Agent 的会话上下文信息，执行具体的任务，我们可以看到此时发送给 LLM 的请求信息，如下所示：

POST Request Sent from LiteLLM:
curl -X POST \
http://localhost:11434/api/generate \
-d '{'model': 'deepseek-r1:1.5b', 'prompt': "### System:\nYou are Professional Writer. \n        The professional writer agent has excellent writing skills and is able to craft \n        clear and concise emails based on the provided information.\n        \nYour personal goal is: Write professional emails based on the research agent's findings\nTo give my best complete final answer to the task respond using the exact following format:\n\nThought: I now can give a great answer\nFinal Answer: Your final answer must be the great and the most complete as possible, it must be outcome described.\n\nI MUST use these formats, my job depends on it!\n\n### User:\n\nCurrent Task: \n        - Write a professional email to a contractor based \n            on the research agent's findings.\n        - The email should clearly state the issues found in the specified section \n            of the report and request a quote or action plan for fixing these issues.\n        - Ensure the email is signed with the following details:\n        \n            Best regards,\n\n            Brandon Hancock,\n            Hancock Realty\n        \n\nThis is the expected criteria for your final answer: \n        Write a clear and concise email that can be sent to a contractor to address the \n        issues found in the home inspection report.\n        \nyou MUST return the actual complete content as the final answer, not a summary.\n\nThis is the context you're working with:\nThe home inspection PDF provides detailed information about the bathroom system, including water heater,龙头, Tank, Shower, and Sink. It also outlines fixtures like showerheads, vanities, and basic components such as flushers and toilet tanks.\n\nBegin! This is VERY important to you, use the tools available and give your best Final Answer, your job depends on it!\n\nThought:\n\n### User:\nI did it wrong. Invalid Format: I missed the 'Action:' after 'Thought:'. I will do right next, and don't use a tool I have already used.\n\nIf you don't need to use any more tools, you must give your best complete final answer, make sure it satisfies the expected criteria, use the EXACT format below:\n\n```\nThought: I now can give a great answer\nFinal Answer: my best complete final answer to the task.\n\n```\n\n", 'options': {'temperature': 0, 'stop': ['\nObservation:']}, 'stream': False}'

可以看到，以 “This is the context you’re working with:” 开始的内容，就是 Research Agent 指派的 Task 执行检索 PDF 文档的结果内容，它作为 Professional Writer Agent 执行 Task 的上下文信息，提供了与用户提出的问题相关的会话上下文。

5、Professional Writer Agent 生成最终的结果

可以通过运行日志看到，Professional Writer Agent 按照指定的要求，生成了最终的 Work Order Email 正文，如下所示：

# Agent: Professional Writer
## Final Answer:
Brandon Hancock, Hancock Realty
Dear Brandon Hancock,

I am writing to address the findings from the bathroom system inspection report. The key issues identified include water heater performance,龙头 pressure, Tank leaks, Shower fixtures not functioning properly, and potential issues with vanities or flushers.

We request quotes for component replacements and an action plan to address these problems promptly. Please let us know when you can provide more details on the specific components that need attention.

Best regards,
Brandon Hancock
Hancock Realty

至此，基于 PDFSearchTool，并且具有两个 Agent 的例子，两个 Agent 进行协作解决了用户提出的问题。

CrewAI Knowledge 使用示例

在 CrewAI 中，Agent 和 Crew 都具有 knowledge_sources 属性，所以 CrewAI 能够分别支持 Agent 和 Crew 两个不同级别的 Knowledge 设置，它们使 Knowledge 限定在了指定的作用域之内。
使用 CrewAI Knowledge 特性的过程中，会遇到各种问题，我们可以通过例子来具体说明使用 StringKnowledgeSource 和 PDFKnowledgeSource。

使用 StringKnowledgeSource 示例

根据给定的字符串，回答用户的提问，这个例子比较简单，示例代码如下所示：

...

from crewai import Agent, Task, Crew, Process, LLM
from crewai.knowledge.source.string_knowledge_source import StringKnowledgeSource

content = "Users name is John. He is 30 years old and lives in San Francisco."
string_source = StringKnowledgeSource(
    content=content,
)

# Create an agent with the knowledge store
agent = Agent(
    role="About User",
    goal="You know everything about the user.",
    backstory="""You are a master at understanding people and their preferences.""",
    verbose=True,
    allow_delegation=False,
    llm=llm,
    knowledge_sources=[string_source],
    embedder=dict(
            provider="ollama",
            config=dict(
                model="nomic-embed-text:latest",
                base_url="http://localhost:11434",
            ),
        ),
)

task = Task(
    description="Answer the following questions about the user: {question}",
    expected_output="An answer to the question.",
    agent=agent,
)

crew = Crew(
    agents=[agent],
    tasks=[task],
    verbose=True,
    process=Process.sequential,
)

result = crew.kickoff(inputs={"question": "What city does John live in and how old is he?"})

我们可以看到，Agent 基于输入的字符串信息作为会话上下文，正确回答了用户提出的问题：

# Agent: About User
## Final Answer:
John is 30 years old and lives in San Francisco.

我们需要注意两个问题：

1、指定自己的 Embedding 模型配置

通过 Agent 的 embedder 来指定希望使用的 Embedding 模型的配置，具体参考前面说明，不再赘述。

2、Knowledge Source 文本长度对 Agent 有影响

上面代码中，我们使用的 StringKnowledgeSource 对应的字符串文本长度比较短，在运行过程中不会出现问题。但是如果文本长度比过长，就会报错，这时需要对 StringKnowledgeSource 进行分块配置，需要指定对应的配置，示例如下：

content = "... ...（长度很长的字符串文本）"
string_source = StringKnowledgeSource(
    content=content,
    chunk_size=256,  # Maximum size of each chunk (default: 4000)
    chunk_overlap=32  # Overlap between chunks (default: 200)
)

需要注意的是，对所有的 Knowledge Source，如果内容过长都需要进行分块配置，否则程序报错。以我使用的为例，主要是在写入 Chromadb 时超时，错误信息如下：

[2025-03-12 13:08:02][ERROR]: Failed to upsert documents: timed out in upsert.
Traceback (most recent call last):
  File "/Users/dean/anaconda3/envs/crewai/lib/python3.12/site-packages/httpx/_transports/default.py", line 72, in map_httpcore_exceptions
    yield
  File "/Users/dean/anaconda3/envs/crewai/lib/python3.12/site-packages/httpx/_transports/default.py", line 236, in handle_request
    resp = self._pool.handle_request(req)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dean/anaconda3/envs/crewai/lib/python3.12/site-packages/httpcore/_sync/connection_pool.py", line 256, in handle_request
    raise exc from None
  File "/Users/dean/anaconda3/envs/crewai/lib/python3.12/site-packages/httpcore/_sync/connection_pool.py", line 236, in handle_request
    response = connection.handle_request(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dean/anaconda3/envs/crewai/lib/python3.12/site-packages/httpcore/_sync/connection.py", line 103, in handle_request
    return self._connection.handle_request(request)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dean/anaconda3/envs/crewai/lib/python3.12/site-packages/httpcore/_sync/http11.py", line 136, in handle_request
    raise exc
  File "/Users/dean/anaconda3/envs/crewai/lib/python3.12/site-packages/httpcore/_sync/http11.py", line 106, in handle_request
    ) = self._receive_response_headers(**kwargs)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dean/anaconda3/envs/crewai/lib/python3.12/site-packages/httpcore/_sync/http11.py", line 177, in _receive_response_headers
    event = self._receive_event(timeout=timeout)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dean/anaconda3/envs/crewai/lib/python3.12/site-packages/httpcore/_sync/http11.py", line 217, in _receive_event
    data = self._network_stream.read(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dean/anaconda3/envs/crewai/lib/python3.12/site-packages/httpcore/_backends/sync.py", line 126, in read
    with map_exceptions(exc_map):
         ^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dean/anaconda3/envs/crewai/lib/python3.12/contextlib.py", line 158, in __exit__
    self.gen.throw(value)
  File "/Users/dean/anaconda3/envs/crewai/lib/python3.12/site-packages/httpcore/_exceptions.py", line 14, in map_exceptions
    raise to_exc(exc) from exc
httpcore.ReadTimeout: timed out

根据实际情况进行分块配置，就能解决这个问题。

使用 PDFKnowledgeSource 示例

使用 PDFKnowledgeSource，示例代码如下所示：

...

from crewai.knowledge.source.pdf_knowledge_source import PDFKnowledgeSource

# Create a PDF knowledge source
pdf_source = PDFKnowledgeSource(
    file_paths=["kde_example_home_inspection.pdf"],
    # chunk_size=256,  # Maximum size of each chunk (default: 4000)
    # chunk_overlap=32  # Overlap between chunks (default: 200)
)

# Create an agent with the knowledge store
agent = Agent(
    role="Report Analyst",
    goal="You know everything about the report.",
    backstory="""You are a master at understanding the content of the given report.""",
    verbose=True,
    allow_delegation=False,
    llm=llm,
    knowledge_sources=[pdf_source],
    embedder=dict(
            provider="ollama",
            config=dict(
                model="nomic-embed-text:latest",
                base_url="http://localhost:11434",
            ),
        ),
)

task = Task(
    description="Answer the following questions about the report: {question}",
    expected_output="An answer to the question.",
    agent=agent,
)

crew = Crew(
    agents=[agent],
    tasks=[task],
    verbose=True,
    process=Process.sequential,
)

crew.kickoff(inputs={"question": "What is the percentage of the lighting use typically?"})

总结

在使用过程中发现，如果 Agent 的属性 role 的内容为中文时，程序就会报错。在跟踪代码过程中发现，会把 role 的内容经过处理作为 collection name，而中文的 role 内容转换成的 collection name 是无法通过 CrewAI 框架代码的校验的，直接会报错。出现这个问题有可能和环境有关，也可能是其它原因，具体解决办法就是改成英文描述，即可暂时解决这个问题。

在使用 AI Agent 实现我们计划的功能的过程中，如果发现达不到预期，这时往往可能有如下几个原因：
1、对自己项目的应用场景的要求并未十分明确，比如应用复杂度、结果精确度，只有选择框架提供的合适的特性才能满足我们项目的要求；
2、和选择的 LLM 支持的语言有关，比如有的 LLM 对英文支持较好，有的对中文支持较好；
3、评估所选择的 LLM 推理能力，如果推理能力不好，也比较难达到预期结果；
4、设置 Agent、Task 的提示词需要一定的技巧，描述不清楚可能推理出来的结果就不好。

参考链接

本文基于署名-非商业性使用-相同方式共享 4.0许可协议发布，欢迎转载、使用、重新发布，但务必保留文章署名时延军（包含链接：http://shiyanjun.cn），不得用于商业目的，基于本文修改后的作品务必以相同的许可发布。如有任何疑问，请与我联系。

发表评论取消回复

石浩枫: 作者你自己看看你写的通顺吗，图layer一半有颜色一半没颜色，画的啥东西
gsgsgsl: 赞一个，前几年搞过kafka2.x版本的鉴权，几年有这方面需求，发现很多api变了，2.x版本的鉴权配置在3.x版...
dack: GPT-2中的相对位置编码请问有出处吗，在GPT-2的论文& #8221;Language Models are...
zhang: 你好，这一行“前面计算已经得到 QKT 矩阵，n=6，dk=8，则 A 的大小也是 6 x 6。”请问在下面的代码中为什么dk =...
derek: 何时才能出现伴侣Ai
丘比特: 请问博主，如果在窗口中用到广播状态，现在您有什么实现方案吗？
z: 寫的真好
方俊: 大佬好有耐心，从14年回复到19年哈哈
Yanjun: 图是用 Astah 和 OminiGraffle 画的
JacobZheng: 问个题外话，图是用什么工具画的啊
Derek Dekker: 感觉还挺难的
luosijie: 博主你好，请问您知道K距离方法出自哪篇文献吗，我该如何引用？

发表评论 取消回复

发表评论取消回复