我是如何使用聚类来改进分块并构建更好的 RAGs-平芜编程栈

原文：towardsdatascience.com/improving-rag-chunking-with-clustering-03c1cf41f1cd

当我的句子看起来相隔甚远时，语义分块崩溃了。

当我的句子看起来相隔甚远时，我的第一次语义分块尝试失败了。然而，代理方法在大多数情况下成本高昂且速度过慢。我想要找到一个恰到好处的平衡点。我希望它既合理准确又成本低廉。

RAG 应用高度依赖于分块策略。更好的分块会导致更好的响应。有许多方法可以对文本进行分块。最简单且最受欢迎的是递归字符分块，而稍微复杂但很有帮助的是语义分块。更类似人类的方法是代理分块。

如果你是新手，请查看我之前关于分块的文章。

如何在分块中实现接近人类水平的性能
为什么基于位置的分块会导致 RAG 性能不佳？

在这篇文章中，我将分享为什么我使用聚类作为语义和代理分块的有效替代方案。让我们首先了解语义分块是如何工作的。

简而言之，语义分块

当两个连续句子的语义意义有显著差异时，语义分块会将文档分割。

例如，当你谈论气候变化时，你可能从气温上升开始，然后你可以谈论鲸鱼数量，然后转向政治。像递归字符分割这样的基于位置的分割方法不会关心你文本中的主题；它会根据固定的标记长度来分割，无论是什么。

但语义分块将它们分得很清楚。只要你在谈论一个话题，分块就会保持不变，只有当你讨论一个新的话题时才会改变。

当你回到一个较旧的话题时，你会遇到挑战。在同一个例子中，你再次谈论了气温上升。即使已经有一个关于这个主题的现有分块，语义分块也会创建一个新的分块。

如何使用查询路由构建有用的 RAG？

简而言之，代理分块

我们经常建议使用代理技术——或者更准确地说，是类似代理的方法——来克服这个问题。

在代理方法中，我们逐句处理文本。一个大型语言模型（LLM）会决定这句话是否可以与相似句子组成一组。如果找不到，它会创建一个新的组。

在开始处理句子之前，一个必要的调整是命题化。这意味着将你的句子转换为独立的句子。或者，更简单地说，命题化将“他”、“她”、“他们”和其他人用他们的实际参考来发音。

然而，正如你可能猜到的，我们严重依赖于 LLM 调用来完成这个任务。如果你已经阅读了我的上一篇文章，至少有两个 LLM 调用——一个用于决定哪个块最适合句子。另一个是更新块摘要和标题，因为已经添加了新的句子。有时，还有一个额外的调用用于创建一个新的块。

这有两个原因不好。一个是显然的成本。更多的 LLM 调用意味着更多的成本。为了减少成本，你可能尝试使用较小的模型（如 GPT-4o-mini）来完成这些任务。但这是一个你必须根据你的具体需求做出的决定。

另一个不利的因素是延迟。除非你使用本地托管模型，否则网络延迟将花费很长时间来分块一个较大规模的文档。

高级递归和后续检索技术，以实现更好的 RAGs

聚类方法

我想要使用一种既便宜又快捷的技术来克服语义分块的问题。这是一种介于语义和代理分块之间的折中方案。

我尝试在句子的向量版本上使用 K-Means 聚类——抱歉，是命题——并且它成功了！

首先，这里有一个用于实验的合成文本块。

The Earthiswarming,andthe consequences are becoming increasingly dire.Rising temperatures are disrupting ecosystems,withthe oceans being among the hardest hit.The warming seas are threatening whale populations,astheir migratory patterns shiftandtheir food sources dwindle.Entire species are at riskasthe delicate balance of marine lifeisthrown off by the relentless heat.On land,the effects are justasdevastating.Communities are being displaced by the increasing frequency of extreme weather eventsandrising sea levels.These climate refugees are forced to leave their homes behind,seeking shelterinregions ill-prepared to accommodate them.Despite the mounting evidence,political leaders remain divided on climate action.Some pushforimmediate change,advocatingforaggressive policies to curb emissions,whileothers downplay the severity of the crisis,prioritizing short-term economic gains overlong-term survival.Yet,the temperature keeps rising,driving home the urgency of the situation.Each fraction of a degree brings us closer to irreversible damage,andthe window to actisclosing rapidly.The future depends on the choices we make now.

正如我们所知，第一步是将这些句子转换为自解释的命题。以下代码行可以完成这个任务。

fromlangchainimporthubfromlangchain.pydantic_v1importBaseModeldefcreate_propositions(paragraph:str)->list[str]:print(paragraph)propositioning_prompt=hub.pull("wfh/proposal-indexing")classSentences(BaseModel):sentences:list[str]propositioning_llm=llm.with_structured_output(Sentences)propositioning_chain=propositioning_prompt|propositioning_llm sentences=propositioning_chain.invoke(paragraph)returnsentences.sentences props=[propforparaintext.split("nn")forpropincreate_propositions(para)]

在上述代码中，我们使用 LLM 提示将句子转换为命题。虽然你可以发挥创意编写自己的提示，但 Langchain hub 中有一个非常好的提示。

我们还使用 Pydantic 模型来提取句子作为结构化输出。这是从非结构化源提取句子的最可靠方法。

最后，我们将文本分割成多个段落，并将每个段落传递给我们的 create_propositions 函数。这确保了命题的意义在同一个段落中不会改变，但在它们不是同一个段落时可以不同。

下一步是为我们的句子创建嵌入。嵌入将我们的文本转换为保留其语义意义的向量。你可以使用许多嵌入模型。在这里，我使用的是 OpenAI 嵌入模型。

fromlangchain_openaiimportOpenAIEmbeddings embeddings=OpenAIEmbeddings()prop_embeddings=embeddings.embed_documents(props)

由于我们已经有了我们命题的向量嵌入，我们现在可以创建它们的簇。再次强调，有众多聚类技术。任何人首先尝试的可能是 K-means。它既易于理解也易于实现，并且执行速度快。在大多数情况下，K-means 是一个足够好的算法。在这里，出于同样的原因，我更倾向于使用 K-means。

以下代码创建了我们的嵌入的 K-means 簇以及一个字典列表来存储命题、它们的嵌入和簇值。

num_clusters=3# Cluster the embeddings and assign a cluster to each propositionkmeans=KMeans(n_clusters=num_clusters,random_state=0).fit(prop_embeddings)cluster_assignments=kmeans.labels_# Create a list of dict to store the embeddings, the text, and the cluster assignmentprops_clustered=[{"text":prop,"embeddings":emb,"cluster":cluster}forprop,emb,clusterinzip(props,prop_embeddings,cluster_assignments)]# Display clusters and their propositionsforclusterinrange(num_clusters):print(f"Cluster{cluster}:")forpropinprops_clustered:ifprop["cluster"]==cluster:print(f" -{prop['text']}")print()

这是输出结果。

Cluster0:-Communities are being displaced by the increasing frequency of extreme weather eventsandrising sea levels.-These displaced communities are climate refugees.-Climate refugees are forced to leave their homes behind.-Climate refugees seek shelterinregions ill-prepared to accommodate them.Cluster1:-The Earthiswarming.-Thereismounting evidence supporting the needforclimate action.-The future depends on the choices humanity makes now.Cluster2:-Political leaders remain divided on climate action.-Some political leaders pushforimmediate change.-These political leaders advocateforaggressive policies to curb emissions.-Other political leaders downplay the severity of the climate crisis.-These political leaders prioritize short-term economic gains overlong-term survival.Cluster3:-The oceans are among the hardest hit ecosystems.-The warming seas are threatening whale populations.-The migratory patterns of whale populations are shifting.-The food sources of whale populations are dwindling.-Entire species are at risk.-On land,the effects are justasdevastating.

这些块非常相关且准确。最好的部分是创建这些簇并没有花费太多时间。如果这是一个代理技术，即使是对于这篇简短的文章，每次 LLM 调用也会花费相当多的时间。

第一个簇讨论气候变化对社区的影响，第三个簇讨论政治视角，最后一个簇讨论气候变化对海洋和鲸鱼的影响。

对于一种既便宜又快速的块化方法来说，这很令人印象深刻。

让我们从我们创建的簇中创建块。

chunk_maker_promtp=PromptTemplate.from_template(""" Summerize the following text into a concise paragraph. It should preserve any key information and statistics. Text:{text} """)chunk_maker_chain=chunk_maker_promtp|llm|output_parser clusters=[[prop["text"]forpropinprops_clusteredifprop["cluster"]==cluster]forclusterinrange(num_clusters)]fori,cinenumerate(clusters):print(f"Cluster{i}:")print(chunk_maker_chain.invoke(c))print()