spaCy Chunks：spaCy文档分块组件

spaCy Chunks是一个用于spaCy的自定义管道组件，允许从文档生成句子或词块的重叠分块。该组件适用于需要处理较小、可能重叠文本段的各类NLP任务。

功能特性

按句子或词块进行分块
可配置的块大小
可调整块间重叠度
支持截断不完整块的选项

安装方法

使用spaCy Chunks需要先安装spaCy：

1
2


pip install spacy
pip install spacy_chunks

下载spaCy模型：

1

python -m spacy download en_core_web_sm

使用示例

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21


import spacy

# 加载spaCy模型
nlp = spacy.load("en_core_web_sm")

# 将分块组件添加到管道
nlp.add_pipe("spacy_chunks", last=True, config={
    "chunking_method": "sentence",
    "chunk_size": 2,
    "overlap": 1,
    "truncate": True
})

# 处理文本
text = "This is the first sentence. This is the second one. And here's the third. The fourth is here. And a fifth."
doc = nlp(text)

# 输出分块结果
print("Chunks:")
for i, chunk in enumerate(doc._.chunks, 1):
    print(f"Chunk {i}: {[sent.text for sent in chunk]}")

输出结果：

1
2
3
4
5


Chunks:
Chunk 1: ['This is the first sentence.', 'This is the second one.']
Chunk 2: ['This is the second one.', "And here's the third."]
Chunk 3: ["And here's the third.", 'The fourth is here.']
Chunk 4: ['The fourth is here.', 'And a fifth.']

配置参数

向管道添加分块组件时，可配置以下参数：

chunking_method: “sentence"或"token”（默认：“sentence”）
chunk_size: 每个块的句子或词块数量（默认：3）
overlap: 块间重叠的句子或词块数量（默认：0）
truncate: 是否移除末尾的不完整块（默认：True）

动态配置修改

可动态修改分块组件的配置：

1
2
3
4
5
6
7
8


# 修改块大小
nlp.get_pipe("spacy_chunks").chunk_size = 3

# 禁用截断功能
nlp.get_pipe("spacy_chunks").truncate = False

# 使用新设置重新处理文本
doc = nlp(text)

贡献指南

欢迎为spaCy Chunks项目贡献代码！请随时提交Pull Request。

许可证

本项目采用MIT许可证。

轻松实现spaCy文档分块技术

本文介绍spaCy Chunks自定义管道组件，能够将文档分割成可重叠的句子或词块，支持灵活配置块大小、重叠度和截断选项，适用于需要分段处理文本的NLP任务。