常见的文档切片方式与优缺点
| 切片方式 | 描述 | 优点 | 缺点 | 使用场景 |
|---|---|---|---|---|
| 固定长度切片 | 每个chunk固定字符数 | 简单、快速、控制粒度 | 易导致语义割裂、检索效果差 | 与滑动窗口切片结合,适用于FAQ问答系统 |
| 滑动窗口切片 | 固定长度+重叠窗口 | 保留上下文、减少语义割裂 | 数据冗余、存储开销 | 与固定长度切片结合,适用于FAQ问答系统 |
| 结构化切片 | 按标题/段落/列表分块 | 保留自然语义结构 | 依赖文档结构、处理复杂 | 与递归式切片结合,适用于技术文档(手册、API) |
| 递归式切片 | 多级分割(段落→句子→字符) | 自适应内容粒度、保留语义 | 配置复杂、运行速度慢 | 与结构化切片结合,适用于技术文档(手册、API) |
| 句子切片 | 按句子划分 | 粒度自然、适配大多数文本 | chunk粒度不均、需聚合 | 对话、聊天记录 |
| 嵌入感知切片 | 利用句向量聚类合并语义相似内容 | 语义一致性强、检索精准 | 算法复杂、计算开销大 | 多轮问答+知识密度高 |
代码实现
固定长度+滑动窗口式切片
pythonfrom langchain.text_splitter import CharacterTextSplitter text_splitter = CharacterTextSplitter( separator="\n", chunk_size=500, # 每块 500 字符 chunk_overlap=50 # 每块重叠 50 字符 ) content = "测试内容" chunks = text_splitter.split_text(content)递归式切片
pythonfrom langchain.text_splitter import RecursiveCharacterTextSplitter text_splitter = RecursiveCharacterTextSplitter( chunk_size=1000, chunk_overlap=200, separators=["\n\n", "\n", ".", " ", ""] ) content = "测试内容" chunks = text_splitter.split_text(content)句子切片
pythonimport nltk from nltk.tokenize import sent_tokenize nltk.download('punkt') content = "测试内容" sentences = sent_tokenize(content) chunk_size = 5 chunks = [" ".join(sentences[i:i+chunk_size]) for i in range(0, len(sentences), chunk_size)]结构化切片(如按照 markdown 标题)
pythonimport re def split_by_markdown_headings(content): pattern = r"(#+ .+)" parts = re.split(pattern, content) chunks = [parts[i] + parts[i+1] for i in range(1, len(parts)-1, 2)] return chunks嵌入式感知
pythonfrom sentence_transformers import SentenceTransformer from sklearn.cluster import KMeans from nltk.tokenize import sent_tokenize model = SentenceTransformer("all-MiniLM-L6-v2") content = "测试内容" sentences = sent_tokenize(content) embeddings = model.encode(sentences) n_clusters = len(sentences) kmeans = KMeans(n_clusters=n_clusters) labels = kmeans.fit_predict(embeddings) # 合并语义相似句子为 chunk from collections import defaultdict grouped = defaultdict(list) for label, sentence in zip(labels, sentences): grouped[label].append(sentence) chunks = [" ".join(grouped[i]) for i in grouped]