Skip to content

常见的文档切片方式与优缺点

切片方式描述优点缺点使用场景
固定长度切片每个chunk固定字符数简单、快速、控制粒度易导致语义割裂、检索效果差与滑动窗口切片结合,适用于FAQ问答系统
滑动窗口切片固定长度+重叠窗口保留上下文、减少语义割裂数据冗余、存储开销与固定长度切片结合,适用于FAQ问答系统
结构化切片按标题/段落/列表分块保留自然语义结构依赖文档结构、处理复杂与递归式切片结合,适用于技术文档(手册、API)
递归式切片多级分割(段落→句子→字符)自适应内容粒度、保留语义配置复杂、运行速度慢与结构化切片结合,适用于技术文档(手册、API)
句子切片按句子划分粒度自然、适配大多数文本chunk粒度不均、需聚合对话、聊天记录
嵌入感知切片利用句向量聚类合并语义相似内容语义一致性强、检索精准算法复杂、计算开销大多轮问答+知识密度高

代码实现

  • 固定长度+滑动窗口式切片

    python
    from langchain.text_splitter import CharacterTextSplitter
     
    text_splitter = CharacterTextSplitter(
        separator="\n",
        chunk_size=500,      # 每块 500 字符
        chunk_overlap=50     # 每块重叠 50 字符
    )
    
    content = "测试内容"
    chunks = text_splitter.split_text(content)
  • 递归式切片

    python
    from langchain.text_splitter import RecursiveCharacterTextSplitter
     
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=1000,
        chunk_overlap=200,
        separators=["\n\n", "\n", ".", " ", ""]
    )
     
    content = "测试内容"
    chunks = text_splitter.split_text(content)
  • 句子切片

    python
    import nltk
    from nltk.tokenize import sent_tokenize
     
    nltk.download('punkt')
    content = "测试内容"
    sentences = sent_tokenize(content)
     
    chunk_size = 5
    chunks = [" ".join(sentences[i:i+chunk_size]) for i in range(0, len(sentences), chunk_size)]
  • 结构化切片(如按照 markdown 标题)

    python
    import re
     
    def split_by_markdown_headings(content):
        pattern = r"(#+ .+)"
        parts = re.split(pattern, content)
        chunks = [parts[i] + parts[i+1] for i in range(1, len(parts)-1, 2)]
        return chunks
  • 嵌入式感知

    python
    from sentence_transformers import SentenceTransformer
    from sklearn.cluster import KMeans
    from nltk.tokenize import sent_tokenize
     
    model = SentenceTransformer("all-MiniLM-L6-v2")
    content = "测试内容"
    sentences = sent_tokenize(content)
    embeddings = model.encode(sentences)
     
    n_clusters = len(sentences)
    kmeans = KMeans(n_clusters=n_clusters)
    labels = kmeans.fit_predict(embeddings)
     
    # 合并语义相似句子为 chunk
    from collections import defaultdict
    grouped = defaultdict(list)
    for label, sentence in zip(labels, sentences):
        grouped[label].append(sentence)
     
    chunks = [" ".join(grouped[i]) for i in grouped]

MIT版权,未经许可禁止任何形式的转载