springboot整合ik分词器
pom依赖
xml<!-- ikanalyzer 中文分词器 --> <dependency> <groupId>com.janeluo</groupId> <artifactId>ikanalyzer</artifactId> <version>2012_u6</version> <exclusions> <exclusion> <groupId>org.apache.lucene</groupId> <artifactId>lucene-core</artifactId> </exclusion> <exclusion> <groupId>org.apache.lucene</groupId> <artifactId>lucene-queryparser</artifactId> </exclusion> <exclusion> <groupId>org.apache.lucene</groupId> <artifactId>lucene-analyzers-common</artifactId> </exclusion> </exclusions> </dependency> <!-- lucene-queryparser 查询分析器模块 --> <dependency> <groupId>org.apache.lucene</groupId> <artifactId>lucene-queryparser</artifactId> <version>7.3.0</version> </dependency>示例代码
javapublic static List<String> iKSegmenterToList(String rawText) throws Exception { List<String> resultList = new ArrayList<>(); StringReader sr = new StringReader(rawText); // 第二个参数:是否开启智能分词 IKSegmenter ik = new IKSegmenter(sr, true); Lexeme lex; while((lex = ik.next()) != null) { String lexemeText = lex.getLexemeText(); resultList.add(lexemeText); } return resultList; }虽然上面代码开启了智能分词,但是分词还存在另外两种情况,此时需要对IK分词器进行单独的扩展配置
- 特殊词汇(如公司名)需要保留
- 一些无关紧要的词汇(如介词)或者就想过滤掉的词需要配置
IK分词器扩展配置步骤,在项目的resources文件夹下新建三个文件:IKAnalyzer.cfg.xml、ext_dict.dic、ext_stopwords.dic
IKAnalyzer.cfg.xml
xml<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd"> <properties> <comment>IK Analyzer 扩展配置</comment> <entry key="ext_dict">ext_dict.dic</entry> <entry key="ext_stopwords">ext_stopwords.dic</entry> </properties>- ext_dict:扩展词典:添加系统默认分词词典中没有的词(如公司名、专有名词等等)
- ext_stopwords:停用词典:用于指定哪些词是停用词,这些词在分词时会被过滤掉(如 的、了、和等等)
ext_dict.dic:自己定义就好,没有的话,不用配置
ext_stopwords.dic:github停用词典整理
停用词典 说明 本地路径 cn_stopwords.txt 中文停用词表 https://docs.qnmdmyy.top/resources/后端/spring全家桶/springboot整合ik分词器/cn_stopwords.txt hit_stopwords.txt 哈工大停用词表 https://docs.qnmdmyy.top/resources/后端/spring全家桶/springboot整合ik分词器/hit_stopwords.txt baidu_stopwords.txt 百度停用词表 https://docs.qnmdmyy.top/resources/后端/spring全家桶/springboot整合ik分词器/baidu_stopwords.txt scu_stopwords.txt 机器智能实验室停用词库 https://docs.qnmdmyy.top/resources/后端/spring全家桶/springboot整合ik分词器/scu_stopwords.txt