HanLP的分词统计

HanLP的分词效果
鄙人研究了HanLP，他的分词效果确实还可以，而且速度也比较快，10的数据是9000毫秒

@SneakyThrows
@Override
public LinkedHashMap<String, Integer> hotWordsCount(String text) {
    // 存放结果的集合
    LinkedHashMap<String, Integer> linkedHashMap = new LinkedHashMap<>();
    // 获取停用词词库的路径
    String stopWordPath = ClassUtils.getDefaultClassLoader().getResource("static/dictionary/stopwords.txt").getPath();
    // 读取停用词表
    BufferedReader br = new BufferedReader(new FileReader(stopWordPath));
    //调用HanLP.segment()对句子进行分词处理
    List<Term> terms = HanLP.segment(text);
    // 使用readLine方法，一次读一行，读取待处理文本
    ArrayList<String> stopWordList = new ArrayList();
    String stopWord;
    while ((stopWord = br.readLine()) != null) {
        stopWordList.add(stopWord);
    }
    for (Term term : terms) {
        // 判断是否为数字,如果是数字直接选择跳过
        if (Pattern.compile("[0-9]*").matcher(term.word).matches()) {
            continue;
        }
        if (term.word.equals("\n")) {
            continue;
        }
        if (term.word.equals("\r")) {
            continue;
        }
        // 如果有停用词
        if (stopWordList.contains(term.word.trim())) {
            continue;
        }
        if (stopWordList.contains(term.word.replace(" ", ""))) {
            continue;
        }
        if (term.word.contains("/")) {
            continue;
        }
        // 判断长度
        if (term.word.length() >= 2) {
            // 说明是第一次
            if (linkedHashMap.get(term.word) == null) {
                linkedHashMap.put(term.word, 1);
            } else {
                linkedHashMap.put(term.word, linkedHashMap.get(term.word) + 1);
            }
        }
    }
    return linkedHashMap;
}

hanlp

HanLP的分词统计