Lucene停用词过滤器

Lucene停用词过滤器

本文介绍了Java Lucene停用词过滤器的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有大约500个句子,我想在其中汇编一组ngram.我在删除停用词时遇到了麻烦.我尝试添加Lucene StandardFilter和StopFilter,但是我仍然遇到相同的问题.这是我的代码:

I have about 500 sentences in which I would like to compile a set of ngrams. I am having trouble removing the stop words. I tried adding the lucene StandardFilter and StopFilter but I still have the same problem. Here is my code:

for(String curS: Sentences)
{
          reader = new StringReader(curS);
          tokenizer = new StandardTokenizer(Version.LUCENE_36, reader);
          tokenizer = new StandardFilter(Version.LUCENE_36, tokenizer);
          tokenizer = new StopFilter(Version.LUCENE_36, tokenizer, stopWords);
          tokenizer = new ShingleFilter(tokenizer, 2, 3);
          charTermAttribute = tokenizer.addAttribute(CharTermAttribute.class);

    while(tokenizer.incrementToken())
    {
        curNGram = charTermAttribute.toString().toString();
        nGrams.add(curNGram);                   //store each token into an ArrayList
    }
}

例如,我正在测试的第一句话是:对于每个听的人".在此示例中,curNGram设置为"For",这是我的列表stopWords中的停用词.另外,在此示例中,每个"都是停用词,因此人"应该是第一个ngram.

For example, the first phrase I am testing is: "For every person that listens to". In this example curNGram is set to "For" which is a stop word in my list stopWords. Also, in this example "every" is a stop word and so "person" should be the first ngram.

  1. 为什么我在使用StopFiler时将停用词添加到我的列表中?

感谢所有帮助!

推荐答案

您发布的内容对我来说还不错,因此我怀疑stopWords并未向过滤器提供您想要的信息.

What you've posted looks okay to me, so I suspect that stopWords isn't providing the information you want to the filter.

尝试类似的东西:

//Let's say we read the stop words into an array list (A simple array, or any list implementation should be fine)
List<String> words = new ArrayList();
//Read the file into words.
Set stopWords = StopFilter.makeStopSet(Version.LUCENE_36, words, true);

假设您生成的停用词列表(我已将其命名为"words")看起来像您认为的那样,则应将其设置为StopFilter可用的格式.

Assuming the list you of stopwords you generated (the one I've named 'words') looks like you think it does, this should put them into a format usable to the StopFilter.

您已经生成了像这样的停用词吗?

Were you already generating stopWords like that?

这篇关于Java Lucene停用词过滤器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-20 08:33