问题描述
我有大约500个句子,我想在其中汇编一组ngram.我在删除停用词时遇到了麻烦.我尝试添加Lucene StandardFilter和StopFilter,但是我仍然遇到相同的问题.这是我的代码:
I have about 500 sentences in which I would like to compile a set of ngrams. I am having trouble removing the stop words. I tried adding the lucene StandardFilter and StopFilter but I still have the same problem. Here is my code:
for(String curS: Sentences)
{
reader = new StringReader(curS);
tokenizer = new StandardTokenizer(Version.LUCENE_36, reader);
tokenizer = new StandardFilter(Version.LUCENE_36, tokenizer);
tokenizer = new StopFilter(Version.LUCENE_36, tokenizer, stopWords);
tokenizer = new ShingleFilter(tokenizer, 2, 3);
charTermAttribute = tokenizer.addAttribute(CharTermAttribute.class);
while(tokenizer.incrementToken())
{
curNGram = charTermAttribute.toString().toString();
nGrams.add(curNGram); //store each token into an ArrayList
}
}
例如,我正在测试的第一句话是:对于每个听的人".在此示例中,curNGram设置为"For",这是我的列表stopWords中的停用词.另外,在此示例中,每个"都是停用词,因此人"应该是第一个ngram.
For example, the first phrase I am testing is: "For every person that listens to". In this example curNGram is set to "For" which is a stop word in my list stopWords. Also, in this example "every" is a stop word and so "person" should be the first ngram.
- 为什么我在使用StopFiler时将停用词添加到我的列表中?
感谢所有帮助!
推荐答案
您发布的内容对我来说还不错,因此我怀疑stopWords并未向过滤器提供您想要的信息.
What you've posted looks okay to me, so I suspect that stopWords isn't providing the information you want to the filter.
尝试类似的东西:
//Let's say we read the stop words into an array list (A simple array, or any list implementation should be fine)
List<String> words = new ArrayList();
//Read the file into words.
Set stopWords = StopFilter.makeStopSet(Version.LUCENE_36, words, true);
假设您生成的停用词列表(我已将其命名为"words")看起来像您认为的那样,则应将其设置为StopFilter可用的格式.
Assuming the list you of stopwords you generated (the one I've named 'words') looks like you think it does, this should put them into a format usable to the StopFilter.
您已经生成了像这样的停用词吗?
Were you already generating stopWords like that?
这篇关于Java Lucene停用词过滤器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!