我有一个Sentences
课。此类的实例表示文本文件中的每个句子。
我正在从文件中读取每个句子,并将该句子作为我的instance
类的Sentences
。对于每个句子,我需要检查其中有几个停用词/功能词。
我有一个包含英语停用词的文本文件(stopwords.txt
)。
我应该如何设计程序,以使每个句子都不必一次又一次地阅读stopwords.txt
文件?相反,我应该将该文件的内容(停用词)“以某种方式”保存,然后检查句子中哪些词是停用词。
我的句子太多了,因此我需要这个程序尽可能快。
编辑:
我已经创建了一个StopWords类
public class StopWords
我正在阅读此类中的stopwords.txt文件,并将其保存在HashSet中。
....
while ((entries = br.readLine()) != null){
stopWordSet.add(entries.toLowerCase());
...
然后,在Sentences类中创建StopWords类的实例:
public class Sentences {
...
private static StopWords stopList = new StopWords("languageresources/stopword.txt");
...
}
我正在从文件中读取句子,并创建Sentences类的实例。这些句子中的每一个单词都保存在一个名为wordList的ArrayList中,并将其发送到StopWords类的dealStopWord()方法,以检查哪些单词是停用词。最后,我使用getStopWordCount()方法获取停用词的数量:
stopList.dealStopWord(wordList);
this.totalFunctionWords = stopList.getStopWordCount();
编辑:如果我将stopList变量设置为Sentences类的本地变量,则对于每个句子,将调用构造函数(即,为每个句子读取stopwords.txt文件),但它比stopList变量为静态的情况要快得多(即,当stopwords.txt仅被读取一次时)
编辑
StopWords.java类
public class StopWords {
//Instance variables
private String stopWordFile = ""; // name of the stopword file
private Set<String> stopWordSet;
private int count = 0; //number of stopwords found in a given sentence
private String[] sortedStopWords;
private ArrayList <String> noStopWordArray = new ArrayList <String> ();
//Constructor: takes the file containing stopwords
public StopWords (String fileName){
System.out.println("Stoplist constructor called");
this.stopWordFile = fileName;
FileReader stopWordFile = null;
try {
stopWordFile = new FileReader(this.stopWordFile);
} catch (FileNotFoundException e) {
e.printStackTrace();
}
BufferedReader br = new BufferedReader(stopWordFile);
String entries;
stopWordSet = new TreeSet<String>();
try {
while ((entries = br.readLine()) != null){
stopWordSet.add(entries.toLowerCase());
}
} catch (IOException e) {
e.printStackTrace();
}
try {
br.close();
} catch (IOException e) {
e.printStackTrace();
}
sortedStopWords = new String[stopWordSet.size()];
int i = 0;
Iterator<String> itr = stopWordSet.iterator();
while (itr.hasNext()){
sortedStopWords[i++] = itr.next();
}//end while
}//public StopWords (String fileName)
//return number of stopwords in a sentence (the sentence comes in as an arraylist of words)
public void dealStopWord(ArrayList <String> wordArray){
this.count = 0;
String temp = "";
int size = wordArray.size();
for(int i = 0; i < size; i++){
temp = wordArray.get(i).toLowerCase();
int found = Arrays.binarySearch(sortedStopWords, temp);
if(found >= 0){
this.count++;
}//end if
else{
this.noStopWordArray.add(wordArray.get(i));
}
}//while(itr.hasNext())
}
public ArrayList <String> getNoStopWordArray(){
return this.noStopWordArray;
}//public ArrayList <String> getNoStopWordArray()
public int getStopWordCount(){
return this.count;
}//public int getStopWordCount()
}//public class StopWords
Sentences.java类的一部分:
public class Sentences {
static StopWords stopList = new StopWords("languageresources/stopword.txt");
public void setFunctionAndContentWords(){
//If I make stopList variable locally here, the code is much faster
stopList.dealStopWord(this.wordList); //at this point, the # of stop words and the sentence without stop word is generated
this.totalFunctionWords = stopList.getStopWordCount(); //setting the feature here.
//...set up done.
}// end method
}
这就是我创建Sentences类实例的方式:
Sentences[] s = new Sentences[totalSentences]; //sentence object..
for (int i = 0; i < totalSentences; i++){
System.out.println("Processing sentence # " + (i+1));
s[i].setFunctionAndContentWords();
}
最佳答案
确保您的StopWords
实例不累积信息或被重置。我将使其完全无状态(没有计数器,尤其是没有不匹配单词的列表)。
这还具有可以多线程使用的优点。
就您而言:
this.noStopWordArray.add(wordArray.get(i));
导致数组不断增长(在静态情况下这是一个更大的问题,因为您将数组重复用于多个句子)。
关于java - Java中基于字典的搜索优化,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/28154799/