我有一个Sentences课。此类的实例表示文本文件中的每个句子。

我正在从文件中读取每个句子,并将该句子作为我的instance类的Sentences。对于每个句子,我需要检查其中有几个停用词/功能词。

我有一个包含英语停用词的文本文件(stopwords.txt)。

我应该如何设计程序,以使每个句子都不必一次又一次地阅读stopwords.txt文件?相反,我应该将该文件的内容(停用词)“以某种方式”保存,然后检查句子中哪些词是停用词。

我的句子太多了,因此我需要这个程序尽可能快。

编辑:

我已经创建了一个StopWords类

public class StopWords


我正在阅读此类中的stopwords.txt文件,并将其保存在HashSet中。

....
while ((entries = br.readLine()) != null){
                    stopWordSet.add(entries.toLowerCase());
...


然后,在Sentences类中创建StopWords类的实例:

public class Sentences {
...
    private static StopWords stopList = new StopWords("languageresources/stopword.txt");
...
}


我正在从文件中读取句子,并创建Sentences类的实例。这些句子中的每一个单词都保存在一个名为wordList的ArrayList中,并将其发送到StopWords类的dealStopWord()方法,以检查哪些单词是停用词。最后,我使用getStopWordCount()方法获取停用词的数量:

stopList.dealStopWord(wordList);
            this.totalFunctionWords = stopList.getStopWordCount();


编辑:如果我将stopList变量设置为Sentences类的本地变量,则对于每个句子,将调用构造函数(即,为每个句子读取stopwords.txt文件),但它比stopList变量为静态的情况要快得多(即,当stopwords.txt仅被读取一次时)

编辑

StopWords.java类

    public class StopWords {

    //Instance variables
    private String stopWordFile = ""; // name of the stopword file
    private Set<String> stopWordSet;
    private int count = 0; //number of stopwords found in a given sentence
    private String[] sortedStopWords;
    private ArrayList <String> noStopWordArray = new ArrayList <String> ();

    //Constructor: takes the file containing stopwords
    public StopWords (String fileName){
        System.out.println("Stoplist constructor called");
        this.stopWordFile = fileName;
        FileReader stopWordFile = null;
        try {
            stopWordFile = new FileReader(this.stopWordFile);
        } catch (FileNotFoundException e) {
            e.printStackTrace();
        }
        BufferedReader br = new BufferedReader(stopWordFile);
        String entries;
        stopWordSet = new TreeSet<String>();
        try {
            while ((entries = br.readLine()) != null){
                stopWordSet.add(entries.toLowerCase());
            }
        } catch (IOException e) {
            e.printStackTrace();
        }
        try {
            br.close();
        } catch (IOException e) {
            e.printStackTrace();
        }
        sortedStopWords = new String[stopWordSet.size()];
        int i = 0;
        Iterator<String> itr = stopWordSet.iterator();
        while (itr.hasNext()){
            sortedStopWords[i++] = itr.next();
        }//end while

    }//public StopWords (String fileName)

    //return number of stopwords in a sentence (the sentence comes in as an arraylist of words)
    public void dealStopWord(ArrayList <String> wordArray){

        this.count = 0;
        String temp = "";
        int size = wordArray.size();
        for(int i = 0; i < size; i++){
            temp = wordArray.get(i).toLowerCase();
            int found = Arrays.binarySearch(sortedStopWords, temp);
            if(found >= 0){
                this.count++;
            }//end if
            else{
                this.noStopWordArray.add(wordArray.get(i));
            }

        }//while(itr.hasNext())

    }

    public ArrayList <String> getNoStopWordArray(){

        return this.noStopWordArray;

    }//public ArrayList <String> getNoStopWordArray()

    public int getStopWordCount(){

        return this.count;

    }//public int getStopWordCount()

}//public class StopWords


Sentences.java类的一部分:

       public class Sentences {
        static StopWords stopList = new StopWords("languageresources/stopword.txt");
    public void setFunctionAndContentWords(){
            //If I make stopList variable locally here, the code is much faster
            stopList.dealStopWord(this.wordList); //at this point, the # of stop words and the sentence without stop word is generated
            this.totalFunctionWords = stopList.getStopWordCount(); //setting the feature here.
            //...set up done.
        }// end method
}


这就是我创建Sentences类实例的方式:

Sentences[] s = new Sentences[totalSentences]; //sentence object..
       for (int i = 0; i < totalSentences; i++){

                    System.out.println("Processing sentence # " + (i+1));


        s[i].setFunctionAndContentWords();
    }

最佳答案

确保您的StopWords实例不累积信息或被重置。我将使其完全无状态(没有计数器,尤其是没有不匹配单词的列表)。

这还具有可以多线程使用的优点。

就您而言:

this.noStopWordArray.add(wordArray.get(i));


导致数组不断增长(在静态情况下这是一个更大的问题,因为您将数组重复用于多个句子)。

关于java - Java中基于字典的搜索优化,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/28154799/

10-10 11:24