python - Porter Stemmer算法不返回预期的输出？修改为def时

我正在使用PorterStemmerPython Port
搬运工堵塞算法（或“波特STEMMER”）是从英语中去除常见的词形和词尾的过程。它的主要用途是作为术语规范化过程的一部分，通常在建立信息检索系统时进行。
为了以下目的。。
你需要做的另一件事是把每一个词都简化成词干。例如，单词sing，sings，singing
它们都有相同的茎，即sing。有一种可以接受的方法，叫做波特的
算法。您可以从http://tartarus.org/martin/PorterStemmer/下载执行此操作的内容。
我修改了代码。。

if __name__ == '__main__':
    p = PorterStemmer()
    if len(sys.argv) > 1:
        for f in sys.argv[1:]:
            infile = open(f, 'r')
            while 1:
                output = ''
                word = ''
                line = infile.readline()
                if line == '':
                    break
                for c in line:
                    if c.isalpha():
                        word += c.lower()
                    else:
                        if word:
                            output += p.stem(word, 0,len(word)-1)
                            word = ''
                        output += c.lower()
                print output,
            infile.close()

从预处理字符串中读取input而不是文件并返回输出。

def algorithm(input):
    p = PorterStemmer()
    while 1:
        output = ''
        word = ''
        if input == '':
            break
        for c in input:
            if c.isalpha():
                word += c.lower()
            else:
                if word:
                    output += p.stem(word, 0,len(word)-1)
                    word = ''
                output += c.lower()
        return output

注意，如果我将return output放置在与while 1:相同的缩进上，它将变成infinite loop。
用法（示例）

import PorterStemmer as ps
ps.algorithm("Michael is Singing");

输出
迈克尔是
预期产量
迈克尔在唱歌
我做错什么了？

最佳答案

因此，看起来罪魁祸首是它目前没有将输入的最后一部分写入output（例如，尝试“michael is sing stuff”，它应该正确地编写所有内容，并省略“stuff”）。可能有一种更优雅的方法来处理这个问题，但有一件事您可以尝试将else子句添加到for循环中。由于问题是最终单词不包含在output中，我们可以使用else来确保在完成for循环时添加最终单词：

def algorithm(input):
    print input
    p = PorterStemmer()
    while 1:
        output = ''
        word = ''
        if input == '':
            break
        for c in input:
            if c.isalpha():
                word += c.lower()
            elif word:
                output += p.stem(word, 0,len(word)-1)
                word = ''
                output += c.lower()
        else:
            output += p.stem(word, 0, len(word)-1)
        print output
        return output

这已经用两个测试用例进行了广泛的测试，所以很明显它是防弹的：）可能有一些边缘案例在那里爬行，但希望它能让你开始。