

我有一个文件,每一行都是一个字符串.它可能包含数字、非英文字母和单词、符号(例如 ! 和 *).我想从每一行中提取英文单词(英文单词用空格分隔).我的代码如下,这是我的 map-reduce 作业的 map 函数.但是,根据最终结果,此映射器函数仅生成字母(例如 a、b、c)的频率计数.任何人都可以帮我找到错误吗?谢谢

I have a document that each line is a string. It might contain digits, non-English letters and words, symbols(such as ! and *). I want to extract the English words from each line(English words are separated by space).My code is the following, which is the map function of my map-reduce job. However, based on the final result, this mapper function only produces letters(such as a,b,c) frequency count. Can anyone help me find the bug? Thanks

import sys
import re

for line in sys.stdin:
    line = re.sub("[^A-Za-z]", "", line.strip())
    line = line.lower()
    words = ' '.join(line.split())
    for word in words:
        print '%s	%s' % (word, 1)



You've actually got two problems.


line = re.sub("[^A-Za-z]", "", line.strip())


This removes all non-letters from the line. Which means you no longer have any spaces to split on, and therefore no way to separate it into words.


Next, even if you didn't do that, you do this:

words = ' '.join(line.split())


This doesn't give you a list of words, this gives you a single string, with all those words concatenated back together. (Basically, the original line with all runs of whitespace converted into a single space.)


So, in the next line, when you do this:

for word in words:

您正在遍历一个字符串,这意味着每个 word 都是一个字符.因为这就是字符串:字符的可迭代.

You're iterating over a string, which means each word is a single character. Because that's what strings are: iterables of characters.


If you want each word (as your variable names imply), you already had those, the problem is that you joined them back into a string. Just don't do this:

words = line.split()
for word in words:


Or, if you want to strip out things besides letters and whitespace, use a regular expression that strips out everything besides letters and whitespace, not one that strips out everything besides letters, including whitespace:

line = re.sub(r"[^A-Za-zs]", "", line.strip())
words = line.split()
for word in words:


However, that pattern is still probably not what you want. Do you really want to turn 'abc1def' into the single string 'abcdef', or into the two strings 'abc' and 'def'? You probably want either this:

line = re.sub(r"[^A-Za-z]", " ", line.strip())
words = line.split()
for word in words:


words = re.split(r"[^A-Za-z]", line.strip())
for word in words:


08-24 03:17