问题描述
我有大量真实世界的文本,我需要从中提取单词以输入拼写检查器.我想提取尽可能多的有意义的词,而不会产生太多干扰.我知道这里有很多正则表达式忍者,所以希望有人可以帮助我.
I have a large set of real-world text that I need to pull words out of to input into a spell checker. I'd like to extract as many meaningful words as possible without too much noise. I know there's plenty of regex ninjas around here, so hopefully someone can help me out.
目前我正在使用 '[a-z]+'
提取所有字母序列.这是一个不错的近似值,但它带来了很多垃圾.
Currently I'm extracting all alphabetical sequences with '[a-z]+'
. This is an okay approximation, but it drags a lot of rubbish out with it.
理想情况我想要一些正则表达式(不一定要漂亮或高效),它提取由自然单词分隔符分隔的所有字母序列(例如 [/-_,.:]
等),并忽略任何具有非法边界的字母序列.
Ideally I would like some regex (doesn't have to be pretty or efficient) that extracts all alphabetical sequences delimited by natural word separators (such as [/-_,.: ]
etc.), and ignores any alphabetical sequences with illegal bounds.
但是,我也很高兴能够获得所有不与数字相邻的字母序列.因此,例如 'pie21'
不会提取 'pie'
,但 'http://foo.com'
会提取 ['http', 'foo', 'com']
.
However I'd also be happy to just be able to get all alphabetical sequences that ARE NOT adjacent to a number. So for instance 'pie21'
would NOT extract 'pie'
, but 'http://foo.com'
would extract ['http', 'foo', 'com']
.
我尝试了 lookahead
和 lookbehind
断言,但它们是按字符应用的(例如 re.findall('(?<!\d)[az]+(?!\d)', 'pie21')
将返回 'pi'
当我希望它不返回任何内容时).我尝试将 alpha 部分包装为一个术语 ((?:[a-z]+)
) 但它没有帮助.
I tried lookahead
and lookbehind
assertions, but they were applied per-character (so for example re.findall('(?<!\d)[a-z]+(?!\d)', 'pie21')
would return 'pi'
when I want it to return nothing). I tried wrapping the alpha part as a term ((?:[a-z]+)
) but it didn't help.
更多细节:数据是一个电子邮件数据库,所以它主要是普通的英文和普通数字,但偶尔会有像GIHQ4NWL0S5SCGBDD40ZXE5IDP13TYNEA
和AC7A21C0
这样的垃圾字符串代码>,我想完全忽略.我假设任何带有数字的字母顺序都是垃圾.
More detail: The data is an email database, so it's mostly plain English with normal numbers, but occasionally there's rubbish strings like GIHQ4NWL0S5SCGBDD40ZXE5IDP13TYNEA
and AC7A21C0
that I'd like to ignore completely. I'm assuming any alphabetical sequence with a number in it is rubbish.
推荐答案
如果你限制自己使用 ASCII 字母,那么使用(使用 re.I
选项集)
If you restrict yourself to ASCII letters, then use (with the re.I
option set)
\b[a-z]+\b
\b
是词边界锚,只匹配字母数字词"的开头和结尾.所以 \b[a-z]+\b
匹配 pie
,但不匹配 pie21
或 21pie
.
\b
is a word boundary anchor, matching only at the start and end of alphanumeric "words". So \b[a-z]+\b
matches pie
, but not pie21
or 21pie
.
要也允许其他非 ASCII 字母,您可以使用以下内容:
To also allow other non-ASCII letters, you can use something like this:
\b[^\W\d_]+\b
它还允许重音字符等.您可能需要设置 re.UNICODE
选项,尤其是在使用 Python 2 时,以允许 \w
速记匹配非 ASCII 字母.
which also allows accented characters etc. You may need to set the re.UNICODE
option, especially when using Python 2, in order to allow the \w
shorthand to match non-ASCII letters.
[^\W\d_]
作为否定字符类允许除数字和下划线之外的任何字母数字字符.
[^\W\d_]
as a negated character class allows any alphanumeric character except for digits and underscore.
这篇关于提取整个单词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!