问题描述
我必须在 Python 中基于 PCRE 解析一些字符串,但我不知道该怎么做.
我想解析的字符串如下:
匹配mysql m/^.
(4.[-.w]+) ... /s p/MySQL/i/$1/
在这个例子中,我必须得到不同的项目:
"m/^.
(4.[-.w]+) ... /s" ;"p/MySQL/" ;我/$1/"
我发现与 Python 中 PCRE 操作相关的唯一一件事是这个模块:http://pydoc.org/2.2.3/pcre.html(但它写的是一个 .so 文件......)
你知道是否存在一些 Python 模块来解析这种字符串吗?
在 Python 中使用非 ASCII 时要特别小心
Python 如何处理或无法处理模式和字符串中的非 ASCII 有一些非常微妙的问题.更糟糕的是,这些差异不仅取决于您使用的 Python 版本,还取决于您是否拥有广泛构建".
一般来说,当你在做 Unicode 的事情时,
在完整的 casefolding 下,这意味着(例如)"ß"
现在正确匹配 "SS"
, "ss"
, ss",ss"
(等)选择不区分大小写匹配时.(这在希腊文字中无疑比拉丁文字更重要.)
另请参阅我的第三次 OSCON2011 演讲的幻灯片或文档源代码,标题为Unicode Support Shootout: The Good, the Bad, and the (mostly) Ugly" 针对 JavaScript、PHP、Go、Ruby、Python、Java 和珀尔.如果不能使用 Perl 正则表达式或 ICU 正则表达式库(它没有命名捕获,唉!),那么 Matthew 的 regex
for Python 可能是你最好的选择.
Nᴏᴛᴀ Bᴇɴᴇ s.ᴠ.ᴘ.(= s'il vous plaît, et même s'il ne vous plaît pas :) 以下未经请求的非商业非广告不是 实际上是 Python regex
库的作者放在这里的.:)
酷 regex
功能
Python regex
库有一个超级功能的聚宝盆,其中一些在任何其他正则表达式系统中都找不到.无论您是否碰巧将它用于它的 ᴘᴄʀᴇ-ness 或它出色的 Unicode 支持,这些都非常值得一试.
该模块的一些突出特点是:
- 可变宽度后视,这是一个在正则表达式引擎中非常罕见的功能,当您真正想要它时却没有它会令人非常沮丧.这很可能是正则表达式中最常被请求的功能.
- 向后搜索,因此您不必先自己反转字符串.
- 作用域
ismx
-type 选项,以便(?i:foo)
仅用于 foo 的 casefolds,而不是整体,或(?-i:foo)
仅在 foo 上将其关闭.这就是 Perl 的工作方式(或可以). - 基于编辑距离的模糊匹配(Udi Manber的
agrep
和glimpse
也有) - 通过
L
插值隐式最短到最长排序的命名列表 - 仅匹配单词的开头或结尾而不匹配任一侧的元字符 (
m
,M
) - 支持所有 Unicode 行分隔符(Java 可以做到这一点,Perl 也可以做到这一点,尽管对
R
有点不情愿,每个 RL1.6. - 在每个 RL1 的括号字符类上的完整集合操作 - 并集、交集、差异和对称差异.3,这比在 Perl 中获得要容易得多.
- 允许重复捕获组,例如
(w+s+)+
,您可以在其中获取第一组的所有单独匹配,而不仅仅是最后一个匹配.(我相信 C# 也可能会这样做.) - 比前瞻中的偷偷摸摸的捕获组更直接地获得重叠匹配的方法.
- 所有组的开始和结束位置,用于以后的切片/子字符串操作,很像 Perl 的
@+
和@-
数组. - 分支重置操作符通过
(?|...|...|...|)
重置每个分支中的组编号,就像它在 Perl 中的工作方式一样. - 可以配置为让您的咖啡在早上等您.
- 支持 RL2.3 中更复杂的词边界.
- 默认采用 Unicode 字符串,并完全支持 RL1.2a 以便
w
、、
s
等在 Unicode 上的工作. - 支持
X
用于字素. - 支持
G
延续点断言. - 适用于 64 位构建(
re
只有 32 位索引). - 支持多线程.
好吧,炒作够了.:)
另一个很好的替代正则表达式引擎
如果您是正则表达式极客,最后一个值得考虑的替代方案是 Python 库绑定 到 Russ Cox 很棒的 RE2 库.它还本机支持 Unicode,包括简单的基于字符的 casefolding,并且与 re
不同,它特别提供了 Unicode 通用类别和 Unicode 脚本字符属性,这是您最常需要的两个关键属性更简单的 Unicode 处理类型.
尽管 RE2 遗漏了一些 Unicode 特性,例如 N{...}
在 ICU、Perl 和 Python 中发现的命名字符支持,但它具有极其重要的计算优势,使其首选正则表达式引擎,每当您担心通过 Web 查询中的正则表达式等进行基于饥饿的拒绝服务攻击时.它通过禁止反向引用来管理这一点,这会导致正则表达式不再是正则表达式,并在时间和空间上冒着超指数爆炸的风险.
RE2 的库绑定不仅适用于 C/C++ 和 Python,还适用于 Perl,尤其适用于 Go,它将很快取代那里的标准正则表达式库.
I have to parse some strings based on PCRE in Python, and I've no idea how to do that.
Strings I want to parse looks like:
match mysql m/^.
(4.[-.w]+) ... /s p/MySQL/ i/$1/
In this example, I have to get this different items:
"m/^.
(4.[-.w]+) ... /s" ; "p/MySQL/" ; "i/$1/"
The only thing I've found relating to PCRE manipulation in Python is this module: http://pydoc.org/2.2.3/pcre.html (but it's written it's a .so file ...)
Do you know if some Python module exists to parse this kind of string?
Be Especially Careful with non‐ASCII in Python
There are some really subtle issues with how Python deals with, or fails to deal with, non-ASCII in patterns and strings. Worse, these disparities vary substantially according, not just to which version of Python you are using, but also whether you have a "wide build".
In general, when you’re doing Unicode stuff, Python 3 with a wide build works best and Python 2 with a narrow build works worst, but all combinations are still a pretty far cry far from how Perl regexes work vis‐à‐vis Unicode. If you’re looking for ᴘᴄʀᴇ patterns in Python, you may have to look a bit further afield than its old re
module.
The vexing "wide-build" issues have finally been fixed once and for all — provided you use a sufficiently advanced release of Python. Here’s an excerpt from the v3.3 release notes:
The Future of Python Regexes
In contrast to what’s currently available in the standard Python distribution’s re
library, Matthew Barnett’s regex
module for both Python 2 and Python 3 alike is much, much better in pretty much all possible ways and will quite probably replace re
eventually. Its particular relevance to your question is that his regex
library is far more ᴘᴄʀᴇ (i.e. it’s much more Perl‐compatible) in every way than re
now is, which will make porting Perl regexes to Python easier for you. Because it is a ground‐up rewrite (as in from‐scratch, not as in hamburger :), it was written with non-ASCII in mind, which re
was not.
The regex
library therefore much more closely follows the (current) recommendations of UTS#18: Unicode Regular Expressions in how it approaches things. It meets or exceeds the UTS#18 Level 1 requirements in most if not all regards, something you normally have to use the ICU regex library or Perl itself for — or if you are especially courageous, the new Java 7 update to its regexes, as that also conforms to the Level One requirements from UTS#18.
Beyond meeting those Level One requirements, which are all absolutely essential for basic Unicode support, but which are not met by Python’s current re
library, the awesome regex
library also meets the Level Two requirements for RL2.5 Named Characters (N{...})
), RL2.2 Extended Grapheme Clusters (X
), and the new RL2.7 on Full Properties from revision 14 of UTS#18.
Matthew’s regex
module also does Unicode casefolding so that case insensitive matches work reliably on Unicode, which re
does not.
The following is no longer true, because regex
now supports full Unicode casefolding, like Perl and Ruby.
Under full casefolding, this means that (for example) "ß"
now correct matches "SS"
, "ss"
, "ſſ"
, "ſs"
(etc.) when case-insensitive matching is selected. (This is admittedly more important in the Greek script than the Latin one.)
See also the slides or doc source code from my third OSCON2011 talk entitled "Unicode Support Shootout: The Good, the Bad, and the (mostly) Ugly" for general issues in Unicode support across JavaScript, PHP, Go, Ruby, Python, Java, and Perl. If can’t use either Perl regexes or possibly the ICU regex library (which doesn’t have named captures, alas!), then Matthew’s regex
for Python is probably your best shot.
Nᴏᴛᴀ Bᴇɴᴇ s.ᴠ.ᴘ. (= s’il vous plaît, et même s’il ne vous plaît pas :) The following unsolicited noncommercial nonadvertisement was not actually put here by the author of the Python regex
library. :)
Cool regex
Features
The Python regex
library has a cornucopeia of superneat features, some of which are found in no other regex system anywhere. These make it very much worth checking out no matter whether you happen to be using it for its ᴘᴄʀᴇ‐ness or its stellar Unicode support.
A few of this module’s outstanding features of interest are:
- Variable‐width lookbehind, a feature which is quite rare in regex engines and very frustrating not to have when you really want it. This may well be the most frequently requested feature in regexes.
- Backwards searching so you don’t have to reverse your string yourself first.
- Scoped
ismx
‐type options, so that(?i:foo)
only casefolds for foo, not overall, or(?-i:foo)
to turn it off just on foo. This is how Perl works (or can). - Fuzzy matching based on edit‐distance (which Udi Manber’s
agrep
andglimpse
also have) - Implicit shortest‐to‐longest sorted named lists via
L<list>
interpolation - Metacharacters that specifically match only the start or only the end of a word rather than either side (
m
,M
) - Support for all Unicode line separators (Java can do this, as can Perl albeit somewhat begrudgingly with
R
per RL1.6. - Full set operations — union, intersection, difference, and symmetric difference — on bracketed character classes per RL1.3, which is much easier than getting at it in Perl.
- Allows for repeated capture groups like
(w+s+)+
where you can get all separate matches of the first group not just its last match. (I believe C# might also do this.) - A more straightforward way to get at overlapping matches than sneaky capture groups in lookaheads.
- Start and end positions for all groups for later slicing/substring operations, much like Perl’s
@+
and@-
arrays. - The branch‐reset operator via
(?|...|...|...|)
to reset group numbering in each branch the way it works in Perl. - Can be configured to have your coffee waiting for you in the morning.
- Support for the more sophisticated word boundaries from RL2.3.
- Assumes Unicode strings by default, and fully supports RL1.2a so that
w
,,
s
, and such work on Unicode. - Supports
X
for graphemes. - Supports the
G
continuation point assertion. - Works correctly for 64‐bit builds (
re
only has 32‐bit indices). - Supports multithreading.
Ok, that’s enough hype. :)
Yet Another Fine Alternate Regex Engine
One final alternative that is worth looking at if you are a regex geek is the Python library bindings to Russ Cox’s awesome RE2 library. It also supports Unicode natively, including simple char‐based casefolding, and unlike re
it notably provides for both the Unicode General Category and the Unicode Script character properties, which are the two key properties you most often need for the simpler kinds of Unicode processing.
Although RE2 misses out on a few Unicode features like N{...}
named character support found in ICU, Perl, and Python, it has extremely serious computational advantages that make it the regex engine of choice whenever you’re concern with starvation‐based denial‐of‐service attacks through regexes in web queries and such. It manages this by forbidding backreferences, which cause a regex to stop being regular and risk super‐exponential explosions in time and space.
Library bindings for RE2 are available not just for C/C++ and Python, but also for Perl and most especially for Go, where it is slated to very shortly replace the standard regex library there.
这篇关于Python 中的 Perl 兼容正则表达式 (PCRE)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!