问题描述
在Lua中,我正在尝试模式匹配和捕获:
In Lua, I'm trying to pattern match and capture:
+384 Critical Strike (Reforged from Parry Chance)
为
(+384) (Critical Strike)
后缀(Reforged from %s)
是可选的.
我正在尝试使用模式匹配 Lua中的字符串(即strfind
)
I'm trying to match a string in Lua using patterns (i.e. strfind
)
示例字符串:
+384 Critical Strike
+1128 Hit
这分为两部分,我要捕获:
This is broken down into two parts that I want to capture:
- 数字,带有正数或负数指示符;他的案子是
+384
- 字符串,在这种情况下为
Critical Strike
.
- The number, with the leading positive or negative indicator; int his case is
+384
- The string, in this case is
Critical Strike
.
我可以使用相当简单的 pattern 捕获它们:
I can capture these using a fairly simple pattern:
lua中的这种模式有效:
And this pattern in lua works:
local text = "+384 Critical Strike";
local pattern = "([%+%-]%d+) (.+)";
local _, _, value, stat = strfind(text, pattern);
- value =
+384
- stat =
Critical Strike
- value =
+384
- stat =
Critical Strike
现在我需要扩展 模式,以包含可选后缀:
Now I need to expand that pattern to include an optional suffix:
+384 Critical Strike (Reforged from Parry Chance)
其中分为:
注意:我并不是特别在意可选的后缀;表示我没有要求来捕获它,尽管捕获它很方便.
Note: I don't particularly care about the optional trailing suffix; meaning that I have no requirement to capture it, Although capturing it would be handy.
这是我开始遇到贪婪捕获问题的地方.我已经拥有的模式立即执行我不希望执行的操作:
This is where I start to get into issues with greedy capturing. Right away the pattern I already have does what I don't want it to:
- 模式=
([%+%-]%d+) (.+)
- 值=
+384
- stat =
Critical Strike (Reforged from Parry Chance)
- pattern =
([%+%-]%d+) (.+)
- value =
+384
- stat =
Critical Strike (Reforged from Parry Chance)
但是让我们尝试在模式中包括后缀:
But let's try to include the suffix in the pattern:
具有以下模式:
pattern = "([%+%-]%d+) (.+)( %(Reforged from .+%))?"
并且我正在使用?
运算符来指示后缀的0
或1
出现,但是与没有匹配.
And I'm using the ?
operator to indicate 0
or 1
appearances of the suffix but that matches nothing.
我盲目尝试将可选的后缀组从括号(
更改为方括号[
:
I blindly tried changing the optional suffix group from parenthesis (
to brackets [
:
pattern = "([%+%-]%d+) (.+)[ %(Reforged from .+%)]?"
但是现在比赛再次变得贪婪:
But now the match is greedy again:
- 值=
+384
- stat =
Critical Strike (Reforged from Parry Chance)
- value =
+384
- stat =
Critical Strike (Reforged from Parry Chance)
基于 Lua 模式参考):
对于由单个字母(%a,%c等)表示的所有类,相应的大写字母表示该类的补码.例如,%S代表所有非空格字符.
For all classes represented by single letters (%a, %c, etc.), the corresponding uppercase letter represents the complement of the class. For instance, %S represents all non-space characters.
字母,空格和其他字符组的定义取决于当前的语言环境.特别地,类[a-z]可能不等同于%l.
The definitions of letter, space, and other character groups depend on the current locale. In particular, the class [a-z] may not be equivalent to %l.
和魔术配对者:
-
*
,它与班级中的0个或多个字符重复匹配.这些重复项将始终与最长的序列匹配; -
+
,它与班级中的1个或多个字符重复匹配.这些重复项将始终与最长的序列匹配; -
-
,它还与班级中的0个或多个字符重复匹配.与"*"不同,这些重复项将始终匹配最短的序列; -
?
,它与该类中出现0或1个字符相匹配;
我注意到有一个贪婪 *
和一个非贪婪 -
修饰符.从我的中间字符串匹配器开始:
I noticed that there's a greedy *
, and a non-greedy -
modifier. Since my middle string matcher:
(%d) (%s) (%s)
似乎一直在吸收文本,直到最后,也许我应该通过将*
更改为-
:
seems to be absorbing text until the end, perhaps i should try to make it non-greedy, by changing the *
to a -
:
oldPattern = "([%+%-]%d+) (.*)[ %(Reforged from .+%)]?"
newPattern = "([%+%-]%d+) (.-)[ %(Reforged from .+%)]?"
除了现在无法匹配:
- 值=
+384
- stat = nil
- value =
+384
- stat = nil
我尝试了一个包含除 以外的所有内容的集合,而不是中间组捕获任何" 字符(即 .
) strong> (
:
Rather than the middle group capturing "any" character (i.e. .
), I tried a set that contains everything except (
:
pattern = "([%+%-]%d+) ([^%(]*)( %(Reforged from .+%))?"
然后车轮从货车上脱下
local pattern = "([%+%-]%d+) ([^%(]*)( %(Reforged from .+%))?"
local pattern = "([%+%-]%d+) ((^%()*)( %(Reforged from .+%))?"
local pattern = "([%+%-]%d+) (%a )+)[ %(Reforged from .+%)]?"
我以为我和……很近
local pattern = "([%+%-]%d+) ([%a ]+)[ %(Reforged from .+%)]?"
捕获
- value = "+385"
- stat = "Critical Strike " (notice the trailing space)
所以这是我把头撞在枕头上睡觉的地方.我不敢相信我在这个正则表达式上花了四个小时....模式.
So this is where I bang my head against the pillow and go to sleep; I can't believe I've spent four hours on this regex....pattern.
@NicolBolas使用伪正则表达式语言定义的所有可能字符串的集合为:
@NicolBolas The set of all possible strings, defined using a pseudo-regular expression language, are:
+%d %s (Reforged from %s)
其中
+
represents either the Plus Sign (+
) or the "Minus Sign" (-
)%d
represents any latin digit character (e.g.0..9
)%s
represents any latin uppercase or lowercase letters, or embedded spaces (e.g.A-Za-z
)- the remaining characters are literals.
如果我不得不编写一个正则表达式,显然可以尝试做我想要做的事情:
If i had to write a regular expression that obviously tries to do what i want:
\+\-\d+ [\w\s]+( \(Reforged from [\w\s]+\))?
但是,如果我对它的解释不够充分,我可以为您提供几乎所有可能在野外遇到的所有价值的几乎完整清单.
But I can give you near practically complete list of all values I'm likely to encounter in the wild if I didn't explain it well enough.
-
+123 Parry
正数,单个单词 -
+123 Critical Strike
正数,两个单词 -
-123 Parry
负数,单个单词 -
-123 Critical Strike
负数,两个单词 -
+123 Parry (Reforged from Dodge)
正数,单个单词,带有单个单词的可选后缀 -
+123 Critical Strike (Reforged from Dodge)
正数,两个单词,两个单词组成的可选后缀 -
-123 Parry (Reforged from Hit Chance)
负数,一个单词,两个单词组成的可选后缀 -
-123 Critical Strike (Reforged from Hit Chance)
负数,两个单词,两个单词组成的可选后缀
+123 Parry
positive number, single word+123 Critical Strike
positive number, two words-123 Parry
negative number, single word-123 Critical Strike
negative number, two words+123 Parry (Reforged from Dodge)
positive number, single word, optional suffix present with single word+123 Critical Strike (Reforged from Dodge)
positive number, two words, optional suffix present with two words-123 Parry (Reforged from Hit Chance)
negative number, single word, optional suffix present with two words-123 Critical Strike (Reforged from Hit Chance)
negative number, two words, optional suffix present with two words
存在奖金模式,这些模式也很可能会匹配:
There are bonus patterns it would seem obvious that the patterns would also match:
-
+1234 Critical Strike Chance
四位数字,三个单词 -
+12345 Mount and run speed increase
五位数字,五个单词 -
+123456 Mount and run speed increase
六位数字,五个单词 -
-1 MoUnT aNd RuN sPeEd InCrEaSe
一位数字,五个单词 -
-1 HiT (Reforged from CrItIcAl StRiKe ChAnCe)
负一位数字,一个单词,带有3个单词的可选后缀
+1234 Critical Strike Chance
four digit number, three words+12345 Mount and run speed increase
five digit number, five words+123456 Mount and run speed increase
six digit number, five words-1 MoUnT aNd RuN sPeEd InCrEaSe
one digit number, five words-1 HiT (Reforged from CrItIcAl StRiKe ChAnCe)
negative one digit number, one word, optional suffix present with 3 words
尽管理想模式应该与上述奖励条目匹配,但它没有.
And while the ideal pattern should match the above bonus entries, it does not have to.
实际上,我尝试解析的所有数字" 都将被本地化,例如:
In reality all "numbers" i am attempting to parse out will be localized, e.g.:
-
+123,456
英文(英语) -
+123.456
在德国(de-DE) -
+123'456
法语(fr-CA) -
+123 456
爱沙尼亚语(et-EE) -
+1,23,456
以阿萨姆语(as-IN)
+123,456
in English (en-US)+123.456
in Germany (de-DE)+123'456
in French (fr-CA)+123 456
in Estonian (et-EE)+1,23,456
in Assamese (as-IN)
任何答案都必须不试图说明这些本地化问题.您不知道将要显示数字的语言环境,这就是为什么从该问题中删除了数字本地化的原因.您必须严格假设数字包含plus sign
,hyphen minus
和拉丁数字0
至9
.我已经知道如何解析本地化的数字.这个问题是关于尝试将可选后缀与贪婪模式解析器进行匹配.
Any answer must not attempt to account for these localization issues. You do not know the locale a number will be presented from, that is why the number localization has been removed from the question. You must strictly assume that numbers contain plus sign
, hyphen minus
, and latin digits 0
through 9
. I already know how to parse localized numbers. This question is about trying to match the optional suffix with a greedy pattern parser.
修改:您实际上不必尝试处理本地化号码.在某种程度上尝试在不知道语言环境的情况下处理它们是错误的.例如,我没有包括所有可能的数字本地化.再说一遍:我不知道将来可能存在哪些将来的本地化.
Edit: You really didn't have to try to handle localized number. At some level trying to handle them, without knowing the locale, is wrong. For example, I didn't include all possible localizations of numbers. For another: I don't know what future localizations might exist in the future.
推荐答案
嗯,我没有安装Lua4,但是这种模式在Lua5下有效.我希望它也适用于Lua4.
Hmm I don't have Lua4 installed but this pattern works under Lua5. I would expect it to work for Lua4 as well.
更新1 :由于已指定其他要求(本地化),因此我调整了模式和测试以反映这些要求.
Update 1: Since additional requirements have been specified (localization) I've adapted the pattern and the tests to reflect these.
更新2 :更新了模式和测试以处理包含@IanBoyd在注释中提到的数字的附加文本类.添加了说明字符串模式.
Update 2: Updated the pattern and tests to deal with an additional class of text containing a number as mentioned by @IanBoyd in the comments. Added an explanationof the string pattern.
更新3 :针对问题的最后一次更新中提到的分别处理本地化号码的情况,添加了变体.
Update 3: Added variation for the case where the localized number is dealt with separately as mentioned in the last update to the question.
尝试:
"(([%+%-][',%.%d%s]-[%d]+)%s*([%a]+[^%(^%)]+[%a]+)%s*(%(?[%a%s]*%)?))"
或(不尝试验证数字本地化令牌)-只需取一些不是模式结尾处带有数字标记的字母:
or (no attempt to validate number localization tokens) - just take anything which is not a letter with a digit sentinel at the end of the pattern:
"(([%+%-][^%a]-[%d]+)%s*([%a]+[^%(^%)]+[%a]+)%s*(%(?[%a%s]*%)?))"
以上两种模式都不打算用科学计数法处理数字(例如:1.23e + 10)
Neither of the patterns above are meant to deal with numbers in scientific notation (e.g: 1.23e+10)
Lua5测试(编辑进行清理-测试变得混乱):
Lua5 test (Edited to clean up - tests getting cluttered):
function test(tab, pattern)
for i,v in ipairs(tab) do
local f1, f2, f3, f4 = v:match(pattern)
print(string.format("Test{%d} - Whole:{%s}\nFirst:{%s}\nSecond:{%s}\nThird:{%s}\n",i, f1, f2, f3, f4))
end
end
local pattern = "(([%+%-][',%.%d%s]-[%d]+)%s*([%a]+[^%(^%)]+[%a]+)%s*(%(?[%a%s]*%)?))"
local testing = {"+123 Parry",
"+123 Critical Strike",
"-123 Parry",
"-123 Critical Strike",
"+123 Parry (Reforged from Dodge)",
"+123 Critical Strike (Reforged from Dodge)",
"-123 Parry (Reforged from Hit Chance)",
"-123 Critical Strike (Reforged from Hit Chance)",
"+122384 Critical Strike (Reforged from parry chance)",
"+384 Critical Strike ",
"+384Critical Strike (Reforged from parry chance)",
"+1234 Critical Strike Chance (Reforged from CrItIcAl StRiKe ChAnCe)",
"+12345 Mount and run speed increase (Reforged from CrItIcAl StRiKe ChAnCe)",
"+123456 Mount and run speed increase (Reforged from CrItIcAl StRiKe ChAnCe)",
"-1 MoUnT aNd RuN sPeEd InCrEaSe (Reforged from CrItIcAl StRiKe ChAnCe)",
"-1 HiT (Reforged from CrItIcAl StRiKe ChAnCe)",
"+123,456 +1234 Critical Strike Chance (Reforged from CrItIcAl StRiKe ChAnCe)",
"+123.456 Critical Strike Chance (Reforged from CrItIcAl StRiKe ChAnCe)",
"+123'456 Critical Strike Chance (Reforged from CrItIcAl StRiKe ChAnCe)",
"+123 456 Critical Strike Chance (Reforged from CrItIcAl StRiKe ChAnCe)",
"+1,23,456 Critical Strike Chance (Reforged from CrItIcAl StRiKe ChAnCe)",
"+9 mana every 5 sec",
"-9 mana every 20 min (Does not occurr in data but gets captured if there)"}
test(testing, pattern)
这是模式的细分:
local explainPattern =
"(" -- start whole string capture
..
--[[
capture localized number with sign -
take at first as few digits and separators as you can
ensuring the capture ends with at least 1 digit
(the last digit is our sentinel enforcing the boundary)]]
"([%+%-][',%.%d%s]-[%d]+)"
..
--[[
gobble as much space as you can]]
"%s*"
..
--[[
capture start with letters, followed by anything which is not a bracket
ending with at least 1 letter]]
"([%a]+[^%(^%)]+[%a]+)"
..
--[[
gobble as much space as you can]]
"%s*"
..
--[[
capture an optional bracket
followed by 0 or more letters and spaces
ending with an optional bracket]]
"(%(?[%a%s]*%)?)"
..
")" -- end whole string capture
这篇关于Lua中的贪婪/非贪婪模式匹配和可选后缀的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!