问题描述
在使用 Perl 处理大型文本化学数据库的过程中,我遇到了使用正则表达式匹配化学式的问题.我之前看过这些 两个主题,但建议的答案对于我的要求来说太松散了.
具体来说,我(公认有限)的研究使我这篇文章 给出了当前接受的化学符号的正则表达式,我将复制到这里以供参考
[BCFHIKNOPSUVWY]|[ISZ][nr]|[ACELP][ru]|A[cglmst]|B[aehikr]|C[adeflos]|D[bsy]|Es|F[elmr]|G[ade]|H[efgos]|Kr|L[aiv]|M[cdgnot]|N[abdehiop]|O[gs]|P[abdmot]|R[abe-hnu]|S[bcegim]|T[abcehilms]|Xe|Yb(因此例如 C
、Cm
和 Cn
会通过,但不会通过 Cg
或 Cx
.)
与前面的问题一样,我还需要匹配数字、完整的括号集和完整的方括号集,以便例如C2H6O
和 (CH3)2CFCOO(CH2)2Si(CH3)2Cl
匹配.
那么我如何将之前的解决方案与正则表达式结合起来以匹配有效的化学元素以严格匹配化学式?
(如果添加不是太麻烦,将非常感谢如何人工解析正则表达式的详细说明,尽管不是绝对必要的.)
Brief
我决定为什么不创建一个庞大的正则表达式来做你想做的事(但仍然保持一个干净的正则表达式).此正则表达式将与循环结合使用,以遍历括号或括号组的匹配项.
假设
我假设如下,因为 OP 没有给出完整的正面和负面匹配列表:
- 嵌套括号是不可能的
- 嵌套方括号是不可能的
- 包围单个圆括号组的方括号组是多余的,因此是不正确的
- 方括号组必须至少包含 2 个组,其中 1 个这样的组必须是括号组
如果这些假设中的任何一个不正确,请告诉我,以便我可以相应地修复正则表达式
答案
代码
(?(DEFINE)(?#周期元素)(?<氢>H)(?<氦>He)(?<锂>Li)(?<铍>Be)(?<硼>B)(?<碳>C)(?<氮>N)(?<氧气>O)(? <氟>F)(?<氖>氖)(?<钠>Na)(?<镁>Mg)(?<铝>Al)(?<硅>Si)(?<磷>P)(?S)(?氯Cl)(?<氩>Ar)(?<钾>K)(?<钙>Ca)(?<钪>Sc)(?<钛>Ti)(?<钒>V)(?<铬>Cr)(?<锰>Mn)(?<铁>Fe)(?<钴>Co)(?<镍>Ni)(?<铜>Cu)(?<锌>Zn)(?<镓>Ga)(?<锗>Ge)(?<砷>As)(?<硒>Se)(? <溴>Br)(?<氪>氪)(?<铷>Rb)(?<锶>Sr)(?<钇>Y)(?<锆>Zr)(?<铌>Nb)(?<钼>Mo)(?<锝>Tc)(?<钌>Ru)(?<铑>Rh)(?<钯>Pd)(?<银>Ag)(?<镉>Cd)(?<铟>In)(? 锡 Sn)(?<锑>Sb)(?<碲>Te)(?<碘>I)(?<氙>氙)(? 铯 Cs)(?<钡>Ba)(?<镧>La)(?<铈>Ce)(?<镨>Pr)(?<钕>Nd)(?<钷>Pm)(?<钐>Sm)(?<Europium>Eu)(?<钆>Gd)(? <铽>Tb)(?<镝>Dy)(?Ho)(?<铒>Er)(? <铥>Tm)(?<镱>Yb)(?<镥>Lu)(?<铪>Hf)(?<钽>Ta)(?<钨>W)(?<铼>Re)(?Os)(?<铱>Ir)(?<白金>Pt)(?<金>Au)(?<汞>Hg)(?<铊>Tl)(?铅)(?<铋>Bi)(?<钋>Po)(?At)(?<氡>Rn)(?<钫>Fr)(?<镭>Ra)(?<锕>Ac)(?<钍>Th)(?Pa)(?<铀>U)(?<镎>Np)(?<钚>Pu)(?<镅>Am)(?<锔>Cm)(?<Berkelium>Bk)(?<Californium>Cf)(?Es)(?Fm)(?<Mendelevium>Md)(?<Nobelium>否)(?<劳伦西姆>Lr)(?<Rutherfordium>Rf)(?<Dubnium>Db)(?<Seaborgium>Sg)(?<Bohrium>Bh)(?<钆>Hs)(?<Meitnerium>Mt)(?<达姆施塔特>Ds)(?<Roentgenium>Rg)(?<Copernicium>Cn)(?<Nihonium>Nh)(?<Flerovium>Fl)(?<莫斯科>麦克)(?Lv)(?<Tennessine>Ts)(?<Oganesson>Og)(?#正则表达式)(?<元素>(?&锕)|(?&银)|(?&铝)|(?&Americ)|(?&氩)|(?&砷)|(?&砹)|(?&Gold)|(?&Barium)|(?&Beryllium)|(?&Bohrium)|(?&Bismuth)|(?&Berkelium)|(?&&;溴)|(?&硼)|(?&钙)|(?&镉)|(?&铈)|(?&Californium)|(?&氯)|(?&锔)|(?&Copernicium)|(?&Cobalt)|(?&Chromium)|(?&Cesium)|(?&Copper)|(?&Carbon)|(?&Dubnium)|(?&Darmstadtium)|(?&Dysprosium)|(?&Erbium)|(?&Einsteinium)|(?&Europium)|(?&Iron)|(?&Flerovium)|(?&Fermium)|(?&Francium)|(?&氟)|(?&镓)|(?&钆)|(?&锗)|(?&氦)|(?&Hafnium)|(?&Mercury)|(?&Holmium)|(?&Hassium)|(?&Hydrogen)|(?&Indium)|(?&Iridium)|(?&碘)|(?&氪)|(?&钾)|(?&镧)|(?&锂)|(?&Lawrncium)|(?&Lutetium)|(?&Livermorium)|(?&Moscovium)|(?&Mendelevium)|(?&镁)|(?&锰)|(?&钼)|(?&钼)|(?&钠)|(?&铌)|(?&钕)|(?&Neon)|(?&Nihonium)|(?&Nickel)|(?&Nobelium)|(?&Neptunium)|(?&Nitrogen)|(?&Oganesson)|(?&&;锇)|(?&氧)|(?&Protactinium)|(?&Lead)|(?&Palladium)|(?&Promethium)|(?&Polonium)|(?&镨)|(?&铂)|(?&钚)|(?&磷)|(?&镭)|(?&铷)|(?&铼)|(?&卢瑟福))|(?&Roentgenium)|(?&Rhodium)|(?&Radon)|(?&Ruthenium)|(?&Antimony)|(?&Scandium)|(?&Selenium)|(?&Seaborgium)|(?&Silicon)|(?&Samarium)|(?&Tin)|(?&Strontium)|(?&Sulfur)|(?&Tantalum)|(?&铽)|(?&锝)|(?&碲)|(?&钍)|(?&钛)|(?&铊)|(?&铥)|(?&Tennessine)|(?&Uranium)|(?&Vanadium)|(?&Tungsten)|(?&Xenon)|(?&Ytterbium)|(?&Yttrium)|(?&锆)|(?&锌))(?(?:[1-9]\d*)?)(?(?:(?&Element)(?&Num))+)(?\((?&ElementGroup)+\)(?&Num))(?\[(?:(?:(?&ElementParenthesesGroup)(?:(?&ElementGroup)|(?&ElementParenthesesGroup))+)|(?:(?:(?&ElementGroup)|(?&ElementParenthesesGroup))+(?&ElementParenthesesGroup)))\](?&Num)))^((?(?&ElementSquareBracketGroup))|(?(?&ElementParenthesesGroup))|(?(?&ElementGroup))))+$
说明
(?(DEFINE))
部分的第一部分列出了每个周期元素(按原子序数排序以便于查找).Element
组充当 1 中列出的每个元素之间的简单或|
.确保每个元素的符号按第一个字符的字母顺序排列,然后按符号字符长度(以免捕获,例如CarbonC
而不是CalciumCa
)ElementGroup
以下列格式指定一组化学品:一个或多个Element
后跟零个或多个数字,不包括零(由组Num)
- 有效示例
C
-Element
CH
-Element
后跟另一个Element
CH3
-Element
后跟另一个Element
和一个Num
O2
-Element
后跟一个Num
- 无效示例
N0
-0
不能显式使用N01
-Num
组指定数字必须以1-9
开头或没有数字A
- 元素不存在c
- 元素不存在 - 区分大小写的正则表达式
- 有效示例
ElementParenthesesGroup
在括号(
)
之间指定一组或多组ElementGroup
,但至少包含一个>元素组
- 有效示例
(CH)
-ElementGroup
用括号括起来(CH3)
-ElementGroup
括号括起来(CH3NO4)
- 多个ElementGroup
用括号括起来(CH3N04)2
- 多个ElementGroup
用括号括起来,后跟一个Num
- 无效示例
(CH[NO4])
- 只有ElementGroup
在ElementParenthesesGroup
内有效
- 有效示例
ElementSquareBracketGroup
指定一组ElementParenthesesGroup
或ElementGroup
在方括号[
]
之间> 但至少包含一个ElementParenthesesGroup
和另一个组(ElementParenthesesGroup
或ElementGroup
)- 有效示例
[CH3(NO4)]
- 包含至少一个ElementParenthesesGroup
和另一个ElementParenthesesGroup
或ElementGroup
[(NO4)CH]2
- 包含至少一个ElementParenthesesGroup
和另一个ElementParenthesesGroup
或ElementGroup
后跟Num
[(NO4)(CH3)]
- 包含至少一个ElementParenthesesGroup
和另一个ElementParenthesesGroup
或ElementGroup
>
- 无效示例
[(NO4)]
- 不包含第二组,括号[
]
是多余的[NO4]
- 不包含ElementParenthesesGroup
- 有效示例
附加信息
我意识到这是一个很长的答案,但 OP 提出了一个非常具体的问题,并希望确保满足特定标准.
确保设置了以下标志:
g
- 确保全局匹配x
- 确保忽略空格- 如果数据跨多行(由换行符分隔)使用
m
表示多行
注意:正则表达式只会捕获它找到的最后一组 X
类型(并覆盖先前捕获的所述类型 X
的组.这是正则表达式,目前无法覆盖此行为.这可能会给您带来不良结果.您可以在链接的正则表达式中的最后一个示例以及 (CH3)2CFCOO(CH2)2Si(CH3)2Cl
因为每个组类型有多个.
In the course of processing a large textual chemical database with Perl, I had been faced with the problem of using a regex to match chemical formulae. I have seen these two previous topics, but the suggested answers there are too loose for my requirements.
Specifically, my (admittedly limited) research has led me to this posting that gives a regex for the currently accepted chemical symbols, which I'll copy here for reference
[BCFHIKNOPSUVWY]|[ISZ][nr]|[ACELP][ru]|A[cglmst]|B[aehikr]|C[adeflos]|D[bsy]|Es|F[elmr]|G[ade]|H[efgos]|Kr|L[aiv]|M[cdgnot]|N[abdehiop]|O[gs]|P[abdmot]|R[abe-hnu]|S[bcegim]|T[abcehilms]|Xe|Yb
(Thus e.g. C
, Cm
, and Cn
will pass, but not Cg
or Cx
.)
As with the previous questions, I also need to match numbers, complete sets of parentheses and complete sets of square brackets, so that both e.g. C2H6O
and (CH3)2CFCOO(CH2)2Si(CH3)2Cl
are matched.
So how do I combine the previous solutions with the grand regex for matching valid chemical elements to strictly match a chemical formula?
(If it's not too much trouble to add, a blow-by-blow account of how to humanly parse the regex would be appreciated greatly, though not strictly necessary.)
Brief
I decided why not create a massive regex to do what you want (but still maintain a clean regex). This regex would be used in conjunction with a loop to go over matches for bracket or parentheses groups.
Assumptions
I am assuming the following since the OP has not given a full list of positive and negative matches:
- Nested parentheses aren't possible
- Nested square brackets aren't possible
- Square bracket groups that surround a single parentheses group are redundant and therefore incorrect
- Square bracket groups must contain at least 2 groups, of which 1 such group must be a parentheses group
If any of these assumptions are incorrect, please let me know so that I may fix the regex accordingly
Answer
Code
(?(DEFINE)
(?# Periodic elements )
(?<Hydrogen>H)
(?<Helium>He)
(?<Lithium>Li)
(?<Beryllium>Be)
(?<Boron>B)
(?<Carbon>C)
(?<Nitrogen>N)
(?<Oxygen>O)
(?<Fluorine>F)
(?<Neon>Ne)
(?<Sodium>Na)
(?<Magnesium>Mg)
(?<Aluminum>Al)
(?<Silicon>Si)
(?<Phosphorus>P)
(?<Sulfur>S)
(?<Chlorine>Cl)
(?<Argon>Ar)
(?<Potassium>K)
(?<Calcium>Ca)
(?<Scandium>Sc)
(?<Titanium>Ti)
(?<Vanadium>V)
(?<Chromium>Cr)
(?<Manganese>Mn)
(?<Iron>Fe)
(?<Cobalt>Co)
(?<Nickel>Ni)
(?<Copper>Cu)
(?<Zinc>Zn)
(?<Gallium>Ga)
(?<Germanium>Ge)
(?<Arsenic>As)
(?<Selenium>Se)
(?<Bromine>Br)
(?<Krypton>Kr)
(?<Rubidium>Rb)
(?<Strontium>Sr)
(?<Yttrium>Y)
(?<Zirconium>Zr)
(?<Niobium>Nb)
(?<Molybdenum>Mo)
(?<Technetium>Tc)
(?<Ruthenium>Ru)
(?<Rhodium>Rh)
(?<Palladium>Pd)
(?<Silver>Ag)
(?<Cadmium>Cd)
(?<Indium>In)
(?<Tin>Sn)
(?<Antimony>Sb)
(?<Tellurium>Te)
(?<Iodine>I)
(?<Xenon>Xe)
(?<Cesium>Cs)
(?<Barium>Ba)
(?<Lanthanum>La)
(?<Cerium>Ce)
(?<Praseodymium>Pr)
(?<Neodymium>Nd)
(?<Promethium>Pm)
(?<Samarium>Sm)
(?<Europium>Eu)
(?<Gadolinium>Gd)
(?<Terbium>Tb)
(?<Dysprosium>Dy)
(?<Holmium>Ho)
(?<Erbium>Er)
(?<Thulium>Tm)
(?<Ytterbium>Yb)
(?<Lutetium>Lu)
(?<Hafnium>Hf)
(?<Tantalum>Ta)
(?<Tungsten>W)
(?<Rhenium>Re)
(?<Osmium>Os)
(?<Iridium>Ir)
(?<Platinum>Pt)
(?<Gold>Au)
(?<Mercury>Hg)
(?<Thallium>Tl)
(?<Lead>Pb)
(?<Bismuth>Bi)
(?<Polonium>Po)
(?<Astatine>At)
(?<Radon>Rn)
(?<Francium>Fr)
(?<Radium>Ra)
(?<Actinium>Ac)
(?<Thorium>Th)
(?<Protactinium>Pa)
(?<Uranium>U)
(?<Neptunium>Np)
(?<Plutonium>Pu)
(?<Americium>Am)
(?<Curium>Cm)
(?<Berkelium>Bk)
(?<Californium>Cf)
(?<Einsteinium>Es)
(?<Fermium>Fm)
(?<Mendelevium>Md)
(?<Nobelium>No)
(?<Lawrencium>Lr)
(?<Rutherfordium>Rf)
(?<Dubnium>Db)
(?<Seaborgium>Sg)
(?<Bohrium>Bh)
(?<Hassium>Hs)
(?<Meitnerium>Mt)
(?<Darmstadtium>Ds)
(?<Roentgenium>Rg)
(?<Copernicium>Cn)
(?<Nihonium>Nh)
(?<Flerovium>Fl)
(?<Moscovium>Mc)
(?<Livermorium>Lv)
(?<Tennessine>Ts)
(?<Oganesson>Og)
(?# Regex )
(?<Element>(?&Actinium)|(?&Silver)|(?&Aluminum)|(?&Americium)|(?&Argon)|(?&Arsenic)|(?&Astatine)|(?&Gold)|(?&Barium)|(?&Beryllium)|(?&Bohrium)|(?&Bismuth)|(?&Berkelium)|(?&Bromine)|(?&Boron)|(?&Calcium)|(?&Cadmium)|(?&Cerium)|(?&Californium)|(?&Chlorine)|(?&Curium)|(?&Copernicium)|(?&Cobalt)|(?&Chromium)|(?&Cesium)|(?&Copper)|(?&Carbon)|(?&Dubnium)|(?&Darmstadtium)|(?&Dysprosium)|(?&Erbium)|(?&Einsteinium)|(?&Europium)|(?&Iron)|(?&Flerovium)|(?&Fermium)|(?&Francium)|(?&Fluorine)|(?&Gallium)|(?&Gadolinium)|(?&Germanium)|(?&Helium)|(?&Hafnium)|(?&Mercury)|(?&Holmium)|(?&Hassium)|(?&Hydrogen)|(?&Indium)|(?&Iridium)|(?&Iodine)|(?&Krypton)|(?&Potassium)|(?&Lanthanum)|(?&Lithium)|(?&Lawrencium)|(?&Lutetium)|(?&Livermorium)|(?&Moscovium)|(?&Mendelevium)|(?&Magnesium)|(?&Manganese)|(?&Molybdenum)|(?&Meitnerium)|(?&Sodium)|(?&Niobium)|(?&Neodymium)|(?&Neon)|(?&Nihonium)|(?&Nickel)|(?&Nobelium)|(?&Neptunium)|(?&Nitrogen)|(?&Oganesson)|(?&Osmium)|(?&Oxygen)|(?&Protactinium)|(?&Lead)|(?&Palladium)|(?&Promethium)|(?&Polonium)|(?&Praseodymium)|(?&Platinum)|(?&Plutonium)|(?&Phosphorus)|(?&Radium)|(?&Rubidium)|(?&Rhenium)|(?&Rutherfordium)|(?&Roentgenium)|(?&Rhodium)|(?&Radon)|(?&Ruthenium)|(?&Antimony)|(?&Scandium)|(?&Selenium)|(?&Seaborgium)|(?&Silicon)|(?&Samarium)|(?&Tin)|(?&Strontium)|(?&Sulfur)|(?&Tantalum)|(?&Terbium)|(?&Technetium)|(?&Tellurium)|(?&Thorium)|(?&Titanium)|(?&Thallium)|(?&Thulium)|(?&Tennessine)|(?&Uranium)|(?&Vanadium)|(?&Tungsten)|(?&Xenon)|(?&Ytterbium)|(?&Yttrium)|(?&Zirconium)|(?&Zinc))
(?<Num>(?:[1-9]\d*)?)
(?<ElementGroup>(?:(?&Element)(?&Num))+)
(?<ElementParenthesesGroup>\((?&ElementGroup)+\)(?&Num))
(?<ElementSquareBracketGroup>\[(?:(?:(?&ElementParenthesesGroup)(?:(?&ElementGroup)|(?&ElementParenthesesGroup))+)|(?:(?:(?&ElementGroup)|(?&ElementParenthesesGroup))+(?&ElementParenthesesGroup)))\](?&Num))
)
^((?<Brackets>(?&ElementSquareBracketGroup))|(?<Parentheses>(?&ElementParenthesesGroup))|(?<Group>(?&ElementGroup)))+$
Explanation
- The first part of the
(?(DEFINE))
section lists each periodic element (ordered by atomic number for easy lookup). - The
Element
group acts as a simple or|
between each of the elements listed in 1. ensuring that each element's symbol is ordered alphabetically by the first character, and then by symbol character length (so as not to catch, for example, CarbonC
instead of CalciumCa
) ElementGroup
specifies a group of chemicals in the format: one or moreElement
followed by zero or more digits, excluding zero (specified by the groupNum
)- Valid Examples
C
-Element
CH
-Element
followed by anotherElement
CH3
-Element
followed by anotherElement
and aNum
O2
-Element
followed by aNum
- Invalid Examples
N0
-0
cannot be used explicitlyN01
-Num
group specifies the number must begin with1-9
or not have a numberA
- Element does not existc
- Element does not exist - case sensitive regex
- Valid Examples
ElementParenthesesGroup
specifies one or more groupings ofElementGroup
between parentheses(
)
but containing at least oneElementGroup
- Valid Examples
(CH)
-ElementGroup
surrounded by parentheses(CH3)
-ElementGroup
surrounded by parentheses(CH3NO4)
- multipleElementGroup
surrounded by parentheses(CH3N04)2
- multipleElementGroup
surrounded by parentheses followed by aNum
- Invalid Examples
(CH[NO4])
- OnlyElementGroup
is valid insideElementParenthesesGroup
- Valid Examples
ElementSquareBracketGroup
specifies a grouping ofElementParenthesesGroup
orElementGroup
between square brackets[
]
but containing at least oneElementParenthesesGroup
and one other group (ElementParenthesesGroup
orElementGroup
)- Valid Examples
[CH3(NO4)]
- Contains at least oneElementParenthesesGroup
and one otherElementParenthesesGroup
orElementGroup
[(NO4)CH]2
- Contains at least oneElementParenthesesGroup
and one otherElementParenthesesGroup
orElementGroup
followed byNum
[(NO4)(CH3)]
- Contains at least oneElementParenthesesGroup
and one otherElementParenthesesGroup
orElementGroup
- Invalid Examples
[(NO4)]
- Does not contain second group, brackets[
]
are redundant[NO4]
- Does not containElementParenthesesGroup
- Valid Examples
Additional Information
I realize this is a very long answer, but the OP is asking a very specific question and wants to ensure specific criteria are met.
Ensure the following flags are set:
g
- ensures global matchesx
- ensures whitespace is ignored- if the data is across multiple lines (separated by a newline character) use
m
for multi line
Note: Regex will only capture the last group of type X
that it finds (and overwrite the previously captured group of said type X
. This is the default behaviour of regex and there is no way to currently override this behaviour. This may give you undesirable results. You can see this with the last example in the linked regex as well as with your example of (CH3)2CFCOO(CH2)2Si(CH3)2Cl
since there are multiple of each group type.
这篇关于用于匹配化学式的严格正则表达式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!