问题描述
我正在寻找一种从 RTF 字符串中删除文本的方法,我发现了以下正则表达式:
({\\)(.+?)(})|(\\)(.+?)(\b)
然而结果字符串有两个右尖括号}"
之前: {\rtf1\ansi\ansicpg1252\deff0\deflang1033{\fonttbl{\f0\fnil\fcharset0 MS Shell Dlg 2;}{\f1\fnil MS Shell Dlg2;}} {\colortbl ;\red0\green0\blue0;} {\*\generator Msftedit 5.41.15.1507;}\viewkind4\uc1\pard\tx720\cf1\f0\fs20 你能把电话的信息发给我吗\f1\par }
之后: } 你能把电话的信息发给我吗 }
对如何改进正则表达式有任何想法吗?
一个更复杂的字符串,比如这个不起作用:{\rtf1\ansi\ansicpg1252\deff0\deflang1033{\fonttbl{\f0\fnil\fcharset0 MS ShellDlg 2;}} {\colortbl ;\red0\green0\blue0;} {\*\generator Msftedit 5.41.15.1507;}\viewkind4\uc1\pard\tx720\cf1\f0\fs20 HKEY_LOCAL_MACHINE\\SOFTWARE\\Microsoft\\test\\myapp\\Apps\\\{3423234-283B-43d2-BCE6-A324B84CC70E\}\par }
在 RTF 中,{ 和 } 标记一个组.组可以嵌套.\ 标记控制字的开始.控制词以空格或非字母字符结尾.控制字后面可以有一个数字参数,中间没有任何分隔符.一些控制字也采用文本参数,用;"分隔.这些控制词通常在它们自己的组中.
我想我已经设法制作了一个可以处理大多数情况的模式.
\{\*?\\[^{}]+}|[{}]|\\\n?[A-Za-z]+\n?(?:-?\d+)?[ ]?
虽然在您的模式上运行时会留下一些空格.
通过 RTF规范(其中一些),我发现纯正则表达式的剥离器存在很多缺陷.最明显的一个是某些组应该被忽略(页眉、页脚等),而其他组应该被呈现(格式化).
我编写了一个 Python 脚本,它应该比上面的正则表达式更有效:
def striprtf(text):pattern = re.compile(r"\\([az]{1,32})(-?\d{1,10})?[ ]?|\\'([0-9a-f]{2})|\\([^az])|([{}])|[\r\n]+|(.)", re.I)# 指定目的地"的控制字.目的地 = 冻结集(('aftncn','aftnsep','aftnsepc','annotation','atnauthor','atndate','atnicn','atnid','atnparent','atnref','atntime','atrfend','atrfstart','author','background','bkmkend','bkmkstart','blipuid','buptim','category','colorschememapping','colortbl','comment','company','creatim','datafield','datastore','defchp','defpap','do','doccomm','docvar','dptxbxtext','ebcend','ebcstart','factoidname','falt','fchars','ffdeftext','ffentrymcr','ffexitmcr','ffformat','ffhelptext','ffl','ffname','ffstattext','field','file','filetbl','fldinst','fldrslt','fldtype','fname','fontemb','fontfile','fonttbl','footer','footerf','footerl','footerr','脚注','formfield','ftncn','ftnsep','ftnsepc','g','generator','gridtbl','header','headerf','headerl','headerr','hl','hlfr','hlinkbase','hlloc','hlsrc','hsv','htmltag','info','keycode','keywords','latentstyles','lchars','levelnumbers','leveltext','lfolevel','linkval','list','listlevel','listname','listoverride','listoverridetable','listpicture','liststylename','listtable','listtext','lsdlockedexcept','macc','maccPr','mailmerge','maln','malnScr','manager','margPr','mbar','mbarPr','mbaseJc','mbegChr','mborderBox','mborderBoxPr','mbox','mboxPr','mchr','mcount','mctrlPr','md','mdeg','mdegHide','mden','mdiff','mdPr','me','mendChr','meqArr','meqArrPr','mf','mfName','mfPr','mfunc','mfuncPr','mgroupChr','mgroupChrPr','mgrow','mhideBot','mhideLeft','mhideRight','mhideTop','mhtmltag','mlim','mlimloc','mlimlow','mlimlowPr','mlimupp','mlimuppPr','mm','mmaddfieldname','mmath','mmathPict','mmathPr','mmaxdist','mmc','mmcJc','mmconnectstr','mmconnectstrdata','mmcPr','mmcs','mmdatasource','mmheadersource','mmmailsubject','mmodso','mmodsofilter','mmodsofldmpdata','mmodsomappedname','mmodsoname','mmodsorecipdata','mmodsosort','mmodsosrc','mmodsotable','mmodsoudl','mmodsoudldata','mmodsouniquetag','mmPr','mmquery','mmr','mnary','mnaryPr','mnoBreak','mnum','mobjDist','moMath','moMathPara','moMathParaPr','mopEmu','phant','mphantPr','mplcHide','mpos','mr','mrad','mradPr','mrPr','msepChr','mshow','mshp','msPre','msPrePr','msSub','msSubPr','msSubSup','msSubSupPr','msSup','msSupPr','mstrikeBLTR','mstrikeH','mstrikeTLBR','mstrikeV','msub','msubHide','msup','msupHide','mtransp','mtype','mvertJc','mvfmf','mvfml','mvtof','mvtol','mzeroAsc','mzeroDesc','mzeroWid','nesttableprops','nextfile','nonesttables','objalias','objclass','objdata','object','objname','objsect','objtime','oldcprops','oldpprops','oldsprops','oldtprops','oleclsid','operator','panose','password','passwordhash','pgp','pgptbl','picprop','pict','pn','pnseclvl','pntext','pntxta','pntxtb','printim','private','propname','protend','protstart','protusertbl','pxe','结果','revtbl','revtim','rsidtbl','rxe','shp','shpgrp','shpinst','shppict','shprslt','shptxt','sn','sp','staticval','stylesheet','subject','sv','svb','tc','template','themedata','title','txe','ud','upr','userprops','wgrffmtfilter','windowcaption','writereservation','writereservhash','xe','xform','xmlattrname','xmlattrvalue','xmlclose','xmlname','xmlnstbl','xmlopen',))# 一些特殊字符的翻译.特殊字符 = {'par': '\n','sect': '\n\n','页面': '\n\n','亚麻布','tab': '\t','emdash': u'\u2014','endash': u'\u2013','emspace': u'\u2003','enspace': u'\u2002','qmspace': u'\u2005','子弹':你'\u2022','lquote': u'\u2018','rquote': u'\u2019','ldblquote': u'\201C','rdblquote': u'\u201D',}堆栈 = []ignorable = False # 这个组(以及里面的所有)是否是可忽略的".ucskip = 1 # Unicode 字符后要跳过的 ASCII 字符数.curskip = 0 # 剩下要跳过的 ASCII 字符数out = [] # 输出缓冲区.在 pattern.finditer(text) 中匹配:word,arg,hex,char,brace,tchar = match.groups()如果大括号:游记 = 0如果大括号 == '{':#推送状态stack.append((ucskip,ignorable))elif 大括号 == '}':# 弹出状态ucskip,ignorable = stack.pop()elif char: # \x (不是字母)游记 = 0如果字符 == '~':如果不可忽视:out.append(u'\xA0')'{}\\' 中的 elif 字符:如果不可忽视:out.append(char)elif 字符 == '*':可忽略 = 真elif 字:# \foo游记 = 0如果目的地中的单词:可忽略 = 真elif 可忽略:经过specialchars 中的 elif 字:out.append(specialchars[word])elif 字 == 'uc':ucskip = int(arg)elif 字 == 'u':c = int(arg)如果 c127: out.append(unichr(c))否则:out.append(chr(c))curskip = ucskipelif 十六进制:# \'xx如果curskip >0:游记 -= 1elif 不可忽视:c = 整数(十六进制,16)如果 c >127: out.append(unichr(c))否则:out.append(chr(c))elif tchar:如果curskip >0:快跳 -= 1elif 不可忽视:out.append(tchar)返回 ''.join(out)
它的工作原理是解析 RTF 代码,并跳过任何指定了目的地"的组,以及所有可忽略"的组 ({\*
...}
).我还添加了一些特殊字符的处理.
缺少许多功能使其成为完整的解析器,但对于简单的文档应该足够了.
更新:此网址已更新此脚本以在 Python 3.x 上运行:
https://gist.github.com/gilsondev/7c1d2d753ddb522e7bc2208676a I was looking for a way to remove text from and RTF string and I found the following regex: However the resulting string has two right angle brackets "}" Before: After: Any thoughts on how to improve the regex? Edit: A more complicated string such as this one does not work: In RTF, { and } marks a group. Groups can be nested. \ marks beginning of a control word. Control words end with either a space or a non alphabetic character. A control word can have a numeric parameter following, without any delimiter in between. Some control words also take text parameters, separated by ';'. Those control words are usually in their own groups. I think I have managed to make a pattern that takes care of most the cases. It leaves a few spaces when run on your pattern though. Going trough the RTF specification (some of it), I see that there are a lot of pitfalls for pure regex based strippers. The most obvious one are that some groups should be ignored (headers, footers, etc.), while others should be rendered (formatting). I have written a Python script that should work better than my regex above: It works by parsing the RTF code, and skipping any groups which has a "destination" specified, and all "ignorable" groups ( There are lots of features missing to make this a full parser, but should be enough for simple documents. UPDATED: This url have this script updated to run on Python 3.x: https://gist.github.com/gilsondev/7c1d2d753ddb522e7bc22511cfb08676 这篇关于用于从 RTF 字符串中提取文本的正则表达式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!({\\)(.+?)(})|(\\)(.+?)(\b)
{\rtf1\ansi\ansicpg1252\deff0\deflang1033{\fonttbl{\f0\fnil\fcharset0 MS Shell Dlg 2;}{\f1\fnil MS Shell Dlg 2;}} {\colortbl ;\red0\green0\blue0;} {\*\generator Msftedit 5.41.15.1507;}\viewkind4\uc1\pard\tx720\cf1\f0\fs20 can u send me info for the call pls\f1\par }
} can u send me info for the call pls }
{\rtf1\ansi\ansicpg1252\deff0\deflang1033{\fonttbl{\f0\fnil\fcharset0 MS Shell Dlg 2;}} {\colortbl ;\red0\green0\blue0;} {\*\generator Msftedit 5.41.15.1507;}\viewkind4\uc1\pard\tx720\cf1\f0\fs20 HKEY_LOCAL_MACHINE\\SOFTWARE\\Microsoft\\test\\myapp\\Apps\\\{3423234-283B-43d2-BCE6-A324B84CC70E\}\par }
\{\*?\\[^{}]+}|[{}]|\\\n?[A-Za-z]+\n?(?:-?\d+)?[ ]?
def striprtf(text):
pattern = re.compile(r"\\([a-z]{1,32})(-?\d{1,10})?[ ]?|\\'([0-9a-f]{2})|\\([^a-z])|([{}])|[\r\n]+|(.)", re.I)
# control words which specify a "destionation".
destinations = frozenset((
'aftncn','aftnsep','aftnsepc','annotation','atnauthor','atndate','atnicn','atnid',
'atnparent','atnref','atntime','atrfend','atrfstart','author','background',
'bkmkend','bkmkstart','blipuid','buptim','category','colorschememapping',
'colortbl','comment','company','creatim','datafield','datastore','defchp','defpap',
'do','doccomm','docvar','dptxbxtext','ebcend','ebcstart','factoidname','falt',
'fchars','ffdeftext','ffentrymcr','ffexitmcr','ffformat','ffhelptext','ffl',
'ffname','ffstattext','field','file','filetbl','fldinst','fldrslt','fldtype',
'fname','fontemb','fontfile','fonttbl','footer','footerf','footerl','footerr',
'footnote','formfield','ftncn','ftnsep','ftnsepc','g','generator','gridtbl',
'header','headerf','headerl','headerr','hl','hlfr','hlinkbase','hlloc','hlsrc',
'hsv','htmltag','info','keycode','keywords','latentstyles','lchars','levelnumbers',
'leveltext','lfolevel','linkval','list','listlevel','listname','listoverride',
'listoverridetable','listpicture','liststylename','listtable','listtext',
'lsdlockedexcept','macc','maccPr','mailmerge','maln','malnScr','manager','margPr',
'mbar','mbarPr','mbaseJc','mbegChr','mborderBox','mborderBoxPr','mbox','mboxPr',
'mchr','mcount','mctrlPr','md','mdeg','mdegHide','mden','mdiff','mdPr','me',
'mendChr','meqArr','meqArrPr','mf','mfName','mfPr','mfunc','mfuncPr','mgroupChr',
'mgroupChrPr','mgrow','mhideBot','mhideLeft','mhideRight','mhideTop','mhtmltag',
'mlim','mlimloc','mlimlow','mlimlowPr','mlimupp','mlimuppPr','mm','mmaddfieldname',
'mmath','mmathPict','mmathPr','mmaxdist','mmc','mmcJc','mmconnectstr',
'mmconnectstrdata','mmcPr','mmcs','mmdatasource','mmheadersource','mmmailsubject',
'mmodso','mmodsofilter','mmodsofldmpdata','mmodsomappedname','mmodsoname',
'mmodsorecipdata','mmodsosort','mmodsosrc','mmodsotable','mmodsoudl',
'mmodsoudldata','mmodsouniquetag','mmPr','mmquery','mmr','mnary','mnaryPr',
'mnoBreak','mnum','mobjDist','moMath','moMathPara','moMathParaPr','mopEmu',
'mphant','mphantPr','mplcHide','mpos','mr','mrad','mradPr','mrPr','msepChr',
'mshow','mshp','msPre','msPrePr','msSub','msSubPr','msSubSup','msSubSupPr','msSup',
'msSupPr','mstrikeBLTR','mstrikeH','mstrikeTLBR','mstrikeV','msub','msubHide',
'msup','msupHide','mtransp','mtype','mvertJc','mvfmf','mvfml','mvtof','mvtol',
'mzeroAsc','mzeroDesc','mzeroWid','nesttableprops','nextfile','nonesttables',
'objalias','objclass','objdata','object','objname','objsect','objtime','oldcprops',
'oldpprops','oldsprops','oldtprops','oleclsid','operator','panose','password',
'passwordhash','pgp','pgptbl','picprop','pict','pn','pnseclvl','pntext','pntxta',
'pntxtb','printim','private','propname','protend','protstart','protusertbl','pxe',
'result','revtbl','revtim','rsidtbl','rxe','shp','shpgrp','shpinst',
'shppict','shprslt','shptxt','sn','sp','staticval','stylesheet','subject','sv',
'svb','tc','template','themedata','title','txe','ud','upr','userprops',
'wgrffmtfilter','windowcaption','writereservation','writereservhash','xe','xform',
'xmlattrname','xmlattrvalue','xmlclose','xmlname','xmlnstbl',
'xmlopen',
))
# Translation of some special characters.
specialchars = {
'par': '\n',
'sect': '\n\n',
'page': '\n\n',
'line': '\n',
'tab': '\t',
'emdash': u'\u2014',
'endash': u'\u2013',
'emspace': u'\u2003',
'enspace': u'\u2002',
'qmspace': u'\u2005',
'bullet': u'\u2022',
'lquote': u'\u2018',
'rquote': u'\u2019',
'ldblquote': u'\201C',
'rdblquote': u'\u201D',
}
stack = []
ignorable = False # Whether this group (and all inside it) are "ignorable".
ucskip = 1 # Number of ASCII characters to skip after a unicode character.
curskip = 0 # Number of ASCII characters left to skip
out = [] # Output buffer.
for match in pattern.finditer(text):
word,arg,hex,char,brace,tchar = match.groups()
if brace:
curskip = 0
if brace == '{':
# Push state
stack.append((ucskip,ignorable))
elif brace == '}':
# Pop state
ucskip,ignorable = stack.pop()
elif char: # \x (not a letter)
curskip = 0
if char == '~':
if not ignorable:
out.append(u'\xA0')
elif char in '{}\\':
if not ignorable:
out.append(char)
elif char == '*':
ignorable = True
elif word: # \foo
curskip = 0
if word in destinations:
ignorable = True
elif ignorable:
pass
elif word in specialchars:
out.append(specialchars[word])
elif word == 'uc':
ucskip = int(arg)
elif word == 'u':
c = int(arg)
if c < 0: c += 0x10000
if c > 127: out.append(unichr(c))
else: out.append(chr(c))
curskip = ucskip
elif hex: # \'xx
if curskip > 0:
curskip -= 1
elif not ignorable:
c = int(hex,16)
if c > 127: out.append(unichr(c))
else: out.append(chr(c))
elif tchar:
if curskip > 0:
curskip -= 1
elif not ignorable:
out.append(tchar)
return ''.join(out)
{\*
...}
). I also added handling of some special characters.