问题描述
我在以下代码中遇到了正则表达式问题:
I am having trouble with the regex in the following code:
import mechanize
import re
br = mechanize.Browser()
br.set_handle_robots(False)
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
response = br.open("http://www.gfsc.gg/The-Commission/Pages/Regulated-Entities.aspx?auto_click=1")
html = response.read()
br.select_form(nr=0)
#print br.form
br.set_all_readonly(False)
next = re.search(r"""<a href="javascript:__doPostBack('(.*?)','(.*?)')">""",html)
if next:
print 'group(1):', next.group(1)
print 'group(2):', next.group(2)
如果将两个(.*?)实例的单引号从正则表达式中删除,则结果如下:
If the single quotes around both instances of (.*?) are removed from the regex, these are the results:
group(1): ('ctl00$ctl20$g_af5ce308_e786_4734_ad4c_9829087cffbd$ctl00$gvWebLicensee','Page$2')
group(2): ('ctl00$ctl20$g_af5ce308_e786_4734_ad4c_9829087cffbd$ctl00$gvWebLicensee'
这些结果不太正确.括号和单引号需要删除(不是我的问题),我希望group(1)和group(2)看起来像这样:
These results are not quite right. The parentheses and single quotes need to be removed (not my question) and I would like group(1) and group(2) to look like this:
group(1): ctl00$ctl20$g_af5ce308_e786_4734_ad4c_9829087cffbd$ctl00$gvWebLicensee
group(2): Page$2
推荐答案
您需要转义括号,因为它们具有特殊含义:
You need to escape the parenthesis since they have a special meaning:
<a href="javascript:__doPostBack\('(.*?)','(.*?)'\)">
HERE^ HERE^
请注意,理想情况下,您不应使用正则表达式解析HTML(即使您的模式非常具体,我也不认为这是).而是使用 BeautifulSoup
解析HTML,找到a
元素,获取href
属性值,然后使用正则表达式提取所需的子字符串.
Note that, ideally, you should not be parsing HTML with regex (even though your pattern is quite specific and I don't think this is that bad). Instead, parse HTML with, say, BeautifulSoup
, locate the a
element, get the href
attribute value and then extract the desired substrings with regex.
这篇关于Python重新-在正则表达式模式中转义巧合括号的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!