问题描述
我正在尝试使用机械化方法在网站上解析和提交表单,但是似乎内置表单解析器无法检测到表单及其元素.我怀疑它在格式不正确的HTML上令人窒息,我想尝试使用更好地设计以处理不良HTML(例如lxml或BeautifulSoup)的解析器对其进行预解析,然后将经过整理,清理的输出馈送到表单解析器.我不仅需要机械化的方式来提交表单,而且还需要维护会话(我在登录会话中正在使用此表单.)
I am trying to parse and submit a form on a website using mechanize, but it appears that the built-in form parser cannot detect the form and its elements. I suspect that it is choking on poorly formed HTML, and I'd like to try pre-parsing it with a parser better designed to handle bad HTML (say lxml or BeautifulSoup) and then feeding the prettified, cleaned-up output to the form parser. I need mechanize not only for submitting the form but also for maintaining sessions (I'm working this form from within a login session.)
如果确实可行,我不确定该怎么做.我不太熟悉HTTP协议的各种细节,如何使各个部分协同工作,等等.有没有指针?
I'm not sure how to go about doing this, if it is indeed possible.. I'm not that familiar with the various details of the HTTP protocol, how to get various parts to work together etc. Any pointers?
推荐答案
从机械化网站:
# Sometimes it's useful to process bad headers or bad HTML:
response = br.response() # this is a copy of response
headers = response.info() # currently, this is a mimetools.Message
headers["Content-type"] = "text/html; charset=utf-8"
response.set_data(response.get_data().replace("<!---", "<!--"))
br.set_response(response)
因此,似乎很有可能用另一个解析器对响应进行预处理,该解析器将重新生成格式正确的HTML,然后将其反馈以进行机械化以进行进一步处理.
so it seems very possible to preprocess the response with another parser which will regenerate well-formed HTML, then feed it back to mechanize for further processing.
这篇关于是否可以将更强大的HTML解析器连接到Python机械化?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!