In [1]: from lxml import etree
我有一个HTML文档:
In [2]: root = etree.fromstring(u'''<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN">\n<HTML></HTML>''', etree.HTMLParser())
其doctype解析正确:
In [3]: root.getroottree().docinfo.doctype
Out[3]: u'<!DOCTYPE html PUBLIC "-//IETF//DTD HTML//EN">'
但是当连载它的时候,我失去了它:
In [4]: etree.tostring(root.getroottree(), method='html')
Out[4]: '<html></html>'
我该怎么做才能使doctype序列化?
Debian GNU/Linux,Sid。巨蟒2.6.6。lxml 2.2.8-2号。
最佳答案
到目前为止,我能让它工作的唯一方法是使用默认的XML解析器并向文档添加一个非空的系统URL:
>>> html = etree.parse(StringIO('''<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN" " ">\n<HTML></HTML>'''))
>>> etree.tostring(html, method="xml")
'<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN" " ">\n<HTML/>'
>>> etree.tostring(html, method="html")
'<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN" " ">\n<HTML></HTML>'
使用
HTMLParser
得到的结果是相同的docinfo,但不是所需的输出:>>> html = etree.parse(StringIO('''<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN" " ">\n<HTML></HTML>'''), etree.HTMLParser())
>>> etree.tostring(html, method="html")
'<html></html>'
关于python - lxml,序列化时缺少doctype,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/3916766/