问题描述
我目前有点想法,我真的希望您能给我一个提示:最好用一小段示例代码来解释我的问题:
I m currently a bit out of ideas, and I really hope that you can give me a hint:Its probably best to explain my question with a small piece of sample code:
from lxml import etree
from io import StringIO
testStr = "<b>text0<i>text1</i><ul><li>item1</li><li>item2</li></ul>text2<b/><b>sib</b>"
parser = etree.HTMLParser()
# generate html tree
htmlTree = etree.parse(StringIO(testStr), parser)
print(etree.tostring(htmlTree, pretty_print=True).decode("utf-8"))
bElem = htmlTree.getroot().find("body/b")
print(".text only contains the first part: "+bElem.text+ " (which makes sense in some way)")
for text in bElem.itertext():
print(text)
输出:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
<body>
<b>text0<i>text1</i><ul><li>item1</li><li>item2</li></ul>text2<b/><b>sib</b></b>
</body>
</html>
.text only contains the first part: text0 (which makes sense in some way)
text0
text1
item1
item2
text2
sib
我的问题:
我想直接访问"text2"
,或获取所有文本部分的列表,仅包括在父标记中可以找到的部分.到目前为止,我只找到了itertext()
,它确实显示了"text2"
.
I would like to access "text2"
directly, or get a list of all text parts, only including the ones that can be found in the parent tag.So far I only found itertext()
, which does display "text2"
.
我还有其他方法可以检索"text2"
吗?
Is there any other way I could retrieve "text2"
?
现在您可能会问为什么我需要这个:基本上itertext()
已经差不多做了我想要的事情:
Now you might be asking why I need this:Basically itertext()
is pretty much already doing what I want:
- 创建一个列表,其中包含在元素的子级中找到的所有文本
- 但是,我要处理遇到的表和列表一个不同的函数(随后创建一个列表结构就像这样:
["text0 text1",["item1","item2"],"text2"]
或一个表(1. 1列的行,2.具有2列的行):["1. row, 1 col",["2. row, 1. col","2. row, 2.col"]]
)
- Create a list, that contains all text found in an element's children
- However, I want to process tables and lists that are encountered witha different function (which subsequently creates a list structurelike this:
["text0 text1",["item1","item2"],"text2"]
or for a table (1. Row with 1 Column, 2.Row with 2 Columns):["1. row, 1 col",["2. row, 1. col","2. row, 2.col"]]
)
也许我采用了完全错误的方法?
Maybe I m taking a completely wrong approach?
推荐答案
您可以重新实现itertext()
函数,并在必要时为ul
和table
插入特殊处理程序:
You could just reimplement itertext()
function and insert special handlers for ul
, table
if necessary:
from lxml import html
def itertext(root, handlers=dict(ul=lambda el: (list(el.itertext()),
el.tail))):
if root.text:
yield root.text
for el in root:
yield from handlers.get(el.tag, itertext)(el)
if root.tail:
yield root.tail
print(list(itertext(html.fromstring(
"<b>text0<i>text1</i><ul><li>item1</li>"
"<li>item2</li></ul>text2<b/><b>sib</b>"))))
输出
['text0', 'text1', ['item1', 'item2'], 'text2', 'sib']
注意:在低于Python 3.3的版本上,yield from X
可以替换为for x in X: yield x
.
Note: yield from X
could be replaced by for x in X: yield x
on older than Python 3.3 versions.
要连接相邻的字符串,请执行以下操作:
To join adjacent strings:
def joinadj(iterable, join=' '.join):
adj = []
for item in iterable:
if isinstance(item, str):
adj.append(item) # save for later
else:
if adj: # yield items accumulated so far
yield join(adj)
del adj[:] # remove yielded items
yield item # not a string, yield as is
if adj: # yield the rest
yield join(adj)
print(list(joinadj(itertext(html.fromstring(
"<b>text0<i>text1</i><ul><li>item1</li>"
"<li>item2</li></ul>text2<b/><b>sib</b>")))))
输出
['text0 text1', ['item1', 'item2'], 'text2 sib']
要允许表,在<ul>
中的嵌套列表中,处理程序应递归调用itertext()
:
To allow tables, nested list in <ul>
the handler should call itertext()
recursively:
def ul_handler(el):
yield list(itertext(el, with_tail=False))
if el.tail:
yield el.tail
def itertext(root, handlers=dict(ul=ul_handler), with_tail=True):
if root.text:
yield root.text
for el in root:
yield from handlers.get(el.tag, itertext)(el)
if with_tail and root.tail:
yield root.tail
print(list(joinadj(itertext(html.fromstring(
"<b>text0<i>text1</i><ul><li>item1</li>"
"<li>item2<ul><li>sub1<li>sub2</li></ul></ul>"
"text2<b/><b>sib</b>")))))
输出
['text0 text1', ['item1', 'item2', ['sub1', 'sub2']], 'text2 sib']
这篇关于Python,lxml-访问文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!