问题描述
这是我的xml文件的一部分.
Here is a part of my xml file..
- <a:p>
- <a:pPr lvl="2">
- <a:spcBef>
<a:spcPts val="200" />
</a:spcBef>
</a:pPr>
- <a:r>
<a:rPr lang="en-US" sz="1400" dirty="0" smtClean="0" />
<a:t>The</a:t>
</a:r>
- <a:r>
<a:rPr lang="en-US" sz="1400" dirty="0" />
<a:t>world</a:t>
</a:r>
- <a:r>
<a:rPr lang="en-US" sz="1400" dirty="0" smtClean="0" />
<a:t>is small</a:t>
</a:r>
</a:p>
- <a:p>
- <a:pPr lvl="2">
- <a:spcBef>
<a:spcPts val="200" />
</a:spcBef>
</a:pPr>
- <a:r>
<a:rPr lang="en-US" sz="1400" dirty="0" smtClean="0" b="0" />
<a:t>The</a:t>
</a:r>
- <a:r>
<a:rPr lang="en-US" sz="1400" dirty="0" b="0" />
<a:t>world</a:t>
</a:r>
- <a:r>
<a:rPr lang="en-US" sz="1400" dirty="0" smtClean="0" b="0" />
<a:t>is too big</a:t>
</a:r>
</a:p>
我已经使用lxml编写了一个代码来提取文本.但是,由于该句子分为两行,因此我想将这两行合并成一个句子,例如The world is small...
.所以我在这里写一个代码:
I have written a code using lxml to extract the text. But, as the sentence is split into two lines, I want to join these two to form a single sentence like The world is small...
. So here I write a code:
path4 = file.xpath('/p:sld/p:cSld/p:spTree/p:sp/p:txBody/a:p/a:r/a:rPr', namespaces={'p':'http://schemas.openxmlformats.org/presentationml/2006/main',
'a':'http://schemas.openxmlformats.org/drawingml/2006/main'})
if path5:
for a in path4:
if a.get('sz') == '1400' and a.xpath('node()') == [] and a.get('b') != '0':
b = a.getparent()
c = b.getparent()
d = c.xpath('./a:r/a:t/text()' , namespaces {'p':'http://schemas.openxmlformats.org/presentationml/2006/main', 'a':'http://schemas.openxmlformats.org/drawingml/2006/main'})
print ''.join(d)
elif a.get('sz') == '1400' and a.xpath('node()') == [] and a.get('b') == '0':
b = a.getparent()
c = b.getparent()
d = c.xpath('./a:r/a:t/text()' , namespaces {'p':'http://schemas.openxmlformats.org/presentationml/2006/main', 'a':'http://schemas.openxmlformats.org/drawingml/2006/main'})
print ''.join(d)
我得到了输出:
The world is samll...
The world is small...
The world is small...
预期输出:
the world is small...
有什么建议吗?
推荐答案
您要为循环中找到的每个a:rPr
语句.
You are making the sentence for every a:rPr
found in the loop.
这是您应做的一个示例:
Here's an example of what you should do instead:
test.xml
:
<body xmlns:a="http://schemas.openxmlformats.org/drawingml/2006/main"
xmlns:p="http://schemas.openxmlformats.org/presentationml/2006/main">
<a:p>
-
<a:pPr lvl="2">
-
<a:spcBef>
<a:spcPts val="200"/>
</a:spcBef>
</a:pPr>
-
<a:r>
<a:rPr lang="en-US" sz="1400" dirty="0" smtClean="0"/>
<a:t>The</a:t>
</a:r>
-
<a:r>
<a:rPr lang="en-US" sz="1400" dirty="0"/>
<a:t>world</a:t>
</a:r>
-
<a:r>
<a:rPr lang="en-US" sz="1400" dirty="0" smtClean="0"/>
<a:t>is small</a:t>
</a:r>
</a:p>
<a:p>
-
<a:pPr lvl="2">
-
<a:spcBef>
<a:spcPts val="200"/>
</a:spcBef>
</a:pPr>
-
<a:r>
<a:rPr lang="en-US" sz="1400" dirty="0" smtClean="0" b="0"/>
<a:t>The</a:t>
</a:r>
-
<a:r>
<a:rPr lang="en-US" sz="1400" dirty="0" b="0"/>
<a:t>world</a:t>
</a:r>
-
<a:r>
<a:rPr lang="en-US" sz="1400" dirty="0" smtClean="0" b="0"/>
<a:t>is too big</a:t>
</a:r>
</a:p>
</body>
test.py
:
from lxml import etree
tree = etree.parse('test.xml')
NAMESPACES = {'p': 'http://schemas.openxmlformats.org/presentationml/2006/main',
'a': 'http://schemas.openxmlformats.org/drawingml/2006/main'}
path = tree.xpath('/body/a:p', namespaces=NAMESPACES)
for outer_item in path:
parts = []
for item in outer_item.xpath('./a:r/a:rPr', namespaces=NAMESPACES):
parts.append(item.getparent().xpath('./a:t/text()', namespaces=NAMESPACES)[0])
print " ".join(parts)
输出:
世界太大
因此,只需循环遍历a:p
项并将文本提取到parts
中,然后在处理每个a:p
之后将其打印出来.为了清楚起见,我已经删除了if语句.
So, just looping over a:p
items and extracting the text into parts
, then print it after processing of each a:p
. I've removed if statement for clarity.
希望有帮助.
这篇关于使用python lxml循环问题进行文本提取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!