问题描述
我正在尝试从html文件中提取文本. html
文件如下所示:
I am trying to extract the text from a html file.The html
file looks like this:
<li class="toclevel-1 tocsection-1">
<a href="#Baden-Württemberg"><span class="tocnumber">1</span>
<span class="toctext">Baden-Württemberg</span>
</a>
</li>
<li class="toclevel-1 tocsection-2">
<a href="#Bayern">
<span class="tocnumber">2</span>
<span class="toctext">Bayern</span>
</a>
</li>
<li class="toclevel-1 tocsection-3">
<a href="#Berlin">
<span class="tocnumber">3</span>
<span class="toctext">Berlin</span>
</a>
</li>
我想从最后一个 span
标记中提取最后一个文本.在第一行中,它是 class ="toctext"
之后的Baden-Würtemberg",然后将其放入python列表中.
I want to extract the last text from the last span
tag.In the first line it would be "Baden-Würtemberg" after class="toctext"
and then put it to a python list.
在Python中,我尝试了以下操作:
in Python I tried the following:
names = soup.find_all("span",{"class":"toctext"})
我的输出是这个列表
:
[<span class="toctext">Baden-Württemberg</span>, <span class="toctext">Bayern</span>, <span class="toctext">Berlin</span>]
那我怎么只提取标签之间的文本呢?
So how can I extract only the text between the tags?
感谢所有人
推荐答案
find_all
方法返回一个列表.遍历列表以获取文本.
The find_all
method returns a list. Iterate over the list to get the text.
for name in names:
print(name.text)
返回:
Baden-Württemberg
Bayern
Berlin
内置的python dir()
和 type()
方法总是很方便地检查对象.
The builtin python dir()
and type()
methods are always handy to inspect an object.
print(dir(names))
[...,
'__sizeof__',
'__str__',
'__subclasshook__',
'__weakref__',
'append',
'clear',
'copy',
'count',
'extend',
'index',
'insert',
'pop',
'remove',
'reverse',
'sort',
'source']
这篇关于使用BeautifulSoup/Python从html文件中提取文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!