问题描述
我很难从网站上获取数据.网站资源在这里:
I have some troubles with getting the data from the website. The website source is here:
view-source:http://release24.pl/wpis/23714/%22La+mer+a+boire%22+%282011%29+FRENCH.DVDRip.XviD-AYMO
有这样的东西:
我想从该网站获取数据以获取Python字符串列表:
And I want to get the data from this website to have a Python list of strings:
[[Tytuł, "La mer à boire"]
[Ocena, "IMDB - 6.3/10 (24)"]
[Produkcja, Francja]
[Gatunek, Dramat]
[Czas trwania, 98 min.]
[Premiera, "22.02.2012 - Świat"]
[Reżyseria, "Jacques Maillot"]
[Scenariusz, "Pierre Chosson, Jacques Maillot"]
[Aktorzy, "Daniel Auteuil, Maud Wyler, Yann Trégouët, Alain Beigel"]]
我使用BeautifulSoup编写了一些代码,但我无法再进一步了,我只是不知道从网站源中获取其余信息以及如何将其转换为字符串...请帮忙!
I wrote some code using BeautifulSoup but I cant go any further, I just don't know what to get the rest from the website source and how to convert is to string ...Please, help!
我的代码:
# -*- coding: utf-8 -*-
#!/usr/bin/env python
import urllib2
from bs4 import BeautifulSoup
try :
web_page = urllib2.urlopen("http://release24.pl/wpis/23714/%22La+mer+a+boire%22+%282011%29+FRENCH.DVDRip.XviD-AYMO").read()
soup = BeautifulSoup(web_page)
c = soup.find('span', {'class':'vi'}).contents
print(c)
except urllib2.HTTPError :
print("HTTPERROR!")
except urllib2.URLError :
print("URLERROR!")
推荐答案
使用BeautifulSoup的秘密在于找到HTML文档的隐藏模式.例如,您的循环
The secret of using BeautifulSoup is to find the hidden patterns of your HTML document. For example, your loop
for ul in soup.findAll('p') :
print(ul)
的方向正确,但是它将返回所有段落,不仅是您要查找的段落.但是,您要查找的段落具有类i
的有用属性.在这些段落中,可以找到两个跨度,一个跨度为i
类,另一个跨度为vi
类.我们很幸运,因为这些跨度包含您要查找的数据:
is in the right direction, but it will return all paragraphs, not only the ones you are looking for. The paragraphs you are looking for, however, have the helpful property of having a class i
. Inside these paragraphs one can find two spans, one with the class i
and another with the class vi
. We are lucky because those spans contains the data you are looking for:
<p class="i">
<span class="i">Tytuł............................................</span>
<span class="vi">: La mer à boire</span>
</p>
因此,首先获取具有给定类的所有段落:
So, first get all the paragraphs with the given class:
>>> ps = soup.findAll('p', {'class': 'i'})
>>> ps
[<p class="i"><span class="i">Tytuł... <LOTS OF STUFF> ...pan></p>]
现在,使用列表理解,我们可以生成一个配对列表,其中每个配对包含该段的第一个和第二个跨度:
Now, using list comprehensions, we can generate a list of pairs, where each pair contains the first and the second span from the paragraph:
>>> spans = [(p.find('span', {'class': 'i'}), p.find('span', {'class': 'vi'})) for p in ps]
>>> spans
[(<span class="i">Tyt... ...</span>, <span class="vi">: La mer à boire</span>),
(<span class="i">Ocena... ...</span>, <span class="vi">: IMDB - 6.3/10 (24)</span>),
(<span class="i">Produkcja.. ...</span>, <span class="vi">: Francja</span>),
# and so on
]
现在我们有了跨度,我们可以从中获取文本:
Now that we have the spans, we can get the texts from them:
>>> texts = [(span_i.text, span_vi.text) for span_i, span_vi in spans]
>>> texts
[(u'Tytu\u0142............................................', u': La mer \xe0 boire'),
(u'Ocena.............................................', u': IMDB - 6.3/10 (24)'),
(u'Produkcja.........................................', u': Francja'),
# and so on
]
这些文本仍然不行,但是很容易更正它们.要删除第一个点,我们可以使用 rstrip()
:
Those texts are not ok still, but it is easy to correct them. To remove the dots from the first one, we can use rstrip()
:
>>> u'Produkcja.........................................'.rstrip('.')
u'Produkcja'
可以使用 lstrip()
:
>>> u': Francja'.lstrip(': ')
u'Francja'
要将其应用于所有内容,我们只需要另一个列表理解即可:
To apply it to all content, we just need another list comprehension:
>>> result = [(text_i.rstrip('.'), text_vi.replace(': ', '')) for text_i, text_vi in texts]
>>> result
[(u'Tytu\u0142', u'La mer \xe0 boire'),
(u'Ocena', u'IMDB - 6.3/10 (24)'),
(u'Produkcja', u'Francja'),
(u'Gatunek', u'Dramat'),
(u'Czas trwania', u'98 min.'),
(u'Premiera', u'22.02.2012 - \u015awiat'),
(u'Re\u017cyseria', u'Jacques Maillot'),
(u'Scenariusz', u'Pierre Chosson, Jacques Maillot'),
(u'Aktorzy', u'Daniel Auteuil, Maud Wyler, Yann Trégouët, Alain Beigel'),
(u'Wi\u0119cej na', u':'),
(u'Trailer', u':Obejrzyj zwiastun')]
就是这样.我希望这个循序渐进的示例可以使您更清楚地使用BeautifulSoup.
And that is it. I hope this step-by-step example can make the use of BeautifulSoup clearer for you.
这篇关于使用Beautiful Soup在python中解析网页的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!