问题描述
我正尝试抓取此网站用刮y的.页面结构如下:
I am trying to scrape this website using scrapy. The page structure looks like this:
<div class="list">
<a id="follows" name="follows"></a>
<h4 class="li_group">Follows</h4>
<div class="soda odd"><a href="...">Star Trek</a></div>
<div class="soda even"><a href="...</a></div>
<div class="soda odd"><a href="..">Star Trek: The Motion Picture</a></div>
<div class="soda even"><a href="..">Star Trek II: The Wrath of Khan</a></div>
<div class="soda odd"><a href="..">Star Trek III: The Search for Spock</a></div>
<div class="soda even"><a href="..">Star Trek IV: The Voyage Home</a></div>
<a id="followed_by" name="followed_by"></a>
<h4 class="li_group">Followed by</h4>
<div class="soda odd"><a href="..">Star Trek V: The Final Frontier</a></div>
<div class="soda even"><a href="..">Star Trek VI: The Undiscovered Country</a></div>
<div class="soda odd"><a href="..">Star Trek: Deep Space Nine</a></div>
<div class="soda even"><a href="..">Star Trek: Generations</a></div>
<div class="soda odd"><a href="..">Star Trek: Voyager</a></div>
<div class="soda even"><a href="..">First Contact</a></div>
<a id="spin_off" name="spin_off"></a>
<h4 class="li_group">Spin-off</h4>
<div class="soda odd"><a href="..">Star Trek: The Next Generation - The Transinium Challenge</a></div>
<div class="soda even"><a href="..">A Night with Troi</a></div>
<div class="soda odd"><a href="..">Star Trek: Deep Space Nine</a></div
</div>
我要选择并提取<h4 class="li_group">Follows</h4>
和<h4 class="li_group">Followed by</h4>
之间的文本,然后选择<h4 class="li_group">Followed by</h4>
和<h4 class="li_group">Spin-off</h4>
之间的文本我尝试了以下代码:
I want to select and extract the texts between: <h4 class="li_group">Follows</h4>
and <h4 class="li_group">Followed by</h4>
then texts between <h4 class="li_group">Followed by</h4>
and <h4 class="li_group">Spin-off</h4>
I tried this code:
def parse(self, response):
for sel in response.css("div.list"):
item = ImdbcoItem()
item['Follows'] = sel.css("a#follows+h4.li_group ~ div a::text").extract(),
item['Followed_by'] = sel.css("a#vfollowed_by+h4.li_group ~ div a::text").extract(),
item['Spin_off'] = sel.css("a#spin_off+h4.li_group ~ div a::text").extract(),
return item
但这是第一项,它提取所有div,而不仅是<h4 class="li_group">Follows</h4>
和<h4 class="li_group">Followed by</h4>
之间的div.任何帮助都会真正有用!
But this the first item extracts all divs not just divs between <h4 class="li_group">Follows</h4>
and <h4 class="li_group">Followed by</h4>
Any Help Would Be Really Helpful!!
推荐答案
在这些情况下,我喜欢使用的提取模式是:
An extraction pattern I like to use for these cases is:
- 循环遍历边界"(此处为
h4
元素) - 从1开始枚举它们
- 像@Andersson的答案一样,使用XPath的
following-sibling
轴来获取下一个边界之前的元素 - 并通过计算前面的边界"元素的数量对它们进行过滤,因为我们从枚举中知道了我们的位置
- loop over the "boundaries" (here,
h4
elements) - while enumerating them starting from 1
- using XPath's
following-sibling
axis, like in @Andersson's answer, to get elements before the next boundary, - and filtering them by counting the number of preceding "boundary" elements, since we know from our enumeration where we are
这将是循环:
$ scrapy shell 'http://www.imdb.com/title/tt0092455/trivia?tab=mc&ref_=tt_trv_cnn'
(...)
>>> for cnt, h4 in enumerate(response.css('div.list > h4.li_group'), start=1):
... print(cnt, h4.xpath('normalize-space()').get())
...
1 Follows
2 Followed by
3 Edited into
4 Spun-off from
5 Spin-off
6 Referenced in
7 Featured in
8 Spoofed in
这是使用枚举在边界之间获取元素的一个示例(请注意,此示例使用XPath变量在表达式中使用$cnt
并在.xpath()
中传递cnt=cnt
):
And this is one example of using the enumeration to get elements between boundaries (note that this use XPath variables with $cnt
in the expression and passing cnt=cnt
in .xpath()
):
>>> for cnt, h4 in enumerate(response.css('div.list > h4.li_group'), start=1):
... print(cnt, h4.xpath('normalize-space()').get())
... print(h4.xpath('following-sibling::div[count(preceding-sibling::h4)=$cnt]',
cnt=cnt).xpath(
'string(.//a)').getall())
...
1 Follows
['Star Trek', 'Star Trek: The Animated Series', 'Star Trek: The Motion Picture', 'Star Trek II: The Wrath of Khan', 'Star Trek III: The Search for Spock', 'Star Trek IV: The Voyage Home']
2 Followed by
['Star Trek V: The Final Frontier', 'Star Trek VI: The Undiscovered Country', 'Star Trek: Deep Space Nine', 'Star Trek: Generations', 'Star Trek: Voyager', 'First Contact', 'Star Trek: Insurrection', 'Star Trek: Enterprise', 'Star Trek: Nemesis', 'Star Trek', 'Star Trek Into Darkness', 'Star Trek Beyond', 'Star Trek: Discovery', 'Untitled Star Trek Sequel']
3 Edited into
['Reading Rainbow: The Bionic Bunny Show', 'The Unauthorized Hagiography of Vincent Price']
4 Spun-off from
['Star Trek']
5 Spin-off
['Star Trek: The Next Generation - The Transinium Challenge', 'A Night with Troi', 'Star Trek: Deep Space Nine', "Star Trek: The Next Generation - Future's Past", 'Star Trek: The Next Generation - A Final Unity', 'Star Trek: The Next Generation: Interactive VCR Board Game - A Klingon Challenge', 'Star Trek: Borg', 'Star Trek: Klingon', 'Star Trek: The Experience - The Klingon Encounter']
6 Referenced in
(...)
这是您可以用来填充和填充项目的方法(在这里,我仅使用一个简单的dict进行说明):
Here's how you could use that to populate and item (here, I'm using a simple dict just for illustration):
>>> item = {}
>>> for cnt, h4 in enumerate(response.css('div.list > h4.li_group'), start=1):
... key = h4.xpath('normalize-space()').get().strip() # there are some non-breaking spaces
... if key in ['Follows', 'Followed by', 'Spin-off']:
... values = h4.xpath('following-sibling::div[count(preceding-sibling::h4)=$cnt]',
... cnt=cnt).xpath(
... 'string(.//a)').getall()
... item[key] = values
...
>>> from pprint import pprint
>>> pprint(item)
{'Followed by': ['Star Trek V: The Final Frontier',
'Star Trek VI: The Undiscovered Country',
'Star Trek: Deep Space Nine',
'Star Trek: Generations',
'Star Trek: Voyager',
'First Contact',
'Star Trek: Insurrection',
'Star Trek: Enterprise',
'Star Trek: Nemesis',
'Star Trek',
'Star Trek Into Darkness',
'Star Trek Beyond',
'Star Trek: Discovery',
'Untitled Star Trek Sequel'],
'Follows': ['Star Trek',
'Star Trek: The Animated Series',
'Star Trek: The Motion Picture',
'Star Trek II: The Wrath of Khan',
'Star Trek III: The Search for Spock',
'Star Trek IV: The Voyage Home'],
'Spin-off': ['Star Trek: The Next Generation - The Transinium Challenge',
'A Night with Troi',
'Star Trek: Deep Space Nine',
"Star Trek: The Next Generation - Future's Past",
'Star Trek: The Next Generation - A Final Unity',
'Star Trek: The Next Generation: Interactive VCR Board Game - A '
'Klingon Challenge',
'Star Trek: Borg',
'Star Trek: Klingon',
'Star Trek: The Experience - The Klingon Encounter']}
>>>
这篇关于如何选择和提取两个元素之间的文本?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!