BeautifulSoup不会提取所有html

本文介绍了BeautifulSoup不会提取所有html的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我们正在尝试从Forever 21网站的此页面获取产品网址( http://www.forever21.com/Product/Category.aspx?br=f21&category=dress&pagesize=100&page=1 ).由于某种原因，BeautifulSoup不会获得类"item_pic"的元素，即使它们位于站点html中也是如此.我们已经尝试过使用请求，机械化，硒化，但是没有运气.所有已注释的代码均来自先前尝试获取html的尝试(均无效).这是我们的代码:

We are trying to get product urls from this page of Forever 21's site (http://www.forever21.com/Product/Category.aspx?br=f21&category=dress&pagesize=100&page=1). For some reason, BeautifulSoup is not getting the elements with class "item_pic", even though they are in the site html. We have tried using requests, mechanize, selenium, and are having no luck. All the commented code is from previous attempts to get the html (none of which worked). Here is our code:

from bs4 import BeautifulSoup
import urllib
import urllib2
import requests

#driver = webdriver.Firefox()
url = "http://www.forever21.com/Product/Category.aspx?br=f21&category=dress&pagesize=100&page=1"
#r = driver.get(url)
#html = r.read()
#headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
#html = requests.get(url, headers=headers)
#response = opener.open(url)
#html = response.read()
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html, "html.parser")
print soup

有什么想法吗?

推荐答案

为了在此处抓取产品网址，您需要使用Selenium.以下代码应为您提供产品ID链接.首先通过硒获取动态生成的源，然后解析您指定的"item_pic" div 的第一个孩子的链接.

In order to scrape the product urls here you need to use Selenium. The following code should give you the product id links. It works by first getting the dynamically generated source through selenium and then parsing the links of the first child of the "item_pic" div you specified.

from bs4 import BeautifulSoup
from selenium import webdriver
import urllib2
import requests

driver = webdriver.Firefox()
url = "http://www.forever21.com/Product/Category.aspx?br=f21&category=dress&pagesize=100&page=1"
driver.get(url)
html = driver.page_source

driver.close()

soup = BeautifulSoup(html, "lxml")

itemList = soup.find_all('div', {'class' : 'item_pic'})

for element in enumerate(itemList):
    print element.a['href']

这篇关于BeautifulSoup不会提取所有html的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！