问题描述
我是新来的网络刮,很少接触到HTML文件系统,并想知道是否有搜索的网页的HTML版本所需的内容更好更有效的方式。
目前,我想刮这里产品评论:<一href=\"http://www.walmart.com/ip/29701960?wmlspartner=wlpa&adid=22222222227022069601&wl0=&wl1=g&wl2=c&wl3=34297254061&wl4=&wl5=pla&wl6=62272156621&veh=sem\" rel=\"nofollow\">http://www.walmart.com/ip/29701960?wmlspartner=wlpa&adid=22222222227022069601&wl0=&wl1=g&wl2=c&wl3=34297254061&wl4=&wl5=pla&wl6=62272156621&veh=sem
I'm new to web scraping, have little exposure to html file systems and wanted to know if there is a better more efficient way to search for a required content on the html version of a web page.Currently, I want to scrape reviews for a product here: http://www.walmart.com/ip/29701960?wmlspartner=wlpa&adid=22222222227022069601&wl0=&wl1=g&wl2=c&wl3=34297254061&wl4=&wl5=pla&wl6=62272156621&veh=sem
对于这一点,我有以下的code:
For this, I have the following code:
url = http://www.walmart.com/ip/29701960? wmlspartner=wlpa&adid=22222222227022069601&wl0=&wl1=g&wl2=c&wl3=34297254061&wl4=&wl5=pla&wl6=6227215 6621&veh=sem
review_url = url
#print review_url
#-------------------------------------------------------------------------
# Scrape the ratings
#-------------------------------------------------------------------------
page_no = 1
sum_total_reviews = 0
more = True
while (more):
#print "XXXX"
# Open the URL to get the review data
request = urllib2.Request(review_url)
try:
#print "XXXX"
page = urllib2.urlopen(request)
except urllib2.URLError, e:
#print "XXXXX"
if hasattr(e, 'reason'):
print 'Failed to reach url'
print 'Reason: ', e.reason
sys.exit()
elif hasattr(e, 'code'):
if e.code == 404:
print 'Error: ', e.code
sys.exit()
content = page.read()
#print content
soup = BeautifulSoup(content)
results = soup.find_all('span', {'class': re.compile(r's_star_\d_0')})
有了这个,我没能读什么。我猜我必须给它一个准确的目的地。有什么建议?
With this, I'm not able to read anything. I'm guessing I have to give it an accurate destination. Any suggestions ?
推荐答案
据我所知,这个问题最初是约 BeautifulSoup
,但既然你没有成功使用它在这种特殊情况,我建议在考虑看看。
I understand that the question was initially about BeautifulSoup
, but since you haven't had any success using it in this particular situation, I suggest taking a look at selenium.
硒使用真正的浏览器 - 你不必应付解析Ajax调用的结果。例如,这里是你如何能得到评审职称和评级从第一个评论网页列表:
Selenium uses a real browser - you don't have to deal with parsing the results of ajax calls. For example, here's how you can get the list of review titles and ratings from the first reviews page:
from selenium.webdriver.firefox import webdriver
driver = webdriver.WebDriver()
driver.get('http://www.walmart.com/ip/29701960?page=seeAllReviews')
for review in driver.find_elements_by_class_name('BVRRReviewDisplayStyle3Main'):
title = review.find_element_by_class_name('BVRRReviewTitle').text
rating = review.find_element_by_xpath('.//div[@class="BVRRRatingNormalImage"]//img').get_attribute('title')
print title, rating
driver.close()
打印:
Renee Culver loves Clorox Wipes 5 out of 5
Men at work 5 out of 5
clorox wipes 5 out of 5
...
此外,考虑到可以使用无头PhantomJS浏览器(例如)。
另一种选择是利用的。
希望有所帮助。
这篇关于牵制在HTML确切内容位置网页抓取的urllib2美味的汤的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!