我正在尝试从具有一堆我需要提取的选择列表的网页上收集数据
数据来自。这是页面:-http://www.asusparts.eu/partfinder/Asus/All在One / E Series /中
这就是我到目前为止所拥有的:
import glob, string
from bs4 import BeautifulSoup
import urllib2, csv
for file in glob.glob("http://www.asusparts.eu/partfinder/*"):
##-page to show all selections for the E-series-##
selected_list = 'http://www.asusparts.eu/partfinder/Asus/All In One/E Series/'
##-
page = urllib2.urlopen(selected_list)
soup = BeautifulSoup(page)
##-page which shows results after selecting one option-##
url = 'http://www.asusparts.eu/partfinder/Asus/All In One/E Series/ET10B'
##-identify the id of select list which contains the E-series-##
select = soup.find('select', id="myselectListModel")
option_tags = select.findAll('option')
##-omit first item in list as isn't part of the option-##
option_tags = option_tags[1:]
for option in option_tags:
open(url + option['value'])
html = urllib2.urlopen("http://www.asusparts.eu/partfinder/")
soup = BeautifulSoup(html)
all = soup.find('div', id="accordion")
我不确定我是否要走正确的路?由于所有选择菜单都令人困惑。基本上我需要抢
所选结果中的所有数据,例如图像,价格,说明等。它们都包含在
一个包含所有结果的div标签,称为“手风琴”,这样仍可以收集所有数据吗?
还是我需要更深入地研究才能在该div中搜索标签?我也更愿意通过id搜索而不是
类,因为我可以一次性获取所有数据。我将如何从上面得到的做到这一点?谢谢。如果我正确使用或不正确使用glob函数,我也不确定吗?
编辑
这是我编辑的代码,没有错误返回,但是我不确定是否返回e系列的所有型号?
import string, urllib2, urllib, csv, urlparse from bs4 import
BeautifulSoup
##-page which shows results after selecting one option-##
url = 'http://www.asusparts.eu/partfinder/Asus/All In One/E Series/ET10B'
base_url = 'http://www.asusparts.eu/' + url
print base_url
##-page to show all selections for the E-series-##
selected_list = urllib.quote(base_url + '/Asus/All In One/E Series/ET10B')
print urllib.quote(base_url + '/Asus/All In One/E Series/ET10B')
#selected_list = 'http://www.asusparts.eu/partfinder/Asus/All In One/E Series/ET10B'
##-
page = urllib2.urlopen('http://www.asusparts.eu/partfinder/Asus/All%20In%20One/E%20Series')
soup = BeautifulSoup(page)
print soup
##-identify the id of select list which contains the E-series-##
select = soup.find('select', id="myselectListModel")
option_tags = select.findAll('option')
print option_tags
##-omit first item in list as isn't part of the option-##
option_tags = option_tags[1:]
print option_tags
for option in option_tags:
url + option['redirectvalue']
print " " + url + option['redirectvalue']
最佳答案
首先,我想指出您发布的代码中的几个问题。首先,所有glob
模块通常不用于发出HTTP请求。它对于遍历指定路径上的文件子集很有用,您可以阅读有关in its docs的更多信息。
第二个问题是该行中的内容:
for file in glob.glob("http://www.asusparts.eu/partfinder/*"):
您有缩进错误,因为后面没有缩进代码。这将引发错误并阻止其余代码的执行。
另一个问题是您正在为变量使用某些python的“保留”名称。绝对不要使用诸如
all
或file
之类的单词作为变量名。最后,当您遍历
option_tags
时:for option in option_tags:
open(url + option['value'])
open
语句将尝试打开路径为url + option['value']
的本地文件。这可能会引发错误,因为我怀疑您在该位置是否有文件。另外,您应该知道,此打开文件没有做任何事情。好吧,批评就够了。我看了一下asus页面,我想我对要完成的事情有个想法。据我了解,您想在asus页面上刮取每种计算机型号的零件清单(图像,文本,价格等)。每个模型的零件列表位于唯一的URL(例如:http://www.asusparts.eu/partfinder/Asus/Desktop/B%20Series/BM2220)。这意味着您需要能够为每个模型创建此唯一URL。更复杂的是,每个零件类别都是动态加载的,例如,“冷却”部分的零件不会加载,直到您单击“冷却”的链接。这意味着我们有两个部分的问题:1)获取所有有效的(品牌,类型,家族,型号)组合,以及2)找出如何为给定模型加载所有零件。
我有点无聊,因此决定编写一个简单的程序来解决大部分繁重的工作。这不是最优雅的事情,但可以完成工作。步骤1)在
get_model_information()
中完成。步骤2)已在parse_models()
中处理,但不太明显。查看asus网站,每当您单击部件子部分时,就会运行JavaScript函数getProductsBasedOnCategoryID()
,该函数将对格式化的PRODUCT_URL
进行ajax调用(请参见下文)。响应是一些JSON信息,这些信息用于填充您单击的部分。import urllib2
import json
import urlparse
from bs4 import BeautifulSoup
BASE_URL = 'http://www.asusparts.eu/partfinder/'
PRODUCTS_URL = 'http://json.zandparts.com/api/category/GetCategories/'\
'44/EUR/{model}/{family}/{accessory}/{brand}/null/'
ACCESSORIES = ['Cable', 'Cooling', 'Cover', 'HDD', 'Keyboard', 'Memory',
'Miscellaneous', 'Mouse', 'ODD', 'PS', 'Screw']
def get_options(url, select_id):
"""
Gets all the options from a select element.
"""
r = urllib2.urlopen(url)
soup = BeautifulSoup(r)
select = soup.find('select', id=select_id)
try:
options = [option for option in select.strings]
except AttributeError:
print url, select_id, select
raise
return options[1:] # The first option is the menu text
def get_model_information():
"""
Finds all the models for each family, all the families and models for each
type, and all the types, families, and models for each brand.
These are all added as tuples (brand, type, family, model) to the list
models.
"""
model_info = []
print "Getting brands"
brand_options = get_options(BASE_URL, 'mySelectList')
for brand in brand_options:
print "Getting types for {0}".format(brand)
# brand = brand.replace(' ', '%20') # URL encode spaces
brand_url = urlparse.urljoin(BASE_URL, brand.replace(' ', '%20'))
types = get_options(brand_url, 'mySelectListType')
for _type in types:
print "Getting families for {0}->{1}".format(brand, _type)
bt = '{0}/{1}'.format(brand, _type)
type_url = urlparse.urljoin(BASE_URL, bt.replace(' ', '%20'))
families = get_options(type_url, 'myselectListFamily')
for family in families:
print "Getting models for {0}->{1}->{2}".format(brand,
_type, family)
btf = '{0}/{1}'.format(bt, family)
fam_url = urlparse.urljoin(BASE_URL, btf.replace(' ', '%20'))
models = get_options(fam_url, 'myselectListModel')
model_info.extend((brand, _type, family, m) for m in models)
return model_info
def parse_models(model_information):
"""
Get all the information for each accessory type for every
(brand, type, family, model). accessory_info will be the python formatted
json results. You can parse, filter, and save this information or use
it however suits your needs.
"""
for brand, _type, family, model in model_information:
for accessory in ACCESSORIES:
r = urllib2.urlopen(PRODUCTS_URL.format(model=model, family=family,
accessory=accessory,
brand=brand,))
accessory_info = json.load(r)
# Do something with accessory_info
# ...
def main():
models = get_model_information()
parse_models(models)
if __name__ == '__main__':
main()
最后,请注意一点。我放弃了
urllib2
来支持requests
库。我个人认为可以提供更多功能并具有更好的语义,但是您可以随意使用。关于python - 使用python和beautifulsoup的选择菜单从网页获取数据,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/16017725/