本文介绍了urllib2的&安培; BeautifulSoup:尼斯夫妇,但速度太慢 - urllib3&安培;线程?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在寻找找到一种方法来优化我的code当我听到关于线程和urllib3一些好东西。显然,人不以为然哪个解决方案是最好的。

下面我的脚本的问题是执行时间:这么慢

第1步:我取此页
http://www.cambridgeesol.org/institutions/results.php?region=Afghanistan&type=&BULATS=on

第2步:我与解析的BeautifulSoup页

第3步:我把数据在Excel文档

第四步:我再做一次,又一次,又一次在我的列表中的所有国家(大名单)
(我只是改变阿富汗的URL到另一个国家)

下面是我的code:

  WS = wb.add_sheet(BULATS_IA)#We以Excel文档添加新的标签
    X = 0#我们需要x和y的数据拉动到Excel文档
    Y = 0
    Countries_List = ['Afghanistan','Albania','Andorra','Argentina','Armenia','Australia','Austria','Azerbaijan','Bahrain','Bangladesh','Belgium','Belize','Bolivia','Bosnia和黑塞哥维那,巴西,文莱达鲁萨兰国,保加利亚,喀麦隆,加拿大,中非共和国','智','中国','哥伦比亚','哥','克罗地亚,古巴,塞浦路斯,捷,丹麦,多明尼加共和国,厄瓜多尔,埃及,厄立特里亚,爱沙尼亚,埃塞俄比亚,法罗群岛,斐济,芬兰,法国,法属波利尼西亚,乔治亚,德国,直布罗陀,希腊,格林纳达,香Kong','Hungary','Iceland','India','Indonesia','Iran','Iraq','Ireland','Israel','Italy','Jamaica','Japan','Jordan','Kazakhstan','Kenya','Kuwait','Latvia','Lebanon','Libya','Liechtenstein','Lithuania','Luxembourg','Macau','Macedonia','Malaysia','Maldives','Malta','Mexico','Monaco','Montenegro','Morocco','Mozambique','Myanmar (缅甸),尼泊尔,荷兰,新喀里多尼亚,新西兰,尼日利亚,挪威,阿曼,巴,巴勒斯坦,巴布亚新几内亚,巴拉圭,秘鲁,菲律宾,波兰,葡萄牙,卡塔尔,罗,俄罗斯,沙特阿拉伯,塞尔维亚,新加坡,斯洛伐克,斯洛文尼亚,南非,韩国,西班牙,斯里兰卡,瑞典,瑞士,叙利亚,台湾,泰国,Trinadad和多巴哥,突尼斯 土耳其,乌克兰,阿拉伯联合酋长国,英国,美国,乌拉圭,乌兹别克斯坦,委内瑞拉','越南']
    Longueur = LEN(Countries_List)    在Countries_List国家:
        Y = 0        htmlSource = urllib.urlopen(\"http://www.cambridgeesol.org/institutions/results.php?region=%s&type=&BULATS=on\" %(国家))。阅读()#我的通讯员国家的名称打开页面的网址
        S =汤(htmlSource)
        tableGood = s.findAll('表')
        尝试:
            行= tableGood [3] .findAll('TR')
            在行TR​​:
                COLS = tr.findAll('TD')
                Y = 0
                X = X + 1个
                在COLS TD:
                    哼哼= td.text
                    ws.write(X,Y,哼哼)
                    Y = Y + 1
                    wb.save(%s.xls%name_excel)        除了(IndexError):
            通过

所以我知道一切都不是完美的,但我期待着在Python学习新的东西!剧本是非常缓慢的,因为urllib2的不是那么快,和BeautifulSoup。对于汤的事情,我想我真的不能变得更好,但对urllib2的,我不知道。

编辑1:
Multiprocessing没用使用的urllib2?
似乎是在我的情况有意思。
你们认为这个可能的解决方案是什么?!

 #确保队列是线程安全的!高清制作人(个体经营):
    #只需要一个制片人,虽然你可以有多个
    与FH =开放('urllist.txt中','R'):
        在FH行:
            self.queue.enqueue(line.strip())DEF消费者(个体经营):
    #火N个这些婴儿对一些速度
    而真正的:
        URL = self.queue.dequeue()
        DH = urllib2.urlopen(URL)
        与FH =开放('的/ dev / null的','W'):#总得把它放在什么地方
            fh.write(dh.read())

编辑2:URLLIB3
谁能告诉我关于更多的事情?

As far as I am requesting 122 times the same website for different pages, I guess reusing the same socket connection can be interesting, am I wrong ?Cant it be faster ? ...

http = urllib3.PoolManager()
r = http.request('GET', 'http://www.bulats.org')
for Pages in Pages_List:
    r = http.request('GET', 'http://www.bulats.org/agents/find-an-agent?field_continent_tid=All&field_country_tid=All&page=%s' % (Pages))
    s = soup(r.data)
解决方案

Consider using something like workerpool. Referring to the Mass Downloader example, combined with urllib3 would look something like:

import workerpool
import urllib3

URL_LIST = [] # Fill this from somewhere

NUM_SOCKETS = 3
NUM_WORKERS = 5

# We want a few more workers than sockets so that they have extra
# time to parse things and such.

http = urllib3.PoolManager(maxsize=NUM_SOCKETS)
workers = workerpool.WorkerPool(size=NUM_WORKERS)

class MyJob(workerpool.Job):
    def __init__(self, url):
       self.url = url

    def run(self):
        r = http.request('GET', self.url)
        # ... do parsing stuff here


for url in URL_LIST:
    workers.put(MyJob(url))

# Send shutdown jobs to all threads, and wait until all the jobs have been completed
# (If you don't do this, the script might hang due to a rogue undead thread.)
workers.shutdown()
workers.wait()

You may note from the Mass Downloader examples that there are multiple ways of doing this. I chose this particular example just because it's less magical, but any of the other strategies are valid also.

Disclaimer: I am the author of both, urllib3 and workerpool.

这篇关于urllib2的&安培; BeautifulSoup:尼斯夫妇,但速度太慢 - urllib3&安培;线程?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

07-31 13:48