问题描述
我已经用 python 编写了一个脚本,结合 pyppeteer
和 asyncio
来从其登陆页面抓取不同帖子的链接,并最终获得每个帖子的标题通过跟踪指向其内页的 url 来发布.我这里解析的内容不是动态的.但是,我使用了 pyppeteer
和 asyncio
来查看它执行异步
的效率.
I've written a script in python in combination with pyppeteer
along with asyncio
to scrape the links of different posts from its landing page and eventually get the title of each post by tracking the url leading to its inner page. The content I parsed here are not dynamic ones. However, I made use of pyppeteer
and asyncio
to see how efficiently it performs asynchronously
.
以下脚本运行良好,但随后出现错误:
The following script goes well for some moments but then enounters an error:
File "C:\Users\asyncio\tasks.py", line 526, in ensure_future
raise TypeError('An asyncio.Future, a coroutine or an awaitable is '
TypeError: An asyncio.Future, a coroutine or an awaitable is required
这是我目前所写的:
import asyncio
from pyppeteer import launch
link = "https://stackoverflow.com/questions/tagged/web-scraping"
async def fetch(page,url):
await page.goto(url)
linkstorage = []
elements = await page.querySelectorAll('.summary .question-hyperlink')
for element in elements:
linkstorage.append(await page.evaluate('(element) => element.href', element))
tasks = [await browse_all_links(link, page) for link in linkstorage]
results = await asyncio.gather(*tasks)
return results
async def browse_all_links(link, page):
await page.goto(link)
title = await page.querySelectorEval('.question-hyperlink','(e => e.innerText)')
print(title)
async def main(url):
browser = await launch(headless=True,autoClose=False)
page = await browser.newPage()
await fetch(page,url)
if __name__ == '__main__':
loop = asyncio.get_event_loop()
future = asyncio.ensure_future(main(link))
loop.run_until_complete(future)
loop.close()
我的问题:我怎样才能摆脱那个错误并异步执行?
推荐答案
问题出在以下几行:
tasks = [await browse_all_links(link, page) for link in linkstorage]
results = await asyncio.gather(*tasks)
目的是让 tasks
成为可等待对象的列表,例如协程对象或期货.该列表将传递给 gather
,以便等待对象可以并行运行,直到它们全部完成.然而,列表推导式包含一个await,这意味着它:
The intention is for tasks
to be a list of awaitable objects, such as coroutine objects or futures. The list is to be passed to gather
, so that the awaitables can run in parallel until they all complete. However, the list comprehension contains an await, which means that it:
- 执行每个
browser_all_links
以串行完成,而不是并行; - 将
browse_all_links
调用的返回值放入列表中.
- executes each
browser_all_links
to completion in series rather than in parallel; - places the return values of
browse_all_links
invocations into the list.
由于 browse_all_links
不返回值,您将 None
对象列表传递给 asyncio.gather
,它抱怨它没有得到可等待的对象.
Since browse_all_links
doesn't return a value, you are passing a list of None
objects to asyncio.gather
, which complains that it didn't get an awaitable object.
要解决此问题,只需从列表推导式中删除 await
.
To resolve the issue, just drop the await
from the list comprehension.
这篇关于使用与 asyncio 关联的 pyppeteer 抓取内容的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!