问题描述
我正在尝试将 JavaScript 网页呈现为填充的 HTML 以进行抓取.研究不同的解决方案(硒,反向工程页面等)让我this 技术,但我无法让它工作.顺便说一句,我是 python 新手,基本上是在剪切/粘贴/实验阶段.过去的安装和缩进问题,但我现在卡住了.
I'm trying to render a javascripted webpage into populated HTML for scraping. Researching different solutions (selenium, reverse-engineering the page etc.) led me to this technique but I can't get it working. BTW I am new to python, basically at the cut/paste/experiment stage. Got past installation and indentation issues but I'm stuck now.
在下面的测试代码中,print(sample_html) 工作并返回目标页面的原始 html,但 print(render(sample_html)) 始终返回单词None".
In the test code below, print(sample_html) works and returns the original html of the target page but print(render(sample_html)) always returns the word 'None'.
有趣的是,如果您在 amazon.com 上运行它,他们会检测到它不是真正的浏览器,并返回带有有关自动访问警告的 html.然而,其他测试页面提供了应该呈现的真实 html,除非它没有.
Interestingly, if you run this on amazon.com they detect it is not a real browser and return html with a warning about automated access. However the other test pages provide true html that should render, except it doesn't.
如何解决总是返回无"的结果?
How do I troubleshoot the result always returning "None'?
def render(source_html):
"""Fully render HTML, JavaScript and all."""
import sys
from PyQt5.QtWidgets import QApplication
from PyQt5.QtWebEngineWidgets import QWebEngineView
class Render(QWebEngineView):
def __init__(self, html):
self.html = None
self.app = QApplication(sys.argv)
QWebEngineView.__init__(self)
self.loadFinished.connect(self._loadFinished)
self.setHtml(html)
self.app.exec_()
def _loadFinished(self, result):
# This is an async call, you need to wait for this
# to be called before closing the app
self.page().toHtml(self.callable)
def callable(self, data):
self.html = data
# Data has been stored, it's safe to quit the app
self.app.quit()
return Render(source_html).html
import requests
#url = 'http://webscraping.com'
#url='http://www.amazon.com'
url='https://www.ncbi.nlm.nih.gov/nuccore/CP002059.1'
sample_html = requests.get(url).text
print(sample_html)
print(render(sample_html))
感谢您在代码中加入的回复.但是现在它返回一个错误并且脚本挂起,直到我杀死 python 启动器,然后导致段错误:
Thanks for the responses which were incorporated into the code. But now it returns an error and the script hangs until I kill the python launcher which then causes a segfault:
这是修改后的代码:
def render(source_url):
"""Fully render HTML, JavaScript and all."""
import sys
from PyQt5.QtWidgets import QApplication
from PyQt5.QtCore import QUrl
from PyQt5.QtWebEngineWidgets import QWebEngineView
class Render(QWebEngineView):
def __init__(self, url):
self.html = None
self.app = QApplication(sys.argv)
QWebEngineView.__init__(self)
self.loadFinished.connect(self._loadFinished)
# self.setHtml(html)
self.load(QUrl(url))
self.app.exec_()
def _loadFinished(self, result):
# This is an async call, you need to wait for this
# to be called before closing the app
self.page().toHtml(self._callable)
def _callable(self, data):
self.html = data
# Data has been stored, it's safe to quit the app
self.app.quit()
return Render(source_url).html
# url = 'http://webscraping.com'
# url='http://www.amazon.com'
url = "https://www.ncbi.nlm.nih.gov/nuccore/CP002059.1"
print(render(url))
抛出这些错误:
$ python3 -tt fees-pkg-v2.py
Traceback (most recent call last):
File "fees-pkg-v2.py", line 30, in _callable
self.html = data
AttributeError: 'method' object has no attribute 'html'
None (hangs here until force-quit python launcher)
Segmentation fault: 11
$
我已经开始阅读 python 类以完全理解我在做什么(总是一件好事).我在想我的环境中的某些问题可能是问题(OSX Yosemite、Python 3.4.3、Qt5.4.1、sip-4.16.6).还有其他建议吗?
I already started reading up on python classes to fully understand what I'm doing (always a good thing). I'm thinking something in my environment could be the problems (OSX Yosemite, Python 3.4.3, Qt5.4.1, sip-4.16.6). Any other suggestions?
推荐答案
问题出在环境上.我已经手动安装了 Python 3.4.3、Qt5.4.1 和 sip-4.16.6,一定是搞砸了.安装 Anaconda 后,脚本开始工作.再次感谢.
The problem was the environment. I had manually installed Python 3.4.3, Qt5.4.1, and sip-4.16.6 and must have mucked something up. After installing Anaconda, the script started working. Thanks again.
这篇关于使用 PyQt5 和 QWebEngineView 抓取 javascript 页面的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!