问题描述
我试图从HTML表格中提取数据,但是看起来在使用 requests.get()
时HTML没有正确加载。相反,源文件中的一行代码如下所示:
当我导航到谷歌浏览器中的页面时,HTML应该显示为它。
如何获得一个Python脚本来加载正确的HTML?
欢迎来到奇妙的网络爬行世界。您遇到的问题是 requests.get()
只会让您获得浏览器在页面加载开始时接收到的初始页面。但是,这不是你在浏览器中看到的页面,因为可能涉及到很多形成网页:JavaScript函数调用,AJAX调用等。
如果你想要以编程方式获取在页面加载后单击Web浏览器中的显示源时看到的HTML - 您需要一个真正的浏览器。这是可能是一个不错的选择:
来自selenium import webdriver
browser = webdriver.Firefox()
browser.get(url)
print browser.page_source
请注意 selenium
本身在方面非常强大 - 您不需要单独的HTML解析器将数据从页面中提取出来。
希望有帮助。
I am attempting to extract data from an HTML table, but it appears that the HTML isn't loading correctly when using requests.get()
. Instead, a line in the source reads:
When I navigate to the page in Google Chrome, the HTML appears as it should.
How do I get a Python script to load the proper HTML?
Welcome to the wonderful world of web-crawling. The problem you are experiencing is that requests.get()
would just get you the initial page that the browser receives at the beginning of a page load. But, this is not the page you see in the browser since there could be so much involved in forming the web page: javascript function calls, AJAX calls etc.
If you want to programmatically get the HTML you see when you click "Show source" in a web browser after the page was loaded - you would need a real browser. This is there selenium
could be a good option:
from selenium import webdriver
browser = webdriver.Firefox()
browser.get(url)
print browser.page_source
Note that selenium
itself is very powerful in terms of locating elements - you don't need a separate HTML parser for extracting the data out of the page.
Hope that helps.
这篇关于为什么requests.get()使用Python比浏览器检索不同的HTML?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!