本文介绍了为什么requests.get()使用Python比浏览器检索不同的HTML?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图从HTML表格中提取数据,但是看起来在使用 requests.get()时HTML没有正确加载。相反,源文件中的一行代码如下所示:

当我导航到谷歌浏览器中的页面时,HTML应该显示为它。

如何获得一个Python脚本来加载正确的HTML?

解决方案

欢迎来到奇妙的网络爬行世界。您遇到的问题是 requests.get()只会让您获得浏览器在页面加载开始时接收到的初始页面。但是,这不是你在浏览器中看到的页面,因为可能涉及到很多形成网页:JavaScript函数调用,AJAX调用等。



如果你想要以编程方式获取在页面加载后单击Web浏览器中的显示源时看到的HTML - 您需要一个真正的浏览器。这是可能是一个不错的选择:

 来自selenium import webdriver 

browser = webdriver.Firefox()
browser.get(url)
print browser.page_source

请注意 selenium 本身在方面非常强大 - 您不需要单独的HTML解析器将数据从页面中提取出来。



希望有帮助。


I am attempting to extract data from an HTML table, but it appears that the HTML isn't loading correctly when using requests.get(). Instead, a line in the source reads:

When I navigate to the page in Google Chrome, the HTML appears as it should.

How do I get a Python script to load the proper HTML?

解决方案

Welcome to the wonderful world of web-crawling. The problem you are experiencing is that requests.get() would just get you the initial page that the browser receives at the beginning of a page load. But, this is not the page you see in the browser since there could be so much involved in forming the web page: javascript function calls, AJAX calls etc.

If you want to programmatically get the HTML you see when you click "Show source" in a web browser after the page was loaded - you would need a real browser. This is there selenium could be a good option:

from selenium import webdriver

browser = webdriver.Firefox()
browser.get(url)
print browser.page_source

Note that selenium itself is very powerful in terms of locating elements - you don't need a separate HTML parser for extracting the data out of the page.

Hope that helps.

这篇关于为什么requests.get()使用Python比浏览器检索不同的HTML?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-06 03:26