问题描述
我一直在使用请求和BeautifulSoup for python来从基本网站中抓取HTML,但大多数现代网站不仅仅提供html作为结果。我相信他们运行javascript或其他东西(我不是很熟悉,这里有点菜鸟)。我想知道是否有人知道如何在谷歌航班上搜索航班并刮取最便宜的价格?
I have been using requests and BeautifulSoup for python to scrape html from basic websites, but most modern websites don't just deliver html as a result. I believe they run javascript or something (I'm not very familiar, sort of a noob here). I was wondering if anyone knows how to, say , search for a flight on google flights and scrape the top result aka the cheapest price??
如果这是简单的html,我可以解析html树并找到文本结果,但是当你查看页面源时不会出现这种情况。如果你在浏览器中检查元素,你可以看到hmtl标签内的价格,就像你在查看基本网站的常规页面来源一样。
If this were simple html, I could just parse the html tree and find the text result, but this does not appear when you view the "page source". If you inspect the element in your browser, you can see the price inside hmtl tags as if you were looking at the regular page source of a basic website.
什么是在这里,inspect元素有html但是页面源没有?有没有人知道如何刮掉这类数据?
What is going on here that the inspect element has the html but the page source doesn't? And does anyone know how to scrape this kind of data?
非常感谢!
推荐答案
你是真的 - 在初始服务器响应之后,页面标记将添加javascript。我没有使用过BeautifulSoup,但是从它的文档来看,它看起来好像没有执行javascript,所以你在这方面运气不好。
You're spot on -- the page markup is getting added with javascript after the initial server response. I haven't used BeautifulSoup, but from its documentation, it looks like it doesn't execute javascript, so you're out of luck on that front.
您可以尝试,这基本上是虚拟浏览器 - 人们将其用于前端测试。它执行javascript,所以它可能能够为你提供你想要的东西。
You might try Selenium, which is basically a virtual browser -- people use it for front-end testing. It executes javascript, so it might be able to give you what you want.
但如果你专门寻找Google Flights信息,那就有一个API :)
But if you're specifically looking for Google Flights information, there's an API for that :) https://developers.google.com/qpx-express/v1/
这篇关于如何从网站抓取数据不要返回简单的HTML的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!