问题描述
使用 Python,我想在一个源非常大的网页上抓取数据(它是某个用户的 Facebook 页面).
假设 URL 是我试图抓取的 URL.我运行以下代码:
导入 urllib2usock = urllib2.urlopen(url)数据 = usock.read()usock.close()
数据应该包含我正在抓取的页面的来源,但由于某种原因,它不包含我直接与页面来源进行比较时可用的所有字符.我不知道我做错了什么.我知道我尝试抓取的页面最近没有更新,所以这不是因为我遗漏了一些最近的数据.
有人知道吗?
我缺少的信息类型如下:
<div class="clearfix uiHeaderTop"><div><h4 tabindex="0" class="uiHeaderTitle">基本信息</h4></div></div></div>;<div class="phs"><table class="uiInfoTable mtm profileInfoTable uiInfoTableFixed"><tbody><tr><th class="label">Networks</th><tdclass="data"><div class="uiCollapsedList uiCollapsedListHidden" id="up82eq_32"><span class="visible">XXXX</span></div></td></tr></tbody></table></div></div>--></code>基本上是我感兴趣的一些领域.让我惊讶的是我可以得到一些领域,但不是全部.
解决方案 Facebook 在很大程度上以 Javascript 为导向.您在浏览器中看到的页面源是任何 JS 代码运行后 after 的 DOM(无论如何页面源都会经常更改).您可能需要自动化浏览器(使用 Selenium),或者尝试其他工具,例如 mechanize...或者寻找合适的 FB 应用程序并使用 FB API.
Using Python, I want to crawl data on a web page whose source if quite big (it is a Facebook page of some user).
Say the URL is the URL I am trying to crawl. I run the following code:
import urllib2
usock = urllib2.urlopen(url)
data = usock.read()
usock.close()
Data is supposed to contain the source of the page I am crawling, but for some reason, it doesn't contain all the characters that are available when I compare directly with the source of the page. I don't know what I am doing wrong. I know that the page I am trying to crawl has not been updated recently, so it is not due to the fact that I am missing some very recent data.
Does someone have a clue?
EDIT: the kind of information I am missing is like:
<code class="hidden_elem" id="up82eq_33"><!-- <div class="mbs profileInfoSection"><div class="uiHeader uiHeaderTopAndBottomBorder uiHeaderSection infoSectionHeader"><div class="clearfix uiHeaderTop"><div><h4 tabindex="0" class="uiHeaderTitle">Basic Information</h4></div></div></div><div class="phs"><table class="uiInfoTable mtm profileInfoTable uiInfoTableFixed"><tbody><tr><th class="label">Networks</th><td class="data"><div class="uiCollapsedList uiCollapsedListHidden" id="up82eq_32"><span class="visible">XXXX</span></div></td></tr></tbody></table></div></div> --></code>
It's basically some field I am interested in. What surprises me is that I can get some fields, but not all.
解决方案 Facebook is heavily Javascript orientated. The page source you see in the browser is the DOM after after any JS code has run (and the page source will frequently be changing anyway). You may have to automate a browser (using Selenium), or try other tools such as mechanize... Or look into a proper FB app and use the FB API.
这篇关于我无法获得 HTML 页面的整个源代码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!
08-21 13:17