There are lots of sites that use this (imo) annoying "infinite scrolling" style.Examples of this are sites like tumblr, twitter, 9gag, etc..
I recently tried to scrape some pics off of these sites programatically with HtmlAgilityPack.like this:
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load(url);
var primary = doc.DocumentNode.SelectNodes("//img[@class='badge-item-img']");
var picstring = primary.Select(r => r.GetAttributeValue("src", null)).FirstOrDefault();
这很好,但是当我尝试从某些站点加载HTML时,我注意到我只得到了少量的内容(比如说前10个帖子"或图片"或其他内容). )这让我想知道是否有可能在c#中模拟页面的向下滚动到底部".
This works fine, but when I tried to load in the HTML from certain sites, I noticed that I only got back a small amount of content (lets say the first 10 "posts" or "pictures", or whatever..)This made me wonder if it would be possible to simulate the "scrolling down to the bottom" of the page in c#.
这不仅是我以编程方式加载html的情况,当我仅访问tumblr之类的站点,并且检查firebug或只是查看源代码"时,我希望所有内容都在某个地方,但是很多似乎是用javascript隐藏/插入的. HTML源中仅显示屏幕上实际可见的内容.
This isn't just the case when I load the html programatically, when I simply go to sites like tumblr, and I check firebug or just "view source", I expected that all the content would be in there somewhere, but alot of it seems to be hidden/inserted with javascript. Only the content that is actually visible on my screen is present in the HTML source.
(I know that I can use API's for tumblr and twitter, but i'm just trying to have some fun hacking stuff together with HtmlAgilityPack)
There is no way to reliably do this for all such websites in one shot, short of embedding a web browser (which typically won't work in headless environments).
What you should consider doing instead is looking at the site's JavaScript in order to see what AJAX queries are used to fetch content as the user scrolls down.
Alternatively, use a web debugger in your browser (such as the one included in Chrome). These debuggers usually have a "network" pane you can use to inspect AJAX requests performed by the page. Looking at these requests as you scroll down should give you enough information to write C# code that simulates those requests.
然后,您将必须将这些请求的响应解析为特定API提供的任何类型的内容,可能是JSON或XML,但几乎肯定不是HTML. (无论如何这可能对您来说更好,因为它可以节省您解析出面向显示的HTML的麻烦,而AJAX API将为您提供易于使用的数据对象.)
You will then have to parse the response from those requests as whatever type of content that particular API delivers, which will probably be JSON or XML, but almost certainly not HTML. (This may be better for you anyway, since it will save you having to parse out display-oriented HTML, whereas the AJAX API will give you data objects that should be much easier to use.)