使用R中的phantomJS来抓取动态加载内容的页面

本文介绍了使用R中的phantomJS来抓取动态加载内容的页面的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

背景
我正在使用rvest从R中的某些网站抓取产品信息。这适用于除了一个网站之外的所有内容，其中内容似乎通过angularJS（？）动态加载，因此无法迭代加载，例如通过URL参数（就像我为其他网站所做的那样）。具体网址如下：

BackgroundI'm currently scraping product information from some websites in R using rvest. This works on all but one website where the content seems to be loaded dynamically via angularJS (?), so cannot be loaded iteratively e.g. via URL parameters (as I did for other websites). The specific url is as follows:

请记住我的计算机上没有管理员权限，只能实施不需要或只需要一次性授予管理员权限的解决方案

Please keep in mind I don't have admin rights on my machine and can only implement solutions that require either no or only single-time granting of admin rights

所需输出
最后R中的表格包含产品信息（例如标签，价格，评级）=>在这个问题中，我纯粹需要帮助来动态加载和存储网站;我可以自己处理R中的后处理。
如果你能把我推向正确的方向，那就太棒了。也许下面列出的我的方法之一是在正确的轨道上，但我似乎无法将它们转移到指定的网站。

Desired OutputIn the end a table in R with product information (e.g. label, price, rating) => In this question, though, I purely need help to dynamically load and store the website; I can handle the postprocessing in R on my own.It'd be absolutely great if you could push me in the right direction; maybe one of my approaches listed below are on the right track, but I just seem unable to transfer those to the stated website.

当前的方法
我发现phantomJS是一个无头浏览器，afaik应该能够处理这个问题。我对Java Script几乎一无所知，并且语法与我更习惯的语言（R，Matlab，SQL）差别很大（至少对我而言），我真的很难实现其他可能在其他地方工作的方法我的代码。
基于（非常感谢）我设法检索至少信息从显示的第一页开始，带有以下代码：

Current approachI found phantomJS as a headless browser that afaik should be able to handle this issue. I have close to none knowledge of Java Script at all and syntax differs (at least for me) heavily from languages I'm more used to (R, Matlab, SQL) that I really struggle to implement approaches suggested somewhere else that might work in my code.Based on this example (thanks a lot) I managed to retrieve at least information from the first shown page with this code:

R：

require(rvest)

## change Phantom.js scrape file
url <- 'http://www.hornbach.de/shop/Badarmaturen/Waschtischarmaturen/S3584/artikelliste.html'

lines <- lines <- readLines("scrape_final.js")
lines[1] <- paste0("var url ='", url ,"';")
writeLines(lines, "scrape_final.js")

## Download website
system("phantomjs scrape_final.js")

### use Rvest to scrape the downloaded website.
web <- read_html("1.html")
content <- html_nodes(web, 'div.paging-indicator')# %>% html_attr('href')
content <- html_text(content) %>% as.data.frame()

和相应的PhantomJS脚本：

and the corresponding PhantomJS script:

var url ='http://www.hornbach.de/shop/Badarmaturen/Waschtischarmaturen/S3584/artikelliste.html';
var page = new WebPage()
var fs = require('fs');

page.open(url, function (status) {
         just_wait();
});


function just_wait() {
    setTimeout(function() {
              fs.write('1.html', page.content, 'w');
           phantom.exit();
    }, 2500);
}

什么不起作用//研究
虽然此代码从第一页检索信息，但我需要遍历所有x产品页面。我尝试使用以下示例扩展上面的代码：

what does not work // researchWhile this code retrieves information from the first page, I do need to iterate through all x product pages. I tried to extend the code above with the following examples:

[使用phantomjs刮取动态加载页面] [3]

[Scrape dynamic loading pages with phantomjs][3]

[网页抓取动态加载R中的数据] [4]

[Web scraping dynamically loading data in R][4]

这些例子让我想到了这个想法

The examples led me to the idea

或以某种方式注入正确的分页值

either clicking the "next page" button
or somehow inject the correct pagination value

Either click on the "next page" button

这看起来如下

    var url ='http://www.hornbach.de/shop/Badarmaturen/Waschtischarmaturen/S3584/artikelliste.html';
    var page = require('webpage').create();
    var fs = require('fs');

    page.open(url, function (status) {
    age.open(url, function() {
      page.includeJs("http://ajax.googleapis.com/ajax/libs/jquery/1.6.1/jquery.min.js", function() {
        page.evaluate(function() {
          $("paging-btn right").click();
            just_wait();
        });
        phantom.exit()
      });

    });


    function just_wait() {
        setTimeout(function() {
                  fs.write('1.html', page.content, 'w');
               phantom.exit();
        }, 2500);
    }

但是由于语法不好以及其他原因，我不会把我带到任何地方。
从R调用这个脚本不会产生错误，它只运行了很长时间，所以我必须退出它（而工作脚本只需要几秒钟。

But that doesn't get me anywhere due to poor syntax and maybe other things.Calling this script from R doesn't produce an error unfortunately, it just runs for ages so I have to quit it (while the working script only takes few secs).

我使用firefox中的小工具检查器来检索按钮名称，还使用了mig是错的：

I used the gadget inspector from firefox to retrieve the button name, but also that might be wrong:

<a class="paging-btn right rel="next" ng-click="goToNextPage()"
ng-hide="articleData.pageNumber == articleData.pageCount"
href="javascript:void(0);">right</a>

更改加载时的分页参数

change the pagination parameter on load

我试图在这里给出给定的例子

I tried to workon the given example herePassing variable into page.evaluate - PhantomJS

但也刚刚获得一个从未在R中结束的脚本

but also just got a script that never ended in R

附加说明
看起来我只允许发布两个链接，所以很遗憾我无法链接我研究和测试过的所有来源。

Additional notesIt looks like I'm only allowed to post two links, so unfortunately I couldn't link all sources I've researched and tested.

我很清楚这是一次巨大而混乱的信息，如果你能帮我改进的话/更好地构建我的问题，请让我知道。我会尽力做出回应，并为你提供所需的任何帮助。

I'm well aware this is huge and messy info at once and if you can help me to improve/better structure my question, please let me know. I'll do my best to be responsive and get you anything you need to assist.

推荐答案

我已经拆分了PhantomJS代码分两部分，避免了错误信息。我非常有信心可以先阅读并存储网站，然后点击下一页按钮并输出新网址，但不幸的是，如果没有错误信息，这就无法解决。

I've split the PhantomJS code in two parts which avoids the error messages. I'm quite confident it is possible to first read and store the website and afterwards lick on the "next page" button and output the new url, but unfortunately this didn't work out without an error message.

以下 R代码是最内部的抓取循环（从一个子子类别的页面检索信息，相应地调用/更改PhantomJS脚本）。

The following R code is the most inner scraping loop (retrieves info from pages of one sub-sub-category, calls / changes the PhantomJS scripts accordingly).

   for (i3 in 1:num_prod_pages) {

      system('phantomjs readhtml.js') # return html of current page via PhantomJS

      ### Use Rvest to scrape the downloaded website.

      html_filename <- paste0(as.character(i3), '.html') # file generated in PhantomJS readhtml.js
      web <- read_html(html_filnamee)
      content <- html_nodes(web, 'div.article-pricing') # %>% html_attr('href')
      content <- html_text(content) %>% as.data.frame()

      ### generate URL of next page

      url_i3 <- capture.output(system("phantomjs nextpage.js", intern = TRUE)) %>%
         .[length(.)] %>% # last line of output contains
         str_sub(str_locate(., 'http')[1], -2) # cut '[1] \' at start and ' \" ' at end

      # Adapt PhantomJS scripts to new url

      lines <- readLines("readhtml.js")
      lines[2] <- paste0("var url ='", url_i3 ,"';")
      lines[11] <- paste0("              fs.write('", as.character(i3), ".html', page.content, 'w');")
      writeLines(lines, "readhtml.js")

      lines <- readLines("nextpage.js")
      lines[2] <- paste0("var url ='", url_i3 ,"';")
      writeLines(lines, "nextpage.js")
   }

以下 PhantomJS代码readhtml。 js代码存储本地网址的网站

The following PhantomJS code "readhtml.js" code stores website with current URL locally

var webPage = require('webpage');
var url ='http://www.hornbach.de/shop/Badarmaturen/S476/artikelliste.html';
var fs = require('fs');
var page = webPage.create();
var system = require('system');

//page.settings.userAgent = 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 Firefox/40.0'
page.settings.userAgent = 'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)'

page.open(url, function(status) {
    if (status === 'success') {
        fs.write('1.html', page.content, 'w');
        console.log('htmlfile ready');
        phantom.exit();
    }
})

以下 PhantomJS代码nextpage.js代码点击下一页按钮并返回新网址

The following PhantomJS code "nextpage.js" code clicks the "next page" button and returns the new URL

var webPage = require('webpage');
var url ='http://www.hornbach.de/shop/Badarmaturen/S476/artikelliste.html';
var fs = require('fs');
var page = webPage.create();
var system = require('system');

page.settings.userAgent = 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 Firefox/40.0';

page.open(url, function(status) {
    if (status === 'success') {
        page.evaluate(function() {
            document.querySelector('a.right:nth-child(3)').click();
        });
        setTimeout(function() {
            var new_url = page.url;
            console.log('URL: ' + new_url);
            phantom.exit();
    }, 2000);
    };
});

总的来说不是很优雅，但缺少其他输入我关闭这个，因为它没有任何错误消息

All in all not really elegant, but lacking other input I close this one as it works without any error messages

这篇关于使用R中的phantomJS来抓取动态加载内容的页面的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！