本文介绍了使用 MapReduce 实现 PageRank的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在努力解决使用 MapReduce 实现 PageRank 的理论问题.

I'm trying to get my head around an issue with the theory of implementing the PageRank with MapReduce.

我有以下三个节点的简单场景:A B C.

I have the following simple scenario with three nodes: A B C.

邻接矩阵在这里:

A { B, C }
B { A }

例如,B 的 PageRank 等于:

The PageRank for B for example is equal to:

(1-d)/N + d ( PR(A) / C(A) )

N     = number of incoming links to B
PR(A) = PageRank of incoming link A
C(A)  = number of outgoing links from page A

我对所有原理图以及映射器和减速器的工作方式都很好,但我无法弄清楚在减速器计算时如何知道 C(A).在通过聚合到 B 的传入链接来计算 B 的 PageRank 时,reducer 将如何知道每个页面的传出链接数.这是否需要在某些外部数据源中查找?

I am fine with all the schematics and how the mapper and reducer would work but I cannot get my head around how at the time of calculation by the reducer, C(A) would be known. How will the reducer, when calculating the PageRank of B by aggregating the incoming links to B will know the number of outgoing links from each page. Does this require a lookup in some external data source?

推荐答案

这是一个伪代码:

map( key: [url, pagerank], value: outlink_list )
    for each outlink in outlink_list
        emit( key: outlink, value: pagerank/size(outlink_list) )

    emit( key: url, value: outlink_list )

reducer( key: url, value: list_pr_or_urls )
    outlink_list = []
    pagerank = 0

    for each pr_or_urls in list_pr_or_urls
        if is_list( pr_or_urls )
            outlink_list = pr_or_urls
        else
            pagerank += pr_or_urls

    pagerank = 1 - DAMPING_FACTOR + ( DAMPING_FACTOR * pagerank )

    emit( key: [url, pagerank], value: outlink_list )

正如 intenret 上的一些文章所建议的那样,在 reduce 中您应该输出外链接而不是内链接,这一点很重要.这样,连续迭代也将外链作为映射器的输入.

It is important that in the reduce you should output outlinks and not inlinks, as some articles on the intenret suggests. This way the consecutive iterations will also have outlinks as input of the mapper.

请注意,来自同一页面的具有相同地址的多个外链算作一个.另外,不要计算循环(页面链接到自身).

Pay attention that multiple outlinks with the same address from the same page count as one. Also, don't count loops (page linking to itself).

阻尼系数通常为 0.85,但您也可以使用其他值.

The damping factor is traditionally 0.85, although you can play around with other values, too.

这篇关于使用 MapReduce 实现 PageRank的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

05-29 03:27
查看更多