问题描述
我是 GraphX 的新手,有一个包含四列的 Spark 数据框,如下所示:
I am new to GraphX and have a Spark dataframe with four columns like below:
src_ip dst_ip flow_count sum_bytes
8.8.8.8 1.2.3.4 435 1137
... ... ... ...
基本上我想将 src_ip
和 dst_ip
都映射到顶点并分配 flow_count
和 sum_bytes
作为边属性.据我所知,我们不能在 GraphX 中添加边属性,因为只允许顶点属性.因此,我正在考虑添加 flow_count
作为边缘权重:
Basically I want to map both src_ip
and dst_ip
to vertices and assign flow_count
and sum_bytes
as edges attribute. As far as I know, we can not add edges attributes in GraphX as only vertex attributes are permitted. Hence, I am thinking about adding flow_count
as edge weight:
//create edges
val trafficEdges = trafficsFromTo.map(x =Edge(MurmurHash3.stringHash(x(0).toString,MurmurHash3.stringHash(x(1).toString,x(2))
但是,我可以添加 sum_bytes
作为边权重吗?
However, can I add sum_bytes
as edge weight as well?
推荐答案
可以将这两个变量添加到边缘.最简单的解决方案是使用元组,例如:
It is possible to add both variables to the edge. The simplest solution would be to use a tuple, for example:
val data = Array(Edge(3L, 7L, (123, 456)), Edge(5L, 3L, (41, 34)))
val edges: RDD[Edge[(Int, Int)]] = spark.sparkContext.parallelize(data)
或者,您可以使用案例类:
Alternatively, you can make use of a case class:
case class EdgeWeight(flow_count: Int, sum_bytes: Int)
val data2 = Array(Edge(3L, 7L, EdgeWeight(123, 456)), Edge(5L, 3L, EdgeWeight(41, 34)))
val edges: RDD[Edge[EdgeWeight]] = spark.sparkContext.parallelize(data2)
如果要添加更多的属性,使用案例类会更方便使用和维护.
Using a case class would be more convenient to use and maintain if there are more attributes to be added.
我相信在这种特定情况下,最优雅的解决方法是:
I believe that in this specific case, it is most elegantly solved by:
val trafficEdges = trafficsFromTo.map{x =>
Edge(MurmurHash3.stringHash(x(0).toString,
MurmurHash3.stringHash(x(1).toString,
EdgeWeight(x(2), x(3))
}
trafficEdges.sortBy(edge => edge.attr.flow_count) // sort by flow_count
这篇关于Spark GraphX:添加多个边权重的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!