本文介绍了Giraph best的顶点输入格式,用于输入String类型为id的输入文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个多节点giraph集群,可以在我的电脑上正常工作。我执行了Giraph的SimpleShortestPathExample并且执行正常。



这个算法是用这个文件(tiny_graph.txt)运行的:

  [0,0,[[1,1],[3,3]]] 
[1,0,[[0,1],[2 ,2],[3,1]]]
[2,0,[[1,2],[4,4]]]
[3,0,[[0,3], [1,1],[4,4]]]
[4,0,[[3,4],[2,4]]]


$ b

这个文件有以下输入格式: source_value,[[dest_id,edge_value],...]]

现在,我试图在同一个群集中执行相同的算法,但输入文件与原始文件不同。我自己的文件是这样的:

  [Portada,0,[[Sugerencias para la cita deldía,1]]] 
[Proverbiosespañoles,0,[]]
[Neil Armstrong,0,[[Luna,1] [ideal,1] [verdad,1] [Categoria:Ingenieros,2,[Categoria:Estadounidenses, 2] [分类:宇航员,2]]]
[类别:Ingenieros,1,[[Neil Armstrong,2]]]
[类别:Estadounidenses,1,[[尼尔阿姆斯特朗,2]] ]
[Categoria:Astronautas,1,[[Neil Armstrong,2]]]

它与原始的非常相似,但是id是String,顶点和边的值是Long。我的问题是这是我应该使用哪个TextInputFormat,因为我已经尝试使用 org.apache.giraph.io.formats.JsonLongDoubleFloatDoubleVertexInputFormat org.apache .giraph.io.formats.TextDoubleDoubleAdjacencyListVertexInputFormat ,我无法得到这个工作。



解决这个问题后,我可以调整原始的最短路径示例算法,并让它为我的文件工作,但直到我得到一个解决方案,我不能达到这一点。



如果这种格式它不是一个好决定,我可能会适应它,但我不知道哪个是我最好的选择,我从giraph中的文本输入和输出格式的知识是非常糟糕的,这就是为什么iOme在这里寻求建议。

解决方案

我解决了这个问题,以适应我自己的文件,以适应 org.apache.giraph.io.formats.TextDoubleDoubleAdjacencyListVertexInputFormat 。我的原始文件应该是这样的:

  Portada 0.0 Sugerencias 1.0 
Proverbios 0.0
Neil 0.0 Luna 1.0理想的1.0 verdad 1.0类别:Ingenieros 2.0类别:商务2.0 2.0类别:Astronautas 2.0
类别:Ingenieros 1.0 Neil 2.0
类别:Estadounidenses 1.0 Neil 2.0
类别:Astronautas 1.0 Neil 2.0

这些数据之间的空格是制表符空格('\t'),因为此格式将该选项作为预定义的标记用于将原始行分割为多个字符串。

感谢@ masoud-sagharichian为您提供帮助! :D

I have a multinode giraph cluster working properly in my PC. I executed the SimpleShortestPathExample from Giraph and was executed fine.

This algorithm was ran with this file (tiny_graph.txt):

[0,0,[[1,1],[3,3]]]
[1,0,[[0,1],[2,2],[3,1]]]
[2,0,[[1,2],[4,4]]]
[3,0,[[0,3],[1,1],[4,4]]]
[4,0,[[3,4],[2,4]]]

This file has the following input format:

[source_id,source_value,[[dest_id, edge_value],...]]

Now, I’m trying to execute this same algorithm, in this same cluster, but with an input file different from the original. My own file is like this:

[Portada,0,[[Sugerencias para la cita del día,1]]]
[Proverbios españoles,0,[]]
[Neil Armstrong,0,[[Luna,1][ideal,1][verdad,1][Categoria:Ingenieros,2,[Categoria:Estadounidenses,2][Categoria:Astronautas,2]]]
[Categoria:Ingenieros,1,[[Neil Armstrong,2]]]
[Categoria:Estadounidenses,1,[[Neil Armstrong,2]]]
[Categoria:Astronautas,1,[[Neil Armstrong,2]]]

It's very similar to the original, but the id's are String and the vertex and edges values are Long. My question it's which TextInputFormat should i use for this, because i already try with org.apache.giraph.io.formats.JsonLongDoubleFloatDoubleVertexInputFormat and org.apache.giraph.io.formats.TextDoubleDoubleAdjacencyListVertexInputFormat and i couldn't get this working.

With this problem solved, i could adapt the original shortest path example algorithm and let it work for my file, but until i get a solution for this i can't reach to that point.

If this format it's not a good decision, i could adapt it maybe, but i don't know which it's my best option, my knowledge from Text Input and Output Format in giraph it's really bad, that's why i0me here asking for advice.

解决方案

I solved this adapting my own file to fit in org.apache.giraph.io.formats.TextDoubleDoubleAdjacencyListVertexInputFormat . My original file should be like this:

Portada 0.0     Sugerencias     1.0
Proverbios      0.0
Neil    0.0     Luna    1.0     ideal   1.0     verdad  1.0     Categoria:Ingenieros    2.0     Categoria:Estadounidenses       2.0     Categoria:Astronautas   2.0
Categoria:Ingenieros    1.0     Neil    2.0
Categoria:Estadounidenses       1.0     Neil    2.0
Categoria:Astronautas   1.0     Neil    2.0

Those spaces between the data are tab spaces ('\t'), because this format has that option as predetermined token value for spliting the original lines into several strings.

Thanks @masoud-sagharichian for your help anyway!! :D

这篇关于Giraph best的顶点输入格式,用于输入String类型为id的输入文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

05-29 04:44