问题描述
我是hadoop的初学者,我被告知要创建一个自定义的inputformat类来读取json数据,我已经搜索并学习了如何创建一个自定义的inputformat类来从文件中读取数据。但是我坚持解析json数据。
我的json数据看起来像这样
[
{
_count:30,
_start:0,
_total:180,
values:[
{
attachment:{
contentDomain :techcarnival2013.eventbrite.com,
contentUrl:http://techcarnival2013.eventbrite.com/,
imageUrl:http://ebmedia.eventbrite.com/s3 -s3 / static / images / django / logos / eb_home_tm-trans-fb.png,
summary:在享受无与伦比的烛台公园之旅的同时, \\游戏,食物,音乐等等,我们会有你从十岁起就没有玩过的狂欢节游戏,巨型充气障碍......,
title:Tech Carnival @烛台公园
},
评论:{
_total:0
},
creationTimestamp:1373908436000,
creator:{
firstName:Clayton,
标题:运营总监 bsecondname:{
name:myname
},
lastName:K.,
pictureUrl:http:// mc lnkd.licdn.com/mpr/mprx/0_R7Vm6_RqBDHaHCDzJHRA6hsNcwOfECjzMeaA6heqHeo0v6ovBWoCe8pVJiYrd5pJVu4KdbnQQ3Lj
,
likes:{
_total:0
},
relationToViewer: {
availableActions:{
_total:7,
values:[
{
code:add-comment
},
{
c ode:categorize-as-job
},
{
code:categorize-as-promotion
},
{
code:flag-as-inappropriate
},
{
code:follow
},
{
代码:like
},
{
code:回复 - 私下
}
]
},
isFollowing:false,
isLiked:false
},
summary:来自技术社区的4000+网络,包括来自DFJ, Google,LinkedIn,Square,Uber,Y Combinator,500 Startups等等.10美元的门票让您可以随时进入弹出式Tech嘉年华,这将是科技夏季最大的星期三晚上。,
title:Tech Event @ Candlestick Park 7月17日星期三!来玩嘉年华游戏~4,000海湾地区的最佳和最亮!
},
{
附件:{
contentDomain:lifebeyondnumbers.com,
contentUrl:http://bit.ly/10VTqMu,
imageUrl:http://lifebeyondnumbers.com/wp-content/uploads/2013/07/lurnq_Online_Courses.jpg ,
summary:LurnQ为每个人免费提供学习和教学平台。它迎合了不同的在线观众,并且与所有人都相关。 ,
title:LurnQ - 让终身学习变得自由,有趣和社交......
},
comment:{
_total:0
},
creationTimestamp:1373883177000,
creator:{
firstName:Syed ,
标题:QubiSquare的创始人和首席执行官,
lastName:Muksit,
pictureUrl:http://mclnkd.licdn.com/mpr / mprx / 0_Y5gdzlRCbQBTqIa-pXYnz-01b6KinDO-pFWnz-ZCZLk1WWdt-_SLUt2uWmrpzo0OxQxcVv6pRjbE
},
likes:{
_total:0
},
relationToViewer:{
availableActions:{
_total:7,
values:[
{
code:add-comment
},
{
code:categorize-as-job
},
{
code:categorize-as-promotion
},
{
code:flag-as-inappropriate
},
$ {
code:follow
},
{
code:like
},
{
code:私人回复
}
]
},
isFollowing:false,
isLiked:false
},
summary:LurnQ为所有人免费提供学习和教学平台。它迎合了不同的在线观众,并且与所有人都相关。我们现在要解决的关键问题是终生学习者。,
title:有太多东西需要学习,大部分时间,我们甚至不知道这个 - 那 - 那么好东西存在。 http://bit.ly/10VTqMu
,
{
附件:{
contentDomain:techcarnival2013.eventbrite.com,
contentUrl:http://techcarnival2013.eventbrite.com/,
imageUrl:http://ebmedia.eventbrite.com/s3-s3/static/images/django/logos/eb_home_tm- trans-fb.png,
summary:了解硅谷最好最亮的几千条,同时享受无与伦比的烛台公园,\0000游戏,美食,音乐等。我们会举办你从十岁开始就没玩过的嘉年华游戏,巨型充气障碍......,
title:Tech Carnival @烛台公园
},
评论:{
_total:0
},
creationTimestamp:1373654758000,
creator:{
firstName:Clayton,
headline:运营总监,
lastName:K.,
pictureUrl:http://mclnkd.licdn.com/mpr/mprx / 0_R7Vm6_RqBDHaHCDzJHRA6hsNcwOfECjzMeaA6heqHeo0v6ovBWoCe8pVJiYrd5pJVu4KdbnQQ3Lj
,
likes:{
_total:0
},
relationToViewer:{
availableActions: {
_total:7,
values:[
{
code:add-comment
},
{
code:categorize-as-job
},
{
code:categorize-as-promotion
},
{
code:flag-as-inappropriate
},
$ {
code:follow
},
{
code:like
},
{
code:私人回复
}
]
},
isFoll false,
isLiked:false
},
summary:来自技术社区的4000多个网络,包括来自DFJ,Google,LinkedIn,Square,Uber ,Y Combinator,500 Startups等等.10美元的门票让您可以随意进入弹出式科技嘉年华,这将是科技夏季最大的星期三晚上。,
title: 7月17日星期三的Tech Event @烛台公园!快来玩嘉年华游戏〜4000区湾区的最佳和最亮!
}
..........
........等等
]
如此即时混淆如何读取我的json对象自定义inputformat class.any如何解析这个想法?我想阅读json数组内的单个json对象,我的意思是阅读正确的json字符串,然后给字符串映射我将使用地图内的json解析器构建如果你的问题与Magham Ravi评论过的一致,但答案是好的。
但是,如果您有一个包含所有JSON数据的文件,如上所述,您可能需要阅读整个文件并检索它作为map函数中的值部分(BytesWritable值)的字符串,并将其提供给可用于相同map()函数内的JSON解析器。
看看
此外,如果您在单个文件中说多个JSON对象数据以及将每个JSON对象数据作为映射器,你可以使用像,其中定义了开始和结束标记。在你使用JSON的情况下,你必须有一个独特的开始和结束标记,它们恰好标记了你想要的一个JSON数据对象的开始和结束。仅仅使用start-tag =[{和end-tag =}]可能没有帮助,如果你想把上面的整个JSON对象作为一个值返回,因为你已经有很多嵌套的东西会混淆InputFormat。
如果您无法在任何情况下实现上述目标,请尝试构建您的customTextInputFormat覆盖在LineReader类中,你会罚款这两个集(我可能有点过时了,请检查现在是否可配置使用配置属性,我知道CDH使它可配置,如果不是你需要覆盖)
pre $私有静态最终字节CR =' \r';
私有静态最终字节LF ='\\\
';
您可以放弃CR并将LF更改为] \\\
[ ,因为你的每个独立的JSON数据将以如图所示的形式出现,或者你会更好地知道它是如何做到的?
[
... JSON 1
]
[
... JSON 2
]
[
... JSON N
]
(注意:中间有一个\\\
)和[标记为不同JSON对象数据之间的边界。
希望这是有道理的。
i am a beginner of hadoop,i have been told to create a custom inputformat class to read json data,i have googled up and learnt how to create a custom inputformat class to read data from the file.but i am stuck on parsing the json data.my json data looks like this
[
{
"_count": 30,
"_start": 0,
"_total": 180,
"values": [
{
"attachment": {
"contentDomain": "techcarnival2013.eventbrite.com",
"contentUrl": "http://techcarnival2013.eventbrite.com/",
"imageUrl": "http://ebmedia.eventbrite.com/s3-s3/static/images/django/logos/eb_home_tm-trans-fb.png",
"summary": "Get to know a few thousand of Silicon Valley's best and brightest while enjoying unparalleled access to Candlestick Park,\u00a0games, food, music and more. We'll have carnival games you haven't played since you were ten, giant inflatable obstacle...",
"title": "Tech Carnival @ Candlestick Park"
},
"comments": {
"_total": 0
},
"creationTimestamp": 1373908436000,
"creator": {
"firstName": "Clayton",
"headline": "Director of Operations",
"secondname":{
"name":"myname"
},
"lastName": "K.",
"pictureUrl": "http://m.c.lnkd.licdn.com/mpr/mprx/0_R7Vm6_RqBDHaHCDzJHRA6hsNcwOfECjzMeaA6heqHeo0v6ovBWoCe8pVJiYrd5pJVu4KdbnQQ3Lj"
},
"likes": {
"_total": 0
},
"relationToViewer": {
"availableActions": {
"_total": 7,
"values": [
{
"code": "add-comment"
},
{
"code": "categorize-as-job"
},
{
"code": "categorize-as-promotion"
},
{
"code": "flag-as-inappropriate"
},
{
"code": "follow"
},
{
"code": "like"
},
{
"code": "reply-privately"
}
]
},
"isFollowing": false,
"isLiked": false
},
"summary": "Network with 4,000+ from the tech community, including folks from DFJ, Google, LinkedIn, Square, Uber, Y Combinator, 500 Startups, etc. $10 ticket gets you all-you-can-ride access to the pop-up Tech Carnival, will be the biggest Wednesday night of the tech summer.",
"title": "Tech Event @ Candlestick Park on Wednesday, July 17th! Come play carnival games with ~4,000 of the Bay area's best and brightest!"
},
{
"attachment": {
"contentDomain": "lifebeyondnumbers.com",
"contentUrl": "http://bit.ly/10VTqMu",
"imageUrl": "http://lifebeyondnumbers.com/wp-content/uploads/2013/07/lurnq_Online_Courses.jpg",
"summary": "LurnQ offers a platform for learning and teaching that is free for everyone. It caters to a diverse online audience and is relevant to everyone in general. The key segment that we address now is of life long learners.",
"title": "LurnQ - making lifelong learning clutter free, fun and a social..."
},
"comments": {
"_total": 0
},
"creationTimestamp": 1373883177000,
"creator": {
"firstName": "Syed",
"headline": "Founder and CEO at QubiqSquare",
"lastName": "Muksit",
"pictureUrl": "http://m.c.lnkd.licdn.com/mpr/mprx/0_Y5gdzlRCbQBTqIa-pXYnz-01b6KinDO-pFWnz-ZCZLk1WWdt-_SLUt2uWmrpzo0OxQxcVv6pRjbE"
},
"likes": {
"_total": 0
},
"relationToViewer": {
"availableActions": {
"_total": 7,
"values": [
{
"code": "add-comment"
},
{
"code": "categorize-as-job"
},
{
"code": "categorize-as-promotion"
},
{
"code": "flag-as-inappropriate"
},
{
"code": "follow"
},
{
"code": "like"
},
{
"code": "reply-privately"
}
]
},
"isFollowing": false,
"isLiked": false
},
"summary": "LurnQ offers a platform for learning and teaching that is free for everyone. It caters to a diverse online audience and is relevant to everyone in general. The key segment that we address now is of life long learners.",
"title": "There is so much to learn and most of the times, we don\u2019t even know that this-and-that good stuff exists. http://bit.ly/10VTqMu"
},
{
"attachment": {
"contentDomain": "techcarnival2013.eventbrite.com",
"contentUrl": "http://techcarnival2013.eventbrite.com/",
"imageUrl": "http://ebmedia.eventbrite.com/s3-s3/static/images/django/logos/eb_home_tm-trans-fb.png",
"summary": "Get to know a few thousand of Silicon Valley's best and brightest while enjoying unparalleled access to Candlestick Park,\u00a0games, food, music and more. We'll have carnival games you haven't played since you were ten, giant inflatable obstacle...",
"title": "Tech Carnival @ Candlestick Park"
},
"comments": {
"_total": 0
},
"creationTimestamp": 1373654758000,
"creator": {
"firstName": "Clayton",
"headline": "Director of Operations",
"lastName": "K.",
"pictureUrl": "http://m.c.lnkd.licdn.com/mpr/mprx/0_R7Vm6_RqBDHaHCDzJHRA6hsNcwOfECjzMeaA6heqHeo0v6ovBWoCe8pVJiYrd5pJVu4KdbnQQ3Lj"
},
"likes": {
"_total": 0
},
"relationToViewer": {
"availableActions": {
"_total": 7,
"values": [
{
"code": "add-comment"
},
{
"code": "categorize-as-job"
},
{
"code": "categorize-as-promotion"
},
{
"code": "flag-as-inappropriate"
},
{
"code": "follow"
},
{
"code": "like"
},
{
"code": "reply-privately"
}
]
},
"isFollowing": false,
"isLiked": false
},
"summary": "Network with 4,000+ from the tech community, including folks from DFJ, Google, LinkedIn, Square, Uber, Y Combinator, 500 Startups, etc. $10 ticket gets you all-you-can-ride access to the pop-up Tech Carnival, will be the biggest Wednesday night of the tech summer.",
"title": "Tech Event @ Candlestick Park on Wednesday, July 17th! Come play carnival games with ~4,000 of the Bay area's best and brightest!"
}
..........
........ so on
]
so im in a confusion how to read the json object in my custom inputformat class.any ideas on how to parse this?i want to read individual json object inside the json array,i mean read the proper json string and then give the string to map where i would use a json parser inside the map to construct my own key value pair.any help on this?thanks in advance
If your questions is in accordance with what Magham Ravi commented, the answer is fine.
But, if you have a single file with all JSON data as you have mentioned above, you might want to read the whole file and retrieve it as a String from the value part (BytesWritable value) in map function and feed it to your JSON parser available inside the same map() function.
Please have a look at WholeFileInputFormat
Furthermore, if you have say multiple JSON object data in a single file and what to get each JSON object data as values in the mapper, you can use something like the XMLInputFormat with start and end tags defined. In your case for JSON, you must have a unique start and end tags that exactly marks the start and end of a single JSON data object you want. Merely, using start-tag = "[{" and end-tag = "}]" might not help if you want the whole JSON object as above to be returned as a value, because you already have many of those nested that would confuse the InputFormat.
If you are not able to achieve the above in any case, try building your customTextInputFormat overriding LineReader defined in TextInputFormat.
In LineReader class, you'll fine these two set ( I may be a little outdated, please check if that's configurable now using a configuration property, I know that CDH has made it configurable, if not your need to override)
private static final byte CR = '\r';
private static final byte LF = '\n';
And you can let go CR and change LF to poing to "]\n[", since each of your independent JSON data would be in the form as shown or you'll know it better how?
[
...JSON 1
]
[
...JSON 2
]
[
...JSON N
]
(NOTE: There is a \n in between ] and [ that marks as a boundary between different JSON object's data.
Hope this makes sense.
这篇关于用于读取hadoop中的json的自定义输入格式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!