问题描述
你好我是HBase数据库的新手。我下载了一些Twitter数据并存储到MongoDB中。现在我需要将这些数据转换为HBase来加速Hadoop处理。但我无法创建它的方案。在这里,我将twitter数据转换为JSON格式 - {
_id:ObjectId (512b71e6e4b02a4322d1c0b0),
id:NumberLong(306044618179506176),
source:< a href = \http://www.facebook.com/twitter\\ \\rel = \nofollow \> Facebook< / a>,
user:{
name:Dada Bhagwan,
location: 印度,
url:http://www.dadabhagwan.org,
id:191724440,
受保护:false,
时区:null,
description:Akram Vignan的创始人 - 实现自我实现的精神科学,
screenName:dadabhagwan,
geoEnabled:false,
profileImageURL:http://a0.twimg.com/profile_images/1647956820/M_DSC_0034_normal.jpg,
largerProfileImageURL:http://a0.twimg.com/profile_images/1647956820/M_DSC_0034_bigger .jpg,
profileImageUrlHttps:https://si0.twimg.com/profile_images/1647956 820 / M_DSC_0034_normal.jpg,
profileImageURLHttps:https://si0.twimg.com/profile_images/1647956820/M_DSC_0034_normal.jpg,
biggerProfileImageURLHttps:https:// si0。 twimg.com/profile_images/1647956820/M_DSC_0034_bigger.jpg,
miniProfileImageURLHttps:https://si0.twimg.com/profile_images/1647956820/M_DSC_0034_mini.jpg,
originalProfileImageURLHttps: https://si0.twimg.com/profile_images/1647956820/M_DSC_0034.jpg,
followersCount:499,
profileBackgroundColor:EEE4C1,
profileTextColor: 333333,
profileLinkColor:990000,
lang:en,
profileSidebarFillColor:FCF9EC,
profileSidebarBorderColor:CBC09A ,
profileUseBackgroundImage:true,
showAllInlineMedia:false,
friendsCount:1,
favouritesCount:0,
profileBackgroundImageUrl: http://a0.twimg.com/profile_background_images/396759326/dadabhagwan-twi tter.jpg,
profileBackgroundImageURL:http://a0.twimg.com/profile_background_images/396759326/dadabhagwan-twitter.jpg,
profileBackgroundImageUrlHttps:https:// si0。 twimg.com/profile_background_images/396759326/dadabhagwan-twitter.jpg,
profileBannerURL:null,
profileBannerRetinaURL:null,
profileBannerIPadURL:null,
profileBannerIPadRetinaURL:null,
miniProfileImageURL:http://a0.twimg.com/profile_images/1647956820/M_DSC_0034_mini.jpg,
originalProfileImageURL:http://a0.twimg。 com / profile_images / 1647956820 / M_DSC_0034.jpg,
utcOffset:-1,
contributorsEnabled:false,
status:null,
createdAt: NumberLong(1284700143000),
profileBannerMobileURL:null,
profileBannerMobileRetinaURL:null,
profileBackgroundTiled:false,
statusesCount:1713,
已验证:false,
译员:false,
listedCount:6,
followRequestSent:false,
descriptionURLEntities:[],
urlentity:{
url:http ://www.dadabhagwan.org,
start:0,
end:26,
expandedURL:http://www.dadabhagwan.org,
displayURL:http://www.dadabhagwan.org
},
rateLimitStatus:null,
accessLevel:0
},
贡献者:[],
geoLocation:null,
place:null,
favorited:false,
retweet:false ,
retweetedStatus:null,
retweetCount:0,
userMentionEntities:[],
retweetedByMe:false,
currentUserRetweetId: -1,
possibleSensitive:false,
urlentities:[
{
url:http://t.co/gR1GohGjaj,
start:113,
end:135,
expandedURL:http://fb.me/2j2HKHJrM,
displayURL:fb.me/ 2j2HKHJrM
}
,
hashtagEntities:[],
mediaEntities:[],
truncated:false,
inReplyToStatusId:-1,
text :每日精神报价:\\我是Chandubhai'本身就是一种幻想,并且从中受到控制。 When ... http://t.co/gR1GohGjaj,
inReplyToUserId:-1,
inReplyToScreenName:null,
createdAt:NumberLong(1361801697000) ,
rateLimitStatus:null,
accessLevel:0
}
这里如何将数据分成列和列族?我认为要创建一个包含源的
和另一个twitter列族
getlocation,place,retweet等... user列族
并包含名称,位置等...
(用户数据),即每个内部子级文档的新列族。
这种方法是正确的吗?我将针对user列家族
和twitter列家族区分
? urlentity
以及如何处理包含子文档列表的键(例如 urlentity $ c
一般来说,您基于您对hbase中的数据进行建模读取和写入访问模式。对于示例列族而言,它们存储在磁盘上的不同文件中。将数据划分为两列的原因在于是否存在大量需要来自一个数据库而不是另一个数据库的情况。等等。
有关HBAse架构设计的精彩演讲,来自HBaseCon 2012的Ian Varley,您可以找到幻灯片和视频
Hi I am very new to HBase database. I downloaded some twitter data and stored into MongoDB. Now I need to transform that data into HBase to speed-up Hadoop processing. But I am not able to create it's scheme. Here I have twitter data into JSON format-
{
"_id" : ObjectId("512b71e6e4b02a4322d1c0b0"),
"id" : NumberLong("306044618179506176"),
"source" : "<a href=\"http://www.facebook.com/twitter\" rel=\"nofollow\">Facebook</a>",
"user" : {
"name" : "Dada Bhagwan",
"location" : "India",
"url" : "http://www.dadabhagwan.org",
"id" : 191724440,
"protected" : false,
"timeZone" : null,
"description" : "Founder of Akram Vignan - Practical Spiritual Science of Self Realization",
"screenName" : "dadabhagwan",
"geoEnabled" : false,
"profileImageURL" : "http://a0.twimg.com/profile_images/1647956820/M_DSC_0034_normal.jpg",
"biggerProfileImageURL" : "http://a0.twimg.com/profile_images/1647956820/M_DSC_0034_bigger.jpg",
"profileImageUrlHttps" : "https://si0.twimg.com/profile_images/1647956820/M_DSC_0034_normal.jpg",
"profileImageURLHttps" : "https://si0.twimg.com/profile_images/1647956820/M_DSC_0034_normal.jpg",
"biggerProfileImageURLHttps" : "https://si0.twimg.com/profile_images/1647956820/M_DSC_0034_bigger.jpg",
"miniProfileImageURLHttps" : "https://si0.twimg.com/profile_images/1647956820/M_DSC_0034_mini.jpg",
"originalProfileImageURLHttps" : "https://si0.twimg.com/profile_images/1647956820/M_DSC_0034.jpg",
"followersCount" : 499,
"profileBackgroundColor" : "EEE4C1",
"profileTextColor" : "333333",
"profileLinkColor" : "990000",
"lang" : "en",
"profileSidebarFillColor" : "FCF9EC",
"profileSidebarBorderColor" : "CBC09A",
"profileUseBackgroundImage" : true,
"showAllInlineMedia" : false,
"friendsCount" : 1,
"favouritesCount" : 0,
"profileBackgroundImageUrl" : "http://a0.twimg.com/profile_background_images/396759326/dadabhagwan-twitter.jpg",
"profileBackgroundImageURL" : "http://a0.twimg.com/profile_background_images/396759326/dadabhagwan-twitter.jpg",
"profileBackgroundImageUrlHttps" : "https://si0.twimg.com/profile_background_images/396759326/dadabhagwan-twitter.jpg",
"profileBannerURL" : null,
"profileBannerRetinaURL" : null,
"profileBannerIPadURL" : null,
"profileBannerIPadRetinaURL" : null,
"miniProfileImageURL" : "http://a0.twimg.com/profile_images/1647956820/M_DSC_0034_mini.jpg",
"originalProfileImageURL" : "http://a0.twimg.com/profile_images/1647956820/M_DSC_0034.jpg",
"utcOffset" : -1,
"contributorsEnabled" : false,
"status" : null,
"createdAt" : NumberLong("1284700143000"),
"profileBannerMobileURL" : null,
"profileBannerMobileRetinaURL" : null,
"profileBackgroundTiled" : false,
"statusesCount" : 1713,
"verified" : false,
"translator" : false,
"listedCount" : 6,
"followRequestSent" : false,
"descriptionURLEntities" : [ ],
"urlentity" : {
"url" : "http://www.dadabhagwan.org",
"start" : 0,
"end" : 26,
"expandedURL" : "http://www.dadabhagwan.org",
"displayURL" : "http://www.dadabhagwan.org"
},
"rateLimitStatus" : null,
"accessLevel" : 0
},
"contributors" : [ ],
"geoLocation" : null,
"place" : null,
"favorited" : false,
"retweet" : false,
"retweetedStatus" : null,
"retweetCount" : 0,
"userMentionEntities" : [ ],
"retweetedByMe" : false,
"currentUserRetweetId" : -1,
"possiblySensitive" : false,
"urlentities" : [
{
"url" : "http://t.co/gR1GohGjaj",
"start" : 113,
"end" : 135,
"expandedURL" : "http://fb.me/2j2HKHJrM",
"displayURL" : "fb.me/2j2HKHJrM"
}
],
"hashtagEntities" : [ ],
"mediaEntities" : [ ],
"truncated" : false,
"inReplyToStatusId" : -1,
"text" : "Spiritual Quote of the Day :\n\n‘I am Chandubhai’ is an illusion itself and from that are \nkarmas charged. When... http://t.co/gR1GohGjaj",
"inReplyToUserId" : -1,
"inReplyToScreenName" : null,
"createdAt" : NumberLong("1361801697000"),
"rateLimitStatus" : null,
"accessLevel" : 0
}
Here how to divide data into columns and column-family? I thought to make one "twitter" column-family
that contain source, getlocation, place, retweet etc...
and another "user" column-family
and that contain name, location etc...
(user's data). i.e new column family for each inner level sub-document.
Is this approach is correct? Now How I will differentiate urlentity
for "user" column-family
and "twitter" column-family
?
And how to handle those keys that contain list of sub-documents (for e.g. urlentity
)
There are many ways to model this in HBase ranging from storing everything in a single column to having a different table for each sub entity with several other tables for "indexing".
Generally speaking you model the data in hbase based on you read and write access patterns. fo r example column family are stored in different files on disk. A reason to divide data into two column families is if there are a lot of cases where you need data from one and not the other. etc.
There's a good presentation about HBAse schema design by Ian Varley from HBaseCon 2012 you can find the slides here and the video here
这篇关于从MongoDB迁移到HBase的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!