问题描述
我有一个可以并行化的解决方案,但是我没有hadoop / nosql的经验,我不知道哪个解决方案最适合我的需要。在理论上,如果我有无限的CPU,我的结果应该立即返回。所以,任何帮助将不胜感激。非常感谢!
以下是我的功能:
- >
- 数据集键:
- 所有数据集都具有相同的键
- 10或20百万)
- 数据集列:
- 具有相同的列
- 10到20列
- 大多数列都是我们需要聚合的数值(avg,stddev,以计算统计信息)
- 几列是type_id列,因为在特定查询中我们可能
只想包含某些type_ids
- 网络应用程序
- 用户可以选择他们感兴趣的数据集(从15到1000) / li>
- >更新数据:
- 整个数据集可以添加,删除或替换/更新
- 添加列。
- 从不向数据集添加行/键,因此不需要具有大量快速写入的系统
- 基础架构:
- 目前有两台机器,每组24个核心 $ b $
I无法预先计算我的聚合值,但由于每个键是独立的,这应该是易于扩展。目前,我有一个postgres数据库中的数据,其中每个数据集在其自己的分区。
- 分区很好, / drop>替换分区
- 数据库适用于基于type_id的过滤
- 数据库不容易写入并行查询
- 数据库适用于结构化数据,而且我的数据不是结构化的
- 为每个特定type_id的数据集创建了一个制表符分隔文件
- 上传到hdfs
-
- map:检索每个键的值/列
- reduce:计算的平均值和标准偏差
从我粗略的概念验证,我可以看到这将缩放很好,但我可以看到hadoop / hdfs有延迟我读过,它通常不用于实时查询(即使我确定,在5秒内将结果返回给用户)。
有关如何处理这个问题的任何建议?我正在想试着HBase接下来得到一个感觉。我应该看看Hive吗? Cassandra? Voldemort?
感谢!
解决方案似乎他们会帮助你。基本上每个人都编译到一个或多个map / reduce作业,所以响应不能在5秒内。
HBase可能工作,虽然你的基础设施有点小最佳性能。我不明白为什么你不能预计算每个列的汇总统计。您应该查找计算运行平均值,以便您不必执行重量减少。
查看
stddev(X)= sqrt (E [X ^ 2] - (E [X])^ 2)
这意味着你可以通过
$ b得到AB的stddev
$ b p sqrt(E [AB ^ 2] - (E [AB])^ 2)。 E [AB ^ 2]是(sum(A ^ 2)+ sum(B ^ 2))/(| A | + | B |)I have a solution that can be parallelized, but I don't (yet) have experience with hadoop/nosql, and I'm not sure which solution is best for my needs. In theory, if I had unlimited CPUs, my results should return back instantaneously. So, any help would be appreciated. Thanks!
Here's what I have:
- 1000s of datasets
- dataset keys:
- all datasets have the same keys
- 1 million keys (this may later be 10 or 20 million)
- dataset columns:
- each dataset has the same columns
- 10 to 20 columns
- most columns are numerical values for which we need to aggregate on (avg, stddev, and use R to calculate statistics)
- a few columns are "type_id" columns, since in a particular query we maywant to only include certain type_ids
- web application
- user can choose which datasets they are interested in (anywhere from 15 to 1000)
- application needs to present: key, and aggregated results (avg, stddev) of each column
- updates of data:
- an entire dataset can be added, dropped, or replaced/updated
- would be cool to be able to add columns. But, if required, can just replace the entire dataset.
- never add rows/keys to a dataset - so don't need a system with lots of fast writes
- infrastructure:
- currently two machines with 24 cores each
- eventually, want ability to also run this on amazon
I can't precompute my aggregated values, but since each key is independent, this should be easily scalable. Currently, I have this data in a postgres database, where each dataset is in its own partition.
- partitions are nice, since can easily add/drop/replace partitions
- database is nice for filtering based on type_id
- databases aren't easy for writing parallel queries
- databases are good for structured data, and my data is not structured
As a proof of concept I tried out hadoop:
- created a tab separated file per dataset for a particular type_id
- uploaded to hdfs
- map: retrieved a value/column for each key
- reduce: computed average and standard deviation
From my crude proof-of-concept, I can see this will scale nicely, but I can see hadoop/hdfs has latency I've read that that it's generally not used for real time querying (even though I'm ok with returning results back to users in 5 seconds).
Any suggestion on how I should approach this? I was thinking of trying HBase next to get a feel for that. Should I instead look at Hive? Cassandra? Voldemort?
thanks!
解决方案Hive or Pig don't seem like they would help you. Essentially each of them compiles down to one or more map/reduce jobs, so the response cannot be within 5 seconds
HBase may work, although your infrastructure is a bit small for optimal performance. I don't understand why you can't pre-compute summary statistics for each column. You should look up computing running averages so that you don't have to do heavy weight reduces.
check out http://en.wikipedia.org/wiki/Standard_deviation
stddev(X) = sqrt(E[X^2]- (E[X])^2)
this implies that you can get the stddev of AB by doing
sqrt(E[AB^2]-(E[AB])^2). E[AB^2] is (sum(A^2) + sum(B^2))/(|A|+|B|)
这篇关于实时查询/聚集数百万条记录 - hadoop? hbase? cassandra?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!
- 用户可以选择他们感兴趣的数据集(从15到1000) / li>