实时查询/聚集数百万条记录 - hadoop？ hbase？ cassandra？

本文介绍了实时查询/聚集数百万条记录 - hadoop？ hbase？ cassandra？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个可以并行化的解决方案，但是我没有hadoop / nosql的经验，我不知道哪个解决方案最适合我的需要。在理论上，如果我有无限的CPU，我的结果应该立即返回。所以，任何帮助将不胜感激。非常感谢！

以下是我的功能：

>
数据集键：
- 所有数据集都具有相同的键
- 10或20百万）

数据集列：
- 具有相同的列
- 10到20列
- 大多数列都是我们需要聚合的数值（avg，stddev，以计算统计信息）
- 几列是type_id列，因为在特定查询中我们可能
  只想包含某些type_ids
网络应用程序
- 用户可以选择他们感兴趣的数据集（从15到1000） / li>
- >更新数据：
  - 整个数据集可以添加，删除或替换/更新
  - 添加列。
  - 从不向数据集添加行/键，因此不需要具有大量快速写入的系统
- 基础架构：
  - 目前有两台机器，每组24个核心
  I无法预先计算我的聚合值，但由于每个键是独立的，这应该是易于扩展。目前，我有一个postgres数据库中的数据，其中每个数据集在其自己的分区。
  - 分区很好， / drop>替换分区
  - 数据库适用于基于type_id的过滤
  - 数据库不容易写入并行查询
  - 数据库适用于结构化数据，而且我的数据不是结构化的
  - 为每个特定type_id的数据集创建了一个制表符分隔文件
  - 上传到hdfs
  - map：检索每个键的值/列
  - reduce：计算的平均值和标准偏差
  从我粗略的概念验证，我可以看到这将缩放很好，但我可以看到hadoop / hdfs有延迟我读过，它通常不用于实时查询（即使我确定，在5秒内将结果返回给用户）。
  
  有关如何处理这个问题的任何建议？我正在想试着HBase接下来得到一个感觉。我应该看看Hive吗？ Cassandra？ Voldemort？
  
  感谢！
  解决方案
  似乎他们会帮助你。基本上每个人都编译到一个或多个map / reduce作业，所以响应不能在5秒内。
  
  HBase可能工作，虽然你的基础设施有点小最佳性能。我不明白为什么你不能预计算每个列的汇总统计。您应该查找计算运行平均值，以便您不必执行重量减少。
  
  查看
  
  stddev（X）= sqrt （E [X ^ 2] - （E [X]）^ 2）
  
  这意味着你可以通过
  $ b得到AB的stddev
  $ b p sqrt（E [AB ^ 2] - （E [AB]）^ 2）。 E [AB ^ 2]是（sum（A ^ 2）+ sum（B ^ 2））/（| A | + | B |）
  
  I have a solution that can be parallelized, but I don't (yet) have experience with hadoop/nosql, and I'm not sure which solution is best for my needs. In theory, if I had unlimited CPUs, my results should return back instantaneously. So, any help would be appreciated. Thanks!
  Here's what I have:
  - 1000s of datasets
  - dataset keys:
    - all datasets have the same keys
    - 1 million keys (this may later be 10 or 20 million)
  - dataset columns:
    - each dataset has the same columns
    - 10 to 20 columns
    - most columns are numerical values for which we need to aggregate on (avg, stddev, and use R to calculate statistics)
    - a few columns are "type_id" columns, since in a particular query we maywant to only include certain type_ids
  - web application
    - user can choose which datasets they are interested in (anywhere from 15 to 1000)
    - application needs to present: key, and aggregated results (avg, stddev) of each column
  - updates of data:
    - an entire dataset can be added, dropped, or replaced/updated
    - would be cool to be able to add columns. But, if required, can just replace the entire dataset.
    - never add rows/keys to a dataset - so don't need a system with lots of fast writes
  - infrastructure:
    - currently two machines with 24 cores each
    - eventually, want ability to also run this on amazon
  I can't precompute my aggregated values, but since each key is independent, this should be easily scalable. Currently, I have this data in a postgres database, where each dataset is in its own partition.
  - partitions are nice, since can easily add/drop/replace partitions
  - database is nice for filtering based on type_id
  - databases aren't easy for writing parallel queries
  - databases are good for structured data, and my data is not structured
  As a proof of concept I tried out hadoop:
  - created a tab separated file per dataset for a particular type_id
  - uploaded to hdfs
  - map: retrieved a value/column for each key
  - reduce: computed average and standard deviation
  From my crude proof-of-concept, I can see this will scale nicely, but I can see hadoop/hdfs has latency I've read that that it's generally not used for real time querying (even though I'm ok with returning results back to users in 5 seconds).
  Any suggestion on how I should approach this? I was thinking of trying HBase next to get a feel for that. Should I instead look at Hive? Cassandra? Voldemort?
  thanks!
  解决方案
  Hive or Pig don't seem like they would help you. Essentially each of them compiles down to one or more map/reduce jobs, so the response cannot be within 5 seconds
  HBase may work, although your infrastructure is a bit small for optimal performance. I don't understand why you can't pre-compute summary statistics for each column. You should look up computing running averages so that you don't have to do heavy weight reduces.
  check out http://en.wikipedia.org/wiki/Standard_deviation
  stddev(X) = sqrt(E[X^2]- (E[X])^2)
  this implies that you can get the stddev of AB by doing
  sqrt(E[AB^2]-(E[AB])^2). E[AB^2] is (sum(A^2) + sum(B^2))/(|A|+|B|)
  
  这篇关于实时查询/聚集数百万条记录 - hadoop？ hbase？ cassandra？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！