实时查询/聚合数百万条记录 - hadoop?数据库?卡桑德拉?

本文介绍了实时查询/聚合数百万条记录 - hadoop?数据库?卡桑德拉?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个可以并行化的解决方案，但我(还)没有使用 hadoop/nosql 的经验，而且我不确定哪种解决方案最适合我的需求.理论上，如果我有无限的 CPU，我的结果应该会立即返回.因此，任何帮助将不胜感激.谢谢！

I have a solution that can be parallelized, but I don't (yet) have experience with hadoop/nosql, and I'm not sure which solution is best for my needs. In theory, if I had unlimited CPUs, my results should return back instantaneously. So, any help would be appreciated. Thanks!

这是我所拥有的:

1000 个数据集
数据集键:
- 所有数据集都具有相同的键
- 100 万个密钥(以后可能是 10 或 2000 万个)
- 每个数据集都有相同的列
- 10 到 20 列
- 大多数列是我们需要聚合的数值(avg、stddev，并使用 R 计算统计数据)
- 有几列是type_id"列，因为在特定查询中我们可能只想包含某些 type_ids
- 用户可以选择他们感兴趣的数据集(从 15 到 1000)
- 应用程序需要呈现:key，以及每列的聚合结果(avg、stddev)
- 可以添加、删除或替换/更新整个数据集
- 能够添加列会很酷.但是，如果需要，可以只替换整个数据集.
- 永远不要向数据集添加行/键 - 所以不需要具有大量快速写入的系统
- 目前有两台机器，每台机器有 24 个内核
- 最终，希望也能在亚马逊上运行此程序
我无法预先计算我的聚合值，但由于每个键都是独立的，这应该很容易扩展.目前，我在 postgres 数据库中有这些数据，其中每个数据集都在自己的分区中.
I can't precompute my aggregated values, but since each key is independent, this should be easily scalable. Currently, I have this data in a postgres database, where each dataset is in its own partition.
- 分区很好，因为可以轻松添加/删除/替换分区
- 数据库非常适合基于 type_id 进行过滤
- 数据库不容易编写并行查询
- 数据库适合结构化数据，而我的数据不是结构化的
作为概念证明，我尝试了 hadoop:
As a proof of concept I tried out hadoop:
- 为特定 type_id 的每个数据集创建一个制表符分隔文件
- 上传到 hdfs
- map:为每个键检索一个值/列
- 减少:计算平均值和标准偏差
从我粗略的概念验证中，我可以看到这将很好地扩展，但是我可以看到 hadoop/hdfs 有延迟我读到它通常不用于实时查询(即使我没问题在 5 秒内将结果返回给用户).
From my crude proof-of-concept, I can see this will scale nicely, but I can see hadoop/hdfs has latency I've read that that it's generally not used for real time querying (even though I'm ok with returning results back to users in 5 seconds).
关于我应该如何处理这个问题有什么建议吗?我想接下来尝试 HBase 来感受一下.我应该看看 Hive 吗?卡桑德拉?伏地魔?
Any suggestion on how I should approach this? I was thinking of trying HBase next to get a feel for that. Should I instead look at Hive? Cassandra? Voldemort?
谢谢！
推荐答案
Hive 或 Pig 似乎不会帮助您.本质上，它们中的每一个都编译为一个或多个 map/reduce 作业，因此响应不能在 5 秒内
Hive or Pig don't seem like they would help you. Essentially each of them compiles down to one or more map/reduce jobs, so the response cannot be within 5 seconds
HBase 可能会工作，尽管您的基础架构有点小以实现最佳性能.我不明白为什么您不能预先计算每列的汇总统计信息.您应该查找计算运行平均值，这样您就不必进行大量减重.
HBase may work, although your infrastructure is a bit small for optimal performance. I don't understand why you can't pre-compute summary statistics for each column. You should look up computing running averages so that you don't have to do heavy weight reduces.
查看http://en.wikipedia.org/wiki/Standard_deviation
stddev(X) = sqrt(E[X^2]-(E[X])^2)
stddev(X) = sqrt(E[X^2]- (E[X])^2)
这意味着您可以通过执行
this implies that you can get the stddev of AB by doing
sqrt(E[AB^2]-(E[AB])^2).E[AB^2] 是 (sum(A^2) + sum(B^2))/(|A|+|B|)
sqrt(E[AB^2]-(E[AB])^2). E[AB^2] is (sum(A^2) + sum(B^2))/(|A|+|B|)

这篇关于实时查询/聚合数百万条记录 - hadoop?数据库?卡桑德拉?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！

seconds

实时查询/聚合数百万条记录 - hadoop?数据库?卡桑德拉?

问题描述

推荐答案