本文介绍了实时查询/聚集数百万条记录 - hadoop? hbase? cassandra?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个可以并行化的解决方案,但是我没有hadoop / nosql的经验,我不知道哪个解决方案最适合我的需要。在理论上,如果我有无限的CPU,我的结果应该立即返回。所以,任何帮助将不胜感激。非常感谢!



以下是我的功能:




  • >
  • 数据集键:

    • 所有数据集都具有相同的键

    • 10或20百万)


  • 数据集列:


    • 具有相同的列

    • 10到20列

    • 大多数列都是我们需要聚合的数值(avg,stddev,以计算统计信息)

    • 几列是type_id列,因为在特定查询中我们可能
      只想包含某些type_ids


  • 网络应用程序


    • 用户可以选择他们感兴趣的数据集(从15到1000) / li>



    • >更新数据:

      • 整个数据集可以添加,删除或替换/更新

      • 添加列。

      • 从不向数据集添加行/键,因此不需要具有大量快速写入的系统


    • 基础架构:


      • 目前有两台机器,每组24个核心
      • $ b $


      I无法预先计算我的聚合值,但由于每个键是独立的,这应该是易于扩展。目前,我有一个postgres数据库中的数据,其中每个数据集在其自己的分区。




      • 分区很好, / drop>替换分区

      • 数据库适用于基于type_id的过滤

      • 数据库不容易写入并行查询

      • 数据库适用于结构化数据,而且我的数据不是结构化的






      • 为每个特定type_id的数据集创建了一个制表符分隔文件

      • 上传到hdfs

      • map:检索每个键的值/列

      • reduce:计算的平均值和标准偏差



      从我粗略的概念验证,我可以看到这将缩放很好,但我可以看到hadoop / hdfs有延迟我读过,它通常不用于实时查询(即使我确定,在5秒内将结果返回给用户)。



      有关如何处理这个问题的任何建议?我正在想试着HBase接下来得到一个感觉。我应该看看Hive吗? Cassandra? Voldemort?



      感谢!

      解决方案

      似乎他们会帮助你。基本上每个人都编译到一个或多个map / reduce作业,所以响应不能在5秒内。



      HBase可能工作,虽然你的基础设施有点小最佳性能。我不明白为什么你不能预计算每个列的汇总统计。您应该查找计算运行平均值,以便您不必执行重量减少。



      查看



      stddev(X)= sqrt (E [X ^ 2] - (E [X])^ 2)



      这意味着你可以通过

      $ b得到AB的stddev
      $ b p sqrt(E [AB ^ 2] - (E [AB])^ 2)。 E [AB ^ 2]是(sum(A ^ 2)+ sum(B ^ 2))/(| A | + | B |)


      I have a solution that can be parallelized, but I don't (yet) have experience with hadoop/nosql, and I'm not sure which solution is best for my needs. In theory, if I had unlimited CPUs, my results should return back instantaneously. So, any help would be appreciated. Thanks!

      Here's what I have:

      • 1000s of datasets
      • dataset keys:
        • all datasets have the same keys
        • 1 million keys (this may later be 10 or 20 million)
      • dataset columns:
        • each dataset has the same columns
        • 10 to 20 columns
        • most columns are numerical values for which we need to aggregate on (avg, stddev, and use R to calculate statistics)
        • a few columns are "type_id" columns, since in a particular query we maywant to only include certain type_ids
      • web application
        • user can choose which datasets they are interested in (anywhere from 15 to 1000)
        • application needs to present: key, and aggregated results (avg, stddev) of each column
      • updates of data:
        • an entire dataset can be added, dropped, or replaced/updated
        • would be cool to be able to add columns. But, if required, can just replace the entire dataset.
        • never add rows/keys to a dataset - so don't need a system with lots of fast writes
      • infrastructure:
        • currently two machines with 24 cores each
        • eventually, want ability to also run this on amazon

      I can't precompute my aggregated values, but since each key is independent, this should be easily scalable. Currently, I have this data in a postgres database, where each dataset is in its own partition.

      • partitions are nice, since can easily add/drop/replace partitions
      • database is nice for filtering based on type_id
      • databases aren't easy for writing parallel queries
      • databases are good for structured data, and my data is not structured

      As a proof of concept I tried out hadoop:

      • created a tab separated file per dataset for a particular type_id
      • uploaded to hdfs
      • map: retrieved a value/column for each key
      • reduce: computed average and standard deviation

      From my crude proof-of-concept, I can see this will scale nicely, but I can see hadoop/hdfs has latency I've read that that it's generally not used for real time querying (even though I'm ok with returning results back to users in 5 seconds).

      Any suggestion on how I should approach this? I was thinking of trying HBase next to get a feel for that. Should I instead look at Hive? Cassandra? Voldemort?

      thanks!

      解决方案

      Hive or Pig don't seem like they would help you. Essentially each of them compiles down to one or more map/reduce jobs, so the response cannot be within 5 seconds

      HBase may work, although your infrastructure is a bit small for optimal performance. I don't understand why you can't pre-compute summary statistics for each column. You should look up computing running averages so that you don't have to do heavy weight reduces.

      check out http://en.wikipedia.org/wiki/Standard_deviation

      stddev(X) = sqrt(E[X^2]- (E[X])^2)

      this implies that you can get the stddev of AB by doing

      sqrt(E[AB^2]-(E[AB])^2). E[AB^2] is (sum(A^2) + sum(B^2))/(|A|+|B|)

      这篇关于实时查询/聚集数百万条记录 - hadoop? hbase? cassandra?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

07-29 15:30