本文介绍了Dask:定期更新已发布的数据集并从其他客户端提取数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想将数据从队列(如redis)追加到published dask dataset上.然后其他python程序将能够获取最新数据(例如,每秒/分钟一次)并执行一些进一步的操作.

I would like to append data on a published dask dataset from a queue (like redis). Then other python programs would be able to fetch the latest data (e.g. once per second/minute) and do some futher opertions.

  1. 有可能吗?
  2. 应使用哪个附加接口?我应该先将其加载到pd.DataFrame还是更好地使用一些文本导入器?
  3. 假定的追加速度是多​​少?是否可以每秒添加1k/10k行?
  4. 是否还有其他好的建议可以在dask集群中交换庞大且快速更新的数据集?
  1. Would that be possible?
  2. Which append interface should be used? Should I load it into a pd.DataFrame first or better use some text importer?
  3. What are the assumed append speeds? Is it possible to append lets say 1k/10k rows in a second?
  4. Are there other good suggestions to exchange huge and rapidly updating datasets within a dask cluster?

感谢任何提示和建议.

推荐答案

您在这里有一些选择.

  • 您可以看一下streamz项目
  • 您可以看看Dask的协调原语

Dask只是跟踪远程数据.与使用Dask相比,应用程序的速度与选择表示数据的方式(如python列表与pandas数据框)的关系更多. Dask每秒可以处理数千个任务.这些任务中的每一个都可以有一行,也可以有数百万行.这取决于您的构建方式.

Dask is just tracking remote data. The speed of your application has a lot more to do with how you choose to represent that data (like python lists vs pandas dataframes) than with Dask. Dask can handle thousands of tasks a second. Each of those tasks could have a single row, or millions of rows. It's up to how you build it.

这篇关于Dask:定期更新已发布的数据集并从其他客户端提取数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

10-28 11:28