I'm trying to find the right tool for the job. I've explored a few different message queues like Kafka, Kestrel, etc... and I'm looking for something that has a PULL functionality.
I have an API (distributed) that shoves the incoming messages into the queue. I'd then have workers (separate machines) that pull messages from the queue. This ensures that the workers don't get flooded and can't handle the load of the queue.
Kafka确实适用于push-pull基础,并且能够处理大规模的实时流.同样,正如他们的文档中所述, Kafka的性能在数据大小方面实际上是恒定的,因此保留大量数据将不是问题.
Kafka does work on the push - pull basic and capable of handling large scale real time streams. Also as mentioned in their documentation Kafka's performance is effectively constant with respect to data size so retaining lots of data will not be a problem.
用于处理流Checkout 风暴 .它是免费的,容错的,分布式实时计算系统,并且非常易于扩展.它所做的正是您所提到的(在单独的计算机上运行工作者).而且它还支持 transactional 拓扑.最重要的是,它与Apache Kafka有很好的集成.
For processing stream Checkout Storm. Its free , fault-tolerant , distributed real time computation system and very easy to scale. It does what exactly you've mentioned (running workers on separate machines). And it also suppport transactional topologies. On top of that it has a very nice integration with Apache Kafka.
For more on storm check here
因此,通常,您可以做的是使用他们的消费API从Kafka队列中检索消息,然后将其提供给Storm集群以其余方式进行分布式处理. Kafka 0.8提供2种类型的API,
So typically what you can do is retrieve message from Kafka queue using their consume API and then feed it to a storm cluster to do the rest in a distributed manner. Kafka 0.8 provides 2 types of API,
High Level or consumer group
Low level or Simple consumer API
High Level or consumer group
Low level or Simple consumer API
The former provides a high level abstraction for consuming data and takes care of lot of things like threading, error handling, while the later allows a much greater control over message handling like reading a message multiple times, message transaction etc.