问题描述
作为我研究的一部分,我正在用 Java 编写一个高负载的 TCP/IP 回显服务器.我想为大约 3-4k 个客户端提供服务,并查看每秒可以挤出的最大消息数.消息大小非常小 - 最多 100 个字节.这项工作没有任何实际目的——只是一项研究.
As a part of my research I'm writing an high-load TCP/IP echo server in Java. I want to serve about 3-4k of clients and see the maximum possible messages per second that I can squeeze out of it. Message size is quite small - up to 100 bytes. This work doesn't have any practical purpose - only a research.
根据我看过的大量演示(HornetQ 基准测试、LMAX Disruptor 讨论等),现实世界的高负载系统往往每秒处理数百万个事务(我相信 Disruptor 提到了大约 6 mils 和 Hornet -8.5).例如,这篇文章指出可以最高可达 40M MPS.所以我把它作为对现代硬件应该具备的能力的粗略估计.
According to numerous presentations that I've seen (HornetQ benchmarks, LMAX Disruptor talks, etc), real-world high-load systems tend to serve millions of transactions per second (I believe Disruptor mentioned about 6 mils and and Hornet - 8.5). For example, this post states that it possible to achieve up to 40M MPS. So I took it as a rough estimate of what should modern hardware be capable of.
我写了一个最简单的单线程 NIO 服务器并启动了一个负载测试.我很惊讶我在本地主机上只能获得大约 100k MPS,而在实际网络中只能获得 25k.数字看起来很小.我在 Win7 x64、core i7 上进行测试.查看 CPU 负载 - 只有一个内核忙(这在单线程应用程序上是预期的),而其余的则闲置.但是,即使我加载了所有 8 个内核(包括虚拟内核),我的 MPS 也不会超过 800k - 甚至不会接近 4000 万 :)
I wrote simplest single-threaded NIO server and launched a load test. I was little surprised that I can get only about 100k MPS on localhost and 25k with actual networking. Numbers look quite small. I was testing on Win7 x64, core i7. Looking at CPU load - only one core is busy (which is expected on a single-threaded app), while the rest sit idle. However even if I load all 8 cores (including virtual) I will have no more than 800k MPS - not even close to 40 millions :)
我的问题是:向客户端提供大量消息的典型模式是什么?我是否应该在单个 JVM 内的多个不同套接字上分配网络负载并使用某种负载均衡器(如 HAProxy)将负载分配到多个内核?或者我应该考虑在我的 NIO 代码中使用多个选择器?或者甚至可以在多个 JVM 之间分配负载并使用 Chronicle 在它们之间建立进程间通信?在像 CentOS 这样合适的服务器端操作系统上进行测试会产生很大的不同(也许是 Windows 会减慢速度)?
My question is: what is a typical pattern for serving massive amounts of messages to clients? Should I distribute networking load over several different sockets inside a single JVM and use some sort of load balancer like HAProxy to distribute load to multiple cores? Or I should look towards using multiple Selectors in my NIO code? Or maybe even distribute the load between multiple JVMs and use Chronicle to build an inter-process communication between them? Will testing on a proper serverside OS like CentOS make a big difference (maybe it is Windows that slows things down)?
下面是我的服务器的示例代码.它总是对任何传入的数据回答ok".我知道在现实世界中,我需要跟踪消息的大小,并准备好将一条消息拆分为多次读取,但我现在想让事情变得非常简单.
Below is the sample code of my server. It always answers with "ok" to any incoming data. I know that in real world I'd need to track the size of the message and be prepared that one message might be split between multiple reads however I'd like to keep things super-simple for now.
public class EchoServer {
private static final int BUFFER_SIZE = 1024;
private final static int DEFAULT_PORT = 9090;
// The buffer into which we'll read data when it's available
private ByteBuffer readBuffer = ByteBuffer.allocate(BUFFER_SIZE);
private InetAddress hostAddress = null;
private int port;
private Selector selector;
private long loopTime;
private long numMessages = 0;
public EchoServer() throws IOException {
this(DEFAULT_PORT);
}
public EchoServer(int port) throws IOException {
this.port = port;
selector = initSelector();
loop();
}
private void loop() {
while (true) {
try{
selector.select();
Iterator<SelectionKey> selectedKeys = selector.selectedKeys().iterator();
while (selectedKeys.hasNext()) {
SelectionKey key = selectedKeys.next();
selectedKeys.remove();
if (!key.isValid()) {
continue;
}
// Check what event is available and deal with it
if (key.isAcceptable()) {
accept(key);
} else if (key.isReadable()) {
read(key);
} else if (key.isWritable()) {
write(key);
}
}
} catch (Exception e) {
e.printStackTrace();
System.exit(1);
}
}
}
private void accept(SelectionKey key) throws IOException {
ServerSocketChannel serverSocketChannel = (ServerSocketChannel) key.channel();
SocketChannel socketChannel = serverSocketChannel.accept();
socketChannel.configureBlocking(false);
socketChannel.setOption(StandardSocketOptions.SO_KEEPALIVE, true);
socketChannel.setOption(StandardSocketOptions.TCP_NODELAY, true);
socketChannel.register(selector, SelectionKey.OP_READ);
System.out.println("Client is connected");
}
private void read(SelectionKey key) throws IOException {
SocketChannel socketChannel = (SocketChannel) key.channel();
// Clear out our read buffer so it's ready for new data
readBuffer.clear();
// Attempt to read off the channel
int numRead;
try {
numRead = socketChannel.read(readBuffer);
} catch (IOException e) {
key.cancel();
socketChannel.close();
System.out.println("Forceful shutdown");
return;
}
if (numRead == -1) {
System.out.println("Graceful shutdown");
key.channel().close();
key.cancel();
return;
}
socketChannel.register(selector, SelectionKey.OP_WRITE);
numMessages++;
if (numMessages%100000 == 0) {
long elapsed = System.currentTimeMillis() - loopTime;
loopTime = System.currentTimeMillis();
System.out.println(elapsed);
}
}
private void write(SelectionKey key) throws IOException {
SocketChannel socketChannel = (SocketChannel) key.channel();
ByteBuffer dummyResponse = ByteBuffer.wrap("ok".getBytes("UTF-8"));
socketChannel.write(dummyResponse);
if (dummyResponse.remaining() > 0) {
System.err.print("Filled UP");
}
key.interestOps(SelectionKey.OP_READ);
}
private Selector initSelector() throws IOException {
Selector socketSelector = SelectorProvider.provider().openSelector();
ServerSocketChannel serverChannel = ServerSocketChannel.open();
serverChannel.configureBlocking(false);
InetSocketAddress isa = new InetSocketAddress(hostAddress, port);
serverChannel.socket().bind(isa);
serverChannel.register(socketSelector, SelectionKey.OP_ACCEPT);
return socketSelector;
}
public static void main(String[] args) throws IOException {
System.out.println("Starting echo server");
new EchoServer();
}
}
推荐答案
what is a typical pattern for serving massive amounts of messages to clients?
有很多可能的模式:一种无需通过多个 jvm 即可利用所有内核的简单方法是:
There are many possible patterns:An easy way to utilize all cores without going through multiple jvms is:
- 让一个线程接受连接并使用选择器读取.
- 一旦您有足够的字节构成一条消息,就可以使用环形缓冲区等结构将其传递给另一个核心.Disruptor Java 框架非常适合这一点.如果需要知道什么是完整消息的处理是轻量级的,那么这是一个很好的模式.例如,如果您有一个长度前缀协议,您可以等到获得预期的字节数,然后将其发送到另一个线程.如果协议的解析非常繁重,那么您可能会压倒这个单线程,从而阻止它接受连接或读取网络字节.
- 在从环形缓冲区接收数据的工作线程上进行实际处理.
- 您可以在工作线程上或通过其他聚合器线程写出响应.
这就是它的要点.这里有更多的可能性,答案实际上取决于您正在编写的应用程序类型.几个例子是:
That's the gist of it. There are many more possibilities here and the answer really depends on the type of application you are writing. A few examples are:
- CPU 繁重的无状态应用程序 比如说图像处理应用程序.每个请求完成的 CPU/GPU 工作量可能会明显高于非常幼稚的线程间通信解决方案产生的开销.在这种情况下,一个简单的解决方案是一堆工作线程从单个队列中提取工作.请注意这是一个单独的队列,而不是每个工作人员一个队列.优点是这本身就是负载平衡的.每个工人完成它的工作,然后轮询单生产者多消费者队列.尽管这是一个争用的来源,但图像处理工作(几秒钟?)应该比任何同步替代方案都要昂贵得多.
- 纯 IO 应用程序一个统计服务器,它只是为一个请求增加一些计数器:在这里你几乎不做 CPU 繁重的工作.大多数工作只是读取字节和写入字节.多线程应用程序在这里可能不会给您带来显着的好处.事实上,如果排队项目所需的时间超过处理它们所需的时间,它甚至可能会减慢速度.单线程 Java 服务器应该能够轻松使 1G 链路饱和.
有状态应用,需要适度的处理,例如一个典型的业务应用程序:这里每个客户端都有一些状态来决定如何处理每个请求.假设我们采用多线程,因为处理很重要,我们可以将客户端关联到某些线程.这是 actor 架构的一种变体:
- A CPU heavy stateless application say an image processing application. The amount of CPU/GPU work done per request will probably be significantly higher than the overhead generated by a very naive inter-thread communication solution. In this case an easy solution is a bunch of worker threads pulling work from a single queue. Notice how this is a single queue instead of one queue per worker. The advantage is this is inherently load balanced. Each worker finishes it's work and then just polls the single-producer multiple-consumer queue. Even though this is a source of contention, the image-processing work (seconds?) should be far more expensive than any synchronization alternative.
- A pure IO application e.g. a stats server which just increments some counters for a request: Here you do almost no CPU heavy work. Most of the work is just reading bytes and writing bytes. A multi-threaded application might not give you significant benefit here. In fact it might even slow things down if the time it takes to queue items is more than the time it takes to process them. A single threaded Java server should be able to saturate a 1G link easily.
Stateful applications which require moderate amounts of processing e.g. a typical business application: Here every client has some state that determines how each request is handled. Assuming we go multi-threaded since the processing is non-trivial, we could affinitize clients to certain threads. This is a variant of the actor architecture:
i) 当客户端第一次将它哈希连接到工作器时.您可能希望使用某个客户端 ID 执行此操作,以便在断开连接并重新连接时仍将其分配给同一个工作人员/演员.
i) When a client first connects hash it to a worker. You might want to do this with some client id, so that if it disconnects and reconnects it is still assigned to the same worker/actor.
ii) 当阅读器线程读取一个完整的请求时,将它放在正确的工作人员/演员的环形缓冲区中.由于同一个 worker 总是处理特定的客户端,因此所有状态都应该是线程本地的,从而使所有处理逻辑都变得简单和单线程.
ii) When the reader thread reads a complete request put it on the ring-buffer for the right worker/actor. Since the same worker always processes a particular client all the state should be thread local making all the processing logic simple and single-threaded.
iii) 工作线程可以写出请求.总是尝试只做一个 write().如果您的所有数据都无法写出,那么您是否注册 OP_WRITE.如果确实有一些未完成的事情,工作线程只需要进行选择调用.大多数写入应该会成功使这变得不必要.这里的技巧是在选择调用和轮询环形缓冲区以获取更多请求之间取得平衡.您还可以使用单个编写器线程,其唯一职责是将请求写出.每个工作线程都可以将它的响应放在一个环形缓冲区上,将它连接到这个单一的编写器线程.单写入线程轮询轮询每个传入的环形缓冲区并将数据写出到客户端.同样,在选择之前尝试写入的警告同样适用于在多个环形缓冲区和选择调用之间进行平衡的技巧.
iii) The worker thread can write requests out. Always attempt to just do a write(). If all your data could not be written out only then do you register for OP_WRITE. The worker thread only needs to make select calls if there is actually something outstanding. Most writes should just succeed making this unnecessary. The trick here is balancing between select calls and polling the ring buffer for more requests. You could also employ a single writer thread whose only responsibility is to write requests out. Each worker thread can put it's responses on a ring buffer connecting it to this single writer thread. The single writer thread round-robin polls each incoming ring-buffer and writes out the data to clients. Again the caveat about trying write before select applies as does the trick about balancing between multiple ring buffers and select calls.
正如您指出的,还有许多其他选择:
As you point out there are many other options:
我是否应该在单个 JVM 内的多个不同套接字上分配网络负载并使用某种负载均衡器(如 HAProxy)将负载分配到多个内核?
您可以这样做,但恕我直言,这不是负载平衡器的最佳用途.这确实为您购买了独立的 JVM,这些 JVM 可能会自行失败,但可能比编写多线程的单个 JVM 应用程序要慢.应用程序本身可能更容易编写,因为它将是单线程的.
You can do this, but IMHO this is not the best use for a load balancer. This does buy you independent JVMs that can fail on their own but will probably be slower than writing a single JVM app that is multi-threaded. The application itself might be easier to write though since it will be single threaded.
Or I should look towards using multiple Selectors in my NIO code?
你也可以这样做.查看 Ngnix 架构以获取有关如何执行此操作的一些提示.
You can do this too. Look at Ngnix architecture for some hints on how to do this.
或者甚至可以在多个 JVM 之间分配负载并使用 Chronicle 在它们之间建立进程间通信?
这也是一种选择.Chronicle 为您提供了一个优势,即内存映射文件对中间退出的进程更具弹性.由于所有通信都是通过共享内存完成的,因此您仍然可以获得充足的性能.
Or maybe even distribute the load between multiple JVMs and use Chronicle to build an inter-process communication between them?
This is also an option. Chronicle gives you an advantage that memory mapped files are more resilient to a process quitting in the middle. You still get plenty of performance since all communication is done through shared memory.
Will testing on a proper serverside OS like CentOS make a big difference (maybe it is Windows that slows things down)?
我不知道这个.不太可能.如果 Java 最充分地使用本机 Windows API,那么它就不那么重要了.我非常怀疑 4000 万事务/秒的数字(没有用户空间网络堆栈 + UDP),但我列出的架构应该做得很好.
I don't know about this. Unlikely. If Java uses the native Windows APIs to the fullest, it shouldn't matter as much. I am highly doubtful of the 40 million transactions/sec figure (without a user space networking stack + UDP) but the architectures I listed should do pretty well.
这些架构往往表现良好,因为它们是使用基于有界数组的数据结构进行线程间通信的单编写器架构.确定多线程是否甚至是答案.在许多情况下,它不是必需的,可能会导致速度减慢.
These architectures tend to do well since they are single-writer architectures that use bounded array based data structures for inter thread communication. Determine if multi-threaded is even the answer. In many cases it is not needed and can lead to slowdown.
另一个需要研究的领域是内存分配方案.特别是分配和重用缓冲区的策略可能会带来显着的好处.正确的缓冲区重用策略取决于应用程序.查看诸如伙伴内存分配、竞技场分配等方案,看看它们是否可以使您受益.JVM GC 对于大多数工作负载来说已经足够好了,所以在你走这条路之前一定要先衡量一下.
Another area to look into is memory allocation schemes. Specifically the strategy to allocate and reuse buffers could lead to significant benefits. The right buffer reuse strategy is dependent on application. Look at schemes like buddy-memory allocation, arena allocation etc to see if they can benefit you. The JVM GC does plenty fine for most work loads though so always measure before you go down this route.
协议设计对性能也有很大影响.我倾向于更喜欢长度前缀协议,因为它们允许您分配正确大小的缓冲区,避免缓冲区列表和/或缓冲区合并.长度前缀协议还可以轻松决定何时切换请求 - 只需检查 num bytes == expected
.实际的解析可以由工作线程完成.序列化和反序列化超出了以长度为前缀的协议.像享元模式在缓冲区而不是分配上的模式在这里有帮助.查看 SBE 了解其中一些原则.
Protocol design has a big effect on performance too. I tend to prefer length prefixed protocols because they let you allocate buffers of right sizes avoiding lists of buffers and/or buffer merging. Length prefixed protocols also make it easy to decide when to handover a request - just check num bytes == expected
. The actual parsing can be done by the workers thread. Serialization and deserialization extends beyond length-prefixed protocols. Patterns like flyweight patterns over buffers instead of allocations helps here. Look at SBE for some of these principles.
正如您想象的那样,可以在这里写出整篇论文.这应该让你朝着正确的方向前进.警告:始终测量并确保您需要比最简单的选项更高的性能.很容易陷入永无止境的性能改进黑洞.
As you can imagine an entire treatise could be written here. This should set you in the right direction. Warning: Always measure and make sure you need more performance than the simplest option. It's easy to get sucked into a never ending black-hole of performance improvements.
这篇关于Java 高负载 NIO TCP 服务器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!