建议压缩库获得byte []尽可能小，不考虑cpu费用？

本文介绍了建议压缩库获得byte []尽可能小，不考虑cpu费用？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

如果我接近这个错误，请纠正我，但我有一个队列服务器和一群java工作者，我在一个集群中运行。我的队列有很小的工作单位，但有很多。到目前为止，我的基准和工人的审查表明，我得到约200mb /秒。

Correct me if I'm approaching this wrong, but I have a queue server and a bunch of java workers that I'm running on in a cluster. My queue has work units that are very small but there are many of them. So far my benchmarks and review of the workers has shown that I get about 200mb/second.

所以我想知道如何通过我的带宽。目前我的CPU使用率不是很高（40-50％），因为它可以处理数据比网络可以发送它。我想通过队列获得更多的工作，并愿意通过昂贵的压缩/解压缩（因为每个核心的一半现在是空闲的）现在支付。

So I'm trying to figure out how to get more work units via my bandwidth. Currently my CPU usage is not very high(40-50%) because it can process the data faster than the network can send it. I want to get more work through the queue and am willing to pay for it via expensive compression/decompression(since half of each core is idle right now).

我有尝试java LZO和gzip，但是想知道是否有什么更好的（即使它更cpu昂贵）？

I have tried java LZO and gzip, but was wondering if there was anything better(even if its more cpu expensive)?

更新：数据是一个字节[]。基本上队列只采用这种格式，所以我使用 ByteArrayOutputStream 写两个int和一个int []到一个byte []格式。 int []中的值都是0到100之间的整数（或1000，但绝大多数的数字是零）。这些列表相当大，从1000到10000个项目（再次，多数零，在int []中超过100个非零数字）

Updated: data is a byte[]. Basically the queue only takes it in that format so I am using ByteArrayOutputStream to write two ints and a int[] to to a byte[] format. The values in int[] are all ints between 0 to 100(or 1000 but the vast majority of the numbers are zeros). The lists are quite large anywhere from 1000 to 10,000 items(again, majority zeros..never more than 100 non-zero numbers in the int[])

推荐答案

这听起来像使用自定义压缩机制，利用数据的结构可能非常有效。

It sounds like using a custom compression mechanism that exploits the structure of the data could be very efficient.

首先，使用 short [] （16位数据类型）而不是 int [] 会将发送的数据量减半这是因为数字很容易在 -2 ^ 15 （-32768）和 2 ^ 15-1 （32767 ）。

Firstly, using a short[] (16 bit data type) instead of an int[] will halve (!) the amount of data sent, you can do this because the numbers are easily between -2^15 (-32768) and 2^15-1 (32767). This is ridiculously easy to implement.

其次，您可以使用类似于运行长度编码的方案：正数表示字母数字，负数表示许多零（取绝对值后）。例如

Secondly, you could use a scheme similar to run-length encoding: a positive number represents that number literally, while a negative number represents that many zeros (after taking absolute values). e.g.

[10, 40, 0, 0, 0, 30, 0, 100, 0, 0, 0, 0] <=> [10, 40, -3, 30, -1, 100, -4]

实现只需用 short 代替 int ，但在最坏的情况下提供〜80％的压缩，100非零，没有一个是连续的）。

This is harder to implement that just substituting short for int, but will provide ~80% compression in the very worst case (1000 numbers, 100 non-zero, none of which are consecutive).

我只是做了一些模拟来计算压缩比。我测试了我上面描述的方法，和路易斯Wasserman和剑桥建议。两个效果都很好。

I just did some simulations to work out the compression ratios. I tested the method I described above, and the one suggested by Louis Wasserman and sbridges. Both performed very well.

假设数组的长度和非零数字的边界均匀，两种方法都保存约5400 int s（或 short s），压缩大小约为原始大小的2.5％！运行长度编码方法似乎节省了大约1个额外的 int （或平均压缩大小减少0.03％），即基本没有差别，因此您应该使用是最容易实现的。以下是50000个随机样本的压缩率的直方图（它们非常相似！）。

Assuming the length of the array and the number of non-zero numbers are both uniformly between their bounds, both methods save about 5400 ints (or shorts) on average with a compressed size of about 2.5% the original! The run-length encoding method seems to save about 1 additional int (or average compressed size that is 0.03% smaller), i.e. basically no difference, so you should use the one that is easiest to implement. The following are histograms of the compression ratios for 50000 random samples (they are very similar!).

摘要：使用 short s而不是 int s和其中一种压缩方法，您将能够将数据压缩到原始大小的大约1％！

Summary: using shorts instead of ints and one of the compression methods, you will be able to compress the data to about 1% of its original size!

对于模拟，我使用了以下R脚本：

For the simulation, I used the following R script:

SIZE <- 50000

lengths <- sample(1000:10000, SIZE, replace=T)
nonzeros <- sample(1:100, SIZE, replace=T)

f.rle <- function(len, nonzero) {
  indexes <- sort(c(0,sample(1:len, nonzero, F)))
  steps <- diff(indexes)
  sum(steps > 1) + nonzero # one short per run of zeros, and one per zero
}

f.index <- function(len, nonzero) {
  nonzero * 2
}

# using the [value, -1 * number of zeros,...] method
rle.comprs <- mapply(f.rle, lengths, nonzeros)
print(mean(lengths - rle.comprs)) # average number of shorts saved

rle.ratios <- rle.comprs / lengths * 100
print(mean(rle.ratios)) # average compression ratio

# using the [(index, value),...] method
index.comprs <- mapply(f.index, lengths, nonzeros)
print(mean(lengths - index.comprs)) # average number of shorts saved

index.ratios <- index.comprs / lengths * 100
print(mean(index.ratios)) # average compression ratio


par(mfrow=c(2,1))
hist(rle.ratios, breaks=100, freq=F, xlab="Compression ratio (%)", main="Run length encoding")
hist(index.ratios, breaks=100, freq=F, xlab="Compression ratio (%)", main="Store indices")

这篇关于建议压缩库获得byte []尽可能小，不考虑cpu费用？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！