问题描述
我正在寻找一种尽可能缩短已经短的字符串的方法。
字符串是一个主机名:port combo,可能看起来像 em> my-domain.se:2121 或 123.211.80.4:2122 。
由于所需的开销和缺乏重复,我有一个想法如何做到。
因为字母表限制为39个字符( [az] [0-9] - :)每个字符可以适合6位。与ASCII相比,这减少了长度的25%。所以我的建议是这样的:
- 使用某种自定义编码将字符串编码为字节数组
- 将字节数组解码为UTF-8或ASCII字符串(这个字符串显然没有任何意义)。
对我的问题:
- $ b $
- Encode the string to a byte array using some kind of custom encoding
- Decode the byte array to a UTF-8 or ASCII string (this string will obviously not make any sense).
- Could this work?
- Is there a better way?
- How?
您可以将字符串编码为base 40,比base 64更紧凑。这将给你12个这样的令牌成64位长。第40个令牌可以是字符串标记的结尾,以提供长度(因为它不会是整个字节数)
如果使用算术编码,它可以小得多,但你需要一个频率表为每个令牌。 (使用一长串可能的例子)
class Encoder {
public static final int BASE = 40;
StringBuilder chars = new StringBuilder(BASE);
byte [] index = new byte [256];
{
chars.append('\0');
for(char ch ='a'; ch< ='z'; ch ++)chars.append(ch);
for(char ch ='0'; ch chars.append( - :.);
Arrays.fill(index,(byte)-1);
for(byte i = 0; i index [chars.charAt(i)] = i;
}
public byte [] encode(String address){
try {
ByteArrayOutputStream baos = new ByteArrayOutputStream();
DataOutputStream dos = new DataOutputStream(baos);
for(int i = 0; i switch(Math.min(3,address.length() - i)){
case 1:// last one。
byte b = index [address.charAt(i)];
dos.writeByte(b);
break;
case 2:
char ch =(char)((index [address.charAt(i + 1)])* 40 + index [address.charAt(i)]);
dos.writeChar(ch);
break;
case 3:
char ch2 =(char)((index [address.charAt(i + 2)] * 40 + index [address.charAt(i + 1)])* 40 + index [address.charAt(i)]);
dos.writeChar(ch2);
break;
}
}
return baos.toByteArray();
} catch(IOException e){
throw new AssertionError(e);
}
}
public static void main(String [] args){
Encoder encoder = new Encoder();
for(String s:twitter.com:2122,123.211.80.4:2122,my-domain.se:2121,www.stackoverflow.com:80.split(,)){
System.out.println(s +(+ s.length()+chars)encoded是+ encoder.encode(s).length +bytes。
}
}
}
b
$ b
twitter.com:2122(16个字符)编码为11个字节。
123.211.80.4:2122(17个字符)编码是12字节。
my-domain.se:2121(17个字符)编码是12字节。
www.stackoverflow.com:80(24个字符)编码是16字节。
我将解码作为练习。 ;)
I'm looking for a way to shorten an already short string as much as possible.
The string is a hostname:port combo and could look like "my-domain.se:2121" or "123.211.80.4:2122".
I know regular compression is pretty much out of the question on strings this short due to the overhead needed and the lack of repetition but I have an idea of how to do it.
Because the alphabet is limited to 39 characters ([a-z][0-9]-:.) every character could fit in 6 bits. This reduce the length with up to 25% compared to ASCII. So my suggestion is somthing along these lines:
And then reverse the process to get the original string.
So to my questions:
You could encode the string as base 40 which is more compact than base 64. This will give you 12 such tokens into a 64 bit long. The 40th token could be the end of string marker to give you the length (as it will not be a whole number of bytes any more)
If you use arithmetic encoding, it could be much smaller but you would need a table of frequencies for each token. (using a long list of possible examples)
class Encoder {
public static final int BASE = 40;
StringBuilder chars = new StringBuilder(BASE);
byte[] index = new byte[256];
{
chars.append('\0');
for (char ch = 'a'; ch <= 'z'; ch++) chars.append(ch);
for (char ch = '0'; ch <= '9'; ch++) chars.append(ch);
chars.append("-:.");
Arrays.fill(index, (byte) -1);
for (byte i = 0; i < chars.length(); i++)
index[chars.charAt(i)] = i;
}
public byte[] encode(String address) {
try {
ByteArrayOutputStream baos = new ByteArrayOutputStream();
DataOutputStream dos = new DataOutputStream(baos);
for (int i = 0; i < address.length(); i += 3) {
switch (Math.min(3, address.length() - i)) {
case 1: // last one.
byte b = index[address.charAt(i)];
dos.writeByte(b);
break;
case 2:
char ch = (char) ((index[address.charAt(i+1)]) * 40 + index[address.charAt(i)]);
dos.writeChar(ch);
break;
case 3:
char ch2 = (char) ((index[address.charAt(i+2)] * 40 + index[address.charAt(i + 1)]) * 40 + index[address.charAt(i)]);
dos.writeChar(ch2);
break;
}
}
return baos.toByteArray();
} catch (IOException e) {
throw new AssertionError(e);
}
}
public static void main(String[] args) {
Encoder encoder = new Encoder();
for (String s : "twitter.com:2122,123.211.80.4:2122,my-domain.se:2121,www.stackoverflow.com:80".split(",")) {
System.out.println(s + " (" + s.length() + " chars) encoded is " + encoder.encode(s).length + " bytes.");
}
}
}
prints
twitter.com:2122 (16 chars) encoded is 11 bytes.
123.211.80.4:2122 (17 chars) encoded is 12 bytes.
my-domain.se:2121 (17 chars) encoded is 12 bytes.
www.stackoverflow.com:80 (24 chars) encoded is 16 bytes.
I leave decoding as an exercise. ;)
这篇关于缩短Java中已经很短的字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!