问题描述
最近我碰到一个不错的问题,它翻起来简单理解为难以找到什么办法来解决。现在的问题是:
输入文本应该是相当大(超过10000个字符)。
唯一的(非常强)要求是归档文件的大小(即打印程序)必须的的严格小于的比原始文本的大小。这使得不可能明显的解决方案,如
的std ::字符串s;
/ *读文成S * /
性病::法院<< #包括<的iostream> INT主要(){性病::法院<< \<< S<<\;};
相信一些存档技术是要在此处使用。
不幸的是,这样的程序不存在。
要明白为什么会这样,我们需要做一些数学。首先,让我们计算了多少二进制串有长度为n的。各个比特可以是0或1,这使我们的两个选择为每个位中的一个。因为有两个选择每比特和n位中,有这样一共有2 长度为n的二进制串。
现在,让我们假设我们要建立一个COM pression算法总是COM presses长度的位串N转换长度小于n的比特串。为了使这项工作,我们需要计数长度有多少种不同的字符串小于n也有。那么,这是通过一个长度为0的位串的数目,长度加1的位串的数目,长度加2等的位串的数目给定,一路直到n - 1。这是总对>
使用位数学的,我们可以得到这个数等于2 - 1。换言之,长度小于n的位串的总数比所述第一小长度为n的位串。
但是,这是一个问题。为了让我们能够有一个无损COM pression算法总是长度为n的字符串对应一个字符串的长度最多n - 1,我们就必须有长度为n的每一个比特串用一些短关联的一些方法位串,使得长度为n的没有任何两个位串与相同的短比特流相关联。通过这种方式,我们可以COM preSS字符串,只需映射到相应的较短的字符串,我们可以DECOM preSS它通过反向映射。没有两个位串的长度为n映射到相同的短串是什么使这无损的限制 - 如果有两个长正位串分别映射到相同的短比特串,然后当是时候DECOM preSS的字符串,不会有办法知道它原来的两个位串,我们有COM pressed。
的这是我们达成的一个问题。由于有2 长度为n的不同位串,只有2 -1短位串,还有我们能长度为n的每个位串配对了一些较短没有可能的方式位串不分配至少两个长度为N位串到同一个较短的字符串。这意味着,不管如何努力,我们尝试,无论我们多么聪明,不管我们多么有创意与我们融为一体pression算法得到的,有一个硬的数学限值说,我们不能总是让文字短。
那么,这如何映射到你原来的问题?好吧,如果我们得到长度至少为10000的文本字符串,需要输出一个较短的程序,打印,那么我们就必须有映射每个2的一些方法长度的字符串10000到2 - 1字符串长度小于10000。该映射有一些其他的特性,即我们总是要产生一个有效的方案,但是这无关紧要在这里 - 那里根本没有足够的更短的字符串来绕去。其结果是,要解决的问题是不可能的。
这是说,我们也许能得到一个程序,可以COM preSS所有,但长度10000的字符串中的一个较短的字符串。事实上,我们可能会发现一个COM pression算法,做到这一点,这意味着的概率是1 - 2 长度10000的任何字符串可能是COM pressed。这是这样一个高的概率,如果我们不停地捡字符串宇宙的寿命,我们就几乎可以肯定不会猜到了一个错误的字符串。
有关进一步的阅读,有一个从信息理论称为柯尔莫哥洛夫复杂性,这是最小的节目长度的概念必要产生一个给定的字符串。一些字符串很容易融为一体pressed(例如,abababababababab),而有些则没有(例如,sdkjhdbvljkhwqe0235089)。存在着被称为科钦pressible字符串串,为此,该字符串不可能是COM pressed到任意小的空间。这意味着任何程序,将打印字符串必须至少只要给定的字符串。对于一个很好的介绍柯尔莫哥洛夫复杂性,你可能想看看迈克尔Sipser,其中有一些较冷的结果的一个很好的概述介绍计算,第二版理论的第6章。对于一个更加严格和深入了解,可以阅读信息论基础,第14章。
希望这有助于!
Recently I came across one nice problem, which turned up as simple to understand as hard to find any way to solve. The problem is:
The input text is supposed to be rather large (more than 10000 characters).
The only (and very strong) requirement is that the size of the archive (i.e. the program printed) must be strictly less than the size of the original text. This makes impossible obvious solutions like
std::string s;
/* read the text into s */
std::cout << "#include<iostream> int main () { std::cout<<\"" << s << "\"; }";
I believe some archiving techniques are to be used here.
Unfortunately, such a program does not exist.
To see why this is so, we need to do a bit of math. First, let's count up how many binary strings there are of length n. Each of the bits can be either a 0 or 1, which gives us one of two choices for each of those bits. Since there are two choices per bit and n bits, there are thus a total of 2 binary strings of length n.
Now, let's suppose that we want to build a compression algorithm that always compresses a bitstring of length n into a bitstring of length less than n. In order for this to work, we need to count up how many different strings of length less than n there are. Well, this is given by the number of bitstrings of length 0, plus the number of bitstrings of length 1, plus the number of bitstrings of length 2, etc., all the way up to n - 1. This total is
Using a bit of math, we can get that this number is equal to 2 - 1. In other words, the total number of bitstrings of length less than n is one smaller than the number of bitstrings of length n.
But this is a problem. In order for us to have a lossless compression algorithm that always maps a string of length n to a string of length at most n - 1, we would have to have some way of associating every bitstring of length n with some shorter bitstring such that no two bitstrings of length n are associated with the same shorter bitstream. This way, we can compress the string by just mapping it to the associated shorter string, and we can decompress it by reversing the mapping. The restriction that no two bitstrings of length n map to the same shorter string is what makes this lossless - if two length-n bitstrings were to map to the same shorter bitstring, then when it came time to decompress the string, there wouldn't be a way to know which of the two original bitstrings we had compressed.
This is where we reach a problem. Since there are 2 different bitstrings of length n and only 2-1 shorter bitstrings, there is no possible way we can pair up each bitstring of length n with some shorter bitstring without assigning at least two length-n bitstrings to the same shorter string. This means that no matter how hard we try, no matter how clever we are, and no matter how creative we get with our compression algorithm, there is a hard mathematical limit that says that we can't always make the text shorter.
So how does this map to your original problem? Well, if we get a string of text of length at least 10000 and need to output a shorter program that prints it, then we would have to have some way of mapping each of the 2 strings of length 10000 onto the 2 - 1 strings of length less than 10000. That mapping has some other properties, namely that we always have to produce a valid program, but that's irrelevant here - there simply aren't enough shorter strings to go around. As a result, the problem you want to solve is impossible.
That said, we might be able to get a program that can compress all but one of the strings of length 10000 to a shorter string. In fact, we might find a compression algorithm that does this, meaning that with probability 1 - 2 any string of length 10000 could be compressed. This is such a high probability that if we kept picking strings for the lifetime of the universe, we'd almost certainly never guess the One Bad String.
For further reading, there is a concept from information theory called Kolmogorov complexity, which is the length of the smallest program necessary to produce a given string. Some strings are easily compressed (for example, abababababababab), while others are not (for example, sdkjhdbvljkhwqe0235089). There exist strings that are called incompressible strings, for which the string cannot possibly be compressed into any smaller space. This means that any program that would print that string would have to be at least as long as the given string. For a good introduction to Kolmogorov Complexity, you may want to look at Chapter 6 of "Introduction to the Theory of Computation, Second Edition" by Michael Sipser, which has an excellent overview of some of the cooler results. For a more rigorous and in-depth look, consider reading "Elements of Information Theory," chapter 14.
Hope this helps!
这篇关于编写一个程序,接受文本输入,并产生一个程序,再现文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!