我写了一个c程序,它应该用Rabin Karp algorithm把文件切成块。这是一个c程序的改编,您可以找到Here。
这似乎是可行的,但问题仍然存在平均块大小不是预期的大小。
用法如下:
rabin Prime WindowSize BoundaryMarker文件
在哪里?
rabin是可执行文件的名称。
素数是一个高素数。例如100007
window size是滚动窗口的大小例如48
BoundaryMarker是指纹中设置为0的位数
文件是要处理的文件
如果我将BoundaryMarker设置为13,我希望块的平均大小为8K。
事实上,他们都不在8K左右。
我很难弄清楚我的程序出了什么问题?
你能帮助我吗?
谢谢
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <fcntl.h>
unsigned char* buffer;
int windowSize;
int writePointer = 0;
int readPointer = 0;
int dataSize = 0;
unsigned char PushChar(unsigned char c)
{ if (++writePointer >= windowSize) writePointer=0;
buffer[writePointer]=c;
dataSize++;
return(c);
}
unsigned char PopChar(void)
{ if (++readPointer >= windowSize) readPointer=0;
dataSize--;
return(buffer[readPointer]);
}
int main(int argc, char *argv[])
{ int fd;
unsigned char c;
unsigned long Q;
unsigned long D=256;
unsigned long pow=1;
int i,k,boundary,boundaryMarker,index;
unsigned char s;
if (argc != 5)
{ printf("\nUsage : rabin Prime WindowSize BoundaryMarker File\n\nwhere :\n");
printf("Prime is a high prime number. For instance 100007\n\n");
printf("WindowSize is the size of rolling window. For instance 48\n\n");
printf("BoundaryMarker is the number of bits set to 0 in a fingerprint\n\n");
printf("File is the file to process\n\n");
return(1);
}
sscanf(argv[1],"%lu",&Q);
sscanf(argv[2],"%d",&windowSize);
sscanf(argv[3],"%d",&boundaryMarker);
for(i=1,boundary=1;i<=boundaryMarker;i++) boundary=boundary*2;
boundary --;
//printf("Q = %lu windowSize = %d boundary = %d\n",Q,windowSize,boundary);
if ((buffer=(unsigned char*) malloc (sizeof(unsigned char)*windowSize))==NULL) return(1);
for (k=1; k < windowSize; k++) pow=(pow*D)%Q;
//printf("pow value %lu\n",pow);
unsigned long sig=0;
int lastIndex=0;
if ((fd=open(argv[4],O_RDONLY))<0) exit(1);
for (i=0; i <windowSize; i++)
{ read(fd,&c,1);
PushChar(c);
sig=(sig*D + (unsigned long)c) %Q;
}
//printf("sig value = %lu\n",sig);
index=0; lastIndex=0;
while (read(fd,&c,1))
{
s=PopChar();
//printf("sig = ( %lu + %lu - %lu * %lu %% %lu ) %lu",sig,Q,pow,(unsigned long) s,Q,Q);
sig = (sig + Q - pow*(unsigned long)s%Q)%Q;
//printf(" = %lu\n",sig);
s=PushChar(c);
//printf("sig2 = ( %lu * %lu + %lu ) %% %lu",sig,D,(unsigned long) s,Q);
sig = (sig*D + (unsigned long)s)%Q;
//printf(" = %lu\n",sig);
index++;
if ((sig & boundary )==0)
{ if (index - lastIndex >= 2048)
{ printf("sig & boundary = %lu & %lu Index=%d chunk size=%d\n",sig,boundary,index,index-lastIndex);
lastIndex=index;
}
}
else if (index -lastIndex >=65536)
{ printf("sig & boundary = %lu & %lu Index=%d chunk size=%d\n",sig,boundary,index,index-lastIndex);
lastIndex=index;
}
}
printf("Index=%d chunk size=%d\n",index,index-lastIndex);
close(fd);
return 1;
}
最佳答案
运行boundarymarker=13的代码,在一兆字节的随机数据上得到104个块,平均块大小为10082字节。这与预期的8192不太远。
然而,较小的boundarymarker值显示出更明显的偏差;例如,将其设置为10,就得到了3049字节的平均块大小,与预期的1024字节相去甚远。设置boundarymarker=5得到的平均块大小为2077字节,甚至不接近预期的32字节。
更仔细地看一下您的代码,这种偏见的明显原因在于以下代码(为了清晰起见,重新格式化了):
if ((sig & boundary ) == 0)
{ if (index - lastIndex >= 2048)
{ printf("sig & boundary = %lu & %lu Index=%d chunk size=%d\n",sig,boundary,index,index-lastIndex);
lastIndex=index;
}
}
else if (index - lastIndex >= 65536)
{ printf("sig & boundary = %lu & %lu Index=%d chunk size=%d\n",sig,boundary,index,index-lastIndex);
lastIndex=index;
}
if (index - lastIndex >= 2048)
抑制比上一个边界小于2048字节的块边界,有效地将小于2048字节的块与下一个块合并。同时,else if (index - lastIndex >= 65536)
检查强制人工块边界,以防止任何块增长超过65536字节。如果这种行为(强制所有块至少2048个,最多65536字节长)不是您想要的,您可以简单地删除这些检查,将代码简化为:
if ((sig & boundary ) == 0)
{ printf("sig & boundary = %lu & %lu Index=%d chunk size=%d\n",sig,boundary,index,index-lastIndex);
lastIndex=index;
}
实际上,对于boundarymarker=n,至少对于n≤12左右的情况,进行此更改会产生非常接近2n字节的平均块大小。
对于n=13,似乎确实存在明显的向下偏移,我怀疑这是由于质数100007仅是边界模213的12.2倍。由于签名值或多或少是随机分布的模素数,当进一步减少模213时,额外的0.2使它们稍微偏向较小的值(包括零)。
通过使用更大的素数,例如231−1=2147483647,可以很容易地固定这种偏差。实际上,切换到这个素数会使块的平均大小更接近8192。
关于c - 使用rabin karp算法切片文件,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/10781832/