全文索引之中文分词（下）

在使用SQL Search的过程中，还发现了一个问题：它对中文，是按字分词的，下面我解释一下：

比如对'博客堂成员很多是MVP'这句话，假如一个个的字的作索引，会比使用'博客堂','成员',MVP'几个词作索引生成的索引大很多，这样不仅浪费空间，也影响索引的效率和准确度。假如英文是按照字母而不是单词作索引，估计世界上如今就没有全文索引，也没有google了。

但是中文在分词上，相比英文有天然的屏障，英文的单词之间有间隔，但是中文不是，必须使用计算机的人工智能把句子分成一个个的词，有些时候，根据句子本身还不够，还必须根据上下文，或者一些日常知识才能判断。比如乒乓球拍/卖/完了和乒乓球/拍卖/完了，电脑咋能知道是哪个意思并正确分词呢！

根据使用的结果，SQL Search对中文使用的应该是按字分词(可能是因为原来是英文引擎的缘故)，比方说你要查'马克'，它会把'马克思'也给你倒腾出来。

我的一个123M的数据库，全文索引有55M，每次全文查询都比较慢(当然机器也很次)。

--------------------------------------------------------------------------------------------------

关于按字分词：

应该还是怡红公子的说法比较妥当，大家看看这个句子：

操作系统能否用汇骗语言改写限制它对每个端口的使用率

为了验证分词，故意使用错误的分词，假如都可以索引出该句子，就说明是按字分词的。比如使用 '用汇' ?查询，也可以查出句子，所以得出了SQLServer按字分词的结论，我没有进一步检查，但是现在发现使用'写限'，使用'统能'就无法查出来了，证明 SQLServer中还是有简单分词的，只是分词结果不理想。

此外，SQLServer还可以使用第三方的产品增强分词的能力。

--------------------------------------------------------------------------------------------------

假如对分词有兴趣的朋友，这里有一些代码可以看，使用发现分词正确率还是很高的，不过要注册才可以得到：http://www.nlp.org.cn/project/project.php?proj_id=6

发表于 2004年3月19日 11:22

不光有，我们还可以在程序中使用（不过我记不太清楚SQL Server Fulltext用的索引是不是和Index Server一样了）： using System; using System.Runtime.InteropServices; namespace FullTextAPI { [Flags] public enum WordBreakType //WORDREP_BREAK_TYPE { Word = 0, Sentense = 1, Paragraph = 2, Chapter = 3 } [ComImport, Guid("CC907054-C058-101A-B554-08002B33B0E6")] [InterfaceType(ComInterfaceType.InterfaceIsIUnknown)] public interface IWordSink { void PutWord( [MarshalAs(UnmanagedType.U4)] int charCount, [MarshalAs(UnmanagedType.LPWStr)] string sourceBuffer, [MarshalAs(UnmanagedType.U4)] int sourceLength, [MarshalAs(UnmanagedType.U4)] int sourcePosition); void PutAltWord( [MarshalAs(UnmanagedType.U4)] int cwc, [MarshalAs(UnmanagedType.LPWStr)] string pwcInBuf, [MarshalAs(UnmanagedType.U4)] int cwcSrcLen, [MarshalAs(UnmanagedType.U4)] int cwcSrcPos); void StartAltPhrase(); void EndAltPhrase(); void PutBreak(WordBreakType breakType); } [ComImport, Guid("CC906FF0-C058-101A-B554-08002B33B0E6")] [InterfaceType(ComInterfaceType.InterfaceIsIUnknown)] public interface IPhraseSink { void PutSmallPhrase( [MarshalAs(UnmanagedType.LPWStr)] string pwcNoun, [MarshalAs(UnmanagedType.U4)] int cwcNoun, [MarshalAs(UnmanagedType.LPWStr)] string pwcModifier, [MarshalAs(UnmanagedType.U4)] int cwcModifier, [MarshalAs(UnmanagedType.U4)] int ulAttachmentType); void PutPhrase( [MarshalAs(UnmanagedType.LPWStr)] string pwcPhrase, [MarshalAs(UnmanagedType.U4)] int cwcPhrase); } [StructLayout(LayoutKind.Sequential)] public struct TextSource //TEXT_SOURCE { [MarshalAs(UnmanagedType.FunctionPtr)] public TextBufferFiller TextBufferFiller; [MarshalAs(UnmanagedType.LPWStr)] public string Buffer; [MarshalAs(UnmanagedType.U4)] public int End; [MarshalAs(UnmanagedType.U4)] public int Current; } // used to fill the buffer for TEXT_SOURCE public delegate uint TextBufferFiller([MarshalAs(UnmanagedType.Struct)] ref TextSource textSource); [ComImport, Guid("D53552C8-77E3-101A-B552-08002B33B0E6")] [InterfaceType(ComInterfaceType.InterfaceIsIUnknown)] public interface IWordBreaker { void Init( [MarshalAs(UnmanagedType.Bool)] bool isQuery, [MarshalAs(UnmanagedType.U4)] int maxTokenSize, [MarshalAs(UnmanagedType.Bool)] out bool hasLicense); void BreakText( [MarshalAs(UnmanagedType.Struct)] ref TextSource textSource, [MarshalAs(UnmanagedType.Interface)] IWordSink wordSink, [MarshalAs(UnmanagedType.Interface)] IPhraseSink phraseSink); void GetLicenseToUse([MarshalAs(UnmanagedType.LPWStr)] out string license); } [ComImport, Guid("80A3E9B0-A246-11D3-BB8C-0090272FA362")] // EN-us public class EnglishUKWordBreaker {} [ComImport, Guid("9717fc70-c1bc-11d0-9692-00a0c908146e")] // ZH-chs public class SimplifiedChineseWordBreaker {} }

转自：http://blog.sina.com.cn/s/blog_5677bc54010000i3.html