问题描述
我是Lucene的初学者.这是我的资料来源:
I'm beginner of lucene. Here's my source:
ft = new FieldType(StringField.TYPE_STORED);
ft.setTokenized(false);
ft.setStored(true);
ftNA = new FieldType(StringField.TYPE_STORED);
ftNA.setTokenized(true);
ftNA.setStored(true);
为什么用lucene标记?例如:我的名字是lee"的字符串值
Why tokenized in lucene? For example: the String value of "my name is lee"
- 用字母标记的情况下,我",名称",是","lee"
- 没有标记的情况下,我的名字是李"
我不明白为什么要通过标记化来建立索引.标记化和未标记化之间有什么区别?
I'dont understand why indexing by tokenized. What is the difference between tokenized and not tokenized?
推荐答案
Lucene通过在文档中找到满足查询表达的约束条件的代币来工作. em>.
Lucene works by finding tokens in documents which satisfy constraints expressed by a query.
例如,如果搜索lee
,则查询将查找包含令牌 lee
的所有文档.如果未对字段进行标记,则只能找到my name is lee
,而不能找到例如lee
.
If you search for lee
for instance, the query will find all documents that contain the token lee
. If the field isn't tokenized, you'll only be able to find my name is lee
, but not just lee
for instance.
现在假设您搜索"is lee"
.这是一个PhraseQuery
,这意味着它将与令牌is
和令牌lee
匹配.
Now suppose you search for "is lee"
. This is a PhraseQuery
, which means it'll match the token is
followed by the token lee
.
令牌化是因为Lucene使用的是倒排索引,即它将令牌映射到包含它们的文档.
Tokenization is needed because Lucene works with an inverted index, ie it maps tokens to the documents that contain them.
这篇关于为什么用Lucene标记文本?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!