本文介绍了Python:飞快似乎返回不正确的结果的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

此代码直接来自Whoosh的:

 导入os.path $ b $ from whoosh.index导入create_in $ b $ from whoosh.fields导入架构,存储,ID,关键字,文本
from whoosh.index从whoosh.query import中导入open_dir
从whoosh.qparser导入*
导入QueryParser

#establish在索引中使用的模式
schema =模式(title = TEXT(存储=真),content = TEXT,
path = ID(存储= True),tags = KEYWORD,icon = STORED)

#create索引目录
如果不是os.path.exists(index):
os.mkdir(index)

#使用上面指定的模式创建索引
ix = create_in(index,schema)

#instantiate编写器对象
编写器= ix.writer()

#将文档添加到索引
writer.add_document(title = u我的文档,content = u这是我的文档!,
path = u/ a, tags = ufirst short,icon = u/icons/star.png)
writer.add_document(title = uSecond try,content = u这是第二个例子。,
path = u/ b,tags = usecond short,icon = u/icons/sheep.png)
writer.add_document(title = u第三次的魅力,内容= ü例子很多。,
path = u/ c,tags = ushort,icon = u/icons/book.png)

#commit更改
writer.commit()

#identify searcher
with ix.searcher()as searcher:

#specify parser
parser = QueryParser(content,ix.schema)

#specify query - try alsosecond
myquery = parser.parse(is)

#搜索结果
results = searcher.search(myquery)

#identify匹配文件的数量
print len(results)

我只是将一个值 - 即动词is传递给parser.parse()调用。然而,当我运行这个时,我会得到长度为零的结果,而不是第二长度的预期结果。如果我用秒替换是,我会得到一个结果,如预期的那样。为什么使用is的搜索不会产生匹配,但是?

编辑



作为@Philippe指出,默认的Whoosh索引器会删除停用词,因此上述行为。如果要保留停用词,则可以指定索引索引内某个给定字段时要使用的分析器,并且可以向分析器传递一个参数以避免剥离停用词;例如:

  schema = Schema(title = TEXT(stored = True,analyzer = analysis.StandardAnalyzer(stoplist = None))) 


解决方案

停用词过滤器由默认的文本分析器:



另请参阅文档:


This code is straight from Whoosh's quickstart docs:

import os.path
from whoosh.index import create_in
from whoosh.fields import Schema, STORED, ID, KEYWORD, TEXT
from whoosh.index import open_dir
from whoosh.query import *
from whoosh.qparser import QueryParser

#establish schema to be used in the index
schema = Schema(title=TEXT(stored=True), content=TEXT,
                path=ID(stored=True), tags=KEYWORD, icon=STORED)

#create index directory
if not os.path.exists("index"):
    os.mkdir("index")

#create the index using the schema specified above
ix = create_in("index", schema)

#instantiate the writer object
writer = ix.writer()

#add the docs to the index
writer.add_document(title=u"My document", content=u"This is my document!",
                    path=u"/a", tags=u"first short", icon=u"/icons/star.png")
writer.add_document(title=u"Second try", content=u"This is the second example.",
                    path=u"/b", tags=u"second short", icon=u"/icons/sheep.png")
writer.add_document(title=u"Third time's the charm", content=u"Examples are many.",
                    path=u"/c", tags=u"short", icon=u"/icons/book.png")

#commit those changes
writer.commit()

#identify searcher
with ix.searcher() as searcher:

    #specify parser
    parser = QueryParser("content", ix.schema)

    #specify query -- try also "second"
    myquery = parser.parse("is")

    #search for results
    results = searcher.search(myquery)

    #identify the number of matching documents
    print len(results)

I have merely passed a value--namely, the verb "is"--to the parser.parse() call. When I run this, however, I get results of length zero, rather than the expected results of length two. If I replace "is" with "second", I get one result, as expected. Why doesn't the search using "is" yield a match, though?

Edit

As @Philippe points out, the default Whoosh indexer removes stop words, hence the behavior described above. If you want to retain stop words, you can specify which analyzer you wish to use when indexing a given field within an index, and you can pass your analyzer a parameter to refrain from stripping stop words; e.g.:

schema = Schema(title=TEXT(stored=True, analyzer=analysis.StandardAnalyzer(stoplist=None)))
解决方案

A stop word filter is applied by the default text analyzer:https://bitbucket.org/mchaput/whoosh/src/999cd5fb0d110ca955fab8377d358e98ba426527/src/whoosh/analysis/filters.py?at=default#cl-41

See also the doc:http://whoosh.readthedocs.org/en/latest/api/analysis.html#whoosh.analysis.StopFilter

这篇关于Python:飞快似乎返回不正确的结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

06-10 13:51