问题描述
我在IR工作.
任何人都可以指导我,如何在Whoosh
中实现语言模型.我已经应用了TD-IDF和BM25.我是IR的新手.
Can any one guide me, how can I implement the language model in Whoosh
.I already Applied TD-IDF and BM25. I am new to IR.
例如,语言模型的最简单形式只是丢弃所有条件上下文,并独立估计每个术语.这样的模型称为unigram语言模型:
For an example, the simplest form of language model simply throws away all conditioning context, and estimates each term independently. Such a model is called a unigram language model:
P_{uni}(t_1t_2t_3t_4) = P(t_1)P(t_2)P(t_3)P(t_4)
还有许多更复杂的语言模型,例如bigram语言模型,它以上一个术语为条件,
There are many more complex kinds of language models, such as bigram language models, which condition on the previous term,
P_{bi}(t_1t_2t_3t_4) = P(t_1)P(t_2\vert t_1)P(t_3\vert t_2)P(t_4\vert t_3)
推荐答案
看看 Whoosh的评分模块,并使用BM25F(276至332行)作为构建自己的加权和评分模型的参考.您需要创建一个权重模型和一个计分器.假设您要调用模型Unigram
,主要步骤将是:
Take a look at Whoosh's scoring module and use BM25F (lines 276 to 332) as a reference for building your own weighting and scoring models. You need to create a Weighting Model and a Scorer. Assuming you want to call your model Unigram
, the main steps would be:
-
实现您自己的
Unigram
加权模型类并从scoring.WeightingModel
继承:
Implement your own
Unigram
weighting model class and inherit fromscoring.WeightingModel
:
class Unigram(WeightingModel)
实现基类所需的方法,主要方法是scorer()
,它返回对Scorer
类的引用(下一个).创建您的searcher
并定义搜索者将使用的权重模型时,将调用此类.
Implement the methods required by the base class, the main one being scorer()
, which returns a reference to your Scorer
class (next). This class is called when you create your searcher
and defines the Weighting Model the searcher will use.
实现UnigramScorer
类并从scoring.WeightLengthScorer
继承:
class UnigramScorer(WeightLengthScorer)
实施__init__
和_score
方法. __init__
带有字段名称和值,并在调用searcher.search()
时为查询中的每个术语调用一次.结果中的每个匹配文档都会调用_score
.它需要weight
和length
并返回给定字段的分数.
Implement the __init__
and _score
methods. __init__
takes the field name and value and is called once for each term in your query when you call searcher.search()
._score
is called for each matching document in your results. It takes a weight
and length
and returns a score for a given field.
在搜索时创建搜索器时,请使用weighting
参数指定自定义语言模型:
When you create your searcher at search time, specify your custom language model using the weighting
parameter:
ix.searcher(weighting = Unigram)
这篇关于通过信息检索中的Whoosh语言模型的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!