通过信息检索中的Whoosh语言模型

本文介绍了通过信息检索中的Whoosh语言模型的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我在IR工作.

任何人都可以指导我，如何在Whoosh中实现语言模型.我已经应用了TD-IDF和BM25.我是IR的新手.

Can any one guide me, how can I implement the language model in Whoosh.I already Applied TD-IDF and BM25. I am new to IR.

例如，语言模型的最简单形式只是丢弃所有条件上下文，并独立估计每个术语.这样的模型称为unigram语言模型:

For an example, the simplest form of language model simply throws away all conditioning context, and estimates each term independently. Such a model is called a unigram language model:

P_{uni}(t_1t_2t_3t_4) = P(t_1)P(t_2)P(t_3)P(t_4)

还有许多更复杂的语言模型，例如bigram语言模型，它以上一个术语为条件，

There are many more complex kinds of language models, such as bigram language models, which condition on the previous term,

P_{bi}(t_1t_2t_3t_4) = P(t_1)P(t_2\vert t_1)P(t_3\vert t_2)P(t_4\vert t_3)

推荐答案

看看 Whoosh的评分模块，并使用BM25F(276至332行)作为构建自己的加权和评分模型的参考.您需要创建一个权重模型和一个计分器.假设您要调用模型Unigram，主要步骤将是:

Take a look at Whoosh's scoring module and use BM25F (lines 276 to 332) as a reference for building your own weighting and scoring models. You need to create a Weighting Model and a Scorer. Assuming you want to call your model Unigram, the main steps would be:

实现您自己的Unigram加权模型类并从scoring.WeightingModel继承:

Implement your own Unigram weighting model class and inherit from scoring.WeightingModel:

class Unigram(WeightingModel)

实现基类所需的方法，主要方法是scorer()，它返回对Scorer类的引用(下一个).创建您的searcher并定义搜索者将使用的权重模型时，将调用此类.

Implement the methods required by the base class, the main one being scorer(), which returns a reference to your Scorer class (next). This class is called when you create your searcher and defines the Weighting Model the searcher will use.

实现UnigramScorer类并从scoring.WeightLengthScorer继承:

class UnigramScorer(WeightLengthScorer)

实施__init__和_score方法. __init__ 带有字段名称和值，并在调用searcher.search()时为查询中的每个术语调用一次.结果中的每个匹配文档都会调用_score.它需要weight和length并返回给定字段的分数.

Implement the __init__ and _score methods. __init__ takes the field name and value and is called once for each term in your query when you call searcher.search()._score is called for each matching document in your results. It takes a weight and length and returns a score for a given field.

在搜索时创建搜索器时，请使用weighting参数指定自定义语言模型:

When you create your searcher at search time, specify your custom language model using the weighting parameter:

ix.searcher(weighting = Unigram)

这篇关于通过信息检索中的Whoosh语言模型的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！