I have question about how to evaluate the information retrieve result is good or not such as calculate
相关文档的等级,召回率,精度,AP,MAP .....
the relevant document rank, recall, precision ,AP, MAP.....
currently, the system is able to retrieve the document from the database once the users enter the query. The problem is I do not know how to do the evaluation.
我有一些公共数据集,例如"Cranfield集合" 数据集链接它包含
I got some public data set such as "Cranfield collection" dataset linkit contains
1.document 2.query 3.relevance assesments
Cranfield 1,400 225 1.6
我可以知道如何使用"Cranfield集合"进行评估吗?相关文档的等级,召回率,精度,AP,MAP .....
May I know how to use do the evaluation by using "Cranfield collection" to calculatethe relevant document rank, recall, precision ,AP, MAP.....
I might need some ideas and direction. not asking for how to code the program.
Okapi BM25 (BM代表最佳匹配")是搜索引擎使用的一种排名功能,用于根据匹配文档与给定搜索查询的相关性对它们进行排名.它基于概率检索框架. BM25是词袋检索功能,可对一组基于文档的文件进行排名不管文档中查询词之间的相互关系(例如,它们的相对接近度)如何,每个文档中出现的查询词都是相同的.有关更多详细信息,请参见维基百科页面.
Okapi BM25 (BM stands for Best Matching) is a ranking function used by search engines to rank matching documents according to their relevance to a given search query. It is based on the probabilistic retrieval framework. BM25 is a bag-of-words retrieval function that ranks a set of documents based on the query terms appearing in each document, regardless of the inter-relationship between the query terms within a document (e.g., their relative proximity). See the Wikipedia page for more details.
Precision measures "of all the documents we retrieved as relevant how many are actually relevant?".
Precision = No. of relevant documents retrieved / No. of total documents retrieved
Recall measures "Of all the actual relevant documents how many did we retrieve as relevant?".
Recall = No. of relevant documents retrieved / No. of total relevant documents
Suppose, when a query "q" is submitted to an information retrieval system (ex., search engine) having 100 relevant documents w.r.t. the query "q", the system retrieves 68 documents out of total collection of 600 documents. Out of 68 retrieved documents, 40 documents were relevant. So, in this case:
Precision = 40 / 68 = 58.8%
和Recall = 40 / 100 = 40%
F-Score / F-measure is the weighted harmonic mean of precision and recall. The traditional F-measure or balanced F-score is:
F-Score = 2 * Precision * Recall / Precision + Recall
中键入内容,它会显示10条结果.如果所有这些都相关,那可能是最好的.如果只有一些相关,例如说五个,那么最好先显示相关的.如果前五个无关紧要而好的仅从第六个开始,那将是不好的,不是吗? AP分数反映了这一点.
You can think of it this way: you type something in Google
and it shows you 10 results. It’s probably best if all of them were relevant. If only some are relevant, say five of them, then it’s much better if the relevant ones are shown first. It would be bad if first five were irrelevant and good ones only started from sixth, wouldn’t it? AP score reflects this.
排名#1:(1.0 + 0.67 + 0.75 + 0.8 + 0.83 + 0.6) / 6 = 0.78
排名2:(0.5 + 0.4 + 0.5 + 0.57 + 0.56 + 0.6) / 6 = 0.52
MAP is mean of average precision across multiple queries/rankings. Giving an example for illustration.
对于查询1,AvgPrec: (1.0+0.67+0.5+0.44+0.5) / 5 = 0.62
对于查询2,AvgPrec: (0.5+0.4+0.43) / 3 = 0.44
因此,MAP = (0.62 + 0.44) / 2 = 0.53
用作检索系统的性能指标.您应该为此类测试构建一个检索系统.如果要使用Java编写程序,则应考虑 Apache Lucene 建立索引.
Sometimes, people use precision@k
, recall@k
as performance measure of a retrieval system. You should build a retrieval system for such testings. If you want to write your program in Java, you should consider Apache Lucene to build your index.