


How are the term frequencies (TF), and inverse document frequency (IDF), affected by stop-word removal and stemming?



tf 是术语频率
idf 是反向文档频率,即通过将文档总数除以包含该术语的文档数量,然后取该商的对数来获得。

tf is term frequencyidf is inverse document frequency which is obtained by dividing the total number of documents by the number of documents containing the term, and then taking the logarithm of that quotient.

除梗将来自同一词干的所有词(例如:played,play ..)进行分组,这将增加词干的出现率,因为频率是使用词干而不是词
第一个文档包含播放 2次和播放 5次,
,第二个文档包含播放 3次和播放 1次
在不阻止第二个文档的情况下搜索播放将是第一个,因为它出现更多的单词 pla y,而如果您进行词干,则词干后两个单词都将被播放,并且第一个文档将成为第一个文档,这是因为该单词包含 stem 播放了7次,第二个文档包含了 stem 播放4次。

stemming effect is grouping all words which are derived from the same stem (ex: played, play,..), this grouping will increase the occurrence of this stem because frequencies are calculated using stem not words,For example, if you have 2 documents:the first one contains 'play' 2 times and 'played' 5 times,and the second document contains 'play' 3 times and 'played' 1 timeif you do a search for 'play' without stemming the second document will be first because it has more occurrence of the word 'play', while if you do stemming, both words will be 'play' after stemming and the first document will be first cause it contains the stem play 7 times and the second document contains the stem play 4 times.


Concerning stopwords removal, it is found frequently in all document and isn't consider as a keyword for any of them, it will have high freq without any scene.


05-26 08:31