本文介绍了在所有文档中获取前100个最常用的三个词短语的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧! 问题描述 我有大约15,000个刮取的网站,其身体文字存储在弹性搜索索引中。我需要在所有这些文本中使用前100个最常用的三个词短语: 这样的一个例子: $ b $你好,先生:203 大坏小马:92 先到先得:56 [...] 我是新来的我研究了一些术语向量,但它们似乎适用于单个文档。所以我觉得这将是术语向量和聚合与n-gram分析的组合。但是我不知道如何去实现这个。 我目前的映射和设置: {mappings:{items:{properties:{body:{类型:string,term_vector:with_positions_offsets_payloads,store:true,analyzer:fulltext_analyzer} } } },设置:{index:{number_of_shards:1,number_of_replicas:0 $ b分析:{analyzer:{fulltext_analyzer:{type:custom,tokenizer :空白,过滤器:[小写,type_as_payload] } } } } } 解决方案 寻找被称为带状疱疹。带状疱疹就像单词n-gram:字符串中多个术语的串行组合。 (例如我们都活着,都住在,住在一个,黄色的,黄色的潜艇) 这里: https://www.elastic.co/blog/searching-with-shingles 基本上,您需要一个带有瓦片分析器的场,只能生成3个阶段的瓦片: 弹性博客文章配置,但是: filter_shingle:{type :shingle,max_shingle_size:3,min_shingle_size:3,output_unigrams:false} 重新建立数据之后,c> ,您应该可以发出一个查询,返回一个简单的术语聚合,在您的 b上ody 字段来查看顶部的一百个三个单词的短语。 { size:0,query:{match_all:{} },aggs:{three-word-phrase :{terms:{field:body,size:100 } } } } I have about 15,000 scraped websites with their body texts stored in an elastic search index. I need to get the top 100 most used three-word phrases being used in all these texts:Something like this:Hello there sir: 203Big bad pony: 92First come first: 56[...]I'm new to this. I looked into term vectors but they appear to apply to single documents. So I feel it will be a combination of term vectors and aggregation with n-gram analysis of sorts. But I have no idea how to go about implementing this. Any pointers will be helpful.My current mapping and settings:{ "mappings": { "items": { "properties": { "body": { "type": "string", "term_vector": "with_positions_offsets_payloads", "store" : true, "analyzer" : "fulltext_analyzer" } } } }, "settings" : { "index" : { "number_of_shards" : 1, "number_of_replicas" : 0 }, "analysis": { "analyzer": { "fulltext_analyzer": { "type": "custom", "tokenizer": "whitespace", "filter": [ "lowercase", "type_as_payload" ] } } } }} 解决方案 What you're looking for are called Shingles. Shingles are like "word n-grams": serial combinations of more than one term in a string. (E.g. "We all live", "all live in", "live in a", "in a yellow", "a yellow submarine")Take a look here: https://www.elastic.co/blog/searching-with-shinglesBasically, you need a field with a shingle analyzer producing solely 3-term shingles:Elastic blog-post configuration but with:"filter_shingle":{ "type":"shingle", "max_shingle_size":3, "min_shingle_size":3, "output_unigrams":"false"}The, after applying the shingle analyzer to the field in question (as in the blog post), and reindexing your data, you should be able to issue a query returning a simple terms aggregation, on your body field to see the top one-hundred 3-word phrases.{ "size" : 0, "query" : { "match_all" : {} }, "aggs" : { "three-word-phrases" : { "terms" : { "field" : "body", "size" : 100 } } }} 这篇关于在所有文档中获取前100个最常用的三个词短语的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!
11-02 22:51