高级别全文检索通常用于在全文本字段(如电子邮件正文)上运行全文检索。 他们了解如何分析被查询的字段,并在执行之前将每个字段的分析器(或search_analyzer)应用于查询字符串。
1.term查询
term是代表完全匹配,也就是精确查询,搜索前不会再对搜索词进行分词,所以我们的搜索词必须是文档分词集合中的一个。
例如我们可以通过指定分词器对”周五召开董事会会议 审议及批准更新后的一季报“进行分词。
GET telegraph/_analyze
{
"analyzer": "ik_max_word",
"text": "周五召开董事会会议 审议及批准更新后的一季报"
}
分词结果集合中共有15个
{
"tokens": [
{
"token": "周五",
"start_offset": 0,
"end_offset": 2,
"type": "CN_WORD",
"position": 0
},
{
"token": "五",
"start_offset": 1,
"end_offset": 2,
"type": "TYPE_CNUM",
"position": 1
},
{
"token": "召开",
"start_offset": 2,
"end_offset": 4,
"type": "CN_WORD",
"position": 2
},
{
"token": "董事会",
"start_offset": 4,
"end_offset": 7,
"type": "CN_WORD",
"position": 3
},
{
"token": "董事",
"start_offset": 4,
"end_offset": 6,
"type": "CN_WORD",
"position": 4
},
{
"token": "会会",
"start_offset": 6,
"end_offset": 8,
"type": "CN_WORD",
"position": 5
},
{
"token": "会议",
"start_offset": 7,
"end_offset": 9,
"type": "CN_WORD",
"position": 6
},
{
"token": "审议",
"start_offset": 10,
"end_offset": 12,
"type": "CN_WORD",
"position": 7
},
{
"token": "及",
"start_offset": 12,
"end_offset": 13,
"type": "CN_CHAR",
"position": 8
},
{
"token": "批准",
"start_offset": 13,
"end_offset": 15,
"type": "CN_WORD",
"position": 9
},
{
"token": "更新",
"start_offset": 15,
"end_offset": 17,
"type": "CN_WORD",
"position": 10
},
{
"token": "后",
"start_offset": 17,
"end_offset": 18,
"type": "CN_CHAR",
"position": 11
},
{
"token": "的",
"start_offset": 18,
"end_offset": 19,
"type": "CN_CHAR",
"position": 12
},
{
"token": "一季",
"start_offset": 19,
"end_offset": 21,
"type": "CN_WORD",
"position": 13
},
{
"token": "一",
"start_offset": 19,
"end_offset": 20,
"type": "TYPE_CNUM",
"position": 14
},
{
"token": "季报",
"start_offset": 20,
"end_offset": 22,
"type": "CN_WORD",
"position": 15
}
]
}
我们用term进行搜索”会议“
GET telegraph/_search
{
"query": {
"term": {
"title": {
"value": "会议"
}
}
}
}
由于搜索字段”会议“属于分词集合,可以搜索到结果
{
"took": 9,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.2876821,
"hits": [
{
"_index": "telegraph",
"_type": "msg",
"_id": "AZetp2QBW8hrYY3zGJk7",
"_score": 0.2876821,
"_source": {
"title": "周五召开董事会会议 审议及批准更新后的一季报",
"content": "以审议及批准更新后的2018年第一季度报告",
"author": "中兴通讯",
"pubdate": "2018-07-17T12:33:11"
}
}
]
}
}
如果我们搜索”董事会会议“
GET telegraph/_search
{
"query": {
"term": {
"title": {
"value": "董事会会议"
}
}
}
}
”董事会会议“虽然属于文档文本中的一部分,但是由于没有在分词集合中,所以也是搜索不到的
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 0,
"max_score": null,
"hits": []
}
}
2.match搜索
match查询会先对搜索词进行分词,分词完毕后再逐个对分词结果进行匹配,因此相比于term的精确搜索,match是分词匹配搜索。
当我们搜索”河北会议“时,搜索词首先会被分解为”河北“、”会议“,只要文档中包含”河北“、”会议“任意一个就会被搜索到。当然我们也可以通过”operator“来指定被分解词匹配逻辑关系,比如我们可以指定”operator“为”and“时,只有文档的分词集合中同时含有”河北“和”会议“才会被搜索到。默认”operator“为”or“,也就是只要文档分词集合中只要含有任意一个就会被搜索到。
GET telegraph/_search
{
"query": {
"match": {
"title": {
"query": "河北会议"
}
}
}
}
搜索结果
{
"took": 4,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 0.99277425,
"hits": [
{
"_index": "telegraph",
"_type": "msg",
"_id": "BJetp2QBW8hrYY3zGJk7",
"_score": 0.99277425,
"_source": {
"title": "河北聚焦十大行业推进国际产能合作",
"content": "河北省政府近日出台积极参与“一带一路”建设推进国际产能合作实施方案",
"author": "财联社",
"pubdate": "2018-07-17T14:14:55"
}
},
{
"_index": "telegraph",
"_type": "msg",
"_id": "AZetp2QBW8hrYY3zGJk7",
"_score": 0.2876821,
"_source": {
"title": "周五召开董事会会议 审议及批准更新后的一季报",
"content": "以审议及批准更新后的2018年第一季度报告",
"author": "中兴通讯",
"pubdate": "2018-07-17T12:33:11"
}
}
]
}
}
如果我们指定”operator“为”and“进行搜索
GET telegraph/_search
{
"query": {
"match": {
"title": {
"query": "河北会议",
"operator": "and"
}
}
}
}
因为所有文档中没有一个的分词集合中既包含”河北“又包含”会议“,所以搜索结果为空。
{
"took": 8,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 0,
"max_score": null,
"hits": []
}
}
3.match_phrase查询
match_phrase查询会将查询内容分词,分词器可以自定义,文档中同时满足以下三个条件才会被检索到:
- 分词后所有词项都要出现在该字段中
- 字段中的词项顺序要一致
- 各搜索词之间必须紧邻
同样上面的例子,我们搜索”董事会会议“,文档会被搜索到。如果分词顺序不一致或者没有紧密相邻都不能被搜索到。
GET telegraph/_search
{
"query": {
"match_phrase": {
"title":{
"query": "董事会会议"
}
}
}
}
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 1.1507283,
"hits": [
{
"_index": "telegraph",
"_type": "msg",
"_id": "AZetp2QBW8hrYY3zGJk7",
"_score": 1.1507283,
"_source": {
"title": "周五召开董事会会议 审议及批准更新后的一季报",
"content": "以审议及批准更新后的2018年第一季度报告",
"author": "中兴通讯",
"pubdate": "2018-07-17T12:33:11"
}
}
]
}
}
4.match_phrase_prefix
match_phrase_prefix与match_phrase比较相近,只是match_phrase_prefix允许搜索词的最后一个分词的前缀匹配上即可。
上面的例子中文档的分词集合中有”召开“、”董事会“这两个紧邻的分词。我们使用match_phrase_prefix搜索时只需要搜索词中包含”召开“以及”董事会“的前缀就能匹配上。
GET telegraph/_search
{
"query": {
"match_phrase_prefix": {
"title": {
"query": "召开董"
}
}
}
}
{
"took": 10,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.8630463,
"hits": [
{
"_index": "telegraph",
"_type": "msg",
"_id": "AZetp2QBW8hrYY3zGJk7",
"_score": 0.8630463,
"_source": {
"title": "周五召开董事会会议 审议及批准更新后的一季报",
"content": "以审议及批准更新后的2018年第一季度报告",
"author": "中兴通讯",
"pubdate": "2018-07-17T12:33:11"
}
}
]
}
}
5.multi_match
当我们想对多个字段进行匹配,其中一个字段包含分词就被文档就被搜索到时,可以用multi_match。
我们搜索”聚焦成交“,只要”title“、”content“任意一个字段中包含
GET telegraph/_search
{
"query": {
"multi_match": {
"query": "聚焦成交",
"fields": ["title","content"]
}
}
}
{
"took": 9,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 1.0806551,
"hits": [
{
"_index": "telegraph",
"_type": "msg",
"_id": "Apetp2QBW8hrYY3zGJk7",
"_score": 1.0806551,
"_source": {
"title": "长生生物再次跌停 三机构抛售近1000万元",
"content": "长生生物再次一字跌停,报收19.89元,成交1432万元",
"author": "长生生物",
"pubdate": "2018-07-17T10:03:11"
}
},
{
"_index": "telegraph",
"_type": "msg",
"_id": "BJetp2QBW8hrYY3zGJk7",
"_score": 0.99277425,
"_source": {
"title": "河北聚焦十大行业推进国际产能合作",
"content": "河北省政府近日出台积极参与“一带一路”建设推进国际产能合作实施方案",
"author": "财联社",
"pubdate": "2018-07-17T14:14:55"
}
}
]
}
}