问题描述
是否可以获取每个突出显示的片段的字符位置?我需要将突出显示的文本与源文档相匹配,并且具有字符位置将使其成为可能。
例如:
curllocalhost:9200 / twitter / tweet / _search?pretty = true-d'{
query:{
query_string {
query:foo
}
},
highlight:{
fields:{
message number_of_fragments:20}
}
}
}'
返回这个高光:
highlight:{
message:[some&em; foo< ; / em> text]
}
如果匹配文档中的字段消息:
这是一些foo文本
有没有办法知道片段从char 8开始,以匹配字段的char 21结尾?
知道匹配的开始/结束偏移量令牌对我也是有好处的 - 也许有一种方法可以使用script_fields访问这些信息? (此显示了如何获取令牌,但不显示偏移量)。
字段message具有:
term_vector with_positions_offsets,
index_options:位置
客户端方法实际上是标准做法。
我们已经讨论了添加偏移量,但是害怕会导致更多的混乱。提供的偏移量特定于Java的UTF-16字符串编码,虽然它们可以在技术上用于从$ LANG计算片段,但是更直接的解析您指定的分隔符的响应文本。
Is it possible to get character positions of each highlighted fragment? I need to match the highlighted text back to the source document and having character positions would make it possible.
For example:
curl "localhost:9200/twitter/tweet/_search?pretty=true" -d '{
"query": {
"query_string": {
"query": "foo"
}
},
"highlight": {
"fields": {
"message": {"number_of_fragments": 20}
}
}
}'
returns this highglight:
"highlight" : {
"message" : [ "some <em>foo</em> text" ]
}
If the field message in the matched document were:
"Here is some foo text"
is there a way to know that the snippet begins at char 8 and ends at char 21 of the matched field?
Knowing the start/end offset of the matched token would be good for me as well - perhaps there is a way to access that information using script_fields? (This question shows how to obtain the tokens, but not the offsets).
The field "message" has:
"term_vector" : "with_positions_offsets",
"index_options" : "positions"
The client-side approach is actually standard practice.
We have discussed adding the offsets, but are afraid it would lead to more confusion. The offsets provided are specific to Java's UTF-16 String encoding, which, while they could technically be used to calculate the fragments from $LANG, it's way more straightforward to parse the response text for the delimiters you specified.
这篇关于ElasticSearch获得突出显示的片段的偏移量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!