我正在考虑支持像this guide建议的非标准ASCII字符的折叠。
PUT /my_index
{
"settings": {
"analysis": {
"analyzer": {
"folding": {
"tokenizer": "standard",
"filter": [ "lowercase", "asciifolding" ]
}
}
}
}
}
奇怪的是,我无法在第一段代码中复制示例。
当我执行
GET /my_index/_analyze?analyzer=folding&text=My œsophagus caused a débâcle
返回以下 token :
sophagus, caused, a, d, b, cle
我想要实现的是:
诸如“école”(例如ecole,ècole)之类的单词的拼写形式的变体应视为同一单词。
现在,如果我执行
GET /my_index/_analyze?analyzer=folding&text=école ecole
我得到 token
cole, ecole
这些是我目前用于文档文本分析的设置
"analysis": {
"filter": {
"french_stop": {
"type": "stop",
"stopwords": "_french_"
},
"french_elision": {
"type": "elision",
"articles": [
"l",
"m",
"t",
"qu",
"n",
"s",
"j",
"d",
"c",
"jusqu",
"quoiqu",
"lorsqu",
"puisqu"
]
},
"french_stemmer": {
"type": "stemmer",
"language": "light_french"
}
},
"analyzer": {
"index_French": {
"filter": [
"french_elision",
"lowercase",
"french_stop",
"french_stemmer"
],
"char_filter": [
"html_strip"
],
"type": "custom",
"tokenizer": "standard"
},
"sort_analyzer": {
"type": "custom",
"filter": [
"lowercase"
],
"tokenizer": "keyword"
}
}
}
我的想法是更改index_French分析器的筛选器,以便该列表如下:
"filter": ["french_elision","lowercase","asciifolding","french_stop","french_stemmer"]
谢谢你的帮助。
最佳答案
在Sense中,您需要像这样调用_analyze端点,它将起作用:
POST /foldings/_analyze
{
"text": "My œsophagus caused a débâcle",
"analyzer": "folding"
}
你会得到
{
"tokens": [
{
"token": "my",
"start_offset": 0,
"end_offset": 2,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "oesophagus",
"start_offset": 3,
"end_offset": 12,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "caused",
"start_offset": 13,
"end_offset": 19,
"type": "<ALPHANUM>",
"position": 2
},
{
"token": "a",
"start_offset": 20,
"end_offset": 21,
"type": "<ALPHANUM>",
"position": 3
},
{
"token": "debacle",
"start_offset": 22,
"end_offset": 29,
"type": "<ALPHANUM>",
"position": 4
}
]
}
关于elasticsearch - 在Elasticsearch中正确折叠ASCII字符,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/36789764/