我正在考虑支持像this guide建议的非标准ASCII字符的折叠。

PUT /my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "folding": {
          "tokenizer": "standard",
          "filter":  [ "lowercase", "asciifolding" ]
        }
      }
    }
  }
}

奇怪的是,我无法在第一段代码中复制示例。

当我执行
GET /my_index/_analyze?analyzer=folding&text=My œsophagus caused a débâcle

返回以下 token :
sophagus, caused, a, d, b, cle

我想要实现的是:

诸如“école”(例如ecole,ècole)之类的单词的拼写形式的变体应视为同一单词。

现在,如果我执行
GET /my_index/_analyze?analyzer=folding&text=école ecole

我得到 token cole, ecole
这些是我目前用于文档文本分析的设置
    "analysis": {
  "filter": {
    "french_stop": {
      "type": "stop",
        "stopwords": "_french_"
    },
      "french_elision": {
        "type": "elision",
          "articles": [
            "l",
            "m",
            "t",
            "qu",
            "n",
            "s",
            "j",
            "d",
            "c",
            "jusqu",
            "quoiqu",
            "lorsqu",
            "puisqu"
          ]
      },
        "french_stemmer": {
          "type": "stemmer",
            "language": "light_french"
        }
  },
    "analyzer": {
      "index_French": {
        "filter": [
          "french_elision",
          "lowercase",
          "french_stop",
          "french_stemmer"
        ],
          "char_filter": [
            "html_strip"
          ],
            "type": "custom",
              "tokenizer": "standard"
      },
        "sort_analyzer": {
          "type": "custom",
            "filter": [
              "lowercase"
            ],
              "tokenizer": "keyword"
        }
    }
}

我的想法是更改index_French分析器的筛选器,以便该列表如下:
"filter": ["french_elision","lowercase","asciifolding","french_stop","french_stemmer"]

谢谢你的帮助。

最佳答案

在Sense中,您需要像这样调用_analyze端点,它将起作用:

POST /foldings/_analyze
{
   "text": "My œsophagus caused a débâcle",
   "analyzer": "folding"
}

你会得到
{
   "tokens": [
      {
         "token": "my",
         "start_offset": 0,
         "end_offset": 2,
         "type": "<ALPHANUM>",
         "position": 0
      },
      {
         "token": "oesophagus",
         "start_offset": 3,
         "end_offset": 12,
         "type": "<ALPHANUM>",
         "position": 1
      },
      {
         "token": "caused",
         "start_offset": 13,
         "end_offset": 19,
         "type": "<ALPHANUM>",
         "position": 2
      },
      {
         "token": "a",
         "start_offset": 20,
         "end_offset": 21,
         "type": "<ALPHANUM>",
         "position": 3
      },
      {
         "token": "debacle",
         "start_offset": 22,
         "end_offset": 29,
         "type": "<ALPHANUM>",
         "position": 4
      }
   ]
}

关于elasticsearch - 在Elasticsearch中正确折叠ASCII字符,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/36789764/

10-11 06:37