elasticsearch - 剥离HTML标签后ElasticSearch突出显示

我正在ES 2.3.3上构建Elastic Search索引，并且已定义以下字段，以以下方式具有一个子字段

"properties": {
        "content": {
            "type": "string",
            "index_options": "offsets",
            "store": "yes",
            "fields": {
                "base": {
                    "type": "string",
                    "analyzer": "base_analyzer"
                },
            }
}

我已经在设置中定义了base_analyzer，以便去除HTML内容

"base_analyzer":
                {
                    "tokenizer":    "standard",
                    "char_filter":  [ "html_strip"]
                }

我想做的是执行搜索，并在content.base(从HTML标记中剥离出的content字段)中突出显示搜索词，然后按以下方式进行操作

"query":  {
           {"match": {"content.base": {"query": "this is what I'm searching"}}},
            },
            "highlight": {
              "fields": {
               "content.base": {}
              }
            }

问题是，如果我在_search中使用以下查询，仍会在突出显示的字段中获得HTML标记。你知道为什么会这样吗？

最佳答案

现在，如果您想在索引和按原样存储内容之前完全去除html，则可以使用mapper附件插件-在其中定义映射时，可以将content_type归类为“html”。没有html标签，您将可以突出显示。

映射器附件在许多事情上很有用，尤其是当您处理多种文档类型时，但最值得注意的是-我相信仅使用此附件即可剥离html标签就足够了(您无法使用html_strip char过滤器来做到这一点)。

不过只是一个警告-不会存储html标记。因此，如果您确实需要这些标签，我建议您定义另一个字段来存储原始内容。另一个注意事项:您不能为映射器附件文档指定多字段，因此您需要将其存储在映射器附件文档之外。请参阅下面的工作示例。

您需要生成此映射:

{
  "html5-es" : {
    "aliases" : { },
    "mappings" : {
      "document" : {
        "properties" : {
          "delete" : {
            "type" : "boolean"
          },
          "file" : {
            "type" : "attachment",
            "fields" : {
              "content" : {
                "type" : "string",
                "store" : true,
                "term_vector" : "with_positions_offsets",
                "analyzer" : "autocomplete"
              },
              "author" : {
                "type" : "string",
                "store" : true,
                "term_vector" : "with_positions_offsets"
              },
              "title" : {
                "type" : "string",
                "store" : true,
                "term_vector" : "with_positions_offsets",
                "analyzer" : "autocomplete"
              },
              "name" : {
                "type" : "string"
              },
              "date" : {
                "type" : "date",
               "format" : "strict_date_optional_time||epoch_millis"
              },
              "keywords" : {
                "type" : "string"
              },
              "content_type" : {
                "type" : "string"
              },
          "content_length" : {
                "type" : "integer"
              },
              "language" : {
                "type" : "string"
              }
            }
          },
          "hash_id" : {
            "type" : "string"
          },
          "path" : {
            "type" : "string"
          },
          "raw_content" : {
            "type" : "string",
            "store" : true,
            "term_vector" : "with_positions_offsets",
            "analyzer" : "raw"
          },
          "title" : {
            "type" : "string"
          }
        }
      }
    },
    "settings" : { //insert your own settings here },
    "warmers" : { }
  }
}

这样，在NEST中，我将这样组装内容:

Attachment attachment = new Attachment();
attachment.Content =   Convert.ToBase64String(File.ReadAllBytes("path/to/document"));
attachment.ContentType = "html";

Document document = new Document();
document.File = attachment;
document.RawContent = InsertRawContentFromString(originalText);

我已经在Sense中进行了测试-结果如下:

"file": {
    "_content": "PGh0bWwgeG1sbnM6TWFkQ2FwPSJodHRwOi8vd3d3Lm1hZGNhcHNvZnR3YXJlLmNvbS9TY2hlbWFzL01hZENhcC54c2QiPg0KICA8aGVhZCAvPg0KICA8Ym9keT4NCiAgICA8aDE+VG9waWMxMDwvaDE+DQogICAgPHA+RGVsZXRlIHRoaXMgdGV4dCBhbmQgcmVwbGFjZSBpdCB3aXRoIHlvdXIgb3duIGNvbnRlbnQuIENoZWNrIHlvdXIgbWFpbGJveC48L3A+DQogICAgPHA+wqA8L3A+DQogICAgPHA+YXNkZjwvcD4NCiAgICA8cD7CoDwvcD4NCiAgICA8cD4xMDwvcD4NCiAgICA8cD7CoDwvcD4NCiAgICA8cD5MYXZlbmRlci48L3A+DQogICAgPHA+wqA8L3A+DQogICAgPHA+MTAvNiAxMjowMzwvcD4NCiAgICA8cD7CoDwvcD4NCiAgICA8cD41IDA5PC9wPg0KICAgIDxwPsKgPC9wPg0KICAgIDxwPjExIDQ3PC9wPg0KICAgIDxwPsKgPC9wPg0KICAgIDxwPkhhbGxvd2VlbiBpcyBpbiBPY3RvYmVyLjwvcD4NCiAgICA8cD7CoDwvcD4NCiAgICA8cD5qb2c8L3A+DQogIDwvYm9keT4NCjwvaHRtbD4=",
    "_content_length": 0,
    "_content_type": "html",
    "_date": "0001-01-01T00:00:00",
    "_title": "Topic10"
},
"delete": false,
"raw_content": "<h1>Topic10</h1><p>Delete this text and replace it with your own content. Check your mailbox.</p><p> </p><p>asdf</p><p> </p><p>10</p><p> </p><p>Lavender.</p><p> </p><p>10/6 12:03</p><p> </p><p>5 09</p><p> </p><p>11 47</p><p> </p><p>Halloween is in October.</p><p> </p><p>jog</p>"
},
"highlight": {
"file.content": [
    "\n    <em>Topic10</em>\n\n    Delete this text and replace it with your own content. Check your mailbox.\n\n     \n\n    asdf\n\n     \n\n    10\n\n     \n\n    Lavender.\n\n     \n\n    10/6 12:03\n\n     \n\n    5 09\n\n     \n\n    11 47\n\n     \n\n    Halloween is in October.\n\n     \n\n    jog\n\n  "
    ]
}

关于elasticsearch - 剥离HTML标签后ElasticSearch突出显示，我们在Stack Overflow上找到一个类似的问题：https://stackoverflow.com/questions/38348242/