elasticsearch copy_to字段的行为不如预期的聚合

本文介绍了elasticsearch copy_to字段的行为不如预期的聚合的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个索引映射，两个字符串字段 field1 和 field2 ，都被声明为copy_to到另一个字段称为 all_fields 。 all_fields 被索引为not_analyzed。

当我在 all_fields上创建一个桶聚合时，我期待着具有field1和field2的键的不同的桶连接在一起。相反，我得到了不相关的字段1和field2的关键字桶。

示例：
映射：

  {
mappings：{
myobject：{
properties：{
field1：{
type：string，
index：analyze，
copy_to：all_fields
}，
field2：{
type：string，
index：analyze，
copy_to：all_fields
}，
all_fields：{
type：string，
index：not_analyzed
} 
} 
} 
} 
}

数据：

  {
field1：晚餐胡萝卜马铃薯西兰花，
field2：这里的东西，
}

和

  {
field1：鱼鸡肉g，
field2：晚餐，
}

：

  {
aggs：{
t：{
 ：{
field：all_fields
} 
} 
} 
}

结果：

  ... 
聚合 {
t：{
doc_count_error_upper_bound：0，
sum_other_doc_count：0，
buckets：[
 {
 ：晚餐，
doc_count：1 
}，
 {
key：晚餐胡萝卜马铃薯西兰花，
doc_count：1 
}，
 {
key：fish chicken something，
doc_count：1 
}，
 {
关键：这里的东西，
doc_count：1 
} 
] 
} 
}

我期待只有2个桶， fish chicken somethingdinner 和晚餐胡萝卜马铃薯broccolisomethinghere

我做错了什么？

解决方案

您要查找的是两个字符串的连接。 copy_to 即使它似乎是这样做，它不是。在 copy_to 中，您在概念上创建了一组来自 field1 和 field2 ，不连接它们。

对于您的用例，您有两个选项：

使用

执行脚本聚合

我会推荐 _source 转换，因为我认为它比执行脚本更有效率。意思是，您在索引时支付一小笔费用，而不是做一个沉重的脚本集合。

对于 _source 转换：

  PUT / lastseen 
 {
mappings：{
test：{
transform：{
script：ctx._source ['all_fields'] = ctx._source ['field1'] +''+ ctx._source [' field2']
}，
properties：{
field1：{
type：string
}，
 field2：{
type：string
}，
lastseen：{
type：long
}，
all_fields：{
type：string，
index：not_analyzed
} 
} 
} 
} 
}

查询：

  GET / lastseen / test / _search 
 {
aggs：{
NAME：{
terms：{ 
 field：all_fields，
size：10 
} 
} 
} 
} 
  pre> 
 
 对于脚本聚合，更容易做（意思是使用 doc ['field']。 而不是更昂贵的 _source.field ）将 .raw 子字段添加到 field1 和 field2 ：
  PUT / lastseen 
 {
mappings：{
test：{
properties：{
field1：{
type：string，
fields：{
raw：{
type：string，
index：not_analyzed 
} 
} 
}，
field2：{
type：string，
fields：{
 raw：{
type：string，
index：not_analyzed
} 
} 
}，
lastseen ： {
type：long
} 
} 
} 
} 
} 
  pre> 
 
 脚本将使用这些 .raw 子字段：
 
 
 $ $ $ $ $ $ $ $ $ $ $ $ $ $$ {
 doc''+''+ doc ['field2.raw']。value，
size：10，
lang：groovy
} 
} 
} 
} 
  
没有code> .raw 子字段（以 not_analyzed 作为目的），您将需要执行此操作，这是更贵：
  {
aggs：{
NAME：{
 terms：{
script：_source.field1 +''+ _source.field2，
size：10，
lang：groovy
} 
} 
} 
} 
  
 
I have an index mapping with two string fields, field1 and field2, both being declared as copy_to to another field called all_fields.  all_fields is indexed as "not_analyzed".
When I create a bucket aggregation on all_fields, I was expecting distinct buckets with keys of field1 and field2 concatenated together.  Instead, I get separate buckets with keys of field1 and field2 unconcatenated.
Example:mapping:
  {
    "mappings": {
      "myobject": {
        "properties": {
          "field1": {
            "type": "string",
            "index": "analyzed",
            "copy_to": "all_fields"
          },
          "field2": {
            "type": "string",
            "index": "analyzed",
            "copy_to": "all_fields"
          },
          "all_fields": {
            "type": "string",
            "index": "not_analyzed"
          }
        }
      }
    }
  }
data in:
  {
    "field1": "dinner carrot potato broccoli",
    "field2": "something here",
  }
and
  {
    "field1": "fish chicken something",
    "field2": "dinner",
  }
aggregation:
{
  "aggs": {
    "t": {
      "terms": {
        "field": "all_fields"
      }
    }
  }
}
results:
...
"aggregations": {
    "t": {
        "doc_count_error_upper_bound": 0,
        "sum_other_doc_count": 0,
        "buckets": [
            {
                "key": "dinner",
                "doc_count": 1
            },
            {
                "key": "dinner carrot potato broccoli",
                "doc_count": 1
            },
            {
                "key": "fish chicken something",
                "doc_count": 1
            },
            {
                "key": "something here",
                "doc_count": 1
            }
        ]
    }
}
I was expecting only 2 buckets, fish chicken somethingdinner and dinner carrot potato broccolisomethinghere
What am I doing wrong?
 解决方案 
What you are looking for is concatenation of two strings. copy_to even if it seems is doing this, it is not. With copy_to you are, conceptually, creating a set of values from both field1 and field2, not concatenating them.
For your use case, you have two options:
use _source transformation
perform a script aggregation
I would recommend _source transformation as I think it's more efficient than doing the scripting. Meaning, you pay a little price at indexing time than doing a heavy scripting aggregation.
For _source transformation:
PUT /lastseen
{
  "mappings": {
    "test": {
      "transform": {
        "script": "ctx._source['all_fields'] = ctx._source['field1'] + ' ' + ctx._source['field2']"
      }, 
      "properties": {
        "field1": {
          "type": "string"
        },
        "field2": {
          "type": "string"
        },
        "lastseen": {
          "type": "long"
        },
        "all_fields": {
          "type": "string",
          "index": "not_analyzed"
        }
      }
    }
  }
}
And the query:
GET /lastseen/test/_search
{
  "aggs": {
    "NAME": {
      "terms": {
        "field": "all_fields",
        "size": 10
      }
    }
  }
}
For script aggregation, to be easier to do (meaning, using doc['field'].value rather than the more expensive _source.field) add .raw sub-fields to field1 and field2:
PUT /lastseen
{
  "mappings": {
    "test": { 
      "properties": {
        "field1": {
          "type": "string",
          "fields": {
            "raw": {
              "type": "string",
              "index": "not_analyzed"
            }
          }
        },
        "field2": {
          "type": "string",
          "fields": {
            "raw": {
              "type": "string",
              "index": "not_analyzed"
            }
          }
        },
        "lastseen": {
          "type": "long"
        }
      }
    }
  }
}
And the script will use these .raw subfields:
{
  "aggs": {
    "NAME": {
      "terms": {
        "script": "doc['field1.raw'].value + ' ' + doc['field2.raw'].value", 
        "size": 10,
        "lang": "groovy"
      }
    }
  }
}
Without the .raw sub-fields (which are made on purpose as not_analyzed) you would have needed to do something like this, which is more expensive:
{
  "aggs": {
    "NAME": {
      "terms": {
        "script": "_source.field1 + ' ' + _source.field2", 
        "size": 10,
        "lang": "groovy"
      }
    }
  }
}
                        
这篇关于elasticsearch copy_to字段的行为不如预期的聚合的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！