我想使用ES来计算用户保留率:

  • 1,事件日志记录到默认索引
  • 2,转换为中间索引:以实体为中心的数据,按acc分组
  • 3,使用aggs过滤器(或adjacency_matrix)计算每天的相交结果。

  • 问题出在第二步:如何生成一个不错的转换
    输入事件日志:
    POST _bulk
    {"index": {"_index": "test.u1"}}
    {"acc":1001, "event":"create", "timestamp":"2020-08-01 09:00"}
    {"index": {"_index": "test.u1"}}
    {"acc":1001, "event":"login", "timestamp":"2020-08-01 10:00"}
    {"index": {"_index": "test.u1"}}
    {"acc":1001, "event":"login", "timestamp":"2020-08-02 10:00"}
    {"index": {"_index": "test.u1"}}
    {"acc":1001, "event":"login", "timestamp":"2020-08-03 10:00"}
    {"index": {"_index": "test.u1"}}
    {"acc":1002, "event":"create", "timestamp":"2020-08-01 10:00"}
    {"index": {"_index": "test.u1"}}
    {"acc":1002, "event":"login", "timestamp":"2020-08-02 10:00"}
    {"index": {"_index": "test.u1"}}
    {"acc":1002, "event":"login", "timestamp":"2020-08-02 11:00"}
    {"index": {"_index": "test.u1"}}
    {"acc":1003, "event":"create", "timestamp":"2020-08-01 10:00"}
    {"index": {"_index": "test.u1"}}
    {"acc":1004, "event":"create", "timestamp":"2020-08-02 10:00"}
    {"index": {"_index": "test.u1"}}
    {"acc":1004, "event":"login", "timestamp":"2020-08-02 10:00"}
    {"index": {"_index": "test.u1"}}
    {"acc":1004, "event":"login", "timestamp":"2020-08-03 10:00"}
    
    期望中间指数:
    {"acc":1001, "create":"08-01", "login":[08-01, 08-02, 08-03]}
    {"acc":1002, "create":"08-01", "login":[08-02]}
    {"acc":1003, "create":"08-01", "login":[]}
    {"acc":1004, "create":"08-02", "login":[08-02, 08-03]}
    
    如何生成“登录” 数组?
    或任何更好的设计是受欢迎的。

    最佳答案

    通过aggs.scripted_metric使其完成

    PUT _transform/tr-acc2-ar2
    {
      "source": {
        "index": [
          "mhlog2-*"
        ]
      },
      "pivot": {
        "group_by": {
          "msg.#account_id": {
            "histogram": {
              "field": "msg.#account_id",
              "interval": "1"
            }
          }
        },
        "aggregations": {
          "create": {
            "filter": {
              "term": {
                "msg.#event_name.keyword": "createRole"
              }
            },
            "aggs": {
              "time": {
                "min": {
                  "field": "@timestamp"
                }
              }
            }
          },
          "login": {
            "filter": {
              "term": {
                "msg.#event_name.keyword": "login"
              }
            },
            "aggs": {
              "days": {
                "scripted_metric": {
                  "init_script": "state.days=[:];",
                  "map_script": "state.days[doc['@timestamp'].value.toString('yyyy-MM-dd')]=1; ",
                  "combine_script": "return state",
                  "reduce_script": "def days = [:]; def array =[]; for (s in states) { for (d in s.days.keySet()) { days[d]=1; } }  for (d in days.keySet()) { array.add(d);} return array; "
                }
              }
            }
          }
        }
      },
      "dest": {
        "index": "idx.tr.acc2.ar2"
      },
      "sync": {
        "time": {
          "field": "@timestamp",
          "delay": "60s"
        }
      }
    }
    
    gen中间索引:
    _id : AAAAAAAA
    _index : acc.array
    _score : 0
    _type : _doc
    create.time : Aug 18, 2020 @ 11:17:43.000
    login.days : 2020-08-18T00:00:00.000Z, 2020-08-19T00:00:00.000Z, 2020-08-20T00:00:00.000Z
    msg.#account_id : 12333212323
    
    最后,通过KQL过滤器可以轻松地为2020-08-19的2020-08-18用户保留:
    create.time: 2020-08-18 AND login.days: 2020-08-19
    

    关于arrays - elasticsearch将数据转换为数组,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/63752220/

    10-09 15:53
    查看更多