lua - 高航空突波延迟

在Aerospike集合中，我们有四个bin userId，adId，timestamp，eventype，主键是userId：timestamp。在userId上创建二级索引，以获取特定用户的所有记录，并将所得记录传递到流udf。在我们的客户端，直到500 qps的航空峰值查询等待时间是合理的，平均等待时间以微秒为单位，但是一旦我们将qps增加到500以上，航空峰值查询等待时间就会激增（大约10毫秒）

我们在客户端看到的消息如下：

Name: Aerospike-13780
State: WAITING on java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@41aa29a4
Total blocked: 0  Total waited: 554,450

Stack trace:
sun.misc.Unsafe.park(Native Method)
java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
java.util.concurrent.ArrayBlockingQueue.take(ArrayBlockingQueue.java:403)
com.aerospike.client.lua.LuaInputStream.read(LuaInputStream.java:38)
com.aerospike.client.lua.LuaStreamLib$read.call(LuaStreamLib.java:60)
org.luaj.vm2.LuaClosure.execute(Unknown Source)
org.luaj.vm2.LuaClosure.onInvoke(Unknown Source)
org.luaj.vm2.LuaClosure.invoke(Unknown Source)
org.luaj.vm2.LuaClosure.execute(Unknown Source)
org.luaj.vm2.LuaClosure.onInvoke(Unknown Source)
org.luaj.vm2.LuaClosure.invoke(Unknown Source)
org.luaj.vm2.LuaValue.invoke(Unknown Source)
com.aerospike.client.lua.LuaInstance.call(LuaInstance.java:128)
com.aerospike.client.query.QueryAggregateExecutor.runThreads(QueryAggregateExecutor.java:104)
com.aerospike.client.query.QueryAggregateExecutor.run(QueryAggregateExecutor.java:77)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
java.lang.Thread.run(Thread.java:745)

以下是lua文件：

function ad_count(stream)

    local function map_function(record)
        local result = map()
        result["adId"] = record["adId"]
        result["timestamp"] = record["timestamp"]
        return result
    end


    local function add_fn(aggregate, record)

        local ad_id = record["adId"]
        local map_result = aggregate[ad_id]
        local l = list()
        if map_result == null then
            map_result = l
        end
        list.append(map_result, record["timestamp"])
        aggregate[ad_id] = map_result
        return aggregate
    end

    local m = map()
    return stream:map(map_function):aggregate(m, add_fn)
end

有2个节点，并且服务器托管在T2.large实例类型的AWS中。

transaction-queues=8;transaction-threads-per-queue=8;transaction-duplicate-threads=0;transaction-pending-limit=20;migrate-threads=1;migrate-xmit-priority=40;migrate-xmit-sleep=500;migrate-read-priority=10;migrate-read-sleep=500;migrate-xmit-hwm=10;migrate-xmit-lwm=5;migrate-max-num-incoming=256;migrate-rx-lifetime-ms=60000;proto-fd-max=15000;proto-fd-idle-ms=60000;proto-slow-netio-sleep-ms=1;transaction-retry-ms=1000;transaction-max-ms=1000;transaction-repeatable-read=false;dump-message-above-size=134217728;ticker-interval=10;microbenchmarks=false;storage-benchmarks=false;ldt-benchmarks=false;scan-max-active=100;scan-max-done=100;scan-max-udf-transactions=32;scan-threads=4;batch-index-threads=4;batch-threads=4;batch-max-requests=5000;batch-max-buffers-per-queue=255;batch-max-unused-buffers=256;batch-priority=200;nsup-delete-sleep=100;nsup-period=120;nsup-startup-evict=true;paxos-retransmit-period=5;paxos-single-replica-limit=1;paxos-max-cluster-size=32;paxos-protocol=v3;paxos-recovery-policy=manual;write-duplicate-resolution-disable=false;respond-client-on-master-completion=false;replication-fire-and-forget=false;info-threads=16;allow-inline-transactions=true;use-queue-per-device=false;snub-nodes=false;fb-health-msg-per-burst=0;fb-health-msg-timeout=200;fb-health-good-pct=50;fb-health-bad-pct=0;auto-dun=false;auto-undun=false;prole-extra-ttl=0;max-msgs-per-type=-1;service-threads=40;fabric-workers=16;pidfile=/var/run/aerospike/asd.pid;memory-accounting=false;udf-runtime-gmax-memory=18446744073709551615;udf-runtime-max-memory=18446744073709551615;sindex-builder-threads=4;sindex-data-max-memory=18446744073709551615;query-threads=6;query-worker-threads=15;query-priority=10;query-in-transaction-thread=0;query-req-in-query-thread=0;query-req-max-inflight=100;query-bufpool-size=256;query-batch-size=100;query-priority-sleep-us=1;query-short-q-max-size=500;query-long-q-max-size=500;query-rec-count-bound=18446744073709551615;query-threshold=10;query-untracked-time-ms=1000;pre-reserve-qnodes=false;service-address=0.0.0.0;service-port=3000;mesh-address=10.0.1.80;mesh-port=3002;reuse-address=true;fabric-port=3001;fabric-keepalive-enabled=true;fabric-keepalive-time=1;fabric-keepalive-intvl=1;fabric-keepalive-probes=10;network-info-port=3003;enable-fastpath=true;heartbeat-mode=mesh;heartbeat-protocol=v2;heartbeat-address=10.0.1.80;heartbeat-port=3002;heartbeat-interval=150;heartbeat-timeout=10;enable-security=false;privilege-refresh-period=300;report-authentication-sinks=0;report-data-op-sinks=0;report-sys-admin-sinks=0;report-user-admin-sinks=0;report-violation-sinks=0;syslog-local=-1;enable-xdr=false;xdr-namedpipe-path=NULL;forward-xdr-writes=false;xdr-delete-shipping-enabled=true;xdr-nsup-deletes-enabled=false;stop-writes-noxdr=false;reads-hist-track-back=1800;reads-hist-track-slice=10;reads-hist-track-thresholds=1,8,64;writes_master-hist-track-back=1800;writes_master-hist-track-slice=10;writes_master-hist-track-thresholds=1,8,64;proxy-hist-track-back=1800;proxy-hist-track-slice=10;proxy-hist-track-thresholds=1,8,64;udf-hist-track-back=1800;udf-hist-track-slice=10;udf-hist-track-thresholds=1,8,64;query-hist-track-back=1800;query-hist-track-slice=10;query-hist-track-thresholds=1,8,64;query_rec_count-hist-track-back=1800;query_rec_count-hist-track-slice=10;query_rec_count-hist-track-thresholds=1,8,64

我们甚至更改了以下配置参数，但它进一步增加了延迟：

query-batch-size=1000
query-short-q-max-size=100000
query-long-q-max-size=100000
query-threads=28
query-worker-threads=400
query-req-max-inflight=1000

最佳答案

在Aerospike社区论坛上链接回相同的问题：https://discuss.aerospike.com/t/high-query-latency/2769/3

您主要是在违反该特定实例的限制。我们不建议使用t2系列。请查看Aerospike的Amazon部署指南的recommendations部分。您应该考虑m3，c3，c4，r3或i2实例系列中的实例。

关于配置的简要说明：

transaction-queues应该调整为内核数。您将其设置为8，并且t2.large具有2个核心。
对于该实例类型，transaction-threads-per-queue设置得太高。
query-threads和query-worker-threads都为此实例设置得太高。

此外，您的查询涉及流UDF，因此它们将比常规查询慢。 UDF非常有用，但是您需要考虑它们适合的上下文。

我的建议是，您可以对此建模不同，并跳过二级索引和UDF。

关于lua - 高航空突波延迟，我们在Stack Overflow上找到一个类似的问题：https://stackoverflow.com/questions/36400687/