我们有一个用Java编写的map reduce代码,它读取多个小文件(例如10k +),将其转换为驱动程序中的单个avro文件,reduce将一堆简化的记录插入到postgres数据库中。此过程每小时发生一次。但是,有多个map reduce作业同时运行,处理不同的avro文件并为每个作业打开不同的数据库连接。因此有时(非常随机)发生所有任务都停留在 reducer 阶段的情况,但以下情况异常(exception):

    "C2 CompilerThread0" daemon prio=10 tid=0x00007f78701ae000 nid=0x6db5 waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"Signal Dispatcher" daemon prio=10 tid=0x00007f78701ab800 nid=0x6db4 waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"Surrogate Locker Thread (Concurrent GC)" daemon prio=10 tid=0x00007f78701a1800 nid=0x6db3 waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"Finalizer" daemon prio=10 tid=0x00007f787018a800 nid=0x6db2 in Object.wait() [0x00007f7847941000]
   java.lang.Thread.State: WAITING (on object monitor)
    at java.lang.Object.wait(Native Method)
    - waiting on <0x00000006e5d34418> (a java.lang.ref.ReferenceQueue$Lock)
    at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:135)
    - locked <0x00000006e5d34418> (a java.lang.ref.ReferenceQueue$Lock)
    at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:151)
    at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:189)

"Reference Handler" daemon prio=10 tid=0x00007f7870181000 nid=0x6db1 in Object.wait() [0x00007f7847a42000]
   java.lang.Thread.State: WAITING (on object monitor)
    at java.lang.Object.wait(Native Method)
    - waiting on <0x00000006e5d32b50> (a java.lang.ref.Reference$Lock)
    at java.lang.Object.wait(Object.java:503)
    at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:133)
    - locked <0x00000006e5d32b50> (a java.lang.ref.Reference$Lock)

"main" prio=10 tid=0x00007f7870013800 nid=0x6da1 runnable [0x00007f7877a7b000]
   java.lang.Thread.State: RUNNABLE
    at java.net.SocketInputStream.socketRead0(Native Method)
    at java.net.SocketInputStream.read(SocketInputStream.java:152)
    at java.net.SocketInputStream.read(SocketInputStream.java:122)
    at org.postgresql.core.VisibleBufferedInputStream.readMore(VisibleBufferedInputStream.java:143)
    at org.postgresql.core.VisibleBufferedInputStream.ensureBytes(VisibleBufferedInputStream.java:112)
    at org.postgresql.core.VisibleBufferedInputStream.read(VisibleBufferedInputStream.java:71)
    at org.postgresql.core.PGStream.ReceiveChar(PGStream.java:269)
    at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1700)
    at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:255)
    - locked <0x00000006e5d34520> (a org.postgresql.core.v3.QueryExecutorImpl)
    at org.postgresql.jdbc2.AbstractJdbc2Statement.execute(AbstractJdbc2Statement.java:555)
    at org.postgresql.jdbc2.AbstractJdbc2Statement.executeWithFlags(AbstractJdbc2Statement.java:417)
    at org.postgresql.jdbc2.AbstractJdbc2Statement.executeQuery(AbstractJdbc2Statement.java:302)
    at ComputeReducer.setup(ComputeReducer.java:299)
    at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:162)
    at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:610)
    at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:444)
    at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:415)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1438)
    at org.apache.hadoop.mapred.Child.main(Child.java:262)

"VM Thread" prio=10 tid=0x00007f787017e800 nid=0x6db0 runnable

"Gang worker#0 (Parallel GC Threads)" prio=10 tid=0x00007f7870024800 nid=0x6da2 runnable

"Gang worker#1 (Parallel GC Threads)" prio=10 tid=0x00007f7870026800 nid=0x6da3 runnable

发生此异常后,我们必须重新启动数据库,否则所有的reduce作业都会闲置约70%,甚至下一小时的作业也无法运行。最初,它用于耗尽打开的连接数,但在将连接数增加到相当高的数量后,情况并非如此。我应该指出我不是数据库专家,因此请提出可能会有所帮助的任何配置更改。只是为了确认这似乎是数据库配置问题?如果是,那么在postgres上配置连接池是否可以解决此问题?

任何帮助/建议都非常感谢!提前致谢。

最佳答案

我最初的想法是,如果它是随机的,则可能是一个锁。有两个要查找锁的区域:

共享资源上的线程之间锁和数据库对象上的锁。

我没有在堆栈跟踪中看到任何迹象表明这是数据库锁定问题,但这可能是由于未关闭事务导致的,因此您没有死锁,但正在等待插入。

您更有可能在Java代码中陷入僵局,也许两个等待线程正在彼此等待?

10-06 12:57
查看更多