hadoop - Apache Flume 1.5在Hadoop 2/自动故障转移群集配置中未提供预期的结果

我已在CentOS 6.5 // 64位上的HA /自动故障转移配置中配置了Apache Hadoop 2集群。我已经安装了Flume 1.5(apache-flume-1.5.0-bin.tar.gz)。
我想使用带有一些关键字过滤功能的flume / Hive分析Twitter数据。见下图:

这是hadoop2配置文件的内容。(仅重要属性)。

core-site.xml

<property>
<name>fs.defaultFS</name>
<value>hdfs://mycluster</value>
</property>

hdfs-site.xml

<property><name>dfs.ha.namenodes.mycluster</name><value>nn1,nn2</value><final>true</final></property>
<property><name>dfs.namenode.rpc-address.mycluster.nn1</name><value>nn1.mycluster1.com:9000</value></property>
<property><name>dfs.namenode.rpc-address.mycluster.nn2</name><value>nn2.mycluster1.com:9000</value></property>
<property><name>dfs.namenode.http-address.mycluster.nn1</name><value>nn1.mycluster1.com:50070</value></property>
<property><name>dfs.namenode.http-address.mycluster.nn2</name><value>nn2.mycluster1.com:50070</value></property>

这是水槽配置文件的内容:

flume-env.sh

JAVA_HOME=/usr/java/jdk1.7.0_60
JAVA_OPTS="-Xms100m -Xmx200m -Dcom.sun.management.jmxremote"

twitter.conf

# Name the components on this agent
TwitterAgent.sources = Twitter
TwitterAgent.sinks = HDFS
TwitterAgent.channels = MemChannel

# Describe/configure the source
TwitterAgent.sources.Twitter.type = org.apache.flume.source.twitter.TwitterSource
TwitterAgent.sources.Twitter.channels = MemChannel
TwitterAgent.sources.Twitter.consumerKey = **************
TwitterAgent.sources.Twitter.consumerSecret = **********
TwitterAgent.sources.Twitter.accessToken = **************
TwitterAgent.sources.Twitter.accessTokenSecret = **************

TwitterAgent.sources.Twitter.maxBatchSize = 1000
TwitterAgent.sources.Twitter.maxBatchDurationMillis = 1000

TwitterAgent.sources.Twitter.keywords=hadoop, big data, analytics, bigdata, cloudera, data science, mapreduce, mahout, nosql

TwitterAgent.sources.Twitter.bind = localhost
TwitterAgent.sources.Twitter.port = 44444

# Describe the sink
TwitterAgent.sinks.HDFS.type = logger
TwitterAgent.sinks.HDFS.channel = MemChannel
TwitterAgent.sinks.HDFS.hdfs.path=/user/flume/tweets/20140814/1_55
TwitterAgent.sinks.HDFS.fileType = DataStream
TwitterAgent.sinks.HDFS.writeFormat = Text
TwitterAgent.sinks.HDFS.batchSize = 100
TwitterAgent.sinks.HDFS.rollSize = 0
TwitterAgent.sinks.HDFS.rollCount = 100
TwitterAgent.sinks.HDFS.rollInterval = 100

# Use a channel which buffers events in memory
TwitterAgent.channels.MemChannel.type = memory
TwitterAgent.channels.MemChannel.capacity = 1000
TwitterAgent.channels.MemChannel.transactionCapacity = 100

我正在执行以下命令。

flume-ng agent --conf conf --conf-file conf/twitter.conf --name TwitterAgent -Dflume.root.logger=INFO,console

我有以下问题。

a)-接缝关键字过滤不起作用。我设置错了吗
配置文件中的属性？

b)-此过程不会在上复制任何文件
在hdfs上为/ user / flume / tweets / 20140814 / 1_55。

c)-Twitter / API访问 token 的访问级别为只读。我需要
读写访问？

d)-使用hdfs.path样式是否正确？
twitter.conf？

e)-进程正在执行但未停止，不确定是什么
标准将停止。

它继续显示以下输出。

14/08/14 03:58:14 INFO twitter.TwitterSource: Processed 45,000 docs
14/08/14 03:58:14 INFO twitter.TwitterSource: Total docs indexed: 45,000, total skipped docs: 0
14/08/14 03:58:14 INFO twitter.TwitterSource:     53 docs/second
14/08/14 03:58:14 INFO twitter.TwitterSource: Run took 846 seconds and processed:
14/08/14 03:58:14 INFO twitter.TwitterSource:     0.013 MB/sec sent to index
14/08/14 03:58:14 INFO twitter.TwitterSource:     11.111 MB text sent to index
14/08/14 03:58:14 INFO twitter.TwitterSource: There were 0 exceptions ignored:
14/08/14 03:58:14 INFO sink.LoggerSink: Event: { headers:{} body: 4F 62 6A 01 02 16 61 76 72 6F 2E 73 63 68 65 6D Obj...avro.schem }
14/08/14 03:58:15 INFO sink.LoggerSink: Event: { headers:{} body: 4F 62 6A 01 02 16 61 76 72 6F 2E 73 63 68 65 6D Obj...avro.schem }
14/08/14 03:58:16 INFO twitter.TwitterSource: Processed 45,100 docs
14/08/14 03:58:16 INFO sink.LoggerSink: Event: { headers:{} body: 4F 62 6A 01 02 16 61 76 72 6F 2E 73 63 68 65 6D Obj...avro.schem }
14/08/14 03:58:17 INFO sink.LoggerSink: Event: { headers:{} body: 4F 62 6A 01 02 16 61 76 72 6F 2E 73 63 68 65 6D Obj...avro.schem }
14/08/14 03:58:18 INFO sink.LoggerSink: Event: { headers:{} body: 4F 62 6A 01 02 16 61 76 72 6F 2E 73 63 68 65 6D Obj...avro.schem }
14/08/14 03:58:18 INFO twitter.TwitterSource: Processed 45,200 docs
14/08/14 03:58:19 INFO sink.LoggerSink: Event: { headers:{} body: 4F 62 6A 01 02 16 61 76 72 6F 2E 73 63 68 65 6D Obj...avro.schem }
14/08/14 03:58:20 INFO twitter.TwitterSource: Processed 45,300 docs
14/08/14 03:58:20 INFO sink.LoggerSink: Event: { headers:{} body: 4F 62 6A 01 02 16 61 76 72 6F 2E 73 63 68 65 6D Obj...avro.schem }
14/08/14 03:58:21 INFO sink.LoggerSink: Event: { headers:{} body: 4F 62 6A 01 02 16 61 76 72 6F 2E 73 63 68 65 6D Obj...avro.schem }
14/08/14 03:58:22 INFO sink.LoggerSink: Event: { headers:{} body: 4F 62 6A 01 02 16 61 76 72 6F 2E 73 63 68 65 6D Obj...avro.schem }
14/08/14 03:58:22 INFO twitter.TwitterSource: Processed 45,400 docs
14/08/14 03:58:23 INFO sink.LoggerSink: Event: { headers:{} body: 4F 62 6A 01 02 16 61 76 72 6F 2E 73 63 68 65 6D Obj...avro.schem }
14/08/14 03:58:24 INFO sink.LoggerSink: Event: { headers:{} body: 4F 62 6A 01 02 16 61 76 72 6F 2E 73 63 68 65 6D Obj...avro.schem }
14/08/14 03:58:24 INFO twitter.TwitterSource: Processed 45,500 docs
14/08/14 03:58:25 INFO sink.LoggerSink: Event: { headers:{} body: 4F 62 6A 01 02 16 61 76 72 6F 2E 73 63 68 65 6D Obj...avro.schem }
14/08/14 03:58:26 INFO sink.LoggerSink: Event: { headers:{} body: 4F 62 6A 01 02 16 61 76 72 6F 2E 73 63 68 65 6D Obj...avro.schem }
14/08/14 03:58:26 INFO twitter.TwitterSource: Processed 45,600 docs
14/08/14 03:58:27 INFO sink.LoggerSink: Event: { headers:{} body: 4F 62 6A 01 02 16 61 76 72 6F 2E 73 63 68 65 6D Obj...avro.schem }
14/08/14 03:58:28 INFO sink.LoggerSink: Event: { headers:{} body: 4F 62 6A 01 02 16 61 76 72 6F 2E 73 63 68 65 6D Obj...avro.schem }
14/08/14 03:58:28 INFO twitter.TwitterSource: Processed 45,700 docs
14/08/14 03:58:29 INFO sink.LoggerSink: Event: { headers:{} body: 4F 62 6A 01 02 16 61 76 72 6F 2E 73 63 68 65 6D Obj...avro.schem }
14/08/14 03:58:30 INFO twitter.TwitterSource: Processed 45,800 docs
14/08/14 03:58:30 INFO sink.LoggerSink: Event: { headers:{} body: 4F 62 6A 01 02 16 61 76 72 6F 2E 73 63 68 65 6D Obj...avro.schem }
14/08/14 03:58:31 INFO sink.LoggerSink: Event: { headers:{} body: 4F 62 6A 01 02 16 61 76 72 6F 2E 73 63 68 65 6D Obj...avro.schem }
14/08/14 03:58:32 INFO sink.LoggerSink: Event: { headers:{} body: 4F 62 6A 01 02 16 61 76 72 6F 2E 73 63 68 65 6D Obj...avro.schem }
14/08/14 03:58:32 INFO twitter.TwitterSource: Processed 45,900 docs
14/08/14 03:58:33 INFO sink.LoggerSink: Event: { headers:{} body: 4F 62 6A 01 02 16 61 76 72 6F 2E 73 63 68 65 6D Obj...avro.schem }
14/08/14 03:58:34 INFO sink.LoggerSink: Event: { headers:{} body: 4F 62 6A 01 02 16 61 76 72 6F 2E 73 63 68 65 6D Obj...avro.schem }
14/08/14 03:58:34 INFO twitter.TwitterSource: Processed 46,000 docs
14/08/14 03:58:34 INFO twitter.TwitterSource: Total docs indexed: 46,000, total skipped docs: 0
14/08/14 03:58:34 INFO twitter.TwitterSource:     53 docs/second
14/08/14 03:58:34 INFO twitter.TwitterSource: Run took 867 seconds and processed:
14/08/14 03:58:34 INFO twitter.TwitterSource:     0.013 MB/sec sent to index
14/08/14 03:58:34 INFO twitter.TwitterSource:     11.36 MB text sent to index
14/08/14 03:58:34 INFO twitter.TwitterSource: There were 0 exceptions ignored:

任何人都可以帮助我，我想念的是什么？

在用于此任务之前，是否应该使用Maven重新构建Flume？

最佳答案

无需授予对Twitter / API访问 token 的读写访问权限？
您使用hdfs.path样式的方式也是正确的。

要解决主要问题(不复制文件)，请进行以下更改:

conf / twitter.conf文件中的更改

a)-

替换以下行:
(TwitterAgent.sinks.HDFS.type = logger)

与以下行:
TwitterAgent.sinks.HDFS.type = hdfs

b)-

注释以下行:

#TwitterAgent.sources.Twitter.type = com.cloudera.flume.source.TwitterSource

使用以下内容(Apache类)

TwitterAgent.sources.Twitter.type = org.apache.flume.source.twitter.TwitterSource

flume-env.conf中的更改

注释如下:(无需设置此值)

#FLUME_CLASSPATH=""

为以下属性设置适当的值:

hdfs.filePrefix
hdfs.fileSuffix
hdfs.inUsePrefix
hdfs.inUseSuffix
hdfs.rollInterval
hdfs.rollSize
hdfs.rollCount
hdfs.idleTimeout
hdfs.batchSize
hdfs.fileType
hdfs.maxOpenFiles
hdfs.minBlockReplicas
hdfs.writeFormat
hdfs.callTimeout
hdfs.threadsPoolSize
hdfs.rollTimerPoolSize
hdfs.kerberosPrincipal
hdfs.kerberosKeytab
hdfs.proxyUser
hdfs.round
hdfs.roundValue
hdfs.roundUnit
hdfs.timeZone
hdfs.useLocalTimeStamp
hdfs.closeTries
hdfs.retryInterval

要查看更多详细信息，请参见以下链接:

https://flume.apache.org/FlumeUserGuide.html

关于hadoop - Apache Flume 1.5在Hadoop 2/自动故障转移群集配置中未提供预期的结果，我们在Stack Overflow上找到一个类似的问题：https://stackoverflow.com/questions/25293297/