我已在CentOS 6.5 // 64位上的HA /自动故障转移配置中配置了Apache Hadoop 2集群。我已经安装了Flume 1.5(apache-flume-1.5.0-bin.tar.gz)。
我想使用带有一些关键字过滤功能的flume / Hive分析Twitter数据。见下图:

这是hadoop2配置文件的内容。(仅重要属性)。

core-site.xml

<property>
<name>fs.defaultFS</name>
<value>hdfs://mycluster</value>
</property>

hdfs-site.xml
<property><name>dfs.ha.namenodes.mycluster</name><value>nn1,nn2</value><final>true</final></property>
<property><name>dfs.namenode.rpc-address.mycluster.nn1</name><value>nn1.mycluster1.com:9000</value></property>
<property><name>dfs.namenode.rpc-address.mycluster.nn2</name><value>nn2.mycluster1.com:9000</value></property>
<property><name>dfs.namenode.http-address.mycluster.nn1</name><value>nn1.mycluster1.com:50070</value></property>
<property><name>dfs.namenode.http-address.mycluster.nn2</name><value>nn2.mycluster1.com:50070</value></property>

这是水槽配置文件的内容:

flume-env.sh
JAVA_HOME=/usr/java/jdk1.7.0_60
JAVA_OPTS="-Xms100m -Xmx200m -Dcom.sun.management.jmxremote"

twitter.conf
# Name the components on this agent
TwitterAgent.sources = Twitter
TwitterAgent.sinks = HDFS
TwitterAgent.channels = MemChannel

# Describe/configure the source
TwitterAgent.sources.Twitter.type = org.apache.flume.source.twitter.TwitterSource
TwitterAgent.sources.Twitter.channels = MemChannel
TwitterAgent.sources.Twitter.consumerKey = **************
TwitterAgent.sources.Twitter.consumerSecret = **********
TwitterAgent.sources.Twitter.accessToken = **************
TwitterAgent.sources.Twitter.accessTokenSecret = **************

TwitterAgent.sources.Twitter.maxBatchSize = 1000
TwitterAgent.sources.Twitter.maxBatchDurationMillis = 1000

TwitterAgent.sources.Twitter.keywords=hadoop, big data, analytics, bigdata, cloudera, data science, mapreduce, mahout, nosql

TwitterAgent.sources.Twitter.bind = localhost
TwitterAgent.sources.Twitter.port = 44444

# Describe the sink
TwitterAgent.sinks.HDFS.type = logger
TwitterAgent.sinks.HDFS.channel = MemChannel
TwitterAgent.sinks.HDFS.hdfs.path=/user/flume/tweets/20140814/1_55
TwitterAgent.sinks.HDFS.fileType = DataStream
TwitterAgent.sinks.HDFS.writeFormat = Text
TwitterAgent.sinks.HDFS.batchSize = 100
TwitterAgent.sinks.HDFS.rollSize = 0
TwitterAgent.sinks.HDFS.rollCount = 100
TwitterAgent.sinks.HDFS.rollInterval = 100

# Use a channel which buffers events in memory
TwitterAgent.channels.MemChannel.type = memory
TwitterAgent.channels.MemChannel.capacity = 1000
TwitterAgent.channels.MemChannel.transactionCapacity = 100

我正在执行以下命令。
flume-ng agent --conf conf --conf-file conf/twitter.conf --name TwitterAgent -Dflume.root.logger=INFO,console

我有以下问题。
  • a)-接缝关键字过滤不起作用。我设置错了吗
    配置文件中的属性?
  • b)-此过程不会在上复制任何文件
    在hdfs上为/ user / flume / tweets / 20140814 / 1_55。
  • c)-Twitter / API访问 token 的访问级别为只读。我需要
    读写访问?
  • d)-使用hdfs.path样式是否正确?
    twitter.conf?
  • e)-进程正在执行但未停止,不确定是什么
    标准将停止。

  • 它继续显示以下输出。
    14/08/14 03:58:14 INFO twitter.TwitterSource: Processed 45,000 docs
    14/08/14 03:58:14 INFO twitter.TwitterSource: Total docs indexed: 45,000, total skipped docs: 0
    14/08/14 03:58:14 INFO twitter.TwitterSource:     53 docs/second
    14/08/14 03:58:14 INFO twitter.TwitterSource: Run took 846 seconds and processed:
    14/08/14 03:58:14 INFO twitter.TwitterSource:     0.013 MB/sec sent to index
    14/08/14 03:58:14 INFO twitter.TwitterSource:     11.111 MB text sent to index
    14/08/14 03:58:14 INFO twitter.TwitterSource: There were 0 exceptions ignored:
    14/08/14 03:58:14 INFO sink.LoggerSink: Event: { headers:{} body: 4F 62 6A 01 02 16 61 76 72 6F 2E 73 63 68 65 6D Obj...avro.schem }
    14/08/14 03:58:15 INFO sink.LoggerSink: Event: { headers:{} body: 4F 62 6A 01 02 16 61 76 72 6F 2E 73 63 68 65 6D Obj...avro.schem }
    14/08/14 03:58:16 INFO twitter.TwitterSource: Processed 45,100 docs
    14/08/14 03:58:16 INFO sink.LoggerSink: Event: { headers:{} body: 4F 62 6A 01 02 16 61 76 72 6F 2E 73 63 68 65 6D Obj...avro.schem }
    14/08/14 03:58:17 INFO sink.LoggerSink: Event: { headers:{} body: 4F 62 6A 01 02 16 61 76 72 6F 2E 73 63 68 65 6D Obj...avro.schem }
    14/08/14 03:58:18 INFO sink.LoggerSink: Event: { headers:{} body: 4F 62 6A 01 02 16 61 76 72 6F 2E 73 63 68 65 6D Obj...avro.schem }
    14/08/14 03:58:18 INFO twitter.TwitterSource: Processed 45,200 docs
    14/08/14 03:58:19 INFO sink.LoggerSink: Event: { headers:{} body: 4F 62 6A 01 02 16 61 76 72 6F 2E 73 63 68 65 6D Obj...avro.schem }
    14/08/14 03:58:20 INFO twitter.TwitterSource: Processed 45,300 docs
    14/08/14 03:58:20 INFO sink.LoggerSink: Event: { headers:{} body: 4F 62 6A 01 02 16 61 76 72 6F 2E 73 63 68 65 6D Obj...avro.schem }
    14/08/14 03:58:21 INFO sink.LoggerSink: Event: { headers:{} body: 4F 62 6A 01 02 16 61 76 72 6F 2E 73 63 68 65 6D Obj...avro.schem }
    14/08/14 03:58:22 INFO sink.LoggerSink: Event: { headers:{} body: 4F 62 6A 01 02 16 61 76 72 6F 2E 73 63 68 65 6D Obj...avro.schem }
    14/08/14 03:58:22 INFO twitter.TwitterSource: Processed 45,400 docs
    14/08/14 03:58:23 INFO sink.LoggerSink: Event: { headers:{} body: 4F 62 6A 01 02 16 61 76 72 6F 2E 73 63 68 65 6D Obj...avro.schem }
    14/08/14 03:58:24 INFO sink.LoggerSink: Event: { headers:{} body: 4F 62 6A 01 02 16 61 76 72 6F 2E 73 63 68 65 6D Obj...avro.schem }
    14/08/14 03:58:24 INFO twitter.TwitterSource: Processed 45,500 docs
    14/08/14 03:58:25 INFO sink.LoggerSink: Event: { headers:{} body: 4F 62 6A 01 02 16 61 76 72 6F 2E 73 63 68 65 6D Obj...avro.schem }
    14/08/14 03:58:26 INFO sink.LoggerSink: Event: { headers:{} body: 4F 62 6A 01 02 16 61 76 72 6F 2E 73 63 68 65 6D Obj...avro.schem }
    14/08/14 03:58:26 INFO twitter.TwitterSource: Processed 45,600 docs
    14/08/14 03:58:27 INFO sink.LoggerSink: Event: { headers:{} body: 4F 62 6A 01 02 16 61 76 72 6F 2E 73 63 68 65 6D Obj...avro.schem }
    14/08/14 03:58:28 INFO sink.LoggerSink: Event: { headers:{} body: 4F 62 6A 01 02 16 61 76 72 6F 2E 73 63 68 65 6D Obj...avro.schem }
    14/08/14 03:58:28 INFO twitter.TwitterSource: Processed 45,700 docs
    14/08/14 03:58:29 INFO sink.LoggerSink: Event: { headers:{} body: 4F 62 6A 01 02 16 61 76 72 6F 2E 73 63 68 65 6D Obj...avro.schem }
    14/08/14 03:58:30 INFO twitter.TwitterSource: Processed 45,800 docs
    14/08/14 03:58:30 INFO sink.LoggerSink: Event: { headers:{} body: 4F 62 6A 01 02 16 61 76 72 6F 2E 73 63 68 65 6D Obj...avro.schem }
    14/08/14 03:58:31 INFO sink.LoggerSink: Event: { headers:{} body: 4F 62 6A 01 02 16 61 76 72 6F 2E 73 63 68 65 6D Obj...avro.schem }
    14/08/14 03:58:32 INFO sink.LoggerSink: Event: { headers:{} body: 4F 62 6A 01 02 16 61 76 72 6F 2E 73 63 68 65 6D Obj...avro.schem }
    14/08/14 03:58:32 INFO twitter.TwitterSource: Processed 45,900 docs
    14/08/14 03:58:33 INFO sink.LoggerSink: Event: { headers:{} body: 4F 62 6A 01 02 16 61 76 72 6F 2E 73 63 68 65 6D Obj...avro.schem }
    14/08/14 03:58:34 INFO sink.LoggerSink: Event: { headers:{} body: 4F 62 6A 01 02 16 61 76 72 6F 2E 73 63 68 65 6D Obj...avro.schem }
    14/08/14 03:58:34 INFO twitter.TwitterSource: Processed 46,000 docs
    14/08/14 03:58:34 INFO twitter.TwitterSource: Total docs indexed: 46,000, total skipped docs: 0
    14/08/14 03:58:34 INFO twitter.TwitterSource:     53 docs/second
    14/08/14 03:58:34 INFO twitter.TwitterSource: Run took 867 seconds and processed:
    14/08/14 03:58:34 INFO twitter.TwitterSource:     0.013 MB/sec sent to index
    14/08/14 03:58:34 INFO twitter.TwitterSource:     11.36 MB text sent to index
    14/08/14 03:58:34 INFO twitter.TwitterSource: There were 0 exceptions ignored:
    

    任何人都可以帮助我,我想念的是什么?

    在用于此任务之前,是否应该使用Maven重新构建Flume?

    最佳答案

    无需授予对Twitter / API访问 token 的读写访问权限?
    您使用hdfs.path样式的方式也是正确的。

    要解决主要问题(不复制文件),请进行以下更改:

    conf / twitter.conf文件中的更改

  • a)-

  • 替换以下行:
    (TwitterAgent.sinks.HDFS.type = logger)

    与以下行:
    TwitterAgent.sinks.HDFS.type = hdfs
  • b)-

  • 注释以下行:
    #TwitterAgent.sources.Twitter.type = com.cloudera.flume.source.TwitterSource
    

    使用以下内容(Apache类)
    TwitterAgent.sources.Twitter.type = org.apache.flume.source.twitter.TwitterSource
    

    flume-env.conf中的更改

    注释如下:(无需设置此值)
    #FLUME_CLASSPATH=""
    

    为以下属性设置适当的值:
    hdfs.filePrefix
    hdfs.fileSuffix
    hdfs.inUsePrefix
    hdfs.inUseSuffix
    hdfs.rollInterval
    hdfs.rollSize
    hdfs.rollCount
    hdfs.idleTimeout
    hdfs.batchSize
    hdfs.fileType
    hdfs.maxOpenFiles
    hdfs.minBlockReplicas
    hdfs.writeFormat
    hdfs.callTimeout
    hdfs.threadsPoolSize
    hdfs.rollTimerPoolSize
    hdfs.kerberosPrincipal
    hdfs.kerberosKeytab
    hdfs.proxyUser
    hdfs.round
    hdfs.roundValue
    hdfs.roundUnit
    hdfs.timeZone
    hdfs.useLocalTimeStamp
    hdfs.closeTries
    hdfs.retryInterval
    

    要查看更多详细信息,请参见以下链接:

    https://flume.apache.org/FlumeUserGuide.html

    关于hadoop - Apache Flume 1.5在Hadoop 2/自动故障转移群集配置中未提供预期的结果,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/25293297/

    10-16 03:24