问题描述
我有以下知识
Hadoop的最佳用例就是大文件,因此可以帮助您在执行mapreduce任务时效率更高。
保持上述原则,我对Flume NG有些困惑。
假设我正在拖曳一个日志文件,并且日志每秒产生一次,当日志获得一个新行时,它将通过Flume传输到hdfs。 a)这是否意味着flume会在记录在我正在拖尾的日志文件的每一行中创建一个新文件,或者它是否附加到现有的hdfs文件?
b)在hdfs中首先允许追加??
如果对b的答案是真的 ??即内容不断添加,我应该如何以及何时运行我的mapreduce应用程序?
以上问题可能听起来很愚蠢,但同样的答案将非常感谢。 p:
PS:我还没有设置Flume NG或hadoop,只是阅读文章以了解它以及如何为我的公司增值。
Flume通过HDFS接收器写入HDFS。当Flume启动并开始接收事件时,接收器将打开新文件并将事件写入其中。在某些时刻,先前打开的文件应该被关闭,直到那时,当前正在写入的数据块中的数据对其他redaers不可见。
如,Flume HDFS接收器有多种文件关闭策略:
- 每N秒(由 rollInterval 选项指定)
- 写入N个字节( rollSize 选项) 在写入N个事件( rollCount 选项)后,
-
$ b 因此,对您的问题:
$ b $ a)Flume将事件写入当前打开的文件直至它关闭(并打开新文件)。
b)Append在HDFS中,但Flume不使用它。关闭文件后,Flume不会追加任何数据。
要从mapreduce应用程序隐藏当前打开的文件,请使用 inUsePrefix 选项 - 名称以。开头的所有文件对MR作业不可见。I am very new to hadoop , so please excuse the dumb questions.
I have the following knowledgeBest usecase of Hadoop is large files thus helping in efficiency while running mapreduce tasks.
Keeping the above in mind I am somewhat confused about Flume NG.Assume I am tailing a log file and logs are produced every second, the moment the log gets a new line it will be transferred to hdfs via Flume.
a) Does this mean that flume creates a new file on every line that is logged in the log file I am tailing or does it append to the existing hdfs file ??
b) is append allowed in hdfs in the first place??
c) if the answer to b is true ?? ie contents are appended constantly , how and when should I run my mapreduce application?
Above questions could sound very silly but a answers to the same will be highly appreciated.
PS: I have not yet set up Flume NG or hadoop as yet, just reading the articles to get an understanding and how it could add value to my company.
解决方案Flume writes to HDFS by means of HDFS sink. When Flume starts and begins to receive events, the sink opens new file and writes events into it. At some point previously opened file should be closed, and until then data in the current block being written is not visible to other redaers.
As described in the documentation, Flume HDFS sink has several file closing strategies:
- each N seconds (specified by rollInterval option)
- after writing N bytes (rollSize option)
- after writing N received events (rollCount option)
- after N seconds of inactivity (idleTimeout option)
So, to your questions:
a) Flume writes events to currently opened file until it is closed (and new file opened).
b) Append is allowed in HDFS, but Flume does not use it. After file is closed, Flume does not append to it any data.
c) To hide currently opened file from mapreduce application use inUsePrefix option - all files with name that starts with . is not visible to MR jobs.
这篇关于Flume NG和HDFS的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!