本文介绍了对日志分析亚马逊马preduce最佳实践的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我解析被Apache,Nginx的,达尔文(视频流媒体服务器)和汇总统计按日期/被推荐人/用户代理提供的每个文件生成访问日志。

I'm parsing access logs generated by Apache, Nginx, Darwin (video streaming server) and aggregating statistics for each delivered file by date / referrer / useragent.

吨产生的每一个小时,这个数字可能会大幅增加在不久的将来 - 所以处理那样的数据,在通过亚马逊弹性麻preduce分布的方式听起来很有道理

Tons of logs generated every hour and that number likely to be increased dramatically in near future - so processing that kind of data in distributed manner via Amazon Elastic MapReduce sounds reasonable.

现在我已经准备好与映射器和减速器来处理我的数据和测试的全过程具有以下流程:

Right now I'm ready with mappers and reducers to process my data and tested the whole process with the following flow:

  • 上传映射器,减速器和数据到Amazon S3
  • 在配置合适的工作,并成功地处理它
  • 下载从Amazon S3汇总结果到我的服务器,并通过运行CLI脚本
  • 插入他们到MySQL数据库
  • uploaded mappers, reducers and data to Amazon S3
  • configured appropriate job and processed it successfully
  • downloaded aggregated results from Amazon S3 to my server and inserted them into MySQL database by running CLI script

我手动根据数以千计的教程是googlable关于亚马逊ERM在互联网上做到这一点。

I've done that manually according to thousands of tutorials that are googlable on the Internet about Amazon ERM.

我该怎么办?什么是自动完成这一过程最好的方法?

What should I do next? What is a best approach to automate this process?

什么是常见的做法:

  • 在使用cron通过API?控制亚马逊电子病历的JobTracker
  • 如何确保我的日志不会被处理两次?
  • 我应该控制处理/结果的文件移动/去除自己的自定义脚本?
  • 什么是处理结果将它们插入到的PostgreSQL / MySQL的一个最好的方法?
  • 我应该创建不同的输入/输出目录为每个作业或使用同一目录中的所有作业?
  • 我应该通过API创建一个新的工作每次?
  • 什么是原始日志上传到Amazon S3的一个最好的方法?我看着 Apache的水槽,但我不知道这是我需要的,只要我并不需要真正的 - 时间日志处理。
  • 如何控制,从Apache日志的新部分,nginx的准备上传到亚马逊? (原木旋转?)
  • 在任何人都可以共享数据处理流程的设置?
  • 您如何控制文件上传和工作的完成?
  • Using cron to control Amazon EMR jobTracker via API?
  • How can I make sure my logs will not be processed twice?
  • Should I control movement/removal of processed/results files by my own custom script?
  • What is a best approach to handle results to insert them into PostgreSQL/MySQL?
  • Should I create different "input"/"output" directories for each Job or use same directories for all jobs?
  • Should I create a new job each time via API?
  • What is a best approach to upload raw logs to Amazon S3? I've looked into Apache Flume but I'm not sure that is something I need as long as I don't need real-time logs processing.
  • How do you control that new portion of logs from Apache, nginx are ready to be uploaded to Amazon? (logs rotation?)
  • Can anyone share their setup of the data processing flow?
  • How you control file uploads and jobs completions?

确定在大多数情况下,它取决于你的基础架构和应用架构。

Sure in most cases it depends on your infrastructure and application architecture.

确定我可以实现这一切与我的自定义解决方案,有可能重新投入了很多的事情,已经被别人利用的地方。

Sure I can implement that everything with my custom solution, possibly re-investing a lot of thing that are already used by others somewhere.

是某种共同的做法,我想熟悉。

But there should be some kind of common practices that I would like to become familiar with.

我觉得这个话题可能是许多人谁试图处理与Amazon弹性麻preduce访问日志是有用的,但没能找到最佳做法,好的材料来处理。

I think that this topic can be useful for many people who trying to process access logs with Amazon Elastic MapReduce, but wasn't able to find good materials about best practices to handle that.

UPD:只是在这里说明的是单一的最后一个问题:

UPD: Just to clarify here is the single final question:

什么是搭载亚马逊弹性麻preduce原木加工的最佳实践?

相关文章:

<一个href="http://stackoverflow.com/questions/7701678/getting-data-in-and-out-of-elastic-ma$p$pduce-hdfs">Getting进出弹性麻preduce HDFS

推荐答案

这是一个非常非常大开的问题,但这里有一些想法,你可以考虑:

That's a very very wide open question, but here are some thoughts you could consider:

  • 使用Amazon SQS:这是一个分布式队列,并且,你中央社有一个过程,写入到队列只要日志是可用的,为工作流管理非常有用和另一个谁从中读取,处理所描述的日志队列中的消息,并删除它时,它的完成处理。这将确保日志只处理一次。
  • 你mentionned Apache的水槽是日志聚合非常有用的。这是你应该考虑,即使你并不需要实时的,因为它给你最起码标准化的聚集过程。
  • 在亚马逊最近发布SimpleWorkFlow。我刚开始寻找到它,但是听起来前途来管理您的数据流水线的每一步。

希望给你一些线索。

这篇关于对日志分析亚马逊马preduce最佳实践的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-24 03:12