本文介绍了压缩输出缩放/级联压缩的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

所以人们在压缩Scalding Jobs的输出时遇到问题,包括我自己。 googling之后,我在一个晦涩的论坛中得到了一个奇怪的答案,但没有什么适合于人们复制和粘贴的需求。

So people have been having problems compressing the output of Scalding Jobs including myself. After googling I get the odd hiff of an answer in a some obscure forum somewhere but nothing suitable for peoples copy and paste needs.

我想要一个输出 Tsv ,但是写入压缩的输出。

I would like an output like Tsv, but writes compressed output.

推荐答案

(你仍然需要设置hadoop作业系统配置属性,即将compress设置为true,并将编解码器设置为合理的或默认为crappy deflate)

Anyway after much faffification I managed to write a TsvCompressed output which seems to do the job (you still need to set the hadoop job system configuration properties, i.e. set compress to true, and set the codec to something sensible or it defaults to crappy deflate)

import com.twitter.scalding._
import cascading.tuple.Fields
import cascading.scheme.local
import cascading.scheme.hadoop.{TextLine, TextDelimited}
import cascading.scheme.Scheme
import org.apache.hadoop.mapred.{OutputCollector, RecordReader, JobConf}

case class TsvCompressed(p: String) extends FixedPathSource(p) with DelimitedSchemeCompressed

trait DelimitedSchemeCompressed extends Source {
  val types: Array[Class[_]] = null

  override def localScheme = new local.TextDelimited(Fields.ALL, false, false, "\t", types)

  override def hdfsScheme = {
    val temp = new TextDelimited(Fields.ALL, false, false, "\t", types)
    temp.setSinkCompression(TextLine.Compress.ENABLE)
    temp.asInstanceOf[Scheme[JobConf,RecordReader[_,_],OutputCollector[_,_],_,_]]
  }
}

这篇关于压缩输出缩放/级联压缩的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-01 16:38