如何将.gz文件解压缩到hadoop的新目录中？

本文介绍了如何将.gz文件解压缩到hadoop的新目录中？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我在hdfs的一个文件夹中有一堆.gz文件。我想将所有这些.gz文件解压到hdfs中的新文件夹。我应该怎么做？

解决方案

我可以通过3种不同的方式来实现它。

使用Linux命令行

  hadoop fs -cat /tmp/Links.txt.gz | gzip -d | hadoop fs -put  -  /tmp/unzipped/Links.txt

我的gzip文件是 Links.txt.gz

输出存储在 /tmp/unzipped/Links.txt中

使用Java程序

在 Hadoop Definitve指南一书中，有关于编解码器的部分。在该节中，有一个程序使用 CompressionCodecFactory 解压缩输出。我正在重新生成该代码：

  package com.myorg.hadooptests; 
 
导入org.apache.hadoop.conf.Configuration; 
导入org.apache.hadoop.fs.FileSystem; 
导入org.apache.hadoop.fs.Path; 
 import org.apache.hadoop.io.IOUtils; 
 import org.apache.hadoop.io.compress.CompressionCodec; 
 import org.apache.hadoop.io.compress.CompressionCodecFactory; 
 
 import java.io.InputStream; 
 import java.io.OutputStream; 
 import java.net.URI; 
 
 public class FileDecompressor {
 public static void main（String [] args）throws Exception {
 String uri = args [0]; 
 Configuration conf = new Configuration（）; 
 FileSystem fs = FileSystem.get（URI.create（uri），conf）; 
 Path inputPath = new Path（uri）; 
 CompressionCodecFactory factory = new CompressionCodecFactory（conf）; 
 CompressionCodec codec = factory.getCodec（inputPath）; 
 if（codec == null）{
 System.err.println（没有为+ uri找到编解码器; 
 System.exit（1）; 
} 
字符串outputUri = 
 CompressionCodecFactory.removeSuffix（uri，codec.getDefaultExtension（））; 
 InputStream in = null; 
 OutputStream out = null; 
尝试{
 in = codec.createInputStream（fs.open（inputPath））; 
 out = fs.create（new Path（outputUri））; 
 IOUtils.copyBytes（in，out，conf）; 
} finally {
 IOUtils.closeStream（in）; 
 IOUtils.closeStream（out）; 
 
 
 
 
 
 这段代码需要gz文件路径作为输入。 
 
您可以这样执行： 
 
 
  FileDecompressor< gzipped file name> 
  
例如当我为我的gzip文件执行时： 
 
 
  FileDecompressor /tmp/Links.txt.gz 
  
我在位置获得了解压缩文件： /tmp/Links.txt  
 
 
 它将解压缩的文件存储在同一个文件夹中。因此，您需要修改此代码以获取2个输入参数：<输入文件路径>和<输出文件夹> 。
 
 
 一旦你使用这个程序，你可以编写一个Shell / Perl / Python脚本来调用这个程序为您的每个输入。

  
   使用Pig脚本  
 
 
您可以编写一个简单的Pig脚本来实现此目的。
 
 
 我编写了以下脚本，它可以工作： 
 
 
  A = LOAD'/tmp/Links.txt.gz'使用PigStorage（）; 
将A存储到'/ tmp / tmp_unzipped /'使用PigStorage（）; 
 mv / tmp / tmp_unzipped / part-m-00000 /tmp/unzipped/Links.txt 
 rm / tmp / tmp_unzipped / 
  
运行此脚本时，解压后的内容将存储在临时文件夹中： / tmp / tmp_unzipped 。此文件夹将包含 
 
 
  / tmp / tmp_unzipped / _SUCCESS 
 / tmp / tmp_unzipped / part-m-00000 
  
  part-m-00000 包含解压缩后的文件。
 
因此，我们需要使用以下命令显式重命名它，最后删除 / tmp / tmp_unzipped 文件夹： 
  mv / tmp / tmp_unzipped / part-m-00000 /tmp/unzipped/Links.txt 
 rm / tmp / tmp_unzipped / 
  
所以，如果你使用这个Pig脚本，你只需要照顾参数化文件名（Links.txt.gz和Links.txt）。 
 
 
 同样，一旦你获得了这个脚本的工作，你可以编写一个Shell / Perl / Python脚本来为你的每个输入调用这个Pig脚本。

I have a bunch of .gz files in a folder in hdfs. I want to unzip all of these .gz files to a new folder in hdfs. How should i do this?

解决方案

I can think of achieving it through 3 different ways.

Using Linux command line
Following command worked for me.
```
hadoop fs -cat /tmp/Links.txt.gz | gzip -d | hadoop fs -put - /tmp/unzipped/Links.txt
```
My gzipped file is Links.txt.gz The output gets stored in /tmp/unzipped/Links.txt

Using Java program

In Hadoop The Definitve Guide book, there is a section on Codecs. In that section, there is a program to Decompress the output using CompressionCodecFactory. I am re-producing that code as is:

package com.myorg.hadooptests;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.compress.CompressionCodec;
import org.apache.hadoop.io.compress.CompressionCodecFactory;

import java.io.InputStream;
import java.io.OutputStream;
import java.net.URI;

public class FileDecompressor {
    public static void main(String[] args) throws Exception {
        String uri = args[0];
        Configuration conf = new Configuration();
        FileSystem fs = FileSystem.get(URI.create(uri), conf);
        Path inputPath = new Path(uri);
        CompressionCodecFactory factory = new CompressionCodecFactory(conf);
        CompressionCodec codec = factory.getCodec(inputPath);
        if (codec == null) {
            System.err.println("No codec found for " + uri);
            System.exit(1);
        }
        String outputUri =
        CompressionCodecFactory.removeSuffix(uri, codec.getDefaultExtension());
        InputStream in = null;
        OutputStream out = null;
        try {
            in = codec.createInputStream(fs.open(inputPath));
            out = fs.create(new Path(outputUri));
            IOUtils.copyBytes(in, out, conf);
        } finally {
            IOUtils.closeStream(in);
            IOUtils.closeStream(out);
        }
    }
}

This code takes the gz file path as input. You can execute this as:

FileDecompressor <gzipped file name>

For e.g. when I executed for my gzipped file:

FileDecompressor /tmp/Links.txt.gz

I got the unzipped file at location: /tmp/Links.txt

It stores the unzipped file in the same folder. So you need to modify this code to take 2 input parameters: <input file path> and <output folder>.

Once you get this program working, you can write a Shell/Perl/Python script to call this program for each of the inputs you have.

Using Pig script
You can write a simple Pig script to achieve this.
I wrote the following script, which works:
```
A = LOAD '/tmp/Links.txt.gz' USING PigStorage();
Store A into '/tmp/tmp_unzipped/' USING PigStorage();
mv /tmp/tmp_unzipped/part-m-00000 /tmp/unzipped/Links.txt
rm /tmp/tmp_unzipped/
```
When you run this script, the unzipped contents are stored in a temporary folder: /tmp/tmp_unzipped. This folder will contain
```
/tmp/tmp_unzipped/_SUCCESS
/tmp/tmp_unzipped/part-m-00000
```
The part-m-00000 contains the unzipped file.
Hence, we need to explicitly rename it using following command and finally delete the /tmp/tmp_unzipped folder:
```
mv /tmp/tmp_unzipped/part-m-00000 /tmp/unzipped/Links.txt
rm /tmp/tmp_unzipped/
```
So, if you use this Pig script, you just need to take care of parameterizing the file name (Links.txt.gz and Links.txt).
Again, once you get this script working, you can write a Shell/Perl/Python script to call this Pig script for each of the inputs you have.

这篇关于如何将.gz文件解压缩到hadoop的新目录中？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！