问题描述
我在hdfs的一个文件夹中有一堆.gz文件。我想将所有这些.gz文件解压到hdfs中的新文件夹。我应该怎么做?
我可以通过3种不同的方式来实现它。
-
使用Linux命令行
hadoop fs -cat /tmp/Links.txt.gz | gzip -d | hadoop fs -put - /tmp/unzipped/Links.txt
我的gzip文件是
Links.txt.gz
输出存储在/tmp/unzipped/Links.txt中
-
使用Java程序
在
Hadoop Definitve指南
一书中,有关于编解码器
的部分。在该节中,有一个程序使用CompressionCodecFactory
解压缩输出。我正在重新生成该代码:package com.myorg.hadooptests;
导入org.apache.hadoop.conf.Configuration;
导入org.apache.hadoop.fs.FileSystem;
导入org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.compress.CompressionCodec;
import org.apache.hadoop.io.compress.CompressionCodecFactory;
import java.io.InputStream;
import java.io.OutputStream;
import java.net.URI;
public class FileDecompressor {
public static void main(String [] args)throws Exception {
String uri = args [0];
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(URI.create(uri),conf);
Path inputPath = new Path(uri);
CompressionCodecFactory factory = new CompressionCodecFactory(conf);
CompressionCodec codec = factory.getCodec(inputPath);
if(codec == null){
System.err.println(没有为+ uri找到编解码器;
System.exit(1);
}
字符串outputUri =
CompressionCodecFactory.removeSuffix(uri,codec.getDefaultExtension());
InputStream in = null;
OutputStream out = null;
尝试{
in = codec.createInputStream(fs.open(inputPath));
out = fs.create(new Path(outputUri));
IOUtils.copyBytes(in,out,conf);
} finally {
IOUtils.closeStream(in);
IOUtils.closeStream(out);
这段代码需要gz文件路径作为输入。
您可以这样执行:
FileDecompressor< gzipped file name>
例如当我为我的gzip文件执行时:
FileDecompressor /tmp/Links.txt.gz
我在位置获得了解压缩文件:
/tmp/Links.txt
它将解压缩的文件存储在同一个文件夹中。因此,您需要修改此代码以获取2个输入参数:
<输入文件路径>和<输出文件夹>
。
一旦你使用这个程序,你可以编写一个Shell / Perl / Python脚本来调用这个程序为您的每个输入。
-
使用Pig脚本
您可以编写一个简单的Pig脚本来实现此目的。
我编写了以下脚本,它可以工作:
A = LOAD'/tmp/Links.txt.gz'使用PigStorage();
将A存储到'/ tmp / tmp_unzipped /'使用PigStorage();
mv / tmp / tmp_unzipped / part-m-00000 /tmp/unzipped/Links.txt
rm / tmp / tmp_unzipped /
运行此脚本时,解压后的内容将存储在临时文件夹中:
/ tmp / tmp_unzipped
。此文件夹将包含
/ tmp / tmp_unzipped / _SUCCESS
/ tmp / tmp_unzipped / part-m-00000
part-m-00000
包含解压缩后的文件。
因此,我们需要使用以下命令显式重命名它,最后删除/ tmp / tmp_unzipped
文件夹:mv / tmp / tmp_unzipped / part-m-00000 /tmp/unzipped/Links.txt
rm / tmp / tmp_unzipped /
所以,如果你使用这个Pig脚本,你只需要照顾参数化文件名(Links.txt.gz和Links.txt)。
同样,一旦你获得了这个脚本的工作,你可以编写一个Shell / Perl / Python脚本来为你的每个输入调用这个Pig脚本。
I have a bunch of .gz files in a folder in hdfs. I want to unzip all of these .gz files to a new folder in hdfs. How should i do this?
I can think of achieving it through 3 different ways.
Using Linux command line
Following command worked for me.
hadoop fs -cat /tmp/Links.txt.gz | gzip -d | hadoop fs -put - /tmp/unzipped/Links.txt
My gzipped file is
Links.txt.gz
The output gets stored in/tmp/unzipped/Links.txt
Using Java program
In
Hadoop The Definitve Guide
book, there is a section onCodecs
. In that section, there is a program to Decompress the output usingCompressionCodecFactory
. I am re-producing that code as is:package com.myorg.hadooptests; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IOUtils; import org.apache.hadoop.io.compress.CompressionCodec; import org.apache.hadoop.io.compress.CompressionCodecFactory; import java.io.InputStream; import java.io.OutputStream; import java.net.URI; public class FileDecompressor { public static void main(String[] args) throws Exception { String uri = args[0]; Configuration conf = new Configuration(); FileSystem fs = FileSystem.get(URI.create(uri), conf); Path inputPath = new Path(uri); CompressionCodecFactory factory = new CompressionCodecFactory(conf); CompressionCodec codec = factory.getCodec(inputPath); if (codec == null) { System.err.println("No codec found for " + uri); System.exit(1); } String outputUri = CompressionCodecFactory.removeSuffix(uri, codec.getDefaultExtension()); InputStream in = null; OutputStream out = null; try { in = codec.createInputStream(fs.open(inputPath)); out = fs.create(new Path(outputUri)); IOUtils.copyBytes(in, out, conf); } finally { IOUtils.closeStream(in); IOUtils.closeStream(out); } } }
This code takes the gz file path as input.
You can execute this as:FileDecompressor <gzipped file name>
For e.g. when I executed for my gzipped file:
FileDecompressor /tmp/Links.txt.gz
I got the unzipped file at location:
/tmp/Links.txt
It stores the unzipped file in the same folder. So you need to modify this code to take 2 input parameters:
<input file path> and <output folder>
.Once you get this program working, you can write a Shell/Perl/Python script to call this program for each of the inputs you have.
Using Pig script
You can write a simple Pig script to achieve this.
I wrote the following script, which works:
A = LOAD '/tmp/Links.txt.gz' USING PigStorage(); Store A into '/tmp/tmp_unzipped/' USING PigStorage(); mv /tmp/tmp_unzipped/part-m-00000 /tmp/unzipped/Links.txt rm /tmp/tmp_unzipped/
When you run this script, the unzipped contents are stored in a temporary folder:
/tmp/tmp_unzipped
. This folder will contain/tmp/tmp_unzipped/_SUCCESS /tmp/tmp_unzipped/part-m-00000
The
part-m-00000
contains the unzipped file.Hence, we need to explicitly rename it using following command and finally delete the
/tmp/tmp_unzipped
folder:mv /tmp/tmp_unzipped/part-m-00000 /tmp/unzipped/Links.txt rm /tmp/tmp_unzipped/
So, if you use this Pig script, you just need to take care of parameterizing the file name (Links.txt.gz and Links.txt).
Again, once you get this script working, you can write a Shell/Perl/Python script to call this Pig script for each of the inputs you have.
这篇关于如何将.gz文件解压缩到hadoop的新目录中?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!