在java中读取一个巨大的Zip文件

在java中读取一个巨大的Zip文件

本文介绍了在java中读取一个巨大的Zip文件 - 内存不足错误的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 java 读取 ZIP 文件,如下所示:

I am reading a ZIP file using java as below:

Enumeration<? extends ZipEntry> zes=zip.entries();
    while(zes.hasMoreElements()) {
        ZipEntry ze=zes.nextElement();
        // do stuff..
    }

我遇到内存不足错误,zip 文件大小约为 160MB.堆栈跟踪如下:

I am getting an out of memory error, the zip file size is about 160MB. The stacktrace is as below:

Exception in thread "Timer-0" java.lang.OutOfMemoryError: Java heap space
at java.util.zip.InflaterInputStream.<init>(InflaterInputStream.java:88)
at java.util.zip.ZipFile$1.<init>(ZipFile.java:229)
at java.util.zip.ZipFile.getInputStream(ZipFile.java:229)
at java.util.zip.ZipFile.getInputStream(ZipFile.java:197)
at com.aesthete.csmart.batches.batchproc.DatToInsertDBBatch.zipFilePass2(DatToInsertDBBatch.java:250)
at com.aesthete.csmart.batches.batchproc.DatToInsertDBBatch.processCompany(DatToInsertDBBatch.java:206)
at com.aesthete.csmart.batches.batchproc.DatToInsertDBBatch.run(DatToInsertDBBatch.java:114)
at java.util.TimerThread.mainLoop(Timer.java:534)
at java.util.TimerThread.run(Timer.java:484)

如何在不增加堆大小的情况下枚举大 zip 文件的内容?此外,当我不枚举内容而只访问这样的单个文件时:

How do I enumerate the contents of a big zip file without having increase my heap size? Also when I dont enumerate the contents and just access a single file like this:

ZipFile zip=new ZipFile(zipFile);
ZipEntry ze=zip.getEntry("docxml.xml");

然后我没有得到内存不足的错误.为什么会发生这种情况?Zip 文件如何处理 zip 条目?另一种选择是使用 ZIPInputStream.那会不会有一个小的内存占用.我最终需要在 Amazon 云(613 MB RAM)上的微型 EC2 实例上运行此代码

Then I dont get an out of memory error. Why does this happen? How does a Zip file handle zip entries? The other option would be to use a ZIPInputStream. Would that have a small memory footprint. I would need to run this code eventually on a micro EC2 instance on the Amazon cloud (613 MB RAM)

提供有关我在获取 zip 条目后如何处理它们的更多信息

Enumeration<? extends ZipEntry> zes=zip.entries();
    while(zes.hasMoreElements()) {
        ZipEntry ze=zes.nextElement();
        S3Object s3Object=new S3Object(bkp.getCompanyFolder()+map.get(ze.getName()).getRelativeLoc());
            s3Object.setDataInputStream(zip.getInputStream(ze));
            s3Object.setStorageClass(S3Object.STORAGE_CLASS_REDUCED_REDUNDANCY);
            s3Object.addMetadata("x-amz-server-side-encryption", "AES256");
            s3Object.setContentType(Mimetypes.getInstance().getMimetype(s3Object.getKey()));
            s3Object.setContentDisposition("attachment; filename="+FilenameUtils.getName(s3Object.getKey()));
            s3objs.add(s3Object);
    }

我从 zipentry 获取 zipinputstream 并将其存储在 S3object 中.我收集列表中的所有 S3Object,然后最后将它们上传到 Amazon S3.对于那些不了解 Amazon S3 的人,它是一种文件存储服务.您通过 HTTP 上传文件.

I get the zipinputstream from the zipentry and store that in the S3object. I collect all the S3Objects in a list and then finally upload them to Amazon S3. For those who dont know Amazon S3, its a file storage service. You upload the file via HTTP.

我在想,也许是因为我收集了所有单独的输入流,所以正在发生这种情况?如果我把它分批会有帮助吗?像一次 100 个输入流?还是先解压,然后用解压后的文件上传而不是存储流会更好?

推荐答案

由于处理 ZIP 文件而出现内存不足异常是非常不可能的.Java 类 ZipFileZipEntry 不包含任何可能填满 613 MB 内存的内容.

It is very unlikley that you get an out of memory exception because of processing a ZIP file. The Java classes ZipFile and ZipEntry don't contain anything that could possibly fill up 613 MB of memory.

可能会耗尽您的内存的是将 ZIP 存档的解压缩文件保存在内存中,或者更糟的是将它们保存为 XML DOM,这会占用大量内存.

What could exhaust your memory is to keep the decompressed files of the ZIP archive in memory, or - even worse - keeping them as an XML DOM, which is very memory intensive.

切换到另一个 ZIP 库几乎没有帮助.相反,您应该考虑更改您的代码,以便它处理 ZIP 存档和包含的文件(如流),并且一次只在内存中保留每个文件的有限部分.

Switching to another ZIP library will hardly help. Instead, you should look into changing your code so that it processes the ZIP archive and the contained files like streams and only keeps a limited part of each file in memory at a time.

顺便说一句:如果您能提供有关巨大 ZIP 文件(它们包含很多小文件还是很少的大文件?)以及您对每个 ZIP 条目的处理的更多信息,我会很高兴的.

BTW: I would be nice if you could provide more information about the huge ZIP files (do they contain many small files or few large files?) and about what you do with each ZIP entry.

更新:

感谢您提供更多信息.看起来您将 ZIP 文件的内容保存在内存中(尽管它在某种程度上取决于 S3Object 类的实现,我不知道).

Thanks for the additional information. It looks like you keep the contents of the ZIP file in memory (although it somewhat depends on the implementation of the S3Object class, which I don't know).

最好按照您自己的建议实施某种批处理.例如,您可以将每个 ZIP 条目的解压缩大小相加,并在总大小超过 100 MB 时上传文件.

It's probably best to implement some sort of batching as you propose yourself. You could for example add up the decompressed size of each ZIP entry and upload the files every time the total size exceeds 100 MB.

这篇关于在java中读取一个巨大的Zip文件 - 内存不足错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

07-26 07:35