读取大XLS和XLSX文件

读取大XLS和XLSX文件

本文介绍了读取大XLS和XLSX文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我知道附近的帖子,我已经尝试了几次尝试以实现自己的目标,如下所述:

I'm aware of the posts that are around, I've tried several attempts to reach my objective, as I will elaborate below:

我有一个 .zip / .rar ,其中包含多个 xls & xlsx 文件.

I have a .zip/.rar, that contains multiple xls & xlsx files.

每个excel文件最多包含成千上万行的duzen,大约有90列(每个excel文件可以包含更多或更少的列).

Each excel file contains duzens up to thousands of rows, around 90 columns give or take (each excel file can have more or less columns).

我创建了一个Java windowbuilder应用程序,在其中选择一个 .zip / .rar 文件,然后选择将这些文件解压缩到何处并使用 FileOutputStream .保存每个文件后,我正在读取文件的内容.

I've created a java windowbuilder application, where I select a .zip/.rar file and select where to unzip these files to and create them using FileOutputStream. After each file being saved, I'm reading the file for it's content.

到目前为止,一切都很好.经过几次尝试避免OOM(OutOfMemory)并加快处理速度后,我到达了最终版本"(这很糟糕,但是直到我弄清楚如何正确地阅读内容为止),我将对此进行解释:

So far so good.After several attempts to avoid OOM (OutOfMemory) and speed things up, I've reached the 'final version' (which is quite awful but it's until I figure out how to read things properly) which I will explain:

File file = new File('certainFile.xlsx'); //or xls, For example purposes
Workbook wb;
Sheet sheet;
/*
There is a ton of other things up to this point that I don't consider relevant, as it's related to unzipping and renaming, etc.
This is within a cycle

/
In every zip file, there is at least 1 or 2 files that somehow, when it goes to
WorkbookFactory.create(), it still gives an OOM because it recognizes is has
a bit over a million rows, meaning it's an 2007 format file (according to our friend Google.com), or so I believe so.
When I open the xlsx file, it indeed has like 10-20mb size and thousands of empty rows. When I save it again
it has 1mb and a couple thousand. After many attempts to read as InputStream, File or trying to save it in
an automatic way, I've worked with converting it to a CSV and read it differently,
ence, this 'solution'. if parseAsXLS is true, it applies my regular logic
per row per cell, otherwise I parse the CSV.
*/
if (file.getName().contains("xlsx")) {
    this.parseAsXLS = false;
    OPCPackage pkg = OPCPackage.open(file);
    //This is just to output the content into a csv file, that I will read later on and it gets overwritten everytime it comes by
    FileOutputStream fo = new FileOutputStream(this.filePath + File.separator + "excel.csv");
    PrintStream ps = new PrintStream(fo);
    XLSX2CSV xlsxCsvConverter = new XLSX2CSV(pkg, ps, 90);
    try {
        xlsxCsvConverter.process();
    } catch (Exception e) {
        //I've added a count at the XLSX2CSV class in order to limit the ammount of rows I want to fetch and throw an Exception on purpose
        System.out.println("Limited the file at 60k rows");
    }
} else {
    this.parseAsXLS = true;
    this.wb = WorkbookFactory.create(file);
    this.sheet = wb.getSheetAt(0);
}

现在发生的是一个 .xlsx (来自 .zip 文件以及其他几个 .xls .xlsx )连续有一定的字符,并且XLSX2CSV将该字符视为endRow,这将导致错误的输出.

What happens now is that a .xlsx (from a .zip file with several other .xls and .xlsx) has somewhat a certain character in a row and the XLSX2CSV considers it as endRow, which results in a incorrect output.

这是一个示例:图像链接

注意:目的是仅从每个excel文件中获取它们在公共(或可能不是强制)中使用的一组特定列,并将它们放到一个新的Excel中.电子邮件列(包含用逗号分隔的多封电子邮件)具有我认为是在电子邮件之前的输入"的内容,因为如果我手动删除它,则可以解决此问题.但是,目标是不手动打开每个excel并对其进行修复,否则,我将打开每个excel并复制粘贴所需的列.在该示例中,我需要以下列: fieldAA fieldAG fieldAL fieldAN .

Note: The objective is to only fetch a certain set of columns that they have in commum (or might have, not obliged) from each excel file and put them together in a new Excel. The email column (that contains multiple emails seperated by a comma), has what I believe to be an 'enter' before the email, because if I erase it manually, it fixes the problem. However, the objective is to not manually open every excel and fix it, otherwise I'd just open every excel and copy-paste the columns I'd need. In that example, I'd require columns: fieldAA, fieldAG, fieldAL and fieldAN.

XLSX2CSV.java (我不是该文件的创建者,我只是将需要的内容应用于该文件)

XLSX2CSV.java (I'm not the creator of this file, I just applied my needs to it)

import java.awt.List;
import java.io.File;
import java.io.IOException;
import java.io.InputStream;
import java.io.PrintStream;

import javax.xml.parsers.ParserConfigurationException;

import org.apache.poi.openxml4j.exceptions.OpenXML4JException;
import org.apache.poi.openxml4j.opc.OPCPackage;
import org.apache.poi.openxml4j.opc.PackageAccess;
import org.apache.poi.ss.usermodel.DataFormatter;
import org.apache.poi.ss.util.CellAddress;
import org.apache.poi.ss.util.CellReference;
import org.apache.poi.util.SAXHelper;
import org.apache.poi.xssf.eventusermodel.ReadOnlySharedStringsTable;
import org.apache.poi.xssf.eventusermodel.XSSFReader;
import org.apache.poi.xssf.eventusermodel.XSSFSheetXMLHandler;
import org.apache.poi.xssf.eventusermodel.XSSFSheetXMLHandler.SheetContentsHandler;
import org.apache.poi.xssf.extractor.XSSFEventBasedExcelExtractor;
import org.apache.poi.xssf.model.StylesTable;
import org.apache.poi.xssf.usermodel.XSSFComment;
import org.xml.sax.ContentHandler;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;
import org.xml.sax.XMLReader;

/**
 * A rudimentary XLSX -> CSV processor modeled on the
 * POI sample program XLS2CSVmra from the package
 * org.apache.poi.hssf.eventusermodel.examples.
 * As with the HSSF version, this tries to spot missing
 *  rows and cells, and output empty entries for them.
 * <p>
 * Data sheets are read using a SAX parser to keep the
 * memory footprint relatively small, so this should be
 * able to read enormous workbooks.  The styles table and
 * the shared-string table must be kept in memory.  The
 * standard POI styles table class is used, but a custom
 * (read-only) class is used for the shared string table
 * because the standard POI SharedStringsTable grows very
 * quickly with the number of unique strings.
 * <p>
 * For a more advanced implementation of SAX event parsing
 * of XLSX files, see {@link XSSFEventBasedExcelExtractor}
 * and {@link XSSFSheetXMLHandler}. Note that for many cases,
 * it may be possible to simply use those with a custom
 * {@link SheetContentsHandler} and no SAX code needed of
 * your own!
 */
public class XLSX2CSV {
    /**
     * Uses the XSSF Event SAX helpers to do most of the work
     *  of parsing the Sheet XML, and outputs the contents
     *  as a (basic) CSV.
     */
    private class SheetToCSV implements SheetContentsHandler {
        private boolean firstCellOfRow;
        private int currentRow = -1;
        private int currentCol = -1;
        private int maxrows = 60000;



        private void outputMissingRows(int number) {

            for (int i=0; i<number; i++) {
                for (int j=0; j<minColumns; j++) {
                    output.append(',');
                }
                output.append('\n');
            }
        }

        @Override
        public void startRow(int rowNum) {
            // If there were gaps, output the missing rows
            outputMissingRows(rowNum-currentRow-1);
            // Prepare for this row
            firstCellOfRow = true;
            currentRow = rowNum;
            currentCol = -1;

            if (rowNum == maxrows) {
                    throw new RuntimeException("Force stop at maxrows");
            }
        }

        @Override
        public void endRow(int rowNum) {
            // Ensure the minimum number of columns
            for (int i=currentCol; i<minColumns; i++) {
                output.append(',');
            }
            output.append('\n');
        }

        @Override
        public void cell(String cellReference, String formattedValue,
                XSSFComment comment) {
            if (firstCellOfRow) {
                firstCellOfRow = false;
            } else {
                output.append(',');
            }

            // gracefully handle missing CellRef here in a similar way as XSSFCell does
            if(cellReference == null) {
                cellReference = new CellAddress(currentRow, currentCol).formatAsString();
            }

            // Did we miss any cells?
            int thisCol = (new CellReference(cellReference)).getCol();
            int missedCols = thisCol - currentCol - 1;
            for (int i=0; i<missedCols; i++) {
                output.append(',');
            }
            currentCol = thisCol;

            // Number or string?
            try {
                //noinspection ResultOfMethodCallIgnored
                Double.parseDouble(formattedValue);
                output.append(formattedValue);
            } catch (NumberFormatException e) {
                output.append('"');
                output.append(formattedValue);
                output.append('"');
            }
        }

        @Override
        public void headerFooter(String arg0, boolean arg1, String arg2) {
            // TODO Auto-generated method stub

        }
    }


    ///////////////////////////////////////

    private final OPCPackage xlsxPackage;

    /**
     * Number of columns to read starting with leftmost
     */
    private final int minColumns;

    /**
     * Destination for data
     */
    private final PrintStream output;

    /**
     * Creates a new XLSX -> CSV converter
     *
     * @param pkg        The XLSX package to process
     * @param output     The PrintStream to output the CSV to
     * @param minColumns The minimum number of columns to output, or -1 for no minimum
     */
    public XLSX2CSV(OPCPackage pkg, PrintStream output, int minColumns) {
        this.xlsxPackage = pkg;
        this.output = output;
        this.minColumns = minColumns;
    }

    /**
     * Parses and shows the content of one sheet
     * using the specified styles and shared-strings tables.
     *
     * @param styles The table of styles that may be referenced by cells in the sheet
     * @param strings The table of strings that may be referenced by cells in the sheet
     * @param sheetInputStream The stream to read the sheet-data from.

     * @exception java.io.IOException An IO exception from the parser,
     *            possibly from a byte stream or character stream
     *            supplied by the application.
     * @throws SAXException if parsing the XML data fails.
     */
    public void processSheet(
            StylesTable styles,
            ReadOnlySharedStringsTable strings,
            SheetContentsHandler sheetHandler,
            InputStream sheetInputStream) throws IOException, SAXException {
        DataFormatter formatter = new DataFormatter();
        InputSource sheetSource = new InputSource(sheetInputStream);
        try {
            XMLReader sheetParser = SAXHelper.newXMLReader();
            ContentHandler handler = new XSSFSheetXMLHandler(
                  styles, null, strings, sheetHandler, formatter, false);
            sheetParser.setContentHandler(handler);
            sheetParser.parse(sheetSource);
         } catch(ParserConfigurationException e) {
            throw new RuntimeException("SAX parser appears to be broken - " + e.getMessage());
         }
    }

    /**
     * Initiates the processing of the XLS workbook file to CSV.
     *
     * @throws IOException If reading the data from the package fails.
     * @throws SAXException if parsing the XML data fails.
     */
    public void process() throws IOException, OpenXML4JException, SAXException {
        ReadOnlySharedStringsTable strings = new ReadOnlySharedStringsTable(this.xlsxPackage);
        XSSFReader xssfReader = new XSSFReader(this.xlsxPackage);
        StylesTable styles = xssfReader.getStylesTable();
        XSSFReader.SheetIterator iter = (XSSFReader.SheetIterator) xssfReader.getSheetsData();
        int index = 0;
        while (iter.hasNext()) {
            try (InputStream stream = iter.next()) {
                processSheet(styles, strings, new SheetToCSV(), stream);
            }
            ++index;
        }
    }
}

我正在寻找实现目标的不同(可行的)方法.

I'm in search of different (and working) approaches to my objective.

谢谢您的时间

推荐答案

好的,所以我尝试复制您的excel文件,然后将XLSX2CSV完全扔出了窗口.我认为将xlsx转换为csv的方法不是正确的方法,因为根据您的XLSX格式,它可以读取所有空行(您可能知道,因为您已将行计数器设置为60k).不仅如此,而且如果我们考虑到字段,它可能会或可能不会导致带有特殊字符的错误输出,例如您的问题.

Okay, so I've tried replicating your excel file and I completly threw the XLSX2CSV out the window. I don't think the approach of converting the xlsx into csv is the right one because, as depending on your XLSX format, it can read all the empty rows (you probably know that because you've set a row counter of 60k). not only that but if we're taking into consideration fields, it may or may not cause incorrect output with special characters, like your problem.

我所做的是我使用了这个库 https://github.com/davidpelfree/sjxlsx读取并重写文件.这非常简单,新的xlsx生成的文件中的字段已更正.

What I've done is I've used this library https://github.com/davidpelfree/sjxlsx to read and re-write the file. It's pretty much straight-forward and the new xlsx generated file has the fields corrected.

我建议您尝试这种方法(也许不使用此lib),尝试重新写入文件以更正它.

I suggest you try this approach (maybe not with this lib), of trying to re-write the file in order to correct it.

这篇关于读取大XLS和XLSX文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-13 00:17