← Back to list

使用POI和SAX读取Excel

Published on: | Views: 139

最近做了一个excel导入功能,在线上使用时崩溃了,看了一下,数据量也不大,就几W条, 也没有具体的错误信息,猜测可能是内存崩溃了。 原来的代码如下:

    Workbook workbook = new XSSFWorkbook(inputStream);
         Sheet sheet = workbook.getSheetAt(0);
         int i = 0;
         for (Row row : sheet) {
        //...
    }
}

根据官网的解释, 这样写的话,会先把整个excel文件加载解析到内存中,所以会占用比较多的内存。 如果excel文件太大,可以使用SAX事件来解析,具体的文档参考 : https://poi.apache.org/components/spreadsheet/how-to.html#xssf_sax_api

然后文档里有个示例代码,搬过来就可以了。 初看示例代码有点迷茫,所以可以先把整个excel解析后的xml打印出来,然后对照阅读就没有什么问题了。 原始excel如下: image.png 解析出来的xml数据如下:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<worksheet
    xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main"
    xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships"
    xmlns:xdr="http://schemas.openxmlformats.org/drawingml/2006/spreadsheetDrawing"
    xmlns:x14="http://schemas.microsoft.com/office/spreadsheetml/2009/9/main"
    xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006"
    xmlns:etc="http://www.wps.cn/officeDocument/2017/etCustomData">
    <sheetPr/>
    <dimension ref="A1:F5"/>
    <sheetViews>
        <sheetView tabSelected="1" workbookViewId="0">
            <selection activeCell="D6" sqref="D6"/>
        </sheetView>
    </sheetViews>
    <sheetFormatPr defaultColWidth="9" defaultRowHeight="15" outlineLevelRow="4" outlineLevelCol="5"/>
    <cols>
        <col min="1" max="2" width="21.5583333333333" style="1" customWidth="1"/>
        <col min="3" max="3" width="12.775" style="1" customWidth="1"/>
        <col min="4" max="4" width="40.1083333333333" style="1" customWidth="1"/>
        <col min="5" max="5" width="5.55833333333333" style="1" customWidth="1"/>
        <col min="6" max="6" width="19.8833333333333" style="1" customWidth="1"/>
        <col min="7" max="16384" width="8.88333333333333" style="1"/>
    </cols>
    <sheetData>
        <row r="1" spans="1:6">
            <c r="A1" s="1" t="s">
                <v>0</v>
            </c>
            <c r="B1" s="1" t="s">
                <v>1</v>
            </c>
            <c r="C1" s="1" t="s">
                <v>2</v>
            </c>
            <c r="D1" s="1" t="s">
                <v>3</v>
            </c>
            <c r="E1" s="1" t="s">
                <v>4</v>
            </c>
            <c r="F1" s="1" t="s">
                <v>5</v>
            </c>
        </row>
        <row r="2" spans="1:6">
            <c r="A2" s="1" t="s">
                <v>0</v>
            </c>
            <c r="B2" s="1" t="s">
                <v>1</v>
            </c>
            <c r="C2" s="1" t="s">
                <v>6</v>
            </c>
            <c r="D2" s="1" t="s">
                <v>7</v>
            </c>
            <c r="E2" s="1" t="s">
                <v>4</v>
            </c>
            <c r="F2" s="1" t="s">
                <v>8</v>
            </c>
        </row>
      //....省略
    </sheetData>
    <pageMargins left="0.7" right="0.7" top="0.75" bottom="0.75" header="0.3" footer="0.3"/>
    <headerFooter/>
</worksheet>

处理流程

示例数据:

 <row r="1" spans="1:6">
            <c r="A1" s="1" t="s">
                <v>0</v>
            </c>
           ...
 </row>

其中: row表示元素为行 c表示元素为列 v表示元素为值,它的数据是一个偏移量 r是列号或者行号 spans表示行数据的跨度 s是? t是数据类型,值为"s"表示字符串

事件处理流程就是这样的:

1. row元素开始
2. c元素开始
3. v元素开始
4. v元素结束
5. c元素结束
6. row元素结束

相关代码中的处理流程:

1. startElement  <row r="1" spans="1:6">
2. startElement  <c r="A1" s="1" t="s">
3. startElement <v>
4. characters 0
5. endElement </v>
6. endElement </c>
7. endElement </row>

封装

为了简便处理导出,写了一个简单的封装类:

public class ExcelImporter {
    /**
     * @param inputStream    输入流,处理结束后自动关闭
     * @param contentHandler 内容处理器
     * @param rId            Normally it's of the form rId# or rSheet#, example rId1
     */
    public static void importFromInputStream(InputStream inputStream,
                                             ExcelSheetHandler contentHandler,
                                             String rId) {
        try {
            OPCPackage pkg = OPCPackage.open(inputStream);
            XSSFReader xssfReader = new XSSFReader(pkg);
            SharedStringsTable sharedStringsTable = xssfReader.getSharedStringsTable();
            XMLReader parser = XMLHelper.newXMLReader();
            contentHandler.init(sharedStringsTable);
            parser.setContentHandler(contentHandler);
            InputStream sheet = xssfReader.getSheet(rId);
            InputSource sheetSource = new InputSource(sheet);
            parser.parse(sheetSource);
            sheet.close();
        } catch (Exception e) {
            throw new RuntimeException(e);
        }
    }
}
public abstract class ExcelSheetHandler<T> extends DefaultHandler {
    private SharedStringsTable sst;
    private StringBuilder lastContents = new StringBuilder();
    private StringBuilder lastColumnLabel = new StringBuilder();
    private T dto;
    private boolean nextIsString;
    private Integer rowIndex = 0;

    public void startElement(String uri, String localName, String name,
                             Attributes attributes) throws SAXException {
        if (rowIndex < getSkipHeaderRowCount()) {
            return;
        }
        if (name.equals("row")) {
            //行开始
            if (dto == null) {
                dto = createData();
            }
            clearData(dto);
        }
        // c => cell
        if (name.equals("c")) {
            lastColumnLabel.setLength(0);
            lastColumnLabel.append(attributes.getValue("r"));
            // Figure out if the value is an index in the SST
            String cellType = attributes.getValue("t");
            nextIsString = cellType != null && cellType.equals("s");
        }
        // Clear contents cache
        lastContents.setLength(0);
    }

    public void endElement(String uri, String localName, String name)
            throws SAXException {
        if (rowIndex < getSkipHeaderRowCount()) {
            if (name.equals("row")) {
                rowIndex++;
            }
            return;
        }

        if (name.equals("row")) {
            onRowEnd(dto, rowIndex);
            rowIndex++;
        }

        // Process the last contents as required.
        // Do now, as characters() may be called more than once
        if (nextIsString) {
            int idx = Integer.parseInt(lastContents.toString());
            lastContents.setLength(0);
            lastContents.append(sst.getItemAt(idx).getString());
            nextIsString = false;
        }
        // v => contents of a cell
        // Output after we've seen the string contents
        if (name.equals("v")) {
            onColEnd(dto, rowIndex, lastColumnLabel.toString(), lastContents.toString());
        }
    }

    public void characters(char[] ch, int start, int length) {
        lastContents.append(ch, start, length);
    }

    public void init(SharedStringsTable sst) {
        this.sst = sst;
    }

    protected int getSkipHeaderRowCount() {
        return 0;
    }

    protected abstract T createData();

    protected abstract void clearData(T dto);

    protected abstract void onRowEnd(T dto, int row);

    protected abstract void onColEnd(T dto, int row, String label, String value);
}