问题描述
我有一个很大的.csv
文件(大约300 MB),可以从远程主机读取该文件,并将其解析为目标文件,但是我不需要将所有行都复制到目标文件中.复制时,我需要从源中读取每一行,如果它通过了某些谓词,则将其添加到目标文件中.
I have a large .csv
file (about 300 MB), which is read from a remote host, and parsed into a target file, but I don't need to copy all the lines to the target file. While copying, I need to read each line from the source and if it passes some predicate, add the line to the target file.
我想Apache CSV(apache.commons.csv
)只能解析整个文件
I suppose that Apache CSV ( apache.commons.csv
) can only parse whole file
CSVFormat csvFileFormat = CSVFormat.EXCEL.withHeader();
CSVParser csvFileParser = new CSVParser("filePath", csvFileFormat);
List<CSVRecord> csvRecords = csvFileParser.getRecords();
所以我不能使用BufferedReader
.根据我的代码,应该为每行创建一个new CSVParser()
实例,这看起来效率很低.
so I can't use BufferedReader
. Based on my code, a new CSVParser()
instance should be created for each line, which looks inefficient.
在上述情况下,如何解析一行(具有表的已知标题)?
How can I parse a single line (with known header of the table) in the case above?
推荐答案
无论您做什么,文件中的所有数据都将移交给本地计算机,因为系统需要通过解析来确定有效性.无论文件是通过解析器读取的文件到达的(以便您可以解析每一行),还是只是出于解析目的复制整个文件,都将全部传输到本地.您将需要获取本地数据,然后修剪多余的数据.
No matter what you do, all of the data from your file is going to come over to your local machine because your system needs to parse through it to determine validity. Whether the file arrives via a file read through the parser (so you can parse each line), or whether you just copy the entire file over for parsing purposes, it will all come over to local. You will need to get the data local, then trim the excess.
呼叫csvFileParser.getRecords()
已经是一场失败的战斗,因为文档解释说,该方法会将文件的每一行加载到内存中.要在保留活动内存的同时解析记录,则应遍历每条记录.该文档暗示以下代码一次将一条记录加载到内存:
Calling csvFileParser.getRecords()
is already a lost battle because the documentation explains that that method loads every row of your file into memory. To parse the record while conserving active memory, you should instead iterate over each record; the documentation implies the following code loads one record to memory at a time:
CSVParser csvFileParser = CSVParser.parse(new File("filePath"), csvFileFormat);
for (CSVRecord csvRecord : csvFileParser) {
... // qualify the csvRecord; output qualified row to new file and flush as needed.
}
由于您解释了"filePath"
不是本地的,因此上述解决方案由于连接问题而容易出现故障.为了消除连接问题,建议您将整个远程文件复制到本地,通过比较校验和确保文件复制正确,解析本地副本以创建目标文件,然后在完成后删除本地副本.
Since you explained that "filePath"
is not local, the above solution is prone to failure due to connectivity issues. To eliminate connectivity issues, I recommend you copy the entire remote file over to local, ensure the file copied accurately by comparing checksums, parse the local copy to create your target file, then delete the local copy after completion.
这篇关于如何通过CSVParser处理大文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!