本文介绍了正确检测Perl中文件的行尾?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

问题:我有在Windows和* nix上生成的数据(大部分为CSV格式),并且大部分在* nix上处理过. Windows使用CRLF作为行尾,而Unix使用LF.对于任何特定的文件,我都不知道它是否具有Windows或* nix行尾.到目前为止,我一直在写这样的东西来解决差异:

Problem: I have data (mostly in CSV format) produced on both Windows and *nix, and processed mostly on *nix. Windows uses CRLF for line endings and Unix uses LF. For any particular file I don't know whether it has windows or *nix line endings. Up until now, I've been writing something like this to handle the difference:

while (<$fh>){
    tr/\r\n//d;
    my @fields = split /,/, $_;
    # ...
}

在* nix上,\ n部分等同于切碎,并且如果它是Windows生成的文件,则还去除了\ r(CR).

On *nix the \n part is equivalent to chomping, and additionally gets rid of \r (CR) if it's a windows-produced file.

但是现在我要使用Text :: CSV_XS b/c,我开始获得带有引号的数据(可能带有嵌入式换行符等)的怪异数据文件.为了使该模块读取此类文件,请使用Text: :CSV_XS :: getline()要求您指定行尾字符. (我无法读取上述每一行tr/\ n \ r//d,并且它们使用Text :: CSV b/c对其进行了解析,这无法正确处理嵌入式换行符).我如何正确地检测任意文件使用的是Windows还是* nix样式的行尾,所以我可以告诉Text :: CSV_XS :: eol()如何chomp()?

But now I want to Text::CSV_XS b/c I'm starting to get weirder data files with quoted data, potentially with embedded line-breaks, etc. In order to get this module to read such files, Text::CSV_XS::getline() requires that you specify the end-of-line characters. (I can't read each line as above, tr/\n\r//d, and them parse it with Text::CSV b/c that wouldn't handle embedded line-breaks properly). How do I properly detect whether an arbitrary file uses windows or *nix style line endings, so I can tell Text::CSV_XS::eol() how to chomp()?

我在CPAN上找不到一个仅检测行尾的模块.我不想首先通过dos2unix转换我的所有数据文件,因为文件巨大(数百GB),并且每个文件花费10分钟以上的时间来处理如此简单的事情似乎很愚蠢.我考虑过编写一个读取文件前几百个字节并计算LF与CRLF的函数,但是我拒绝相信这没有更好的解决方案.

I couldn't find a module on CPAN that simply detects line endings. I don't want to to first convert all my datafiles via dos2unix, b/c the files are huge (hundreds of gigabytes), and spending 10+ minutes for each file to deal with something so simple seems silly. I thought about writing a function which reads the first several hundred bytes of a file and counts LF's vs CRLF's, but I refuse to believe this doesn't have a better solution.

有帮助吗?

请注意:所有文件都完全具有Windows命令行结尾或* nix结尾,即,它们不是都混在一个文件中.

Note: all files are either have entirely windows-line endings or *nix endings, ie, they are not both mixed in a single file.

推荐答案

您可以使用:crlf PerlIO层,然后告诉 Text :: CSV_XS 使用\n作为行结束符.这样会将所有CR/LF对默默地映射到单个换行符,但这大概就是您想要的.

You could just open the file using the :crlf PerlIO layer and then tell Text::CSV_XS to use \n as the line ending character. This will silently map any CR/LF pairs to single line feeds, but that's presumably what you want.

use Text::CSV_XS;
my $csv = Text::CSV_XS->new( { binary => 1, eol => "\n" } );

open( $fh, '<:crlf', 'data.csv' ) or die $!;

while ( my $row = $csv->getline( $fh ) ) {
     # do something with $row
}

这篇关于正确检测Perl中文件的行尾?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

10-25 04:35