如何处理 bash 脚本读取的 CSV 文件中的逗号

本文介绍了如何处理 bash 脚本读取的 CSV 文件中的逗号的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在创建一个 bash 脚本来从一个 CSV 文件生成一些输出(我有超过 1000 个条目并且不喜欢手工制作......).

I'm creating a bash script to generate some output from a CSV file (I have over 1000 entries and don't fancy doing it by hand...).

CSV 文件的内容类似于:

The content of the CSV file looks similar to this:

Australian Capital Territory,AU-ACT,20034,AU,Australia
Piaui,BR-PI,20100,BR,Brazil
"Adygeya, Republic",RU-AD,21250,RU,Russian Federation

我有一些代码可以使用逗号作为分隔符来分隔字段，但有些值实际上包含逗号，例如 Adygeya, Republic.这些值用引号括起来，表示其中的字符应视为字段的一部分，但我不知道如何解析它以将其考虑在内.

I have some code that can separate the fields using the comma as delimiter, but some values actually contain commas, such as Adygeya, Republic. These values are surrounded by quotes to indicate the characters within should be treated as part of the field, but I don't know how to parse it to take this into account.

目前我有这个循环:

while IFS=, read province provinceCode criteriaId countryCode country
do
    echo "[$province] [$provinceCode] [$criteriaId] [$countryCode] [$country]"
done < $input

为上面给出的示例数据生成此输出:

which produces this output for the sample data given above:

[Australian Capital Territory] [AU-ACT] [20034] [AU] [Australia]
[Piaui] [BR-PI] [20100] [BR] [Brazil]
["Adygeya] [ Republic"] [RU-AD] [21250] [RU,Russian Federation]

如您所见，第三个条目的解析不正确.我要它输出

As you can see, the third entry is parsed incorrectly. I want it to output

[Adygeya Republic] [RU-AD] [21250] [RU] [Russian Federation]

推荐答案

如果你想在 awk (GNU awk 4> 需要此脚本按预期工作):

If you want to do it all in awk (GNU awk 4 is required for this script to work as intended):

awk '{
 for (i = 0; ++i <= NF;) {
   substr($i, 1, 1) == """ &&
     $i = substr($i, 2, length($i) - 2)
   printf "[%s]%s", $i, (i < NF ? OFS : RS)
    }
 }' FPAT='([^,]+)|("[^"]+")' infile

示例输出:

% cat infile
Australian Capital Territory,AU-ACT,20034,AU,Australia
Piaui,BR-PI,20100,BR,Brazil
"Adygeya, Republic",RU-AD,21250,RU,Russian Federation
% awk '{
 for (i = 0; ++i <= NF;) {
   substr($i, 1, 1) == """ &&
     $i = substr($i, 2, length($i) - 2)
   printf "[%s]%s", $i, (i < NF ? OFS : RS)
    }
 }' FPAT='([^,]+)|("[^"]+")' infile
[Australian Capital Territory] [AU-ACT] [20034] [AU] [Australia]
[Piaui] [BR-PI] [20100] [BR] [Brazil]
[Adygeya, Republic] [RU-AD] [21250] [RU] [Russian Federation]

使用 Perl:

perl -MText::ParseWords -lne'
 print join " ", map "[$_]",
   parse_line(",",0, $_);
  ' infile

这应该适用于您的 awk 版本(基于 this cus 发布，也删除了嵌入的逗号).

This should work with your awk version (based on this c.u.s. post, removed the embedded commas too).

awk '{
 n = parse_csv($0, data)
 for (i = 0; ++i <= n;) {
    gsub(/,/, " ", data[i])
    printf "[%s]%s", data[i], (i < n ? OFS : RS)
    }
  }
function parse_csv(str, array,   field, i) {
  split( "", array )
  str = str ","
  while ( match(str, /[ 	]*("[^"]*(""[^"]*)*"|[^,]*)[ 	]*,/) ) {
    field = substr(str, 1, RLENGTH)
    gsub(/^[ 	]*"?|"?[ 	]*,$/, "", field)
    gsub(/""/, """, field)
    array[++i] = field
    str = substr(str, RLENGTH + 1)
  }
  return i
}' infile

这篇关于如何处理 bash 脚本读取的 CSV 文件中的逗号的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！