问题描述
我正在创建一个 bash 脚本来从一个 CSV 文件生成一些输出(我有超过 1000 个条目并且不喜欢手工制作......).
I'm creating a bash script to generate some output from a CSV file (I have over 1000 entries and don't fancy doing it by hand...).
CSV 文件的内容类似于:
The content of the CSV file looks similar to this:
Australian Capital Territory,AU-ACT,20034,AU,Australia
Piaui,BR-PI,20100,BR,Brazil
"Adygeya, Republic",RU-AD,21250,RU,Russian Federation
我有一些代码可以使用逗号作为分隔符来分隔字段,但有些值实际上包含逗号,例如 Adygeya, Republic
.这些值用引号括起来,表示其中的字符应视为字段的一部分,但我不知道如何解析它以将其考虑在内.
I have some code that can separate the fields using the comma as delimiter, but some values actually contain commas, such as Adygeya, Republic
. These values are surrounded by quotes to indicate the characters within should be treated as part of the field, but I don't know how to parse it to take this into account.
目前我有这个循环:
while IFS=, read province provinceCode criteriaId countryCode country
do
echo "[$province] [$provinceCode] [$criteriaId] [$countryCode] [$country]"
done < $input
为上面给出的示例数据生成此输出:
which produces this output for the sample data given above:
[Australian Capital Territory] [AU-ACT] [20034] [AU] [Australia]
[Piaui] [BR-PI] [20100] [BR] [Brazil]
["Adygeya] [ Republic"] [RU-AD] [21250] [RU,Russian Federation]
如您所见,第三个条目的解析不正确.我要它输出
As you can see, the third entry is parsed incorrectly. I want it to output
[Adygeya Republic] [RU-AD] [21250] [RU] [Russian Federation]
推荐答案
如果你想在 awk (GNU awk 4> 需要此脚本按预期工作):
If you want to do it all in awk (GNU awk 4 is required for this script to work as intended):
awk '{
for (i = 0; ++i <= NF;) {
substr($i, 1, 1) == """ &&
$i = substr($i, 2, length($i) - 2)
printf "[%s]%s", $i, (i < NF ? OFS : RS)
}
}' FPAT='([^,]+)|("[^"]+")' infile
示例输出:
% cat infile
Australian Capital Territory,AU-ACT,20034,AU,Australia
Piaui,BR-PI,20100,BR,Brazil
"Adygeya, Republic",RU-AD,21250,RU,Russian Federation
% awk '{
for (i = 0; ++i <= NF;) {
substr($i, 1, 1) == """ &&
$i = substr($i, 2, length($i) - 2)
printf "[%s]%s", $i, (i < NF ? OFS : RS)
}
}' FPAT='([^,]+)|("[^"]+")' infile
[Australian Capital Territory] [AU-ACT] [20034] [AU] [Australia]
[Piaui] [BR-PI] [20100] [BR] [Brazil]
[Adygeya, Republic] [RU-AD] [21250] [RU] [Russian Federation]
使用 Perl:
perl -MText::ParseWords -lne'
print join " ", map "[$_]",
parse_line(",",0, $_);
' infile
这应该适用于您的 awk 版本(基于 this cus 发布,也删除了嵌入的逗号).
This should work with your awk version (based on this c.u.s. post, removed the embedded commas too).
awk '{
n = parse_csv($0, data)
for (i = 0; ++i <= n;) {
gsub(/,/, " ", data[i])
printf "[%s]%s", data[i], (i < n ? OFS : RS)
}
}
function parse_csv(str, array, field, i) {
split( "", array )
str = str ","
while ( match(str, /[ ]*("[^"]*(""[^"]*)*"|[^,]*)[ ]*,/) ) {
field = substr(str, 1, RLENGTH)
gsub(/^[ ]*"?|"?[ ]*,$/, "", field)
gsub(/""/, """, field)
array[++i] = field
str = substr(str, RLENGTH + 1)
}
return i
}' infile
这篇关于如何处理 bash 脚本读取的 CSV 文件中的逗号的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!