问题描述
这里有一个很奇怪的文件格式,它使用任何数量的制表符和空格来分隔字段(即使是前导和前导字段)。另一个特点是,字段可以在其中添加空格,然后以CSV方式转义。
I have a really weird fileformat here, which uses tabs and spaces in any amount to seperate fields (even trailing and leading ones). Another speciality is, that fields can be added with spaces in them, which are then escaped in a CSV manner.
一个例子:
0 "some string" 234 23947 123 ""some escaped"string""
我尝试用awk解析这些列,我需要在数组中包含每个项,例如
I try to parse such columns with awk and i would need to have every item in an array, e.g.
foo[0] -> 0
foo[1] -> "some string"
foo[2] -> 234
foo[3] -> 23947
foo[4] -> 123
foo[5] -> ""some escaped"string""
这是否可能?我阅读了表示解析csv已经非常困难(对于开始,应该足够用空格解析正常的字符串,转义的变体是非常罕见的)
Is this even possible? I read http://web.archive.org/web/20120531065332/http://backreference.org/2010/04/17/csv-parsing-with-awk/ which says that parsing csv is already very hard (For the beginning it should be enough to parse normal strings with spaces, the escaped variant is very rare)
在我混乱了很长时间之前:有什么办法
Before i mess around a long time: Is there any way to do this in awk or would i better use some other language?
推荐答案
使用GNU awk for FPAT:
With GNU awk for FPAT:
$ cat tst.awk
BEGIN { FPAT="\\S+|\"[^\"]+\"|,[^,]+," }
{
gsub(/@/,"@A")
gsub(/,/,"@B")
gsub(/""/,",")
for (i=1; i<=NF; i++) {
gsub(/,/,"\"\"",$i)
gsub(/@B/,",",$i)
gsub(/@A/,"@",$i)
print i, $i
}
}
$ awk -f tst.awk file
1 0
2 "some string"
3 234
4 23947
5 123
6 ""some escaped"string""
您可以查看
这篇关于拆分与制表符和空格分隔的列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!