问题描述
我有一个CSV文件,我想使用jq解析并获取嵌套的JSON.我最近开始使用JQ,我非常喜欢该工具.我了解基本功能,但是解析csv文件似乎有点困难,尤其是要打印嵌套对象时.
I have a CSV file which I would like to parse and obtain a Nested JSON using jq. I have started to use JQ recently and I really like the tool. I understand basic functionalities, but parsing a csv file seems a little difficult especially to print nested objects.
基因,外显子,总,外显子碱基,总碱基,外显子碱基的分数PIK3CA,PIK3CA_Exon10; chr1; 1000; 1500,PIK3CA_Exon13; chr1; 1000; 1500,PIK3CA_Exon14; chr1; 1000; 1500,1927879,12993042,0.15NRAS,NRAS_Exon4; chr1; 1000; 1500,NRAS_Amp_369; chr1; 1000; 1500,NRAS_Amp_371; chr1; 1000; 1500,NRAS_Amp_374; chr1; 1000; 1500,NRAS_Amp_379; chr1; 1000; 1500,884111,8062107,0.11
Gene, Exon,Total,Exon Bases, Total Bases, Fraction of Exon basesPIK3CA,PIK3CA_Exon10;chr1;1000;1500,PIK3CA_Exon13;chr1;1000;1500,PIK3CA_Exon14;chr1;1000;1500,1927879,12993042,0.15NRAS,NRAS_Exon4;chr1;1000;1500,NRAS_Amp_369;chr1;1000;1500,NRAS_Amp_371;chr1;1000;1500,NRAS_Amp_374;chr1;1000;1500,NRAS_Amp_379;chr1;1000;1500,884111,8062107,0.11
第一列将始终具有一个值.第二列可以有多个外显子(1个或多个).您会看到它在第二行中有3个值,在第三行中有5个值.外显子碱基将始终是倒数第二列,总碱基数将是最后一列,但最后一列,外显子碱基的分数将是最后一列.
The first column will have one value always. The second column can have multiple exons (1 or more). You can see that it has 3 values in 2nd row and 5 in 3rd row. Exon bases will be the second last column always, Total bases will be last but one and Fraction of exon bases will be the last column.
我已添加标题以进行说明,可以将其删除或修改以进行处理
I have added the header for explanation purposes, it can be removed or modified for processing
{
"Exome regions":[
{
"metric":"PIK3CA",
"value":[
{
"metric":"Exons",
"value":[
"PIK3CA_Exon10",
{
"chromosome":"chr1",
"start":1000,
"end":1500
},
"PIK3CA_Exon13",
{
"chromosome":"chr1",
"start":1000,
"end":1500
},
"PIK3CA_Exon14",
{
"chromosome":"chr1",
"start":1000,
"end":1500
}
],
"type":"set"
},
{
"metric":"Fraction of bases",
"value":0.15,
"type":"simple"
},
{
"metric":"Total_bases",
"value":1927879,
"type":"simple"
}
],
"type":"set"
},
{
"metric":"NRAS",
"value":[
{
"metric":"Exons",
"value":[
"NRAS_Exon4",
{
"chromosome":"chr1",
"start":1000,
"end":1500
},
"NRAS_Amp_369",
{
"chromosome":"chr1",
"start":1000,
"end":1500
},
"NRAS_Amp_371",
{
"chromosome":"chr1",
"start":1000,
"end":1500
},
"NRAS_Amp_374",
{
"chromosome":"chr1",
"start":1000,
"end":1500
},
"NRAS_Amp_379",
{
"chromosome":"chr1",
"start":1000,
"end":1500
}
],
"type":"set"
},
{
"metric":"Fraction of bases",
"value":0.11,
"type":"simple"
},
{
"metric":"Total_bases",
"value":884111,
"type":"simple"
}
],
"type":"set"
}
]
}
谢谢您的帮助!
PS:-我需要添加更多信息,我必须编辑Exon字段,并向每个Exon添加染色体",开始"和结束".在这里,我给出了相同的开始和结束,但是在实际情况下,每个Exon都会有所不同.你能帮我这个忙吗?另外,这些外显子的输入也可以用任何其他字符分隔.现在,我用;"分隔
PS: - I need to add more information, I have to edit the Exon fields and add "Chromosomes", "Start" and "End" to each Exon. Here i have given same start and end, but in actual scenario it varies for each Exon. Can you please help me with this.Also, the input for these Exons can be separated by any other character too.Right now I separate it by ";"
推荐答案
以下是使用函数解析和汇编输出的解决方案:
Here is a solution which uses functions for parsing and assembly of the output:
def parse:
[
inputs # read lines
| split(",") # split into columns
| select(length>0) # eliminate blanks
| .[:1] + [.[1:-3]] + .[-3:] # normalize columns
]
;
def simple(n;v): {metric:n, value:v|tonumber, type:"simple"};
def set(n;v): {metric:n, value:v, type:"set"};
def region:
set(.[0]; [
set("Exons"; .[1]),
simple("Fraction of bases"; .[2]),
simple("Total_bases"; .[3])
]
)
;
{
"Exome regions": parse | map(region)
}
运行样本(假设过滤器位于filter.jq
中,数据位于data.json
中)
Sample Run (assumes filter is in filter.jq
and data in data.json
)
$ jq -M -Rnr -f filter.jq data.json
{
"Exome regions": [
{
"metric": "PIK3CA",
"value": [
{
"metric": "Exons",
"value": [
"PIK3CA_Exon10",
"PIK3CA_Exon13",
"PIK3CA_Exon14"
],
"type": "set"
},
{
"metric": "Fraction of bases",
"value": 1927879,
"type": "simple"
},
{
"metric": "Total_bases",
"value": 12993042,
"type": "simple"
}
],
"type": "set"
},
{
"metric": "NRAS",
"value": [
{
"metric": "Exons",
"value": [
"NRAS_Exon4",
"NRAS_Amp_369",
"NRAS_Amp_371",
"NRAS_Amp_374",
"NRAS_Amp_379"
],
"type": "set"
},
{
"metric": "Fraction of bases",
"value": 884111,
"type": "simple"
},
{
"metric": "Total_bases",
"value": 8062107,
"type": "simple"
}
],
"type": "set"
}
]
}
以下是修正后的问题的解决方案:
Here is a solution to the revised problem:
def parse:
[
inputs # read lines
| split(",") # split into columns
| select(length>0) # eliminate blanks
| .[:1] + [.[1:-3]] + .[-3:] # normalize columns
]
;
def simple(n;v): {metric:n, value:v|tonumber, type:"simple"};
def set(n;v): {metric:n, value:v, type:"set"};
def exons(v): [ v[] | split(";") | .[0], {"chromosome":.[1], "start":.[2], "end":.[3]} ];
def region:
set(.[0]; [
set("Exons"; exons(.[1])),
simple("Fraction of bases"; .[2]),
simple("Total_bases"; .[3])
]
)
;
{ "Exome regions": parse | map(region) }
这篇关于将逗号分隔的文件转换为jq中的嵌套对象json的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!