html - 将制表符分隔的文本文件转换为HTML/PDF/latex/knitr报告

这是制表符分隔的文件：

Chr Start   End Ref Alt Func.refGene    Gene.refGene    GeneDetail.refGene  ExonicFunc.refGene  AAChange.refGene    snp138  clinvar_20140929    SIFT_score  SIFT_pred   Polyphen2_HDIV_score    Polyphen2_HDIV_pred Polyphen2_HVAR_score    Polyphen2_HVAR_pred LRT_score   LRT_pred    MutationTaster_score    MutationTaster_pred MutationAssessor_score  MutationAssessor_pred   FATHMM_score    FATHMM_pred RadialSVM_score RadialSVM_pred  LR_score    LR_pred VEST3_score CADD_raw    CADD_phred  GERP++_RS   phyloP46way_placental   phyloP100way_vertebrate SiPhy_29way_logOdds
chr13   52523808    52523808    C   T   exonic  ATP7B       nonsynonymous SNV   ATP7B:NM_000053:exon12:c.2855G>A:p.R952K,ATP7B:NM_001243182:exon13:c.2522G>A:p.R841K    rs732774    CLINSIG=non-pathogenic|non-pathogenic;CLNDBN=Wilson's_disease|not_specified;CLNREVSTAT=single|single;CLNACC=RCV000029357.1|RCV000078044.1;CLNDSDB=GeneReviews:MedGen:OMIM:Orphanet:SNOMED_CT|.;CLNDSDBID=NBK1512:C0019202:277900:ORPHA905:88518009|.    0.99    T   0.04    B   0.03    B   0.000   N   0.000   P   -1.04   N   -3.73   D   -0.965  T   0.000   T   0.214   1.511   11.00   6.06    1.111   2.781   12.356
chr13   52523867    52523867    T   G   exonic  ATP7B       synonymous SNV  ATP7B:NM_000053:exon12:c.2796A>C:p.S932S,ATP7B:NM_001243182:exon13:c.2463A>C:p.S821S

I have a bash script that takes ABI file as input and uses ANNOVAR for annotating the variants. A tab-delimited text file is produced that contains the annotated variants. So everytime the bash script is executed for different ABI files, the number of columns are fixed in the tab-delimited file but the number of rows as well as the individual annotations may vary for each resulting variant.

Attempts so far-->

I have tried to write a bash script that extracts [for the first variant] different fields from the tab-delimited text file, saves it as text file, combines all the resulting text individual files and using AWK script it assigns different variables to each of the fields in the Combined Text File. I have created HTML page using AWK and have used these variables in AWK script to print in respective tags in HTML and it works fine for a file that follows the same pattern in tab-delimited text file. But when a particular field is not present for other annotated results with different pattern, the script prints different fields than the variable it has been assigned for.

If the first variant contains the Clinically significant mutation, there will be annotation present in the "clinvar" column and thus it needs to be reported in a different section along with other details.

The order of the combined text file is not the same for each variant, hence the report generated for it is not correct.

Expected Result-->

Since the format of the tab-delimited file is not uniform, is there any way that for each row I can set multiple conditions wherein for example If a specific column [for ex:clinvar] has a value, then print it in between HTML tags and if it is not present, then check for another column [for ex: rsID] and if a value is present then print it in some other HTML tags, and so on for other columns as well!

Variant position:chr13:52523808C>T

Variant Type: Nonsynonymous-SNV

rsID: rs732774

Amino Acid Change: p.R952K

Gene Name:ATP7B

Disease:Wilsons Disease

Result: Non-pathogenic

The format of the HTML page and the values in it should be something like this:

<html>
<title></title><head>
<style type="text/css">
body {background-color:lightgray}
h1   {background-color:SlateGray}
</style>
</head><body bgcolor="LightGray">
<table border=1><th align=>Test Code</th><th align=>Gene Name</th><th align=>Condition tested</th><th align=>Result</th>
<tr><td width=750 align=></td><td width=750 align=>ATP7B(RefSeq ID: NM_000053)</td><td width=750 align=>Wilson's_disease</td><td width=750 align=>Non-pathogenic</td></tr>
<h1 align=>Test Details</h1>
<table border=1><th align=centre>Genomic Location of Mutation</th><th align=centre>Mutation Type</th><th align=centre>dbSNP Identifier</th><th align=centre>Amino Acid Change</th><th align=centre>OMIM Identifier</th>
<h1 align=>Significant Findings</h1>
<tr><td width=750 align=>chr13:52523808C>T</td><td width=750 align=>Nonsynonymous-SNV</td><td width=750 align=>rs732774</td><td width=750 align=>p.R952K</td><td width=750 align=>http://www.omim.org/entry/277900</td></tr>
<p> The identified variant is located in the <strong> exonic </strong> region of the <strong> chr13 </strong> chromosome and is a <strong> Nonsynonymous-SNV </strong> which causes an amino acid change from <strong> Arginine </strong> to <strong> Lysine </strong>. The mutation has also been reported in the dbSNP database (http://www.ncbi.nlm.nih.gov/SNP/) with an accession number of <strong> rs732774 </strong>. </p>
</table></body>
</html>

以类似的方式，当存在ExonicFunc.refGene列包含“非同义”且snp138列中没有值的新变体时，它应在HTML标记之间打印SIFT_分数和其他详细信息。这些只是需要的一些条件，但如果有人能给一个想法，如何去做这一切，这将是非常有益的！!!
感谢您阅读这么长的一期，任何关于这个问题的帮助都将不胜感激。

最佳答案

我在这里向您展示的awk程序将拆分所有标题和相应行中的所有数据。我想你可以修改它来定制你的需求。记住，你所有的棘手的规则-当这没有出现时，表明相反-是更好的实现自己而不是要求一个实现。

#
# processor.awk
#


BEGIN   {
        IGNORECASE = 1;
        header = "";
        html_template = "<tr><td>##fieldname</td><td>##fieldvalue</td></tr>"
        }
        {
        if( header == "" && $0 != "" )
        {   # the first not empty line is the header
            header = $0;
            # put every element of the header into an array
            split( header, fields, "\t" );
            # for debug: print the fields found
            #for( elem in fields )
            #   print "field" elem ": " fields[elem];
        } # if
        else
        {
            # normal lines
            # split the line into the elements
            split( $0, content, "\t" );
            # for every element in the content line....
            for( elem = 1; fields[elem] !=""; elem++ )
            {
                print elem;
                out_line = html_template;
                out_line = gensub( /##fieldname/, fields[elem], "g", out_line );
                out_line = gensub( /##fieldvalue/, content[elem], "g", out_line );
                # print the result
                print out_line;
            } # for
        } # if
        }
END     {
        }

关于html - 将制表符分隔的文本文件转换为HTML/PDF/latex/knitr报告，我们在Stack Overflow上找到一个类似的问题：https://stackoverflow.com/questions/30950426/