问题描述
如何在管道中使用awk和xmllint提取和验证xml文件.
How do I extract and validate xml files using awk and xmllint in a pipeline.
仅提取文件的Awk程序:
Awk program that only extracts files:
extractxml
extractxml
#!/usr/bin/awk -f
/<?xml version/{ getline doctype; getline datadoc;
if (match(datadoc,/file="([^-]+)-[^"]+.XML"/,a)) {
fn=a[1]".xml"; print $0 ORS doctype ORS datadoc > fn; print a[1]".xml" ; next;
}}{ print > fn }
输入的串联xml文件:
The input concatenated xml file:
refcase.xml
refcase.xml
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE data-document SYSTEM "refcase.dtd" [ ]>
<data-document lang="EN" dtd-version="v1 2017-01-01" file="aa1234aa-20170101.XML">
<document-metatdata lang="EN" country="INTL">
<document-reference/>
</document-metatdata>
</data-document>
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE data-document SYSTEM "refcase.dtd" [ ]>
<data-document lang="EN" dtd-version="v1 2017-01-01" file="aa2345bb-20170202.XML">
<document-metatdata lang="EN" country="LOCAL">
<document-reference/>
</document-metatdata>
</data-document>
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE data-document SYSTEM "refcase.dtd" [ ]>
<data-document lang="EN" dtd-version="v1 2017-01-01" file="aa3456cc-20170303.XML">
<document-metatdata lang="EN" country="NA">
<document-reference/>
</document-metatdata>
</data-document>
验证命令:
xmllint --debug --dtdvalid refcase.dtd aa1234bb.xml
xmllint用于验证xml文件的XML dtd文件:
XML dtd file used by xmllint for validation of the xml file:
refcase.dtd
refcase.dtd
<?xml encoding="UTF-8"?>
<!ELEMENT data-document (document-metatdata)>
<!ATTLIST data-document
xmlns CDATA #FIXED ''
date-published CDATA #REQUIRED
dtd-version CDATA #REQUIRED
file NMTOKEN #REQUIRED
<!ELEMENT document-metatdata (document-reference)>
<!ATTLIST document-metatdata
xmlns CDATA #FIXED ''
country NMTOKEN #REQUIRED
lang NMTOKEN #REQUIRED>
<!ELEMENT document-reference EMPTY>
<!ATTLIST document-reference
xmlns CDATA #FIXED ''>
当我将此代码添加到awk程序中时:
When I add this code to the awk program:
{ print > fn } system("xmllint --debug --dtdvalid refcase.dtd " fn " > " a[1]".xml.rpt")
- awk提取仍然可以正常工作,并且像以前一样创建.xml文件.
- awk输出现在传递到xmllint命令以进行xml验证,并且xmllint命令的输入似乎有问题.
Awk程序提取文件并将输出发送到xmllint命令:
Awk program that extracts the files and sends the output to the xmllint command:
#!/usr/bin/awk -f
/<?xml version/{ getline doctype; getline datadoc;
if (match(datadoc,/file="([^-]+)-[^"]+.XML"/,a)) {
fn=a[1]".xml"; print $0 ORS doctype ORS datadoc > fn; print a[1]".xml" ; next;
}}{ print > fn } system("xmllint --debug --dtdvalid refcase.dtd " fn " > " a[1]".xml.rpt")
在awk中调用时,xmllint命令的问题输出:
Problem output from the xmllint command when invoked in awk:
aa1234aa.xml
aa1234aa.xml:5: parser error : Premature end of data in tag document-metatdata line 4
aa1234aa.xml:5: parser error : Premature end of data in tag data-document line 3
<document-metatdata lang="EN" country="INTL">
aa1234aa.xml:6: parser error : Premature end of data in tag document-metatdata line 4
aa1234aa.xml:6: parser error : Premature end of data in tag data-document line 3
<document-reference/>
aa1234aa.xml:7: parser error : Premature end of data in tag data-document line 3
在shell中执行命令时不会发生解析器错误,仅在awk程序中执行时才会发生错误.对我来说,这表明提取的xml文件是可以的.
The parser errors do not occur when the command is executed in the shell, the errors only occur when executed in the awk program. Which suggests to me the extracted xml files are okay.
这是对数千个串联的txt文件的提取过程,每个txt文件都包含数千个xml文件.我需要跟踪和审核所有步骤并验证输出.
It is an extraction process for thousands of concatenated txt files that each contain thousands of xml files. I need to trace and audit all the steps and validate the outputs.
提取的xml文件的预期输出:
Expected output of extracted xml files:
aa1234aa.XML
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE data-document SYSTEM "refcase.dtd" [ ]>
<data-document lang="EN" dtd-version="v1 2017-01-01" file="aa1234aa-20170101.XML">
<document-metatdata lang="EN" country="INTL">
<document-reference/>
</document-metatdata>
</data-document>
aa2345bb.xml
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE data-document SYSTEM "refcase.dtd" [ ]>
<data-document lang="EN" dtd-version="v1 2017-01-01" file="aa2345bb-20170202.XML">
<document-metatdata lang="EN" country="LOCAL">
<document-reference/>
</document-metatdata>
</data-document>
aa3456cc.xml
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE data-document SYSTEM "refcase.dtd" [ ]>
<data-document lang="EN" dtd-version="v1 2017-01-01" file="aa3456cc-20170303.XML">
<document-metatdata lang="EN" country="NA">
<document-reference/>
</document-metatdata>
</data-document>
问题:
我希望awk将输出写入文件,然后将输出重定向到命令以进行进一步处理.
I would like awk to write the output to a file and redirect the output to a command for further processing.
不确定awk是否是提取的最佳工具,到目前为止,它在所有测试数据上都运行良好.我需要记录该过程并验证输出.
Not sure if awk is the best tool for extractions, it has worked well so far across the test data. I need to log the process and validate the output.
欣赏其他任何可靠且可扩展的方法吗?
Appreciate any other approaches that would be reliable and scalable?
推荐答案
您发布的命令是:
/<?xml version/{ getline doctype; getline datadoc;
if (match(datadoc,/file="([^-]+)-[^"]+.XML"/,a)) {
fn=a[1]".xml"; print $0 ORS doctype ORS datadoc > fn; print a[1]".xml" ; next;
}}{ print > fn } system("xmllint --debug --dtdvalid refcase.dtd " fn " > " a[1]".xml.rpt")
第1步是将其修复为使用合理的格式,以便我们可以看到控制流:
Step 1 is to fix it to use sensible formatting so we can see the control flow:
/<?xml version/{
getline doctype
getline datadoc;
if (match(datadoc,/file="([^-]+)-[^"]+.XML"/,a)) {
fn=a[1]".xml"
print $0 ORS doctype ORS datadoc > fn
print a[1]".xml"
next
}
}
{ print > fn }
system("xmllint --debug --dtdvalid refcase.dtd " fn " > " a[1]".xml.rpt")
好的,所以现在我们一眼就能看到system()调用是在条件块中而不是在动作中,它并没有关闭输出文件,没有引用xmllint文件名,这很困难-在多个位置编码a [1].xml",因此让我们修复它们:
OK, so now at a glance we can see that the system() call is in a condition block instead of an action, it's not closing output files as it goes, it's not quoting the xmllint file names, and it's hard-coding a[1]".xml" in multiple places so lets fix those:
/<?xml version/{
getline doctype
getline datadoc
if (match(datadoc,/file="([^-]+)-[^"]+.XML"/,a)) {
close(fn)
fn=a[1]".xml"
print $0 ORS doctype ORS datadoc > fn
print fn
next
}
}
{
print > fn
system("xmllint --debug --dtdvalid refcase.dtd \047" fn "\047 > \047" fn ".rpt\047")
}
现在让我们摆脱对getline
的脆弱和不必要的调用:
Now let's get rid of the fragile and unnecessary calls to getline
:
/<?xml version/{
xmlversion = $0
cnt = 3
}
cnt==2 {
doctype = $0
}
cnt==1 {
datadoc = $0
if (match(datadoc,/file="([^-]+)-[^"]+.XML"/,a)) {
close(fn)
fn=a[1]".xml"
print xmlversion ORS doctype ORS datadoc > fn
print fn
next
}
}
cnt { cnt--; next }
{
print > fn
system("xmllint --debug --dtdvalid refcase.dtd \047" fn "\047 > \047" fn ".rpt\047")
}
现在,我们可以看到您正在为输出的每一行而不是已完成的每个输出文件调用"xmllint".将命令更改为此:
Now we can see that you're calling "xmllint" for every line that's output instead of on every output file that's completed. Change your command to this:
/<?xml version/{
xmlversion = $0
cnt = 3
}
cnt==2 {
doctype = $0
}
cnt==1 {
if (match($0,/file="([^-]+)-[^"]+.XML"/,a)) {
lint(fn)
fn=a[1]".xml"
print xmlversion ORS doctype ORS $0 > fn
print fn
next
}
}
cnt { cnt--; next}
{ print > fn }
END { lint(fn) }
function lint(fn) {
if (fn != "") {
close(fn)
system("xmllint --debug --dtdvalid refcase.dtd \047" fn "\047 > \047" fn ".rpt\047")
fn = ""
}
}
最后,根据我现在对您的预期输出的了解,这就是我真正编写脚本的方式(还修复了我以前没有编写过的<?xml
中的?
和.XML
中的.
的未转义正则表达式元字符之前没有发现):
Finally, given what I now know about your expected output, this is how I'd really write your script (also fixed the unescaped regexp metacharacters ?
in <?xml
and .
in .XML
that I hadn't spotted previously):
/<\?xml version/ {
lint(fn)
fn = ""
}
match($0,/file="([^-]+)-[^"]+\.XML"/,a) {
fn = a[1]".xml"
$0 = prev2 ORS prev1 ORS $0
print fn
}
{
if ( fn != "" ) {
print > fn
}
prev2 = prev1
prev1 = $0
}
END { lint(fn) }
function lint(fn) {
if (fn != "") {
close(fn)
system("xmllint --debug --dtdvalid refcase.dtd \047" fn "\047 > \047" fn ".rpt\047")
}
}
这篇关于awk管道以提取和验证xml文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!