使用源文件中的数据从XML文件中获取块

本文介绍了使用源文件中的数据从XML文件中获取块的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！问题描述自从我阅读了一些关于XML的文章后，我修改了这个问题。我有一个包含AuthNumbers列表的文件源文件。 111222 111333 111444 等。我需要搜索该列表中的数字，并在相应的XML文件中找到它们。在xml文件中，行的格式如下： < trpcAuthCode> 111222< / trpcAuthCode> 使用grep可以非常轻松地实现，但是我需要包含事务的整个块。这个区块开头为： < trans type =network salerecalled =false> 或< trans type =network salerecalled =falserollback =true> 和/或一些其他变化。其实< trans *> 会是最好的，如果可能的话。该块以< / trans> 它不需要优雅或高效。我只需要它的工作。我怀疑有些交易正在剔除，我需要一个快速的方式来审核那些没有被处理的交易。如果这里有帮助，可以链接到原始的（消毒的）xml https://www.dropbox.com/s/cftn23tnz8uc9t8/main.xml?dl=0 我想提取什么： https://www.dropbox.com/s/b2bl053nom4brkk/transaction_results.xml?dl=0 每个结果的大小会有所不同因为根据购买的产品数量，每笔交易的长度可能会有很大差异。在结果xml中，您看到我基于trpcAuthCode列表111222,111333,111444提取了我需要的xml。解决方案关于XML和awk问题，您经常会发现专家的评论（如果名称中包含 k 的话），awk中的XML处理很复杂或不足。正如我理解这个问题，脚本需要用于个人和/或调试目的。为此，我的解决方案应该足够了，但请记住，它不适用于任何合法的XML文件。根据您的描述，脚本是：如果< trans *> 匹配如果找到< trpcAuthCode> ，则获取其内容并与列表进行比较。如果匹配，记住块输出。如果< / trans> 匹配停止记录。如果输出已启用，则打印记录的块，否则放弃它。因为我在SO：Shell脚本 - 将xml分成多个文件这应该不会太多很难实现。虽然需要一个附加功能：将AuthNumbers数组提供给脚本。由于一个惊人的巧合，我今天早上在 SO：如何访问awk中的数组，这是在shell中的另一个awk中声明的？（感谢 jas ）。因此，把它放在脚本 filter-trpcAuthCode.awk ： BEGIN { record = 0＃记录的状态 buffer =＃记录的缓冲区找到的= 0＃找到的授权码的状态＃构建温度。（authCodes，list，\\\）＃构建最终数组，其中的值成为键（列表中的i）authCodeList [ （authCodeList）{ print authCode $ b $（））{ $ for debug：输出authCodeList print<！ - authCodeList： b} print - > } /< trans（[^>] *）？> / { record = 1＃开始记录 buffer =＃clear缓冲区发现= 0＃发现认证码的状态 $ b记录{缓冲区=缓冲区\ n$ 0记录行（如果记录是启用）} 记录&& /< trpcAuthCode> / {＃提取授权代码 authCode = gensub（/^.*>（[^ ＃检查auth代码中的auth代码是否在authCodeList中 found = authCodeList中的authCode } /< \ / trans> ; / {记录= 0＃停止记录＃如果找到认证码，打印缓冲区如果（找到）{打印缓冲区} } 注： $ b $我在中对 authCodes 应用 split（）时挣扎最初> BEGIN 。这使分割值与枚举键存储在一起的数组。因此，我寻找一种解决方案，使数值本身成为数组的关键。（否则，运算符中的不能用于搜索。）我在 SO：检查数组是否包含值。我执行了建议的模式 code>< trans *> 作为 /< trans（[^>] *）？/ < trans> （尽管< trans> 似乎永远不会出现没有属性的情况），但不会出现< transSet> 。 buffer = buffer \\\$ 0 将当前行追加到以前的内容。 $ 0 包含没有换行符的行。因此，它必须重新插入。我是如何做到的，缓冲区以换行符开始，但最后一行没有结束。考虑到打印缓冲区在文本末尾添加了换行符，这对我来说很好。或者，上面的代码片段可以被替换为： buffer = buffer $ 0\ n 甚至 buffer =（buffer！=？buffer\\\：）$ 0 。这是一个有趣的问题。）过滤的文件简单地打印到标准输出通道。它可能被重定向到一个文件。考虑到这一点，我将附加/调试输出格式化为XML注释。如果你对awk有点熟悉，你可能会注意到没有任何在我的脚本中 next 语句。这是有意的。换句话说，规则的顺序是精心挑选的，这样一条线就可以被所有规则连续处理/影响。（我测试了一个极端情况： < trans>< trpcAuthCode> 111222< / trpcAuthCode>< / trans> ，甚至可以正确处理。）为了简化测试，我添加了一个封装bash脚本 filter-trpcAuthCode.sh ＃！/ usr / bin / bash ＃取消注释下一行以进行调试 #set -x ＃检查命令行参数 if [[$＃-ne 2]];那么回显错误：非法的命令行参数数量！ echo echo用法： echo $（basename $ 0）XML_FILE AUTH_CODES exit 1 fi ＃call awk script awk -v authCodes =$（cat 我使用示例文件测试了脚本（在Windows 10上使用cygwin中的bash） main.xml 并获得了四个匹配的块。我有点担心输出，因为在您的示例输出中 transaction_results。 xml 只有三个匹配的块。但是通过视觉检查我的输出结果似乎是合适的。（所有这四个匹配都包含一个匹配的< trpcAuthCode> 元素。）示例 sample.xml ： <？xml version = 1.0\" >？; < transSet periodID =1periodname =ShiftlongId =2017-04-27shortId =052site =12345> < trans type =periodClose> < trHeader> < / trHeader> < / trans> < printCashier> < cashier sysid =7empNum =07posNum =101period =11> A.Dude< / cashier> < / printCashier> < trans type =printCashier> < trHeader> < cashier sysid =7empNum =07posNum =101period =11> A.Dude< / cashier> < posNum> 101< / posNum> < / trHeader> < / trans> < trans type =journal> < trHeader> < / trHeader> < / trans> < trans type =network salerecalled =false> < trHeader> < termMsgSN type =FINANCIALterm =908> 31054< / termMsgSN> < / trHeader> < trPaylines> < trPayline type =salesysid =1locale =DOLLAR> < trpCardInfo> < trpcAccount> 1234567890123456< / trpcAccount> < trpcAuthCode> 532524< / trpcAuthCode> < / trpCardInfo> < / trPayline> < / trPaylines> < / trans> < trans type =network salerecalled =false> < trHeader> < termMsgSN type =FINANCIALterm =908> 31054< / termMsgSN> < / trHeader> < trPaylines> < trPayline type =salesysid =1locale =DOLLAR> < trpPaycode mop =3cat =1nacstendercode =genericnacstendersubcode =generic> CREDIT< / trpPaycode> < trpAmt> 61.77< / trpAmt> < trpCardInfo> < trpcAccount> 2345678901234567< / trpcAccount> < trpcAuthCode> 111222< / trpcAuthCode> < / trpCardInfo> < / trPayline> < / trPaylines> < / trans> < trans type =periodClose> < trHeader> < date> 2017-04-27T23：50：17-04：00< / date> < / trHeader> < / trans> < endTotals> < insideSales> 445938.63< / insideSales> < / endTotals> < / transSet> 对于其他示例输入，我只是将文本复制到文件 authCodes中。 txt ： 111222 111333 111444 在示例会话中使用两个输入文件： $ ./filter-xml-trpcAuthCode.sh 错误：非法数量的命令行参数！用法： filter-xml-trpcAuthCode.sh XML_FILE AUTH_CODES $ ./filter-xml-trpcAuthCode.sh sample.xml authCodes.txt <！ - authCodeList： 111222 111333 111444 - > < trans type =network salerecalled =false> < trHeader> < termMsgSN type =FINANCIALterm =908> 31054< / termMsgSN> < / trHeader> < trPaylines> < trPayline type =salesysid =1locale =DOLLAR> < trpPaycode mop =3cat =1nacstendercode =genericnacstendersubcode =generic> CREDIT< / trpPaycode> < trpAmt> 61.77< / trpAmt> < trpCardInfo> < trpcAccount> 2345678901234567< / trpcAccount> < trpcAuthCode> 111222< / trpcAuthCode> < / trpCardInfo> < / trPayline> < / trPaylines> < / trans> $ ./filter-xml-trpcAuthCode.sh main.xml authCodes.txt> output.txt $ 最后一个命令将输出重定向到一个文件 output.txt ，这个文件可能会在以后检查或处理。 I revamped this question since I've been reading a bit on XML. I have a file source file that contains a list of AuthNumbers.111222111333111444etc.I need to search for the numbers in that list and find them in a corresponding XML file. In the xml file the line is formatted as such:<trpcAuthCode>111222</trpcAuthCode>This can be achieved quite painlessly using grep however I require the entire block containing the transaction. The block starts with: <trans type="network sale" recalled="false"> or <trans type="network sale" recalled="false" rollback="true"> and/or some other variations. Actually <trans*> would be best if something like that is possible.The block ends with </trans>It doesn't need to be elegant or efficient. I just need it to work. I suspect some transactions are dropping out and I need a quick way to vet the ones that are not being processed. If it helps here is a link to the original (sterilized) xml https://www.dropbox.com/s/cftn23tnz8uc9t8/main.xml?dl=0And what I would like to extract:https://www.dropbox.com/s/b2bl053nom4brkk/transaction_results.xml?dl=0The size of each result will vary as each transaction can vary greatly in length depending on the amount of products purchased. In the results xml you see that I extracted the xml I need based on the trpcAuthCode list 111222,111333,111444. 解决方案 Concerning XML and awk questions, you often find comments of the gurus (the one if a k in their reputation) that XML processing in awk is complicated or not sufficient. As I understood the question, the script is needed for personal and/or debugging purposes. For this, my solution should be sufficient but, please, keep in mind that it will not work on any legal XML file.Based on your description, the sketch of the script is:If <trans*> is matched start recording.If <trpcAuthCode> is found get its contents and compare with the list. In case of match, remember block for output.If </trans> is matched stop recording. If output has been enabled print recorded block otherwise discard it.Because I did something similar in SO: Shell scripting - split xml into multiple files this should become not too hard to implmenent.Though, one additional feature is necessary: feeding the AuthNumbers array into the script. Due to a surprising coincidence, I learnt the answer just this morning in SO: How to access an array in an awk, which is declared in a different awk in shell? (thanks to the comment of jas).So, putting it altogether in a script filter-trpcAuthCode.awk:BEGIN { record = 0 # state for recording buffer = "" # buffer for recording found = 0 # state for found auth code # build temp. array from authCodes which has to be pre-defined split(authCodes, list, "\n") # build final array where values become keys for (i in list) authCodeList[list[i]] # for debugging: output of authCodeList print ""}/<trans( [^>]*)?>/ { record = 1 # start recording buffer = "" # clear buffer found = 0 # reset state for found auth code}record { buffer = buffer"\n"$0 # record line (if recording is enabled)}record && /<trpcAuthCode>/ { # extract auth code authCode = gensub(/^.*>([^<]*)<\/trpcAuthCode.*$/, "\\1", "g") # check whether auth code in authCodeList found = authCode in authCodeList}/<\/trans>/ { record = 0 # stop recording # print buffer if auth code has been found if (found) { print buffer }}Notes:I struggled initially when applying the split() on authCodes in BEGIN. This makes an array where the split values are stored with enumerated keys. Thus, I looked for a solution to make the values itself keys of the array. (Otherwise, the in operator cannot be used for search.) I found an elegant solution in the accepted answer of SO: Check if array contains value.I implemented the proposed pattern <trans*> as /<trans( [^>]*)?/ which will even match <trans> (although <trans> seems never to occur without attributes) but not <transSet>.Thebuffer = buffer"\n"$0appends the current line to the previous contents. The $0 contains the line without the newline character. Thus, it has to be re-inserted. How I did it, the buffer starts with a newline but the last line ends without. Considering that the print buffer adds a newline at the end of text this is fine for me. Alternatively, the above snippet could be replaced bybuffer = buffer $0 "\n"or evenbuffer = (buffer != "" ? buffer"\n" : "") $0.(It's a matter of taste.)The filtered file is simply printed to standard output channel. It might be redirected to a file. Considering this, I formatted the additional/debug output as XML comment.If your are a little bit familiar with awk you may notice that there isn't any next statement in my script. This is by intention. In other words, the order of rules is well-chosen so that a line may be processed/affected consecutively by all rules. (I tested an extreme case:<trans><trpcAuthCode>111222</trpcAuthCode></trans>and even this is processed correctly.)To simplify testing I added a wrapper bash script filter-trpcAuthCode.sh#!/usr/bin/bash# uncomment next line for debugging#set -x# check command line argumentsif [[ $# -ne 2 ]]; then echo "ERROR: Illegal number of command line arguments!" echo "" echo "Usage:" echo $(basename $0) " XML_FILE AUTH_CODES" exit 1fi# call awk scriptawk -v authCodes="$(cat <$2)" -f filter-xml-trpcAuthCode.awk "$1"I tested the scripts (with bash in cygwin on Windows 10) against your sample file main.xml and got four matching blocks. I was a little bit concerned about the output because in your sample output transaction_results.xml are only three matching blocks. But checking my output visually it seems to be appropriate. (All four hits contained a matching <trpcAuthCode> element.)I reduced your sample input a little bit for demonstration sample.xml:<?xml version="1.0"?><transSet periodID="1" periodname="Shift" longId="2017-04-27" shortId="052" site="12345"> <trans type="periodClose"> <trHeader> </trHeader> </trans> <printCashier> <cashier sysid="7" empNum="07" posNum="101" period="11">A.Dude</cashier> </printCashier> <trans type="printCashier"> <trHeader> <cashier sysid="7" empNum="07" posNum="101" period="11">A.Dude</cashier> <posNum>101</posNum> </trHeader> </trans> <trans type="journal"> <trHeader> </trHeader> </trans> <trans type="network sale" recalled="false"> <trHeader> <termMsgSN type="FINANCIAL" term="908">31054</termMsgSN> </trHeader> <trPaylines> <trPayline type="sale" sysid="1" locale="DOLLAR"> <trpCardInfo> <trpcAccount>1234567890123456</trpcAccount> <trpcAuthCode>532524</trpcAuthCode> </trpCardInfo> </trPayline> </trPaylines> </trans> <trans type="network sale" recalled="false"> <trHeader> <termMsgSN type="FINANCIAL" term="908">31054</termMsgSN> </trHeader> <trPaylines> <trPayline type="sale" sysid="1" locale="DOLLAR"> <trpPaycode mop="3" cat="1" nacstendercode="generic" nacstendersubcode="generic">CREDIT</trpPaycode> <trpAmt>61.77</trpAmt> <trpCardInfo> <trpcAccount>2345678901234567</trpcAccount> <trpcAuthCode>111222</trpcAuthCode> </trpCardInfo> </trPayline> </trPaylines> </trans> <trans type="periodClose"> <trHeader> <date>2017-04-27T23:50:17-04:00</date> </trHeader> </trans> <endTotals> <insideSales>445938.63</insideSales> </endTotals></transSet>For the other sample input I simply copied the text into a file authCodes.txt:111222111333111444Using both input files in the sample session:$ ./filter-xml-trpcAuthCode.shERROR: Illegal number of command line arguments!Usage:filter-xml-trpcAuthCode.sh XML_FILE AUTH_CODES$ ./filter-xml-trpcAuthCode.sh sample.xml authCodes.txt <trans type="network sale" recalled="false"> <trHeader> <termMsgSN type="FINANCIAL" term="908">31054</termMsgSN> </trHeader> <trPaylines> <trPayline type="sale" sysid="1" locale="DOLLAR"> <trpPaycode mop="3" cat="1" nacstendercode="generic" nacstendersubcode="generic">CREDIT</trpPaycode> <trpAmt>61.77</trpAmt> <trpCardInfo> <trpcAccount>2345678901234567</trpcAccount> <trpcAuthCode>111222</trpcAuthCode> </trpCardInfo> </trPayline> </trPaylines> </trans>$ ./filter-xml-trpcAuthCode.sh main.xml authCodes.txt >output.txt$The last command re-directs output to a file output.txt which may be inspected or processed afterwards. 这篇关于使用源文件中的数据从XML文件中获取块的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！