我有一个由HTML代码组成的文本文件,需要对其进行操作以提高可读性。我的问题是,每个文件名有两行不是唯一的,但我需要区分它们:
编辑
我会把输入放在这里给那些要求的人:

<body>
<tbody>
<tr><td><b>Test Suite</b></td></tr>
<tr><td><a href="HAPPY/3_step_minimal_foundation_no_prefill_HAPPY">3_step_minimal_foundation_no_prefill_HAPPY</a></td></tr>
<tr><td><a href="HAPPY/fullform_no_prefill_HAPPY">fullform_no_prefill_HAPPY</a></td></tr>
<tr><td><a href="HAPPY/fullform_mobile_foundation_no_prefill_HAPPY">fullform_mobile_foundation_no_prefill_HAPPY</a></td></tr>
<tr><td><a href="SAD/3_step_minimal_foundation_SAD">3_step_minimal_foundation_SAD</a></td></tr>
<tr><td><a href="SAD/fullform_SAD">fullform_SAD</a></td></tr>
<tr><td><a href="SAD/fullform_mobile_foundation_SAD">fullform_mobile_foundation_SAD</a></td></tr>
<tr><td><a href="HAPPY_PLUS_OPTIONS/3_step_minimal_foundation_HAPPY_PLUS_OPTIONS">3_step_minimal_foundation_HAPPY_PLUS_OPTIONS</a></td></tr>
<tr><td><a href="HAPPY_PLUS_OPTIONS/fullform_HAPPY_PLUS_OPTIONS">fullform_HAPPY_PLUS_OPTIONS</a></td></tr>
<tr><td><a href="HAPPY_PLUS_OPTIONS/fullform_mobile_foundation_HAPPY_PLUS_OPTIONS">fullform_mobile_foundation_HAPPY_PLUS_OPTIONS</a></td></tr>
<tr><td><a href="SAD_PLUS_OPTIONS/3_step_minimal_foundation_SAD_PLUS_OPTIONS">3_step_minimal_foundation_SAD_PLUS_OPTIONS</a></td></tr>
<tr><td><a href="SAD_PLUS_OPTIONS/fullform_SAD_PLUS_OPTIONS">fullform_SAD_PLUS_OPTIONS</a></td></tr>
<tr><td><a href="SAD_PLUS_OPTIONS/fullform_mobile_foundation_SAD_PLUS_OPTIONS">fullform_mobile_foundation_SAD_PLUS_OPTIONS</a></td></tr>
</tbody></table>
</body>

3-步进极小的基础

3-步进极小的基础
例如,需要成为:
3-步进式最小基础填充

3-步进极小的基础
文本文件的当前状态:
这就是我实现这一点的代码:
$ sed -n '/ref/p' EVERYTHING | awk -F'[/"<> ]+' '{sub("", "", $6); print $6, $7, $8}' | tr -s '[[:space:]]' '\n' | awk -v n=3 '1; NR % n == 0 {print ""}' | sed '/^HAPPY/s/^/Flow Type\: /' | sed '/^SAD/s/^/Flow Type\: /' | sed '$d'

Flow Type: HAPPY
3_step_minimal_foundation_no_prefill_HAPPY
3_step_minimal_foundation_no_prefill_HAPPY

Flow Type: HAPPY
fullform_no_prefill_HAPPY
fullform_no_prefill_HAPPY

Flow Type: HAPPY
fullform_mobile_foundation_no_prefill_HAPPY
fullform_mobile_foundation_no_prefill_HAPPY

Flow Type: SAD
3_step_minimal_foundation_SAD
3_step_minimal_foundation_SAD

Flow Type: SAD
fullform_SAD
fullform_SAD

Flow Type: SAD
fullform_mobile_foundation_SAD
fullform_mobile_foundation_SAD

Flow Type: HAPPY_PLUS_OPTIONS
3_step_minimal_foundation_HAPPY_PLUS_OPTIONS
3_step_minimal_foundation_HAPPY_PLUS_OPTIONS

Flow Type: HAPPY_PLUS_OPTIONS
fullform_HAPPY_PLUS_OPTIONS
fullform_HAPPY_PLUS_OPTIONS

我想要的输出:
Flow Type: HAPPY
Flow Name: 3_step_minimal_foundation_no_prefill
File Name: 3_step_minimal_foundation_no_prefill_HAPPY

Flow Type: HAPPY
Flow Name: fullform_no_prefill
File Name: fullform_no_prefill_HAPPY

Flow Type: HAPPY
Flow Name: fullform_mobile_foundation_no_prefill
File Name: fullform_mobile_foundation_no_prefill_HAPPY

Flow Type: SAD
Flow Name: 3_step_minimal_foundation
File Name: 3_step_minimal_foundation_SAD

Flow Type: SAD
Flow Name: fullform
File Name: fullform_SAD

Flow Type: SAD
Flow Name: fullform_mobile_foundation
File Name: fullform_mobile_foundation_SAD

Flow Type: HAPPY_PLUS_OPTIONS
Flow Name: 3_step_minimal_foundation
File Name: 3_step_minimal_foundation_HAPPY_PLUS_OPTIONS

Flow Type: HAPPY_PLUS_OPTIONS
Flow Name: fullform
File Name: fullform_HAPPY_PLUS_OPTIONS

有没有办法从行号N中删除/保留特定文本?一旦我得到每一行的唯一性,就很容易正确地标记每一行。
-最好的

最佳答案

好的,对于从下划线到行尾删除所有匹配前一行的行的基本功能,这个过程非常简单。这里有两个选项,100%未经测试。
在awk中:

awk '$0 == last { sub(/_[^_]+$/,""); } { last=$0; } 1' inputfile

在shell中:
while read line; do
    if [ "$line" = "$last" ]; then
        line="${line%_*}"
    fi
    echo "$line"
    last="$line"
done < inputfile

但这改变了这两行中的第二行。对于所需的其他格式,似乎需要修改这两行中的第一行。这让事情变得有点复杂。。。
要从必须的文本转到想要的文本,让我们以不同的方式来看待这一点,并假设两个重复的行总是出现在以“Flow Type:”开头的行之后。
awk '
  /^Flow Type:/ {
    print;
    getline one; getline two
    if (one == two) {
      sub(/_[^_]+$/,"",one);
      print "Flow Name: " one;
      print "File Name: " two;
    } else {
      print one; print two
    }
    next;
  }

  1
' inputfile

但我们也可以处理你的原始HTML。
在sed中,模式识别非常有趣。这里有一个在GNU sed:
sed -r 's|<tr><td><a href="([^/]+)/(([^"]+)_[^_]+)".*|Flow Type: \1\nFlow Name: \3\nFile Name: \2|' input.html

这里需要GNU-sed的是新行(\n);从结构上来说,它只是简单的sed。此解决方案在*BSD或OSX中不起作用。
编辑:根据对potong回答的评论,OSX中可以使用的一个变体是:
<input.html sed -n 's/^.*"\([^"\/]*\)\/\(\([^"]*\)_\1\)".*/Flow Type: \1|Flow Name: \3|File Name: \2|/p'  | tr '|' '\n'`

或者如果你更喜欢这里而不是布雷:
<input.html sed -E 's|<tr><td><a href="([^/]+)/(([^"]+)_[^_]+)".*|Flow Type: \1#Flow Name: \3#File Name: \2#|' | tr '#' '\n'

这解决了OSX sed无法在替换字符串中插入新行的限制。相反,我们插入一个未使用的字符,并使用tr将其转换为换行符。
为了在awk中实现相同的目标(即处理HTML),您可以执行如下操作:
awk '
  /<tr><td><a/ {

    type=$0; file=$0;
    sub(/^[^"]+/,"",type); sub(/\/.*/,"",type);
    sub(/^[^\/]+\//,"",file); sub(/".*/,"",file);
    name=file; sub(/_[^_]+$/,"",name);

    printf("Flow type: %s\nFlow name: %s\nFile name: %s\n\n", type, name, file);

  }' input.html

好的,这是我最后的更新。这就是你要找的吗?
awk '
  /<tr><td><a/ {

    type=$0; sub(/^[^"]+"/,"",type); sub(/\/.*/,"",type);
    file=$0; sub(/^[^\/]+\//,"",file); sub(/".*/,"",file);

    if ( index(file, type) ) {
        name=substr(file, 0, index(file, type)-2);
    } else {
        name=file; sub(/_[^_]+$/,"",name);
    }

    printf("Flow type: %s\nFlow name: %s\nFile name: %s\n\n", type, name, file);

  }'

关于bash - 如何使用AWK或SED在N行之前打印字符串并从N行删除特定字符串,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/32203437/

10-13 09:20
查看更多