我有一个由HTML代码组成的文本文件,需要对其进行操作以提高可读性。我的问题是,每个文件名有两行不是唯一的,但我需要区分它们:
编辑
我会把输入放在这里给那些要求的人:
<body>
<tbody>
<tr><td><b>Test Suite</b></td></tr>
<tr><td><a href="HAPPY/3_step_minimal_foundation_no_prefill_HAPPY">3_step_minimal_foundation_no_prefill_HAPPY</a></td></tr>
<tr><td><a href="HAPPY/fullform_no_prefill_HAPPY">fullform_no_prefill_HAPPY</a></td></tr>
<tr><td><a href="HAPPY/fullform_mobile_foundation_no_prefill_HAPPY">fullform_mobile_foundation_no_prefill_HAPPY</a></td></tr>
<tr><td><a href="SAD/3_step_minimal_foundation_SAD">3_step_minimal_foundation_SAD</a></td></tr>
<tr><td><a href="SAD/fullform_SAD">fullform_SAD</a></td></tr>
<tr><td><a href="SAD/fullform_mobile_foundation_SAD">fullform_mobile_foundation_SAD</a></td></tr>
<tr><td><a href="HAPPY_PLUS_OPTIONS/3_step_minimal_foundation_HAPPY_PLUS_OPTIONS">3_step_minimal_foundation_HAPPY_PLUS_OPTIONS</a></td></tr>
<tr><td><a href="HAPPY_PLUS_OPTIONS/fullform_HAPPY_PLUS_OPTIONS">fullform_HAPPY_PLUS_OPTIONS</a></td></tr>
<tr><td><a href="HAPPY_PLUS_OPTIONS/fullform_mobile_foundation_HAPPY_PLUS_OPTIONS">fullform_mobile_foundation_HAPPY_PLUS_OPTIONS</a></td></tr>
<tr><td><a href="SAD_PLUS_OPTIONS/3_step_minimal_foundation_SAD_PLUS_OPTIONS">3_step_minimal_foundation_SAD_PLUS_OPTIONS</a></td></tr>
<tr><td><a href="SAD_PLUS_OPTIONS/fullform_SAD_PLUS_OPTIONS">fullform_SAD_PLUS_OPTIONS</a></td></tr>
<tr><td><a href="SAD_PLUS_OPTIONS/fullform_mobile_foundation_SAD_PLUS_OPTIONS">fullform_mobile_foundation_SAD_PLUS_OPTIONS</a></td></tr>
</tbody></table>
</body>
3-步进极小的基础
和
3-步进极小的基础
例如,需要成为:
3-步进式最小基础填充
和
3-步进极小的基础
文本文件的当前状态:
这就是我实现这一点的代码:
$ sed -n '/ref/p' EVERYTHING | awk -F'[/"<> ]+' '{sub("", "", $6); print $6, $7, $8}' | tr -s '[[:space:]]' '\n' | awk -v n=3 '1; NR % n == 0 {print ""}' | sed '/^HAPPY/s/^/Flow Type\: /' | sed '/^SAD/s/^/Flow Type\: /' | sed '$d'
Flow Type: HAPPY
3_step_minimal_foundation_no_prefill_HAPPY
3_step_minimal_foundation_no_prefill_HAPPY
Flow Type: HAPPY
fullform_no_prefill_HAPPY
fullform_no_prefill_HAPPY
Flow Type: HAPPY
fullform_mobile_foundation_no_prefill_HAPPY
fullform_mobile_foundation_no_prefill_HAPPY
Flow Type: SAD
3_step_minimal_foundation_SAD
3_step_minimal_foundation_SAD
Flow Type: SAD
fullform_SAD
fullform_SAD
Flow Type: SAD
fullform_mobile_foundation_SAD
fullform_mobile_foundation_SAD
Flow Type: HAPPY_PLUS_OPTIONS
3_step_minimal_foundation_HAPPY_PLUS_OPTIONS
3_step_minimal_foundation_HAPPY_PLUS_OPTIONS
Flow Type: HAPPY_PLUS_OPTIONS
fullform_HAPPY_PLUS_OPTIONS
fullform_HAPPY_PLUS_OPTIONS
我想要的输出:
Flow Type: HAPPY
Flow Name: 3_step_minimal_foundation_no_prefill
File Name: 3_step_minimal_foundation_no_prefill_HAPPY
Flow Type: HAPPY
Flow Name: fullform_no_prefill
File Name: fullform_no_prefill_HAPPY
Flow Type: HAPPY
Flow Name: fullform_mobile_foundation_no_prefill
File Name: fullform_mobile_foundation_no_prefill_HAPPY
Flow Type: SAD
Flow Name: 3_step_minimal_foundation
File Name: 3_step_minimal_foundation_SAD
Flow Type: SAD
Flow Name: fullform
File Name: fullform_SAD
Flow Type: SAD
Flow Name: fullform_mobile_foundation
File Name: fullform_mobile_foundation_SAD
Flow Type: HAPPY_PLUS_OPTIONS
Flow Name: 3_step_minimal_foundation
File Name: 3_step_minimal_foundation_HAPPY_PLUS_OPTIONS
Flow Type: HAPPY_PLUS_OPTIONS
Flow Name: fullform
File Name: fullform_HAPPY_PLUS_OPTIONS
有没有办法从行号N中删除/保留特定文本?一旦我得到每一行的唯一性,就很容易正确地标记每一行。
-最好的
最佳答案
好的,对于从下划线到行尾删除所有匹配前一行的行的基本功能,这个过程非常简单。这里有两个选项,100%未经测试。
在awk中:
awk '$0 == last { sub(/_[^_]+$/,""); } { last=$0; } 1' inputfile
在shell中:
while read line; do
if [ "$line" = "$last" ]; then
line="${line%_*}"
fi
echo "$line"
last="$line"
done < inputfile
但这改变了这两行中的第二行。对于所需的其他格式,似乎需要修改这两行中的第一行。这让事情变得有点复杂。。。
要从必须的文本转到想要的文本,让我们以不同的方式来看待这一点,并假设两个重复的行总是出现在以“Flow Type:”开头的行之后。
awk '
/^Flow Type:/ {
print;
getline one; getline two
if (one == two) {
sub(/_[^_]+$/,"",one);
print "Flow Name: " one;
print "File Name: " two;
} else {
print one; print two
}
next;
}
1
' inputfile
但我们也可以处理你的原始HTML。
在sed中,模式识别非常有趣。这里有一个在GNU sed:
sed -r 's|<tr><td><a href="([^/]+)/(([^"]+)_[^_]+)".*|Flow Type: \1\nFlow Name: \3\nFile Name: \2|' input.html
这里需要GNU-sed的是新行(
\n
);从结构上来说,它只是简单的sed。此解决方案在*BSD或OSX中不起作用。编辑:根据对potong回答的评论,OSX中可以使用的一个变体是:
<input.html sed -n 's/^.*"\([^"\/]*\)\/\(\([^"]*\)_\1\)".*/Flow Type: \1|Flow Name: \3|File Name: \2|/p' | tr '|' '\n'`
或者如果你更喜欢这里而不是布雷:
<input.html sed -E 's|<tr><td><a href="([^/]+)/(([^"]+)_[^_]+)".*|Flow Type: \1#Flow Name: \3#File Name: \2#|' | tr '#' '\n'
这解决了OSX sed无法在替换字符串中插入新行的限制。相反,我们插入一个未使用的字符,并使用
tr
将其转换为换行符。为了在awk中实现相同的目标(即处理HTML),您可以执行如下操作:
awk '
/<tr><td><a/ {
type=$0; file=$0;
sub(/^[^"]+/,"",type); sub(/\/.*/,"",type);
sub(/^[^\/]+\//,"",file); sub(/".*/,"",file);
name=file; sub(/_[^_]+$/,"",name);
printf("Flow type: %s\nFlow name: %s\nFile name: %s\n\n", type, name, file);
}' input.html
好的,这是我最后的更新。这就是你要找的吗?
awk '
/<tr><td><a/ {
type=$0; sub(/^[^"]+"/,"",type); sub(/\/.*/,"",type);
file=$0; sub(/^[^\/]+\//,"",file); sub(/".*/,"",file);
if ( index(file, type) ) {
name=substr(file, 0, index(file, type)-2);
} else {
name=file; sub(/_[^_]+$/,"",name);
}
printf("Flow type: %s\nFlow name: %s\nFile name: %s\n\n", type, name, file);
}'
关于bash - 如何使用AWK或SED在N行之前打印字符串并从N行删除特定字符串,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/32203437/