我将尽一切努力。因此,我尝试按特定字词分割较大的导出文件(400MB)。我们将在此示例中将唯一词称为PYTHONEXP

例:

XXXXX PYTHONEXP xxxxxx
xxxxxxxxxxxxxxxxxxxxxx
xxxx 12.34.34.34 xxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxx
xxxx 12.34.34.34 xxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxx

XXXXX PYTHONEXP xxxxxx
xxxxxxx 55.44.44.44 xxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxx
xxxxx 55.44.44.44 xxxx
xxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxx

XXXXX PYTHONEXP xxxxxx
xxxxxxxxxxxxxxxxxxxxxx
xxxxx 77.66.66.66 xxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxx
xxxxxx 77.66.66.66 xxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxx

XXXXX PYTHONEXP xxxxxx
xxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxx 99.88.88.88 xxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxx
xxxxxx 99.88.88.88xxx
xxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxx

XXXXX PYTHONEXP xxxxxx
xxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxx
xxxxxxx 22.33.33.33 xxxxxxxxxxxxxxxx
xxxxxxxxx
xxxxxxx 22.33.33.33 xxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxx

XXXXX PYTHONEXP xxxxxx
xxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxx 99.88.88.88 xxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxx
xxxxxx 99.88.88.88 xxx
xxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxx


现在,让我们假设x是随机单词,但是每个块在开始时就共享唯一的单词(PYTHONEXP)。我希望能够细分每个部分并仅删除那些细分中的重复IP。理想情况下,我想要这样的输出:

XXXXX PYTHONEXP xxxxxx
xxxxxxxxxxxxxxxxxxxxxx
xxxx 12.34.34.34 xxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxx
xxxx  xxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxx

XXXXX PYTHONEXP xxxxxx
xxxxxxx 55.44.44.44 xxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxx
xxxxx  xxxx
xxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxx

XXXXX PYTHONEXP xxxxxx
xxxxxxxxxxxxxxxxxxxxxx
xxxxx 77.66.66.66 xxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxx
xxxxxx  xxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxx

XXXXX PYTHONEXP xxxxxx
xxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxx 99.88.88.88 xxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxx
xxxxxx xxx
xxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxx

XXXXX PYTHONEXP xxxxxx
xxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxx
xxxxxxx 22.33.33.33 xxxxxxxxxxxxxxxx
xxxxxxxxx
xxxxxxx  xxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxx

XXXXX PYTHONEXP xxxxxx
xxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxx 99.88.88.88 xxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxx
xxxxxx  xxx
xxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxx


请注意,我仍然在期望的输出中而不是4中有2个99.88.88.88条目。我的主要目标是基于仅包含PYTHONEXP的行在这些段中删除重复项。我非常感谢您提供任何帮助,或者知道是否有可能。我希望我能解释这个权利,或者说得通。

最佳答案

以此作为输入文件:

$ cat file
XXXXX PYTHONEXP xxxxxx xxxxxxxxxxxxxxxxxxxxxx xxxx 12.34.34.34 xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxx xxxx 12.34.34.34 xxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxx
XXXXX PYTHONEXP xxxxxx xxxxxxx 55.44.44.44 xxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxx xxxxx 55.44.44.44 xxxx xxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxx
XXXXX PYTHONEXP xxxxxx xxxxxxxxxxxxxxxxxxxxxx xxxxx 77.66.66.66 xxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxx xxxxxx 77.66.66.66 xxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxx
XXXXX PYTHONEXP xxxxxx xxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxx 99.88.88.88 xxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxx xxxxxx 99.88.88.88 xxx xxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxx
XXXXX PYTHONEXP xxxxxx xxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxx xxxxxxx 22.33.33.33 xxxxxxxxxxxxxxxx xxxxxxxxx xxxxxxx 22.33.33.33 xxxxxxxxxxxxxxx xxxxxxxxxxxxxxxx
XXXXX PYTHONEXP xxxxxx xxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxx 99.88.88.88 xxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxx xxxxxx 99.88.88.88 xxx xxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxx


我们只能选择那些包含PYTHONEXP的行,并针对这些行删除IP地址的第二次出现,如下所示:

$ sed -En '/PYTHONEXP/{ s/(( ([[:digit:]]+\.){3}[[:digit:]]+).*)(\2)/\1/; p }' file
XXXXX PYTHONEXP xxxxxx xxxxxxxxxxxxxxxxxxxxxx xxxx 12.34.34.34 xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxx xxxx xxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxx
XXXXX PYTHONEXP xxxxxx xxxxxxx 55.44.44.44 xxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxx xxxxx xxxx xxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxx
XXXXX PYTHONEXP xxxxxx xxxxxxxxxxxxxxxxxxxxxx xxxxx 77.66.66.66 xxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxx xxxxxx xxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxx
XXXXX PYTHONEXP xxxxxx xxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxx 99.88.88.88 xxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxx xxxxxx xxx xxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxx
XXXXX PYTHONEXP xxxxxx xxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxx xxxxxxx 22.33.33.33 xxxxxxxxxxxxxxxx xxxxxxxxx xxxxxxx xxxxxxxxxxxxxxx xxxxxxxxxxxxxxxx
XXXXX PYTHONEXP xxxxxx xxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxx 99.88.88.88 xxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxx xxxxxx xxx xxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxx


这与您所需的输出匹配。

怎么运行的


-E告诉sed使用更现代的扩展正则表达式。
-n告诉sed除非我们明确要求不要打印。
/PYTHONEXP/{ ... }告诉sed仅选择与正则表达式PYTHONEXP匹配的行,并针对这些行将命令用大括号括起来。在我们的例子中,花括号包含两个命令:


s/old/new/是一个替换命令,在我们的示例中,该命令从该行中删除第二次出现的IP地址。
p告诉sed打印结果行。



替换命令如下所示:

s/(( ([[:digit:]]+\.){3}[[:digit:]]+).*)(\2)/\1/
   ----------------------------------   ----
                  |                       |
        This matches a space followed     |
        by an IP address                  |
        (This is saved in group 2.)       |
                                          |
                                       This matches another
                                       occurrence of the same IP


  --------------------------------------
               |
      This matches a space and an IP
      followed by anything and this
      is save as group 1.

10-07 23:40