python - Bash或Python从文本文件中提取块

我有一个巨大的文本文件，其结构为：

SEPARATOR
STRING1
(arbitrary number of lines)
SEPARATOR
...
SEPARATOR
STRING2
(arbitrary number of lines)
SEPARATOR
SEPARATOR
STRING3
(arbitrary number of lines)
SEPARATOR
....

在文件的不同“块”之间仅更改的是STRING和分隔符之间的内容。我需要使用bash或python获取一个脚本，该脚本在输入中提供了STRING_i，并在输出中提供了一个文件，其中包含

SEPARATOR
STRING_i
(number of lines for this string)
SEPARATOR

这里使用bash或python的最佳方法是什么？另外的选择？它也必须很快。

谢谢

最佳答案

在Python 2.6或更高版本中：

def doit(inf, ouf, thestring, separator='SEPARATOR\n'):
  thestring += '\n'
  for line in inf:
    # here we're always at the start-of-block separator
    assert line == separator
    blockid = next(inf)
    if blockid == thestring:
      # found block of interest, use enumerate to count its lines
      for c, line in enumerate(inf):
        if line == separator: break
      assert line == separator
      # emit results and terminate function
      ouf.writelines((separator, thestring, '(%d)' % c, separator))
      inf.close()
      ouf.close()
      return
    # non-interesting block, just skip it
    for line in inf:
      if line == separator: break

在较旧的Python版本中，您可以执行几乎相同的操作，但是将blockid = next(inf)行更改为blockid = inf.next()。

这里的假设是，输入和输出文件是由调用者打开的（它也会传递有趣的thestring值，还可以传递separator的值），但是关闭它们是该函数的工作（例如，为了最大程度地方便使用）一个管道过滤器，其inf为sys.stdin，ouf为sys.stdout）；当然可以轻松进行调整。

删除assert可以从微观上加快速度，但是我喜欢它们的“健全性检查”角色（它们也可能有助于理解代码流的逻辑）。

这种方法的关键是文件是一个（行的）迭代器，并且迭代器可以在多个位置进行高级处理（因此我们可以有多个for语句，或诸如next(inf)之类的特定“高级迭代器”调用，它们正确合作）。

关于python - Bash或Python从文本文件中提取块，我们在Stack Overflow上找到一个类似的问题：https://stackoverflow.com/questions/2050411/