问题描述
我有一个 output
文件,其中包含数千行信息.我经常在输出文件中找到以下格式的信息
I have an output
file containing thousands of lines of information. Every so often I find in the output file information of the following form¨
Input Orientation:
...
content
...
Distance matrix (angstroms):
我现在想将 content
保存到一个变量中,以便随后进行格式化.另一件事是,我只对文件中的 last 模式感兴趣.我有一个使用 sed
和 awk
进行此操作的解决方案,但这使我不得不为执行一项工作而准备多个文件.这项工作应该可以用 python 完成,但我不知道从哪里开始阅读和学习.
I now want to save the content
to a variable for subsequent formatting. Another thing is that I am only interested in the last pattern in my file. I have a solution for doing this with sed
and awk
, but that leads me to maving multiple files for carrying out one job. This job should be doable with python, but I have no idea where to start reading and to learn this.
编辑我一直在阅读正则表达式,无论是否相信我都取得了一些进步!我首先逐行读取文件,然后反转列表,然后加入组成该列表的所有字符串.我现在只剩下一个大的多行字符串.接下来,我使用 re
模块制作我的正则表达式 r'Distance matrix(.*?)Input orientation'
,我认为这意味着以下含义:我的第一个模式是"Distance矩阵",然后是一个匹配零个或多个字符的子模式,但以一种懒惰的方式(在第一次匹配后停止),然后是我的最后一个模式输入方向".
EDITI have been reading up on regular expressions, and believe it or not I have made some progress! I first read in the file line by line, then reverse the list, and then join all strings that make up that list. I now end up with just one big, multiline string. Next I use the re
module to make my regex r'Distance matrix(.*?)Input orientation'
, which I think means the following: my first pattern is "Distance matrix", then a subpattern where zero or more of all characters are matched, but in a lazy way (stop after first match), and then my last pattern "Input orientation".
with open(inputfile,"r") as input_file:
input_file_lines = input_file.readlines()
reverse_lines = input_lines[::-1]
string = ''.join(reverse_lines)
match = re.search('Distance matrix(.*?)Input orientation', string, re.DOTALL).group(1)
用于测试的示例数据文件:
Sample data file for testing:
Item Value Threshold Converged?
Maximum Force 0.005032 0.000450 NO
RMS Force 0.001066 0.000300 NO
Maximum Displacement 0.027438 0.001800 NO
RMS Displacement 0.007282 0.001200 NO
Predicted change in Energy=-8.909077D-05
GradGradGradGradGradGradGradGradGradGradGradGradGradGradGradGradGradGrad
Input orientation:
---------------------------------------------------------------------
Center Atomic Atomic Coordinates (Angstroms)
Number Number Type X Y Z
---------------------------------------------------------------------
1 6 0 Incorrect Incorrect Incorrect
2 1 0 Incorrect Incorrect Incorrect
3 1 0 Incorrect Incorrect Incorrect
4 1 0 Incorrect Incorrect Incorrect
5 17 0 Incorrect Incorrect Incorrect
6 9 0 Incorrect Incorrect Incorrect
---------------------------------------------------------------------
Distance matrix (angstroms):
1 2 3 4 5
1 C 0.000000
2 H 1.080163 0.000000
3 H 1.080326 1.809416 0.000000
4 H 1.080621 1.810236 1.810685 0.000000
5 Cl 1.962171 2.470702 2.468769 2.465270 0.000000
6 F 2.390537 2.343910 2.357275 2.380515 4.352568
6
6 F 0.000000
Input orientation:
---------------------------------------------------------------------
Center Atomic Atomic Coordinates (Angstroms)
Number Number Type X Y Z
---------------------------------------------------------------------
1 6 0 Correct Correct Correct
2 1 0 Correct Correct Correct
3 1 0 Correct Correct Correct
4 1 0 Correct Correct Correct
5 17 0 Correct Correct Correct
6 9 0 Correct Correct Correct
---------------------------------------------------------------------
Distance matrix (angstroms):
1 2 3 4 5
1 C 0.000000
2 H 1.080516 0.000000
3 H 1.080587 1.801890 0.000000
4 H 1.080473 1.801427 1.801478 0.000000
5 Cl 1.936014 2.458132 2.459437 2.460630 0.000000
6 F 2.414588 2.368281 2.365651 2.355690 4.350586
推荐答案
此处不需要正则表达式.您所需要的只是良好的索引编制.Python 字符串具有 index
和 rindex
方法接收一个子字符串,在文本中找到它,然后返回子字符串中第一个字符的索引.阅读本文档 应该会让您熟悉切片字符串.该程序可能看起来像这样:
Regex isn't necessary here. All you need is good ol' indexing. Python strings have index
and rindex
methods that take a substring, finds it in the text, and returns the index of the first character in the substring. Reading this doc should get you familiar with slicing strings. The program could look something like this:
with open(input_file) as f:
s = f.read() # reads the file as one big string
last_block = s[s.rindex('Input'):s.rindex('Distance')]
该代码的最后一行从文件的 end 开始查找第一次出现的 'Input'
,因为我们使用了 rindex
,然后移到最前面,并将该位置标记为整数.然后,它对'Distance'
执行相同的操作.然后,它使用这些整数仅返回位于它们之间的字符串部分.对于您的示例文件,它将返回:
The last line of that code finds the first occurrence of 'Input'
starting from the end of the file, since we used rindex
, and moving towards the front and marks that position as an integer. It then does the same with 'Distance'
. It then uses those integers to return only the portion of the string that rests between them. in the case of your example file it would return:
Input orientation:
---------------------------------------------------------------------
Center Atomic Atomic Coordinates (Angstroms)
Number Number Type X Y Z
---------------------------------------------------------------------
1 6 0 Correct Correct Correct
2 1 0 Correct Correct Correct
3 1 0 Correct Correct Correct
4 1 0 Correct Correct Correct
5 17 0 Correct Correct Correct
6 9 0 Correct Correct Correct
---------------------------------------------------------------------
如果您不想使用'Input orientation'
标头,则只需将其添加到 rindex('Input')
的结果中,直到获得所需的结果.例如,这可能看起来像 s [s.rindex('Input')+ 19:s.rindex('Distance')]
.
If you don't want the 'Input orientation'
header, you can simply add to the result of rindex('Input')
until you get the desired result. That could look like s[s.rindex('Input') + 19:s.rindex('Distance')]
, for instance.
还必须注意,如果未找到子字符串,则 index
和 rindex
会引发错误.如果不需要,可以使用 find
和 rfind
.
It is also important to note that index
and rindex
throw errors if the substring is not found. If that is not desired, you can use find
and rfind
.
这篇关于匹配模式并使用python保存到变量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!