我正在一个项目中,我必须构建SVM分类器,以根据文章标题和摘要中的单词来预测MeSH术语分配。我们获得了识别每篇文章的1000个PMID的gzip文件。下面是一个示例文件:
PMID- 22997744
OWN - NLM
STAT- MEDLINE
DCOM- 20121113
LR - 20120924
IS - 0042-4676 (Print)
IS - 0042-4676 (Linking)
IP - 3
DP - 2012 May-Jun
TI - [Value of magnetic resonance imaging in the diagnosis of recurrent colorectal
cancer].
PG - 28-33
AB - To diagnose recurrent colorectal cancer is an urgent problem of oncoproctology.
Eighty patients with suspected recurrent colon tumor were examined. All the
patients underwent irrigoscopy, colonoscopy, magnetic resonance imaging of the
abdomen and small pelvis. The major magnetic resonance symptoms of recurrent
colon tumors were studied; a differential diagnosis of recurrent processes and
postoperative changes at the site of intervention was made.
FAU - Dan'ko, N A
MH - Aged
MH - Colon/pathology/surgery
MH - Colorectal Neoplasms/*diagnosis/pathology/surgery
MH - Diagnosis, Differential
MH - Female
MH - Humans
MH - Magnetic Resonance Imaging/*methods
MH - Male
MH - Middle Aged
MH - Neoplasm Recurrence, Local/*diagnosis
MH - Postoperative Complications/*diagnosis
MH - Rectum/pathology/surgery
MH - Reproducibility of Results
我试图弄清楚如何创建一个具有以下内容的字典:
{PMID: {Title (TI): Title words},
{Abstract (AB): Abstract words},
{MeSH (MH): MeSH terms}}.
是否有捷径可寻?到目前为止,我知道下面的代码已经接近,但是它并不完美。
class Node:
def __init__(self, indented_line):
self.children = []
self.level = len(indented_line) - len(indented_line.lstrip())
self.text = indented_line.strip()
def add_children(self, nodes):
childlevel = nodes[0].level
while nodes:
node = nodes.pop(0)
if node.level == childlevel: # add node as a child
self.children.append(node)
elif node.level > childlevel: # add nodes as grandchildren of the last child
nodes.insert(0,node)
self.children[-1].add_children(nodes)
elif node.level <= self.level: # this node is a sibling, no more children
nodes.insert(0,node)
return
def as_dict(self):
if len(self.children) > 1:
return {self.text: [node.as_dict() for node in self.children]}
elif len(self.children) == 1:
return {self.text: self.children[0].as_dict()}
else:
return self.text
# Problem A [0 points]
def read_data(filenames):
data = None
# Begin CODE
data = {}
contents = []
for filename in filenames:
with gzip.open(filename,'rt') as f:
contents.append(f.read())
root = Node('root')
root.add_children([Node(line) for line in contents[0].splitlines() if line.strip()])
d = root.as_dict()['root']
print(d[:50])
# End CODE
return data
最佳答案
让我们将示例简化为更简单的内容:
content = """
PMID- 22997744
OWN - NLM
TI - [Value of magnetic resonance imaging in the diagnosis of recurrent colorectal
cancer].
PG - 28-33
AB - To diagnose recurrent colorectal cancer is an urgent problem of oncoproctology.
Eighty patients with suspected recurrent colon tumor were examined.
FAU - Dan'ko, N A
MH - Aged
MH - Colon/pathology/surgery"""
您可以使用regular expressions匹配模式。正则表达式是一种强大而强大的工具:
>>> match = re.search('^PMID- (.*)$', content, re.MULTILINE)
模式
^PMID- (.*)$
匹配行^
的开头,后跟PMID-
,然后是多个字符.
,然后是行$
的结尾。方括号(.*)
表示方括号内匹配的结果将放在一个组中。我们需要检查是否存在匹配项:>>> match is not None
True
我们可以查询匹配项:
>>> match.groups()
('22997744',)
因此,我们可以看到存在一组(因为我们在模式中仅定义了一组),并且它与PMID相匹配:
22997744
。我们可以通过请求匹配组1的值来获得该值。匹配组0是匹配的整个字符串:
PMID- 22997744
。>>> pmid = match.group(1)
>>> pmid
'22997744'
使用
TI
和AB
跨多行进行匹配的模式要困难得多。我不是专家,也许其他人会加入一些更好的东西。我只是先做一个文本替换,所以所有文本都在一行上。例如:>>> text = """TI - [Value of magnetic resonance imaging in the diagnosis of recurrent colorectal
... cancer].
>>> print(text)
TI - [Value of magnetic resonance imaging in the diagnosis of recurrent colorectal
cancer].
>>> print(text.replace('\n ', ' '))
TI - [Value of magnetic resonance imaging in the diagnosis of recurrent colorectal cancer].
然后我们可以用类似的方式匹配
TI
和AB
:>>> content = content.replace('\n ', ' ')
>>> match = re.search('^TI - (.*)$', content, re.MULTILINE)
>>> ti = match.group(1)
>>> ti
'[Value of magnetic resonance imaging in the diagnosis of recurrent colorectal cancer].'
>>> match = re.search('^AB - (.*)$', content, re.MULTILINE)
>>> ab = match.group(1)
>>> ab
'To diagnose recurrent colorectal cancer is an urgent problem of oncoproctology. Eighty patients with suspected recurrent colon tumor were examined'
要匹配
MH
,我们要查找所有匹配项。 re.search
只会找到第一个匹配项。 re.findall
将返回多个匹配项:>>> mh = re.findall('^MH - (.*)$', content, re.MULTILINE)
>>> mh
['Aged', 'Colon/pathology/surgery']
将所有这些放在一起:
data = {}
data[pmid] = {'Title': ti,
'Abstract': ab,
'MeSH': mh}
这给出了(使用
pprint
使其看起来更好):>>> from pprint import pprint
>>> pprint(data)
{'22997744': {'Abstract': 'To diagnose recurrent colorectal cancer is an urgent problem of oncoproctology. Eighty patients with suspected recurrent colon tumor were examined.',
'MeSH': ['Aged', 'Colon/pathology/surgery'],
'Title': '[Value of magnetic resonance imaging in the diagnosis of recurrent colorectal cancer].'}}
关于python - 如何解析PubMed文本文件?,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/53798457/