问题描述
我有一个自定义规则匹配,可以匹配文档中的某些句子.我现在想从匹配的句子中提取一些数字.但是,匹配的句子并不总是具有相同的形状和形式.最好的方法是什么?
I have a custom rule matching in spacy, and I am able to match some sentences in a document. I would like to extract some numbers now from the matched sentences. However, the matched sentences do not have always have the same shape and form. What is the best way to do this?
# case 1:
texts = ["the surface is 31 sq",
"the surface is sq 31"
,"the surface is square meters 31"
,"the surface is 31 square meters"
,"the surface is about 31,2 square"
,"the surface is 31 kilograms"]
pattern = [
{"LOWER": "surface"},
{"LEMMA": "be", "OP": "?"},
{"TEXT" : {"REGEX": "^(?i:sq(?:uare)?|m(?:et(?:er|re)s?)?)$"}, "OP": "+"},
{"IS_ALPHA": True, "OP": "?"},
{"LIKE_NUM": True},
]
pattern_1 = [
{"LOWER": "surface"},
{"LEMMA": "be", "OP": "?"},
{"IS_ALPHA": True, "OP": "?"},
{"LIKE_NUM": True},
{"TEXT" : {"REGEX": "^(?i:sq(?:uare)?|m(?:et(?:er|re)s?)?)$", "OP": "+"}}
]
matcher = Matcher(nlp.vocab)
matcher.add("Surface", None, pattern, pattern_1)
for index, text in enumerate(texts):
print(f"Case {index}")
doc = nlp(text)
matches = matcher(doc)
for match_id, start, end in matches:
string_id = nlp.vocab.strings[match_id] # Get string representation
span = doc[start:end] # The matched span
print(match_id, string_id, start, end, span.text)
我的输出将是
Case 0
4898162435462687487 Surface 1 5 surface is 31 sq
Case 1
4898162435462687487 Surface 1 5 surface is sq 31
Case 2
4898162435462687487 Surface 1 6 surface is square meters 31
Case 3
4898162435462687487 Surface 1 5 surface is 31 square
Case 4
4898162435462687487 Surface 1 6 surface is about 31,2 square
Case 5
我只想返回数字(平方米).类似于[31、31、31、31、31.2],而不是全文.凭空执行此操作的正确方法是什么?
I would like to return the number (square meters) only. Something like [31, 31, 31, 31, 31.2] rather than the full text. What is the correct way to do this in spacy?
推荐答案
由于每个匹配项都包含一次LIKE_NUM
实体,因此您可以解析匹配子树并返回此类令牌的第一个匹配项:
Since each match contains a single occurrence of LIKE_NUM
entity you may just parse the match subtree and return the first occurrence of such a token:
value = [token for token in span.subtree if token.like_num][0]
测试:
results = []
for text in texts:
doc = nlp(text)
matches = matcher(doc)
for match_id, start, end in matches:
span = doc[start:end] # The matched span
results.append([token for token in span.subtree if token.like_num][0])
print(results) # => [31, 31, 31, 31, 31,2]
这篇关于spacy规则匹配器从匹配的句子中提取值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!