如何执行实体链接到本地​​知识图

如何执行实体链接到本地​​知识图

本文介绍了如何执行实体链接到本地​​知识图?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用在线文章从头开始构建自己的知识库.

I'm building my own knowledge base from scratch, using articles online.

我正在尝试将我抓取的 SPO 三元组(主题和可能的对象)中的实体映射到我自己的实体记录,这些实体由我从其他网站抓取的上市公司组成.

I am trying to map the entities from my scraped SPO triples (the Subject and potentially the Object) to my own record of entities which consist of listed companies which I scraped from some other website.

我研究了大部分库,该方法侧重于将实体映射到维基百科、YAGO 等大型知识库,但我不确定如何将这些技术应用于我自己的知识库.

I've researched most of the libraries, and the method are focused on mapping entities to big knowledge bases like Wikipedia, YAGO, etc., but I'm not really sure how to apply those techniques to my own knowledge base.

目前,我发现 NEL Python 包声称可以这样做,但我不太了解文档,它只关注维基百科数据转储.

Currently, I've found the NEL Python package that claims to be able to do so, but I don't quite understand the documentation, and it focus only on a Wikipedia data dump.

是否有任何技术或库允许我这样做?

Is there any techniques or libraries that allows me to do so?

推荐答案

我假设你有一些类似于 wikidata 知识库的东西,它是一个带有别名的巨大概念列表.

I assume you have something similar to wikidata knowledge base that is a giant list of concepts with aliases.

这或多或少可以表示如下:

More or less this can be represented as follow:

C1 new york
C1 nyc
C1 big apple

现在链接一个句子到上面的知识库,对于单个单词很容易,你只需要设置一个索引,将单个单词概念映射到一个标识符.

Now the link a spans of a sentence to the above KB, for single words it is easy, you just have to setup a index that maps a single word concept to an identifier.

困难的部分是将多个单词概念或短语概念(例如new york"或big apple")联系起来.

The difficult part is linking multiple word concepts or phrasal concepts like "new york" or "big apple".

为了实现这一点,我使用了一种算法,将一个句子分成所有可能的片段.我称这些为跨度".然后尝试将单个跨度或一组词与数据库中的概念(单个词或多个词)匹配.

To achieve that I use an algorithm that splits a sentence into all the slices possible. I call those "spans". Then try to match individual span or group of words with a concept from the database (single word or with multiple words).

例如,这里是一个简单句子的所有跨度的示例.它是一个存储字符串列表的列表:

For instance, here is example of all the spans for a simple sentence. It is a list that store lists of strings:

[['new'], ['york'], ['is'], ['the'], ['big'], ['apple']]
[['new'], ['york'], ['is'], ['the'], ['big', 'apple']]
[['new'], ['york'], ['is'], ['the', 'big'], ['apple']]
[['new'], ['york'], ['is'], ['the', 'big', 'apple']]
[['new'], ['york'], ['is', 'the'], ['big'], ['apple']]
[['new'], ['york'], ['is', 'the'], ['big', 'apple']]
[['new'], ['york'], ['is', 'the', 'big'], ['apple']]
[['new'], ['york'], ['is', 'the', 'big', 'apple']]
[['new'], ['york', 'is'], ['the'], ['big'], ['apple']]
[['new'], ['york', 'is'], ['the'], ['big', 'apple']]
[['new'], ['york', 'is'], ['the', 'big'], ['apple']]
[['new'], ['york', 'is'], ['the', 'big', 'apple']]
[['new'], ['york', 'is', 'the'], ['big'], ['apple']]
[['new'], ['york', 'is', 'the'], ['big', 'apple']]
[['new'], ['york', 'is', 'the', 'big'], ['apple']]
[['new'], ['york', 'is', 'the', 'big', 'apple']]
[['new', 'york'], ['is'], ['the'], ['big'], ['apple']]
[['new', 'york'], ['is'], ['the'], ['big', 'apple']]
[['new', 'york'], ['is'], ['the', 'big'], ['apple']]
[['new', 'york'], ['is'], ['the', 'big', 'apple']]
[['new', 'york'], ['is', 'the'], ['big'], ['apple']]
[['new', 'york'], ['is', 'the'], ['big', 'apple']]
[['new', 'york'], ['is', 'the', 'big'], ['apple']]
[['new', 'york'], ['is', 'the', 'big', 'apple']]
[['new', 'york', 'is'], ['the'], ['big'], ['apple']]
[['new', 'york', 'is'], ['the'], ['big', 'apple']]
[['new', 'york', 'is'], ['the', 'big'], ['apple']]
[['new', 'york', 'is'], ['the', 'big', 'apple']]
[['new', 'york', 'is', 'the'], ['big'], ['apple']]
[['new', 'york', 'is', 'the'], ['big', 'apple']]
[['new', 'york', 'is', 'the', 'big'], ['apple']]
[['new', 'york', 'is', 'the', 'big', 'apple']]

每个子列表可能映射也可能不映射到一个概念.要找到最佳映射,您可以根据匹配的概念数量对上述每一行进行评分.

Each sublist may or may not map to a concept. To find the best mapping, you can score each of the above line based on the number of concept that match.

以下是根据示例知识库获得最佳分数的上述跨度列表中的两个:

Here is two of the above list of spans that have the best score according to the example knowledge base:

2  ~  [['new', 'york'], ['is'], ['the'], ['big', 'apple']]
2  ~  [['new', 'york'], ['is', 'the'], ['big', 'apple']]

所以它猜测纽约"是概念,大苹果"也是概念.

So it guessed "new york" is concept and "big apple" is also a concept.

完整代码如下:

input = 'new york is the big apple'.split()


def spans(lst):
    if len(lst) == 0:
        yield None
    for index in range(1, len(lst)):
        for span in spans(lst[index:]):
            if span is not None:
                yield [lst[0:index]] + span
    yield [lst]

knowledgebase = [
    ['new', 'york'],
    ['big', 'apple'],
]

out = []
scores = []

for span in spans(input):
    score = 0
    for candidate in span:
        for uid, entity in enumerate(knowledgebase):
            if candidate == entity:
                score += 1
    out.append(span)
    scores.append(score)

leaderboard = sorted(zip(out, scores), key=lambda x: x[1])

for winner in leaderboard:
    print(winner[1], ' ~ ', winner[0])

这必须改进以将匹配概念的列表与其概念标识符相关联,并找到一种对所有内容进行拼写检查的方法(根据知识库).

This can must be improved to associate list that match a concept to its concept identifier, and find a way to spell check everything (according to the knowledge base).

这篇关于如何执行实体链接到本地​​知识图?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-12 04:31