python - 如何训练Sense2Vec模型

sense2vec的文档中提到了3个主文件-第一个是merge_text.py。我尝试了几种类型的输入-txt，csv，bzip压缩文件，因为merge_text.py尝试打开bzip2压缩的文件。

可以在以下位置找到该文件:
https://github.com/spacy-io/sense2vec/blob/master/bin/merge_text.py

该脚本需要哪种类型的输入格式？
此外，如果有人可以提出建议，请教如何训练模型。

最佳答案

我扩展并调整了来自sense2vec的代码示例。

您从以下输入文本中查找:

“就沙特阿拉伯及其动机而言，这也很简单。沙特人是
擅长金钱和算术。面对赔钱的痛苦选择
维持目前的产量每桶60美元或消耗200万桶
每天因市价下跌而损失更多的钱-这是一个简单的选择:
减少痛苦的道路。如果还有诸如伤害美国的次要原因
紧张的石油生产国或伤害伊朗和俄罗斯，这很棒，但这确实
只是钱而已。”

对此:

as | ADV远| ADV as | ADP沙乌地阿拉伯| ENT和| CCONJ | ADJ主题|名词| ADJ非常|动词非常| ADV简单| ADJ | ADV沙特| ENT很好|动词很好| ADJ于| ADP货币| NOUN和| CCONJ算术| NOUN面对| VERB面对| ADP痛苦的选择| NOUN的| ADP损失| VERB钱| NOUN维持| VERB当前产出| NOAT处于| ADP US $ | SYM 60 | MONEY per | ADP桶| NOUN or | CCONJ提取|两百万美元|卡丁纳桶|每天ADP天|每天ANO休假| ADP市场| NOUN和| CCONJ损失| VERB多_多钱| NOUN它| PRON的| VERB轻松选择| NOUN拿取| VERB路径| NOUN表示| ADJ是|| VERB少| ADV痛苦| ADJ如果| ADP那里| ADV是| VERB次要原因| NOUN之类| ADP伤害| VERB我们| ENT产油者| NOUN或| CCONJ伤害| VERB伊朗| ENT和| CCONJ俄国| ENT的| VERB很棒| ADJ但是| CCONJ它| PRON的| VERB确实| ADV只是| ADV关于| ADP钱|名词

双换行符被解释为单独的文档。

将Urls识别为以下内容，将其简化为domain.tld并标记为| URL

名词(也是名词的名词短语的一部分)被隐化(动机成为主题)

删除带有POS标签的单词，例如DET(确定的文章)和PUNCT(用于标点符号)

这是代码。如果您有任何问题，请告诉我。

我可能很快就会在github.com/woltob上发布它。

import spacy
import re

nlp = spacy.load('en')
nlp.matcher = None

LABELS = {
    'ENT': 'ENT',
    'PERSON': 'PERSON',
    'NORP': 'ENT',
    'FAC': 'ENT',
    'ORG': 'ENT',
    'GPE': 'ENT',
    'LOC': 'ENT',
    'LAW': 'ENT',
    'PRODUCT': 'ENT',
    'EVENT': 'ENT',
    'WORK_OF_ART': 'ENT',
    'LANGUAGE': 'ENT',
    'DATE': 'DATE',
    'TIME': 'TIME',
    'PERCENT': 'PERCENT',
    'MONEY': 'MONEY',
    'QUANTITY': 'QUANTITY',
    'ORDINAL': 'ORDINAL',
    'CARDINAL': 'CARDINAL'
}

pre_format_re = re.compile(r'^[\`\*\~]')
post_format_re = re.compile(r'[\`\*\~]$')
url_re = re.compile(r'(https?:\/\/)?([a-z0-9-]+\.)?([\d\w]+?\.[^\/]{2,63})')
single_linebreak_re = re.compile('\n')
double_linebreak_re = re.compile('\n{2,}')
whitespace_re = re.compile(r'[ \t]+')
quote_re = re.compile(r'"|`|´')

def strip_meta(text):
    text = text.replace('per cent', 'percent')
    text = text.replace('&gt;', '>').replace('&lt;', '<')
    text = pre_format_re.sub('', text)
    text = post_format_re.sub('', text)
    text = double_linebreak_re.sub('{2break}', text)
    text = single_linebreak_re.sub(' ', text)
    text = text.replace('{2break}', '\n')
    text = whitespace_re.sub(' ', text)
    text = quote_re.sub('', text)
    return text

def transform_doc(doc):
    for ent in doc.ents:
        ent.merge(ent.root.tag_, ent.text, LABELS[ent.label_])
    for np in doc.noun_chunks:
        while len(np) > 1 and np[0].dep_ not in ('advmod', 'amod', 'compound'):
            np = np[1:]
        np.merge(np.root.tag_, np.text, np.root.ent_type_)
    strings = []
    for sent in doc.sents:
        sentence = []
        if sent.text.strip():
            for w in sent:
                if w.is_space:
                    continue
                w_ = represent_word(w)
                if w_:
                    sentence.append(w_)
            strings.append(' '.join(sentence))
    if strings:
        return '\n'.join(strings) + '\n'
    else:
        return ''


def represent_word(word):
    if word.like_url:
        x = url_re.search(word.text.strip().lower())
        if x:
            return x.group(3)+'|URL'
        else:
            return word.text.lower().strip()+'|URL?'
    text = re.sub(r'\s', '_', word.text.strip().lower())
    tag = LABELS.get(word.ent_type_)
    # Dropping PUNCTUATION such as commas and DET like the
    if tag is None and word.pos_ not in ['PUNCT', 'DET']:
        tag = word.pos_
    elif tag is None:
        return None
    # if not word.pos_:
    #    tag = '?'
    return text + '|' + tag

corpus = '''
As far as Saudi Arabia and its motives, that is very simple also. The Saudis are
good at money and arithmetic. Faced with the painful choice of losing money
maintaining current production at US$60 per barrel or taking two million barrels
per day off the market and losing much more money - it's an easy choice: take
the path that is less painful. If there are secondary reasons like hurting US
tight oil producers or hurting Iran and Russia, that's great, but it's really
just about the money.
'''

corpus_stripped = strip_meta(corpus)

doc = nlp(corpus_stripped)
corpus_ = []
for word in doc:
    # only lemmatize NOUN and PROPN
    if word.pos_ in ['NOUN', 'PROPN'] and len(word.text) > 3 and len(word.text) != len(word.lemma_):
        # Keep the original word with the length of the lemma, then add the white space, if it was there.:
        lemma_ = str(word.text[:1]+word.lemma_[1:]+word.text_with_ws[len(word.text):])
            # print(word.text, lemma_)
        corpus_.append(lemma_)
    # print(word.text, word.text[:len(word.lemma_)]+word.text_with_ws[len(word.text):])
    # All other words are added normally.
    else:
        corpus_.append(word.text_with_ws)

result = transform_doc(nlp(''.join(corpus_)))

sense2vec_filename = 'text.txt'
file = open(sense2vec_filename,'w')
file.write(result)
file.close()
print(result)

您可以使用Tensorboard中的Gensim使用这种方法来可视化模型:
https://github.com/ArdalanM/gensim2tensorboard

我还将调整此代码以使其与sense2vec方法配合使用(例如，在预处理步骤中单词变为小写字母，只需在代码中将其注释掉即可)。

编码愉快，
狼毒

关于python - 如何训练Sense2Vec模型，我们在Stack Overflow上找到一个类似的问题：https://stackoverflow.com/questions/37946008/