本文介绍了Spacy注释工具实体索引的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何在Spacy中读取带注释的数据?

How can I read my annotated data in Spacy?

1)我带注释的数据的格式:

1) My annotated data's form:

  "annotation": [
    [
      79,
      99,
      "Nom complet"
    ],

2)脚本中带注释的数据形式:

2) Annotated data's form in the script:

  "annotation": [
    {
      "label": [
        "Companies worked at"
      ],
      "points": [
        {
          "start": 1749,
          "end": 1754,
          "text": "Oracle"
        }
      ]
    },

3)如何更改可读取注释数据的代码?

3) How can I change this code that can read my annotated data?

for line in lines:
    data = json.loads(line)
    text = data['text']
    entities = []
    for annotation in data['annotation']:
        #only a single point in text annotation.
        point = annotation['points'][0]
        labels = annotation['label']
        # handle both list of labels or a single label.
        if not isinstance(labels, list):
            labels = [labels]

        for label in labels:
            dataturks indices are both inclusive [start, end] but spacy is not [start, end)
    entities.append(([0], [1],[2]))


    training_data.append((text, {"entities" : entities}))

推荐答案

培训Json:-[{ "text": "This Labor-Contract ('CONTRACT'), effective as of May 12, 2017 (Effective Date), is made by and between Client-ABC, Inc. ('Client-ABC'), having its principal place of business at 1030 Client-ABC Street, Atlanta, GA 30318, USA and Supplier-ABC (Supplier), having a place of business at 100 Park Avenue, Miami, 10178, USA (hereinafter referred to individually as Party and collectively as Parties).", "entities": [ [ 50, 62, "EFFECTIVE_DATE" ], [ 106, 116, "VENDOR_NAME" ], [ 181, 203, "VENDOR_ADDRESS" ], [ 205, 212, "VENDOR_CITY" ], [ 214, 216, "VENDOR_STATE" ], [ 217, 222, "VENDOR_POSTAL_CODE" ], [ 224, 227, "VENDOR_COUNTRY" ] ] },{second training data}]

Training Json:-[{ "text": "This Labor-Contract ('CONTRACT'), effective as of May 12, 2017 ("Effective Date"), is made by and between Client-ABC, Inc. ('Client-ABC'), having its principal place of business at 1030 Client-ABC Street, Atlanta, GA 30318, USA and Supplier-ABC ("Supplier"), having a place of business at 100 Park Avenue, Miami, 10178, USA (hereinafter referred to individually as "Party" and collectively as "Parties").", "entities": [ [ 50, 62, "EFFECTIVE_DATE" ], [ 106, 116, "VENDOR_NAME" ], [ 181, 203, "VENDOR_ADDRESS" ], [ 205, 212, "VENDOR_CITY" ], [ 214, 216, "VENDOR_STATE" ], [ 217, 222, "VENDOR_POSTAL_CODE" ], [ 224, 227, "VENDOR_COUNTRY" ] ] },{second training data}]

自定义培训代码:-

training_pickel_file = "training_pickel_file.json"
with open(training_pickel_file) as input:
TRAIN_DATA = json.load(input)
for annotations in TRAIN_DATA:
   for ent in annotations["entities"]:
      ner.add_label(ent[2])
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']
with nlp.disable_pipes(*other_pipes):  # only train NER
    optimizer = nlp.begin_training()
    for itn in range(n_iter):
        random.shuffle(TRAIN_DATA)
        losses = {}
        for a in TRAIN_DATA:
            doc = nlp.make_doc(a["text"])
            gold = GoldParse(doc, entities = a["entities"])
            nlp.update([doc], [gold], drop =0.5, sgd=optimizer, losses = losses)
        print('Losses', losses)

这篇关于Spacy注释工具实体索引的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

07-26 13:15