NER中的编码转换

命名实体识别(NER)的作用：关系抽取、事件抽取、知识图谱(分构建和应用两个方向，构建目前人工干预还是需要很多)、问答系统、机器翻译…

通用实体分类：人名、地名、组织机构名、日期时间、专用名词

学术上的分类：

三大类：实体类、时间类、数字类
七小类：人名、地名、组织机构名、时间、日期、货币、百分比

加载数据

数据导入

def load_sentences(path):
    # 存放数据集
    sentences = []
    # 临时存放每一个句子
    sentence = []
    for line in codecs.open(path, 'r', encoding='utf8'):
        # 去掉两边空格
        line = line.strip()
        if not line:  # 是否读完一个句子
            if len(sentence) > 0:
                sentences.append(sentence)
                sentence = []
        else:
            if line[0] == " ":
                continue
            else:
                word = line.split()
                assert len(word) >= 2
                sentence.append(word)
    # 确保最后一个句子被读入
    if len(sentence) > 0:
        sentences.append(sentence)
    return sentences

首先检查是否为BIO编码格式，如

开 O
始 O
修 O
建 O
莫 B-LOC
斯 I-LOC
科 I-LOC
到 O
圣 B-LOC
彼 I-LOC
得 I-LOC
堡 I-LOC

BIO检验

先进行检验，如果不符合，做相应的处理进行转换

def check_bio(tags):
    #  检查输入的编码是否符合BIO编码，如I一定在B之后;不符合则进行调整
    for i, tag in enumerate(tags):
        if tag == 'O':
            continue
        tag_list = tag.split("-")
        if len(tag_list) != 2 or tag_list[0] not in set(['B', 'I']):
            return False
        if tag_list[0] == 'B':
            continue
        elif i == 0 or tags[i-1] == 'O':  #  前一位置为O，且当前位置不为B,则转换为B,即I-ORG转为B-ORG
            tag[i] = 'B' + tag[1:]
        elif tags[i-1][1:] == tag[1:]:  #  判断尾部三位是否合法
            continue
        else:
            #  如果编码类型完全不一致，则从B开始编码
            tags[i] = 'B' + tag[1:]
    return True

转为BIOES格式

然后将BIO格式转为BIOES格式

def bio_to_bioes(tags):
    new_tags = []
    for i, tag in enumerate(tags):
        if tag == 'O':
            new_tags.append(tag)
        elif tag.split('-')[0] == 'B':
            #  若不是最后一个，且后面一个开头为I，则是一个正确的开头
            if i + 1 < len(tags) and tags[i+1].split('-')[0] == 'I':
                new_tags.append(tag)
            else:
                new_tags.append(tag.replace('B-','S-')) #  不是开头，则换成单字
        elif tag.split('-')[0] == 'I':
            if i + 1 < len(tags) and tags[i+1].split('-')[0] == 'I': # 是正确的中间字
                new_tags.append(tag)
            else:
                new_tags.append(tag.replace('I-','E-')) # 不是中间，换成结尾
        else:
            raise Exception('非法编码')
    return new_tags

综合处理

def update_tag_scheme(sentences, tag_scheme):
    # 编码更新
    for i, s in enumerate(sentences):
        tags = [w[-1] for w in s ]  #  取出标记部分
        if not data_utils.check_bio(tags):
            s_str = "\n".join(" ".join(w) for w in s)
            raise Exception("输入的句子应为BIO编码，请检查输入句子%i:\n%s" % (i, s_str))
    
    if tag_scheme == "BIO":
        for word,new_tag in zip(s,tags):
            word[-1] = new_tag
            
    if tag_scheme == 'BIOES':
        new_tags = data_utils.bio_to_bioes(tags)
        for word,new_tag in zip(s,new_tags): # 重组
            word[-1] = new_tag
    else:
        raise Exception("非法目标编码")

相关算法

加载数据

BIO检验

转为BIOES格式