Commit 843c314b authored by Thomas Kleinbauer's avatar Thomas Kleinbauer

Initial commit.

parent fc915477
This diff is collapsed.
This is the Weakly Supervised Learning Library for Text Processing of
the EU H2020 Project "COMPRISE". Please consult the file
"acknowledgment.pdf"
The code in this repository is one half of the initial version of this
software library, the other half addressing weakly supervised learning
for speech-to-text processing.
A detailed description of the library can be found in the official
project deliverable "D4.2 Initial weakly supervised learning library",
available at https://www.compriseh2020.eu
Prerequisites
=============
Before you can run this example, you have to make sure that you have
successfully installed all required dependencies. Please follow the
installation instructions in ../README.txt for this.
How to run
==========
All of the following commands need to be run from inside the 'example'
directory.
1) Create noisy annotations for the training data
In order to create noisy annotations based on part-of-speech tags and
taking the previous NER label into account, run the following command:
python3 ../tools/sample_noise_prev_pos.py data/train_clean.tsv data/train_noisy.tsv
Only 1% of the clean data is used to estimate the probability
distributions from which to sample the noisy annotations.
2) Convert .tsv files to .pickle files
All data files, which are provided in TAB-separated-values format
(.tsv), need to be converted to the .pickle format expected by the
classifier. This needs to be done only once and will speed up the
run-time of the classifier.
python3 ../tools/create_pickle.py verbmobil io data/train_clean.tsv data/train_clean.pickle
python3 ../tools/create_pickle.py verbmobil io data/train_noisy.tsv data/train_noisy.pickle
python3 ../tools/create_pickle.py verbmobil bio data/valid.tsv data/valid.pickle
python3 ../tools/create_pickle.py verbmobil bio data/test.tsv data/test.pickle
Note that the training data is expected to be in IO format while
validation and test set must be in BIO format. This is reflected in
the different command line parameters passed in the first two vs. the
last two script invocations.
3) Run the classifier
In order to train and evaluate a model on the data suchly created, run
the following command:
python3 ../main/ner.py --config-dir=. example
4) Inspect the result
The example data present a hard problem for the classifier: the label
distribution is extremely skewed, with the O-tag dominating all other
labels dramatically. Moreoever, we're estimating the noisy labels
using only one percent of the clean data.
Therefore, we observe only a modest result at the end of the training,
with f-scores in the mid-20% region. Also, it takes a high number
epochs until any training effect is visible on the validation data at
all, as can be witnessed in the regular print-outs during the training
phase.
The data found in this directory consists of three files:
- train.tsv
- valid.tsv
- test.tsv
The contents are English sentences annotated with the following Named
Entity labels:
- PER (Persons)
- ORG (Organizations)
- LOC (Locations)
- DATE (Dates)
- TIME (Temporal expressions)
- O (Everything else)
The sentences are encoded with one word per line, followed by a TAB
character ('\t'), followed by one of the six labels listed above.
Sentence boundaries are marked by empty lines.
The sentences themselves are an excerpt from
Dorothy Wordsworth: "Recollections of a tour made in Scotland A.D. 1803",
J. C. Shairp (Ed.), 1874
The three files listed above constitute a training / development /
test split containing 54305 / 15544 / 7759 words respectively, which
implements roughly a 70 / 20 / 10 percent of the total number of
words.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
{
"NAME": "example",
"PATH_TRAIN_CLEAN": "./data/train_clean.pickle",
"PATH_TRAIN_NOISY": "./data/train_noisy.pickle",
"PATH_DEV": "./data/valid.pickle",
"PATH_TEST": "./data/test.pickle",
"DATA_SEPARATOR": " ",
"WORD_EMBEDDING": "../data/fasttext/cc.en.300.bin",
"LABEL_FORMAT": "io",
"CONTEXT_LENGTH": 3,
"LSTM_SIZE": 300,
"DENSE_SIZE": 100,
"DENSE_ACTIVATION": "relu",
"BATCH_SIZE": 100,
"EPOCHS": 50,
"USE_CLEAN": true,
"USE_NOISY": true,
"NOISE_METHOD": "channel",
"USE_IDENTITY_MATRIX": false,
"USE_WORD_CLUSTER": "none",
"PATH_WORD_CLUSTER": "none",
"NUM_WORD_CLUSTER": 0,
"WORD_CLUSTER_SELECTION": 1.0,
"WORD_CLUSTER_INTERPOLATION": 0.0,
"SAMPLE_SEED": 12,
"TRAINING_SEED": 34,
"SAMPLE_PCT_CLEAN": 0.01,
"SAMPLE_PCT_NOISY": 1.00,
"CLEANING_DENSE_SIZE": 0,
"NUM_WORKERS": 0,
"REPORT_INTERVAL": 1
}
COMPRISE Weakly Supervised NER
==============================
The code in this directory represents the initial version of the
COMPRISE weakly-supervised learning for text processing library. It
focuses on a single task, Named Entity Recognition.
Installation
------------
How to run
----------
# Copyright 2020 Saarland University, Spoken Language Systems LSV
# Authors: Lukas Lange, Michael A. Hedderich, Dietrich Klakow, Thomas Kleinbauer
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# THIS CODE IS PROVIDED *AS IS*, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, EITHER EXPRESS OR IMPLIED, INCLUDING WITHOUT LIMITATION ANY IMPLIED
# WARRANTIES OR CONDITIONS OF TITLE, FITNESS FOR A PARTICULAR PURPOSE,
# MERCHANTABLITY OR NON-INFRINGEMENT.
#
# See the Apache 2 License for the specific language governing permissions and
# limitations under the License.
import numpy as np
import torch
from torch.utils import data
class CoNLLDataset(data.Dataset):
def __init__(self, instances, onehot=False):
self.instances = instances
embs = np.asarray([ins.embedding for ins in instances])
labels = np.asarray([ins.label_emb for ins in instances])
self.words = [ins.word for ins in instances]
# convert to torch
self.embs = torch.from_numpy(embs).float()
if onehot:
self.labels = torch.from_numpy(labels).float()
else:
self.labels = torch.from_numpy(labels).argmax(dim=1)
def __len__(self):
return len(self.instances)
def __getitem__(self, index):
emb = self.embs[index]
label = self.labels[index]
word = self.words[index]
return emb, label, word
# A more memory-efficient (but slower?) implementation of the above class
# class CoNLLDataset(data.Dataset):
# def __init__(self, instances, onehot=False):
# self.instances = instances
# self.onehot = onehot
# def __len__(self):
# return len(self.instances)
# def __getitem__(self, index):
# ins = self.instances[index]
# emb = torch.from_numpy(ins.embedding).float()
# label = torch.from_numpy(ins.label_emb)
# if self.onehot:
# label = label.float()
# else:
# label = label.argmax()
# return emb, label, ins.word
# Based on the script from https://github.com/spyysalo/conlleval.py/blob/master/conlleval.py
#
# with the following license
#
# The MIT License (MIT)
#
# Copyright (c) 2016 Sampo Pyysalo
#
# Permission is hereby granted, free of charge, to any person obtaining a copy
# of this software and associated documentation files (the "Software"), to deal
# in the Software without restriction, including without limitation the rights
# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
# copies of the Software, and to permit persons to whom the Software is
# furnished to do so, subject to the following conditions:
#
# The above copyright notice and this permission notice shall be included in all
# copies or substantial portions of the Software.
#
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
# SOFTWARE.
# Small changes were done to the original version, e.g. replacing the
# sys.args format with function calls.
# Python version of the evaluation script from CoNLL'00-
# Intentional differences:
# - accept any space as delimiter by default
# - optional file argument (default STDIN)
# - option to set boundary (-b argument)
# - LaTeX output (-l argument) not supported
# - raw tags (-r argument) not supported
import sys
import re
from collections import defaultdict, namedtuple
ANY_SPACE = '<SPACE>'
class FormatError(Exception):
pass
Metrics = namedtuple('Metrics', 'tp fp fn prec rec fscore')
class EvalCounts(object):
def __init__(self):
self.correct_chunk = 0 # number of correctly identified chunks
self.correct_tags = 0 # number of correct chunk tags
self.found_correct = 0 # number of chunks in corpus
self.found_guessed = 0 # number of identified chunks
self.token_counter = 0 # token counter (ignores sentence breaks)
# counts by type
self.t_correct_chunk = defaultdict(int)
self.t_found_correct = defaultdict(int)
self.t_found_guessed = defaultdict(int)
def parse_tag(t):
m = re.match(r'^([^-]*)-(.*)$', t)
return m.groups() if m else (t, '')
def evaluate(iterable, tmp_options={}):
options = {'delimiter': tmp_options['delimiter'] if 'delimiter' in tmp_options else ANY_SPACE,
'boundary': tmp_options['boundary'] if 'boundary' in tmp_options else '-X-',
'otag': tmp_options['otag'] if 'otag' in tmp_options else 'O'}
counts = EvalCounts()
num_features = None # number of features per line
in_correct = False # currently processed chunks is correct until now
last_correct = 'O' # previous chunk tag in corpus
last_correct_type = '' # type of previously identified chunk tag
last_guessed = 'O' # previously identified chunk tag
last_guessed_type = '' # type of previous chunk tag in corpus
for line in iterable:
line = line.rstrip('\r\n')
if len(line) < 3:
continue
if options['delimiter'] == ANY_SPACE:
features = line.split()
else:
features = line.split(options['delimiter'])
if num_features is None:
num_features = len(features)
elif num_features != len(features) and len(features) != 0:
raise FormatError('unexpected number of features: %d (%d)' %
(len(features), num_features))
if len(features) == 0 or features[0] == options['boundary']:
features = [options['boundary'], 'O', 'O']
if len(features) < 3:
raise FormatError('unexpected number of features in line %s' % line)
guessed, guessed_type = parse_tag(features.pop())
correct, correct_type = parse_tag(features.pop())
first_item = features.pop(0)
if first_item == options['boundary']:
guessed = 'O'
end_correct = end_of_chunk(last_correct, correct,
last_correct_type, correct_type)
end_guessed = end_of_chunk(last_guessed, guessed,
last_guessed_type, guessed_type)
start_correct = start_of_chunk(last_correct, correct,
last_correct_type, correct_type)
start_guessed = start_of_chunk(last_guessed, guessed,
last_guessed_type, guessed_type)
if in_correct:
if (end_correct and end_guessed and
last_guessed_type == last_correct_type):
in_correct = False
counts.correct_chunk += 1
counts.t_correct_chunk[last_correct_type] += 1
elif (end_correct != end_guessed or guessed_type != correct_type):
in_correct = False
if start_correct and start_guessed and guessed_type == correct_type:
in_correct = True
if start_correct:
counts.found_correct += 1
counts.t_found_correct[correct_type] += 1
if start_guessed:
counts.found_guessed += 1
counts.t_found_guessed[guessed_type] += 1
if first_item != options['boundary']:
if correct == guessed and guessed_type == correct_type:
counts.correct_tags += 1
counts.token_counter += 1
last_guessed = guessed
last_correct = correct
last_guessed_type = guessed_type
last_correct_type = correct_type
if in_correct:
counts.correct_chunk += 1
counts.t_correct_chunk[last_correct_type] += 1
return counts
def uniq(iterable):
seen = set()
return [i for i in iterable if not (i in seen or seen.add(i))]
def calculate_metrics(correct, guessed, total):
tp, fp, fn = correct, guessed-correct, total-correct
p = 0 if tp + fp == 0 else 1.*tp / (tp + fp)
r = 0 if tp + fn == 0 else 1.*tp / (tp + fn)
f = 0 if p + r == 0 else 2 * p * r / (p + r)
return Metrics(tp, fp, fn, p, r, f)
def metrics(counts):
c = counts
overall = calculate_metrics(
c.correct_chunk, c.found_guessed, c.found_correct
)
by_type = {}
for t in uniq(list(c.t_found_correct.keys()) + list(c.t_found_guessed.keys())):
by_type[t] = calculate_metrics(
c.t_correct_chunk[t], c.t_found_guessed[t], c.t_found_correct[t]
)
return overall, by_type
def report(counts, out=None):
out = ''
overall, by_type = metrics(counts)
c = counts
out += ('processed %d tokens with %d phrases; ' %
(c.token_counter, c.found_correct))
out += ('found: %d phrases; correct: %d.\n' %
(c.found_guessed, c.correct_chunk))
if c.token_counter > 0:
out += ('accuracy: %6.2f%%; ' %
(100.*c.correct_tags/c.token_counter))
out += ('precision: %6.2f%%; ' % (100.*overall.prec))
out += ('recall: %6.2f%%; ' % (100.*overall.rec))
out += ('FB1: %6.2f\n' % (100.*overall.fscore))
for i, m in sorted(by_type.items()):
out += ('%17s: ' % i)
out += ('precision: %6.2f%%; ' % (100.*m.prec))
out += ('recall: %6.2f%%; ' % (100.*m.rec))
out += ('FB1: %6.2f %d\n' % (100.*m.fscore, c.t_found_guessed[i]))
return out
def end_of_chunk(prev_tag, tag, prev_type, type_):
# check if a chunk ended between the previous and current word
# arguments: previous and current chunk tags, previous and current types
chunk_end = False
if prev_tag == 'E': chunk_end = True
if prev_tag == 'S': chunk_end = True
if prev_tag == 'B' and tag == 'B': chunk_end = True
if prev_tag == 'B' and tag == 'S': chunk_end = True
if prev_tag == 'B' and tag == 'O': chunk_end = True
if prev_tag == 'I' and tag == 'B': chunk_end = True
if prev_tag == 'I' and tag == 'S': chunk_end = True
if prev_tag == 'I' and tag == 'O': chunk_end = True
if prev_tag != 'O' and prev_tag != '.' and prev_type != type_:
chunk_end = True
# these chunks are assumed to have length 1
if prev_tag == ']': chunk_end = True
if prev_tag == '[': chunk_end = True
return chunk_end
def start_of_chunk(prev_tag, tag, prev_type, type_):
# check if a chunk started between the previous and current word
# arguments: previous and current chunk tags, previous and current types
chunk_start = False
if tag == 'B': chunk_start = True
if tag == 'S': chunk_start = True
if prev_tag == 'E' and tag == 'E': chunk_start = True
if prev_tag == 'E' and tag == 'I': chunk_start = True
if prev_tag == 'S' and tag == 'E': chunk_start = True
if prev_tag == 'S' and tag == 'I': chunk_start = True
if prev_tag == 'O' and tag == 'E': chunk_start = True
if prev_tag == 'O' and tag == 'I': chunk_start = True
if tag != 'O' and tag != '.' and prev_type != type_:
chunk_start = True
# these chunks are assumed to have length 1
if tag == '[': chunk_start = True
if tag == ']': chunk_start = True
return chunk_start
# Copyright 2020 Saarland University, Spoken Language Systems LSV
# Authors: Lukas Lange, Michael A. Hedderich, Dietrich Klakow, Thomas Kleinbauer
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# THIS CODE IS PROVIDED *AS IS*, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, EITHER EXPRESS OR IMPLIED, INCLUDING WITHOUT LIMITATION ANY IMPLIED
# WARRANTIES OR CONDITIONS OF TITLE, FITNESS FOR A PARTICULAR PURPOSE,
# MERCHANTABLITY OR NON-INFRINGEMENT.
#
# See the Apache 2 License for the specific language governing permissions and
# limitations under the License.
import json
import os
import sys
class ExperimentalSettings:
"""
Mimics a dictionary to hold the settings of an experiment.
Loads the settings from a JSON config file. The settings can
not be changed to ensure that they are consistent.
Use:
SETTINGS = ExperimentalSettings.load_json("experiment01")
a = SETTINGS["IMPORTANT_HYPERPARAMETER]
The JSON file must contain one dictionary {...}. The dictionary
must at least contain the value "NAME" which must be
identical to the filename ("NAME.json").
"""
def __init__(self, name):
"""
name: Name of the file that stores the configuration
once finalize() is called (.config is added).
"""
self.name = name
def __getitem__(self, key):
return self.settings[key]
def __setitem__(self, key, value):
if key in self.settings:
raise Exception("ExperimentalSettings object can not be changed.")
self.settings[key] = value
def __contains__(self, key):
return key in self.settings
@staticmethod
def load_json(name, override_values={}, dir_path="../config"):
while dir_path and dir_path[-1] == '/':
dir_path = dir_path[:-1]
try:
with open(os.path.join(dir_path, name + ".json"), 'r') as f:
file_content = f.read()
settings = json.loads(file_content)
for key, value in override_values.items():
if key in settings:
settings[key] = value
if settings["NAME"] != name:
raise ValueError("Name in json is specified as {} ".format(settings['NAME']) +
"while the name is loaded from a file called {}".format(name))
new_settings_object = ExperimentalSettings(name)
new_settings_object.settings = settings
return new_settings_object
except FileNotFoundError as err:
print("File not found: {}/{}.json".format(dir_path, name), file=sys.stderr)
print("(Did you specify the correct load directory?)")
sys.exit(1)
def __repr__(self):
return self.settings.__repr__()
def __str__(self):
return self.settings.__str__()
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
# Copyright 2020 Saarland University, Spoken Language Systems LSV
# Authors: Lukas Lange, Michael A. Hedderich, Dietrich Klakow, Thomas Kleinbauer
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# THIS CODE IS PROVIDED *AS IS*, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, EITHER EXPRESS OR IMPLIED, INCLUDING WITHOUT LIMITATION ANY IMPLIED
# WARRANTIES OR CONDITIONS OF TITLE, FITNESS FOR A PARTICULAR PURPOSE,
# MERCHANTABLITY OR NON-INFRINGEMENT.
#
# See the Apache 2 License for the specific language governing permissions and
# limitations under the License.
import sys
if __name__ == '__main__':
if len(sys.argv) != 3:
print("USAGE: python3 {} <infile.tsv> <outfile.tsv>".format(sys.argv[0]))
sys.exit(1)
with open(sys.argv[1], 'r') as infile:
with open(sys.argv[2], 'w') as outfile:
for line in infile:
split = line.split('\t')
if len(split) == 2:
label = split[1]
if len(label) > 2 and (label[:2] == 'B-' or label[:2] == 'I-'):
outfile.write("{}\t{}\n".format(split[0], label[2:]))
else:
outfile.write(line)
else:
outfile.write(line)
# Copyright 2020 Saarland University, Spoken Language Systems LSV
# Authors: Lukas Lange, Michael A. Hedderich, Dietrich Klakow, Thomas Kleinbauer
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# THIS CODE IS PROVIDED *AS IS*, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, EITHER EXPRESS OR IMPLIED, INCLUDING WITHOUT LIMITATION ANY IMPLIED
# WARRANTIES OR CONDITIONS OF TITLE, FITNESS FOR A PARTICULAR PURPOSE,
# MERCHANTABLITY OR NON-INFRINGEMENT.
#
# See the Apache 2 License for the specific language governing permissions and
# limitations under the License.
import sys
sys.path.append('../main')
from functools import reduce
from utils import create_pickle_data
from ner_datacode import LabelRepresentation, WordEmbedding
import os
import tempfile
LABELS = { 'verbmobil': ['O', 'PER', 'ORG', 'LOC', 'DATE', 'TIME'] }
PATH_TO_FASTTEXT = '/data/corpora/fasttext/cc.en.300.bin'