Commit 793ffdc6 authored by Lucas Terriel's avatar Lucas Terriel 🐍

new notebook serialization + upload existing notebook

parent e99b856a
This source diff could not be displayed because it is too large. You can view the blob instead.
......@@ -4,7 +4,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## Notebook 2 - Pipeline CSV/TSV 2 IOB\n",
"## Pipeline CSV/TSV 2 IOB\n",
"\n",
"### Description : \n",
"\n",
......
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Sérialisation d'un corpus de texte en un set d'entraînement et un set d'évaluation\n",
"\n",
"### Description : \n",
"\n",
"*Ce notebook présente deux scripts reproductibles pour sérialiser un corpus de texte pour de l'entrainement et de l'évaluation NER/NED.*\n",
"\n",
"#### Auteur : Lucas Terriel / INRIA-ALMANACH\n",
"#### Date de dernière modification : 02/01/2021\n",
"#### Version de Python : 3.X\n",
"\n",
"#### SOMMAIRE : \n",
"\n",
"- Étape préliminaire : récupérer la liste des fichiers \n",
"\n",
"\n",
"##### Méthode longue\n",
"\n",
"- Étape 1 : Mélanger les données (*randomize data*)\n",
"- Étape 2 : Découper (*split*) le dataset en un set d'entraînement et un set d'évaluation\n",
"- Étape 3 : Récupérer les set de données dans deux dossiers distincs\n",
"\n",
"##### Methode courte (sklearn)"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"# import packages\n",
"from os import (path, mkdir, listdir)\n",
"import random\n",
"from math import floor\n",
"import shutil\n",
"from decimal import Decimal\n",
"\n",
"from sklearn.model_selection import train_test_split"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Étape préliminaire : récupérer la liste des fichiers**"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[DATASET] : ['FRAN_IR_050516_to_rawtext.fr.txt', 'FRAN_IR_055325_to_rawtext.fr.txt', 'FRAN_IR_054129_to_rawtext.fr.txt', 'FRAN_IR_001488_to_rawtext.fr.txt', 'FRAN_IR_058836_to_rawtext.fr.txt', 'FRAN_IR_001631_to_rawtext.fr.txt', 'FRAN_IR_058341_to_rawtext.fr.txt', 'FRAN_IR_053754_to_rawtext.fr.txt', 'FRAN_IR_001454_to_rawtext.fr.txt', 'FRAN_IR_000061_to_rawtext.fr.txt', 'FRAN_IR_041253_to_rawtext.fr.txt', 'FRAN_IR_050370_to_rawtext.fr.txt', 'FRAN_IR_054605_to_rawtext.fr.txt', 'FRAN_IR_058292_to_rawtext.fr.txt', 'FRAN_IR_000242_to_rawtext.fr.txt', 'FRAN_IR_057246_to_rawtext.fr.txt', 'FRAN_IR_050185_to_rawtext.fr.txt']\n"
]
}
],
"source": [
"# Let's change the parameters here\n",
"input_dir = './in_notebook/exemple_serialisation/'\n",
"extension_file = '.txt'\n",
"\n",
"def get_file_list_from_dir(basedir, extension):\n",
" all_files = listdir(path.abspath(basedir))\n",
" data_files = list(filter(lambda file: file.endswith(extension), all_files))\n",
" return data_files\n",
"\n",
"data_to_serialize = get_file_list_from_dir(input_dir, extension_file)\n",
"\n",
"print(f'[DATASET] : {data_to_serialize}')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"*Méthode longue*\n",
"\n",
"**Étape 1 : Mélanger les données (*randomize data*)**\n",
"\n",
"Il s'agit ici de mélanger le dataset initial afin d'obtenir des données très hétérogènes et non prévisible pour la pipeline d'entraînement et d'évaluation."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"['FRAN_IR_041253_to_rawtext.fr.txt', 'FRAN_IR_050370_to_rawtext.fr.txt', 'FRAN_IR_050516_to_rawtext.fr.txt', 'FRAN_IR_055325_to_rawtext.fr.txt', 'FRAN_IR_054129_to_rawtext.fr.txt', 'FRAN_IR_001488_to_rawtext.fr.txt', 'FRAN_IR_001454_to_rawtext.fr.txt', 'FRAN_IR_053754_to_rawtext.fr.txt', 'FRAN_IR_050185_to_rawtext.fr.txt', 'FRAN_IR_058292_to_rawtext.fr.txt', 'FRAN_IR_000061_to_rawtext.fr.txt', 'FRAN_IR_001631_to_rawtext.fr.txt', 'FRAN_IR_000242_to_rawtext.fr.txt', 'FRAN_IR_057246_to_rawtext.fr.txt', 'FRAN_IR_058836_to_rawtext.fr.txt', 'FRAN_IR_054605_to_rawtext.fr.txt', 'FRAN_IR_058341_to_rawtext.fr.txt']\n"
]
}
],
"source": [
"def randomize_files(dataset):\n",
" shuffled = random.sample(dataset, len(dataset))\n",
" return shuffled\n",
"\n",
"random_set = randomize_files(data_to_serialize)\n",
"\n",
"print(random_set)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Étape 2 : Découper (*split*) le dataset en un set d'entraînement et un set d'évaluation**\n",
"\n",
"On défini un ratio personnalisé selon les choix d'entraînement, c'est-à-dire la proportion de données qui iront dans le set d'entraînement et le set d'évaluation.\n",
"\n",
"par exemple, un ratio réglé 80:20 équivaut à récupérer 80% de données d'entraînement et 20% de données de test sur le dataset complet."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"\n",
"[TRAIN SET :] ['FRAN_IR_041253_to_rawtext.fr.txt', 'FRAN_IR_050370_to_rawtext.fr.txt', 'FRAN_IR_050516_to_rawtext.fr.txt', 'FRAN_IR_055325_to_rawtext.fr.txt', 'FRAN_IR_054129_to_rawtext.fr.txt', 'FRAN_IR_001488_to_rawtext.fr.txt', 'FRAN_IR_001454_to_rawtext.fr.txt', 'FRAN_IR_053754_to_rawtext.fr.txt', 'FRAN_IR_050185_to_rawtext.fr.txt', 'FRAN_IR_058292_to_rawtext.fr.txt', 'FRAN_IR_000061_to_rawtext.fr.txt', 'FRAN_IR_001631_to_rawtext.fr.txt', 'FRAN_IR_000242_to_rawtext.fr.txt']\n",
"\n",
"[SIZE :] 13 files / ratio : 0.7647058823529411\n",
"\n",
"-------------------------\n",
"\n",
"[TEST SET :] ['FRAN_IR_057246_to_rawtext.fr.txt', 'FRAN_IR_058836_to_rawtext.fr.txt', 'FRAN_IR_054605_to_rawtext.fr.txt', 'FRAN_IR_058341_to_rawtext.fr.txt']\n",
"\n",
"[SIZE :] 4 files / ratio : 0.23529411764705882\n",
"\n",
"\n"
]
}
],
"source": [
"def get_training_and_testing_sets(file_list, train_size):\n",
" train_size = 0.8\n",
" split_index = floor(len(file_list) * train_size)\n",
" training = file_list[:split_index]\n",
" testing = file_list[split_index:]\n",
" return training, testing\n",
"\n",
"# Set the ratio of your output data with train size parameter\n",
"train_size = 0.8\n",
"\n",
"train_set, test_set = get_training_and_testing_sets(random_set, train_size)\n",
"\n",
"\n",
"print(f\"\"\"\n",
"\n",
"[TRAIN SET :] {train_set}\n",
"\n",
"[SIZE :] {len(train_set)} files / ratio : {len(train_set)/len(data_to_serialize)}\n",
"\n",
"-------------------------\n",
"\n",
"[TEST SET :] {test_set}\n",
"\n",
"[SIZE :] {len(test_set)} files / ratio : {len(test_set)/len(data_to_serialize)}\n",
"\n",
"\"\"\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Étape 3 : Récupérer les set de données dans deux dossiers distincs**\n",
"\n",
"Attention : si vous relancez le notebook, la fonction random est réactivé et cela peut ajouter des fichiers dans vos dossiers de sortie"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"output_dir = \"./out_notebook/\"\n",
"\n",
"# Create new directories for output, if not exists\n",
"if not path.isdir(f\"{output_dir}/data_train/\"):\n",
" mkdir(f\"{output_dir}/data_train/\")\n",
"if not path.isdir(f\"{output_dir}/data_test/\"):\n",
" mkdir(f\"{output_dir}/data_test/\")\n",
" \n",
"def copy_files_to_dir(list_files, input_dir, destination_dir):\n",
" for file in list_files:\n",
" shutil.copyfile(f'{input_dir}/{file}', f'{destination_dir}/{file}')\n",
"\n",
"# copy train set\n",
"copy_files_to_dir(train_set, input_dir, f'{output_dir}/data_train/')\n",
"# copy test set\n",
"copy_files_to_dir(test_set, input_dir, f'{output_dir}/data_test/')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"*Méthode courte*\n",
"\n",
"La méthode courte consiste à utiliser la fonction du *package* scikit-learn `train_test_split` qui reprend l'ensemble des étapes ci-dessous, ormis la copie des fichiers dans les dossiers de sortie."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"\n",
"[TRAIN SET :] ['FRAN_IR_000242_to_rawtext.fr.txt', 'FRAN_IR_057246_to_rawtext.fr.txt', 'FRAN_IR_058836_to_rawtext.fr.txt', 'FRAN_IR_054129_to_rawtext.fr.txt', 'FRAN_IR_041253_to_rawtext.fr.txt', 'FRAN_IR_001631_to_rawtext.fr.txt', 'FRAN_IR_050185_to_rawtext.fr.txt', 'FRAN_IR_001454_to_rawtext.fr.txt', 'FRAN_IR_054605_to_rawtext.fr.txt', 'FRAN_IR_001488_to_rawtext.fr.txt', 'FRAN_IR_058292_to_rawtext.fr.txt', 'FRAN_IR_053754_to_rawtext.fr.txt', 'FRAN_IR_058341_to_rawtext.fr.txt']\n",
"\n",
"[SIZE :] 13 files / ratio : 0.7647058823529411\n",
"\n",
"-------------------------\n",
"\n",
"[TEST SET :] ['FRAN_IR_050516_to_rawtext.fr.txt', 'FRAN_IR_000061_to_rawtext.fr.txt', 'FRAN_IR_050370_to_rawtext.fr.txt', 'FRAN_IR_055325_to_rawtext.fr.txt']\n",
"\n",
"[SIZE :] 4 files / ratio : 0.23529411764705882\n",
"\n",
"\n"
]
}
],
"source": [
"# Set test_size parameter for split dataset with your own ratio\n",
"train, test = train_test_split(data_to_serialize, test_size = 0.2)\n",
"\n",
"print(f\"\"\"\n",
"\n",
"[TRAIN SET :] {train}\n",
"\n",
"[SIZE :] {len(train)} files / ratio : {len(train)/len(data_to_serialize)}\n",
"\n",
"-------------------------\n",
"\n",
"[TEST SET :] {test}\n",
"\n",
"[SIZE :] {len(test)} files / ratio : {len(test)/len(data_to_serialize)}\n",
"\n",
"\"\"\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.9"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
......@@ -4,7 +4,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## Notebook 3 - Entrainer un modèle NER avec Spacy\n",
"## Entrainer un modèle NER avec Spacy\n",
"\n",
"### Description : \n",
"\n",
......
This diff is collapsed.
......@@ -4,7 +4,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## Notebook 2 - Pipeline CSV/TSV 2 IOB\n",
"## Pipeline CSV/TSV 2 IOB\n",
"\n",
"### Description : \n",
"\n",
......
This diff is collapsed.
......@@ -4,7 +4,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## Notebook 3 - Entrainer un modèle NER avec Spacy\n",
"## Entrainer un modèle NER avec Spacy\n",
"\n",
"### Description : \n",
"\n",
......
......@@ -12,5 +12,6 @@ dependencies:
- lxml
- nltk
- beautifulsoup4
- scikit-learn
- spacy
- fr_core_news_md
\ No newline at end of file
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment