Commit daac19c5 authored by BERNIER Fabien's avatar BERNIER Fabien
Browse files

[+] README additional information

parent b060521d
......@@ -2,35 +2,107 @@
FixOut addresses fairness issues of ML models based on decision outcomes, and shows how the simple idea of “feature dropout” followed by an “ensemble approach” can improve model fairness.
Originally, it was conceived to tackle process fairness of ML Models based on decision outcomes (see LimeOut [1]). For that it uses an explanation method to assess a model’s reliance on salient or sensitive features, that is integrated in a human-centered workflow: given a classifier M, a dataset D, a set F of sensitive features and an explanation method of choice, FIXOut outputs a competitive classifier M’ that improves in process fairness as well as in other fairness metrics.
Classifiers available:
* Logistic Regression
* Random Forest
* Bagging
* AdaBoost
* Multilayer Perceptron
* Gaussian Mixture
* Gradiente Boosting
## Description
Originally, it was conceived to tackle process fairness of ML Models based on decision outcomes (see LimeOut [1]). For that it uses an explanation method to assess a model’s reliance on salient or sensitive features, that is integrated in a human-centered workflow: given a classifier M, a dataset D, a set F of sensitive features and an explanation method of choice, FIXOut outputs a competitive classifier M’ that improves in process fairness as well as in other fairness metrics.
Explainers
Explainers available:
* LIME
* SHAP
# Example
`python runner.py --data german.data --trainsize 0.8 --algo mlp --max_features 10 --cat_features 0 2 3 5 6 8 9 11 13 14 16 18 19 --drop 8 18 19 --exp anchors`
## Installation
# References
[1] Vaishnavi Bhargava, Miguel Couceiro, Amedeo Napoli. LimeOut: An Ensemble Approach To Improve Process Fairness. XKDD Workshop 2020. ⟨hal-02864059v2⟩
FixOut works on Python >= **3.6**.
There is no proper installer at the moment, since the module is under construction.
If you are on Linux, then install the `swig` package. For debian-based distributions:
[2] Guilherme Alves, Vaishnavi Bhargava, Miguel Couceiro, Amedeo Napoli. Making ML models fairer through explanations: the case of LimeOut. AIST 2020. ⟨hal-02864059v5⟩
```shell
$ sudo apt install swig
```
For all operating systems, install requirements:
```shell
$ pip install -r requirements.txt
```
## Usage example
### For tabular data
For a more complete example, see [examples/experimenter.py](examples/experimenter.py)
```python
from fixout.lime_tabular_global import TabularExplainer
from fixout.core_tabular import EnsembleOutTabular
lr = LogisticRegression()
model = make_pipeline(ct, lr)
model.fit(X_train, y_train)
print("Original score:", model.score(X_test, y_test))
# explain the original model
explainer_original = TabularExplainer(model.predict_proba, X_train,
categorical_features=[1, 3, 5, 6, 7, 8, 9, 13]) # categorical features indexes
explainer_original.global_explanation(n_samples=500)
# print the explanation
for i, contrib in explainer_original.get_top_k(k=10) :
print(features[i], '\t', contrib)
# make an ensemble
ensemble = EnsembleOutTabular(lr, ct, sensitive_features=(5, 8, 9)) # features which we want to lower the contribution
ensemble.fit(X_train, y_train)
print("Ensemble score:", ensemble.score(X_test, y_test))
# explain the ensemble
explainer_ensemble = TabularExplainer(ensemble.predict_proba, X_train,
categorical_features=[1, 3, 5, 6, 7, 8, 9, 13])
explainer_ensemble.global_explanation(n_samples=500)
## Dependencies
* Python >= 3.7
* Scikit-learn >= 0.20.3
* numpy 1.16.4
* pandas 0.24.2
* scipy 1.3.0
* seaborn 0.9.0
\ No newline at end of file
for i, contrib in explainer_ensemble.get_top_k(k=10) :
print(features[i], '\t', contrib)
```
### For textual data
For a more complete example, see [examples/test_text.py](examples/test_text.py)
```python
from fixout.core_text import EnsembleOutText
from fixout.lime_text_global import TextExplainer
# creating a model pipeline and training it
vectorizer = TfidfVectorizer(lowercase=True)
lr = LogisticRegression()
model = make_pipeline(vectorizer, lr)
model.fit(X_train, y_train)
# evaluating our model
print("Accuracy:", model.score(X_test, y_test))
# explaining our model
explainer = TextExplainer(model.predict_proba)
explainer.global_explanation(X_test, n_samples=500)
for word, contrib in explainer.get_top_k(k=10) :
print(word, '\t', contrib)
# correcting fairness if necessary
ensemble = EnsembleOutText(model, sensitive_words=[["host", "symposium"], ["desks", "edu"]])
ensemble.fit(X_train, y_train)
print("Ensemble accuracy:", ensemble.score(X_test, y_test))
# explaining the ensemble model
ensemble_explainer = TextExplainer(ensemble.predict_proba)
ensemble_explainer.global_explanation(X_test, n_samples=250)
for word, contrib in explainer.get_top_k(k=10) :
print(word, '\t', contrib)
```
## References
[1] Vaishnavi Bhargava, Miguel Couceiro, Amedeo Napoli. LimeOut: An Ensemble Approach To Improve Process Fairness. XKDD Workshop 2020. ⟨hal-02864059v2⟩
[2] Guilherme Alves, Vaishnavi Bhargava, Miguel Couceiro, Amedeo Napoli. Making ML models fairer through explanations: the case of LimeOut. AIST 2020. ⟨hal-02864059v5⟩
import sys; sys.path.extend(['..'])
import pandas as pd
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from fixout.lime_tabular_global import TabularExplainer
from fixout.core_tabular import EnsembleOutTabular
from fixout.utils import columns_preprocessers, transform_categorical_names
# load the data and convert it to numpy
data = pd.read_csv("/home/fabien/Documents/Orpa/fixout-pkdd/datasets/adult2.data")
features = data.columns[:-1]
X = data.drop(columns="Target").to_numpy()
y = data["Target"].to_numpy()
# preprocess the data with respect to their type (categorical/numerical)
categorical = [1, 3, 5, 6, 7, 8, 9, 13]
categories = transform_categorical_names(X, categorical, feature_names=features)
ct = columns_preprocessers(X, categorical,
categorical_preprocesser=OneHotEncoder())
# split the data and train a model
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
lr = LogisticRegression()
model = make_pipeline(ct, lr)
model.fit(X_train, y_train)
print("Original score:", model.score(X_test, y_test))
# explain the original model
explainer_original = TabularExplainer(model.predict_proba, X_train, categorical_features=categorical)
explainer_original.global_explanation(n_samples=200)
for i, contrib in explainer_original.get_top_k(k=10) :
print(features[i], '\t', contrib)
# make an ensemble
ensemble = EnsembleOutTabular(lr, ct, sensitive_features=(5, 8, 9))
ensemble.fit(X_train, y_train)
print("Ensemble score:", ensemble.score(X_test, y_test))
# explain the ensemble
explainer_ensemble = TabularExplainer(ensemble.predict_proba, X_train, categorical_features=categorical)
explainer_ensemble.global_explanation(n_samples=200)
for i, contrib in explainer_ensemble.get_top_k(k=10) :
print(features[i], '\t', contrib)
import sys; sys.path.extend(['..'])
import pandas as pd
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from fixout.lime_tabular_global import TabularExplainer
from fixout.core_tabular import EnsembleOutTabular
from fixout.utils import columns_preprocessers, transform_categorical_names
# load the data and convert it to numpy
data = pd.read_csv("/home/fabien/Documents/Orpa/fixout-pkdd/datasets/adult2.data")
features = data.columns[:-1]
X = data.drop(columns="Target").to_numpy()
y = data["Target"].to_numpy()
# preprocess the data with respect to their type (categorical/numerical)
categorical = [1, 3, 5, 6, 7, 8, 9, 13]
categories = transform_categorical_names(X, categorical, feature_names=features)
ct = columns_preprocessers(X, categorical,
categorical_preprocesser=OneHotEncoder())
# split the data and train a model
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
lr = LogisticRegression()
model = make_pipeline(ct, lr)
model.fit(X_train, y_train)
print("Original score:", model.score(X_test, y_test))
# explain the original model
explainer_original = TabularExplainer(model.predict_proba, X_train, categorical_features=categorical)
explainer_original.global_explanation(n_samples=200)
for i, contrib in explainer_original.get_top_k(k=10) :
print(features[i], '\t', contrib)
# make an ensemble
ensemble = EnsembleOutTabular(lr, ct, sensitive_features=(5, 8, 9))
ensemble.fit(X_train, y_train)
print("Ensemble score:", ensemble.score(X_test, y_test))
# explain the ensemble
explainer_ensemble = TabularExplainer(ensemble.predict_proba, X_train, categorical_features=categorical)
explainer_ensemble.global_explanation(n_samples=200)
for i, contrib in explainer_ensemble.get_top_k(k=10) :
print(features[i], '\t', contrib)
......@@ -49,4 +49,4 @@ ensemble_explainer = TextExplainer(ensemble.predict_proba)
ensemble_explainer.global_explanation(X_test, n_samples=250)
for word, contrib in explainer.get_top_k(k=10) :
print(word, '\t', contrib)
\ No newline at end of file
print(word, '\t', contrib)
......@@ -8,4 +8,6 @@ scikit-learn~=0.23.2
scipy~=1.4.1
matplotlib~=3.3.2
nltk~=3.4.5
fairmodels~=0.1.3
\ No newline at end of file
fairmodels~=0.1.3
lime
shap
\ No newline at end of file
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment