HiPaR (Hierarchical Interpretable Pattern-aided Regression)
HIPAR is a pattern-based method for regression on tabular data. Given a dataset, HIPAR outputs a set of hybrid rules of the form p => y = f(X) that predict a target variable y. Here, p is a conjunctive pattern that characterizes a region of the dataset (e.g., property-type='house' and surface > 50), and f(X) is a linear function on the numerical features of the dataset.
Install HiPaR
A simple way to install HIPAR is to run the following command:
$ pip install hipar
This command will install the latest published version. If you want to install the latest version then run:
$ pip install git+https://gitlab.inria.fr/lgalarra/hipar.git
How to use HiPaR
HIPAR's code is still in alpha status, nevertheless the code can be used without major issues.
from hipar import HIPAR
from data import get_simple_housing
hipar = HIPAR(min_support=2, interclass_variance_percentile_threshold=0)
X, y = get_simple_housing()
hipar.fit(X, y)
## Get all rules found during the enumeration phase
print(hipar.all_rules)
## Get the rules selected by HiPaR (used for prediction)
print(hipar.get_selected_rules())
X_test = ...
print(hipar.predict(X_test))
Experimental Results
The first implementation of HiPaR including all the experimental evaluation and data is available here.
Roadmap
Diferences with the published version
- Interclass variance threshold is calculated over the entire set of refinement conditions and not on the set of discretized refinement conditions
- We do not check if a new rule is better than all its parents, but only better than the generating parent. This just sents more rules to the selection phase, but makes the code simpler (I am not confident of the previous implementation of this feature).
Improvements w.r.t. the published version
- Support for multiple metrics in the enumeration phase. A new rule will be compared against its parent on all the metrics provided as input in the constructor
Roadmap
- Consider alternative discretization approaches for the numerical variables in the conditions. [Research in progress, error-based discretization already implemented]
- Implement feature selection when the number of input features is too large in order to improve efficiency
- Normalize the data for the local linear regressors: that will allow us to compare the importance of the coefficients of the different input features for the sake of interpretability
- Consider other quality criteria to prune during the enumeration such as the p-values of the linear coefficients.
- If we need to compare against all the HIPAR-based hybrid methods published in the paper, we will have to reimplement them.
- Use a proper library for logging
Publications
- Olivier Gauriau, Luis Galárraga, François Brun, Alexandre Termier, Loïc Davadan, François Joudelat. Comparing Machine-Learning Models of Different Levels of Complexity for Crop Protection: A Look into the Complexity-Accuracy Trade-off. Journal of Smart Agricultural Technology, 2024. Full text(https://doi.org/10.1016/j.atech.2023.100380)
- Luis Galárraga, Olivier Pelgrin, Alexandre Termier. HiPaR: Hierarchical Pattern-aided Regression. Full paper at the Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD 2021), Delhi. Technical Report(https://arxiv.org/abs/2102.12370)