`shapfire.shapfire`¶

This module contains the main implementation of the ShapFire method for feature ranking and selection.

shapfire.shapfire.DEFAULT_REPEATS : int = 1¶: The number of times, in a cross-validation, the division of a dataset into a certain number of folds should be repeated.

shapfire.shapfire.DEFAULT_SPLITS : int = 2¶: The default number of folds a dataset should be divided into in a cross-validation.

class shapfire.ShapFire(estimator_class: LGBMClassifier | LGBMRegressor | RandomForestClassifier | RandomForestRegressor, scoring: str, estimator_params: None | dict[str, Any] = None, n_splits: int = DEFAULT_SPLITS, n_repeats: int = DEFAULT_REPEATS, random_seed: int = utils.DEFAULT_RANDOM_SEED, iterations: None | int = None, reference: str = 'mean', n_samples: int = 1000, n_batches: int = 250)[source]¶

The main class used for applying SHAP feature importance rank ensembling for feature selection.

Parameters¶:

estimator_class: LGBMClassifier | LGBMRegressor | RandomForestClassifier | RandomForestRegressor¶: The scikit-learn or Microsoft LightGBM tree-based estimator to use. The estimator can either be a classifier or a regressor.
scoring: str¶: The specification of a scoring function to use for model-evaluation, i.e., a function that can be used for assessing the prediction error of a trained model given a test set.
estimator_params: None | dict[str, Any] = None¶: The estimator hyperparameters and corresponding values to search or directly use. If only a single value for each hyperparameter is provided then only cross-validation will be performed and no hyperparameter search will be performed. Defaults to None.
n_splits: int = DEFAULT_SPLITS¶: The number of folds to generate in the outer loop of a nested cross-validation. Defaults to shapfire.shapfire.DEFAULT_SPLITS.
n_repeats: int = DEFAULT_REPEATS¶: The number of new folds that should be generated in the outer loop of a nested cross-validation. Defaults to shapfire.shapfire.DEFAULT_REPEATS.
random_seed: int = utils.DEFAULT_RANDOM_SEED¶: The random seed to use for reproducibility purposes. Defaults to shapfire.utils.DEFAULT_RANDOM_SEED.
iterations: None | int = None¶: The number of feature subsets to sample and subsequently use for model-training such that SHAP feature importance values can be extracted. Defaults to None which in turns sets the number of iterations to the size of the largest cluster of highly associated features found.
reference: str = 'mean'¶: The data fusion method to use for producing a reference vector. Defaults to “mean”.
n_samples: int = 1000¶: The number of random samples of rank permutations to use in a batch. Several batches of random samples are used to estimate a “ranking distribution” which in turn is used to determine a feature importance cut-off threshold. Defaults to 1000.
n_batches: int = 250¶: The number of batches of random samples that should be used to estimate a “ranking distribution” which in turn is used to determine a feature importance cut-off threshold. Defaults to 250.

ranked_differences¶: A class attribute and pandas dataframe that specifies the final importance values associated with each of the features in the given input dataset.

selected_features¶: A ShapFire class attribute and list that specifies the final feature subset selected by ShapFire and which is expected to achieve the best possible model performance.

fit(X: ndarray | DataFrame, y: ndarray | DataFrame) → ShapFire[source]¶

Perform SHAP feature importance rank ensembling for the purpose of ranking and selecting the features that can be said to be the most important for a certain prediction task at hand.

Parameters¶:

X: ndarray | DataFrame¶: An input dataset that the ShapFire method should be applied to. The dataset is assumed to contain features (columns) and corresponding observations (rows).
y: ndarray | DataFrame¶: The samples associated with the target variable of the dataset.

Returns¶:

A ShapFire object containing the necessary data associated with the most important features of the given input dataset.

transform(X: ndarray | DataFrame) → ndarray | DataFrame[source]¶

Reduce the input dataset X containing features (columns) and corresponding observations (rows), to only the columns of features selected by ShapFire.

Parameters¶:

X: ndarray | DataFrame¶: The original input dataset that the ShapFire method was applied to. The dataset is assumed to contain features (columns) and corresponding observations (rows).

Raises¶:

ValueError – If the fit() method has not yet been called.

Returns¶:

A reduced dataset that only contains the most important features (columns).

fit_transform(X: ndarray | DataFrame, y: ndarray | DataFrame) → ndarray | DataFrame[source]¶

Perform SHAP feature importance rank ensembling for the purpose of selecting the features that are the most important. Subsequently, reduce the input dataset ‘X’ to only the columns of the selected features.

Parameters¶:

X: ndarray | DataFrame¶: An input dataset that the ShapFire method should be applied to. The dataset is assumed to contain features (columns) and corresponding observations (rows).
y: ndarray | DataFrame¶: The samples associated with the target variable of the dataset.

Returns¶:

A reduced dataset that only contains the data associated with the most important features.

plot_ranking(groupby: str = 'cluster', rcParams: None | dict[str, str] = None, figsize: None | tuple[float, float] = None, fontsize: int = 10, with_text: bool = True, with_overlay: bool = True, ax: None | Axes = None) → tuple[matplotlib.figure.Figure, matplotlib.axes._axes.Axes][source]¶

Plot the feature importance scores associated with each feature. The features will be ordered in the figure from best to worst and possibly according to which cluster they each belong to.

Parameters¶:

groupby: str = 'cluster'¶: A string value indicating how the feature importance ranking should be displayed in a figure. If the option ‘cluster’ is chosen, then the features are grouped and shown in the figure based on their assigned cluster and according to the importance rank of the best feautre in the cluster. If ‘feature’ is chosen, then the features are shown in the figure purely according to their global rank without any consideration to what cluster each features are a part of.
figsize: None | tuple[float, float] = None¶: The width and height of the figure in inches. Defaults to None.
fontsize: int = 10¶: The size of the font present in the figure. Defaults to 10.
with_text: bool = True¶: If input argument groupby is set to ‘cluster’, then with_text determines whether features that have been grouped in the figure by the cluster they each belong to, should also be annotated with a text label. Defaults to True.
with_overlay: bool = True¶: Depending on whether groupby is set to ‘cluster’ or ‘feature’, groups of features or individual features are assigned a gray-scale overlay creating a visual grouping / delimitation of features. Defaults to True.
ax: None | Axes = None¶: A Matplotlib Axes object. Defaults to None.

Returns¶:

A Matplotlib Figure and Axes object.

Last update: Jun 12, 2022

shapfire.shapfire¶

`shapfire.shapfire`¶