Demo: Binary Classification

Create an initial dataset for a binary classification problem with 20 features:

  • 6 of the features are informative

  • 4 of the features are linear combinations of the informative features

  • 10 of the features are just random noise

[1]:
# Ignore lightgbm warnings
import warnings
warnings.filterwarnings("ignore")

# Import necessary libraries for the demo
import pandas as pd
from lightgbm import LGBMClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from shapfire import ShapFire, RefitHelper, plot_roc_curve

X, y = make_classification(
    n_samples=1000,
    n_features=20,
    n_informative=6,
    n_redundant=4,
    n_repeated=0,
    n_classes=2,
    random_state=0,
    shuffle=False,
)
X = pd.DataFrame(data=X, columns=[f"feature{i}" for i in range(X.shape[1])])
X
[1]:
feature0 feature1 feature2 feature3 feature4 feature5 feature6 feature7 feature8 feature9 feature10 feature11 feature12 feature13 feature14 feature15 feature16 feature17 feature18 feature19
0 -0.056717 -0.493931 1.667702 1.397865 -0.794842 -0.682968 -0.511378 -0.992701 -0.248595 -1.676577 1.206354 2.321515 -1.423698 -0.433072 0.682042 -1.125577 -0.593864 -0.024779 1.735459 -1.113934
1 1.012914 -0.696091 0.414157 0.567214 0.729235 2.892775 -0.016728 -1.475205 3.680440 -2.095504 -0.676781 1.576879 -2.227987 1.638134 0.015417 -0.507376 1.178729 2.676799 0.344251 0.454819
2 1.690440 -1.236785 -0.507020 3.027518 -2.315026 2.067620 0.273264 1.017970 1.103050 -5.071645 -1.156792 -2.066161 -1.825515 -1.288961 0.324183 0.229208 -0.033722 0.727694 0.745335 -0.315728
3 2.035054 -1.154556 -0.012416 1.361509 -0.931541 2.933542 -0.734587 -0.335309 2.853751 -4.326141 -0.296359 1.215873 -0.278985 -1.448467 0.580039 0.156749 0.639724 0.029043 0.822496 1.151368
4 1.024961 -0.366287 2.079224 0.739798 -0.486162 0.621330 -1.945298 -1.429361 1.423769 -3.266301 2.028758 -2.972793 0.003815 -0.904622 -0.139921 -0.550930 -1.697206 0.102097 -1.056104 0.146621
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
995 1.410788 0.250078 0.366037 1.168022 -0.056665 -0.251516 -0.505207 0.175348 0.103474 -1.987223 -0.384352 -1.023449 0.664366 0.745393 0.758317 1.200208 -0.147041 0.158603 -1.219479 -1.394935
996 -0.613218 1.423613 1.527973 -0.245896 -2.291340 0.481899 -2.839232 1.633723 -0.621871 -3.940202 0.300710 -0.307716 -0.349721 0.175795 0.228521 0.212923 0.732526 -0.404249 -0.448734 0.305272
997 -0.231221 1.140357 -2.173634 -1.429259 -1.838736 -0.198766 -0.784890 3.750921 -2.509571 -0.128341 -0.031324 -0.850717 0.361888 -0.409298 0.709181 1.040780 0.269000 0.130427 -0.856227 0.278540
998 -3.281446 -1.185517 1.473731 0.117330 -1.795964 1.482811 0.437157 -1.201846 0.767806 -0.192885 -1.765901 -1.326832 -1.287806 -0.353538 0.996278 2.046064 -0.360873 -0.720141 0.355115 -0.995815
999 0.384327 0.535252 -0.050257 0.491075 0.507009 2.607129 0.183923 0.083624 2.764692 -2.048920 -0.513324 -0.085608 1.195993 -0.924152 -0.812521 -0.147024 0.184254 0.145577 0.432219 -0.270619

1000 rows × 20 columns

Dataset Split

Split the initial dataset into:

  • A pre-processing dataset that we use for feature ranking and selection

  • A final dataset which we use to obtain ROC AUC performance estimates using the selected features

[2]:
X_preprocess, X_final, y_preprocess, y_final = train_test_split(X, y, test_size=0.50, random_state=0)

Feature Importance Ranking & Selection

[3]:
# Prepare input parameters for the ShapFire method
estimator_class=LGBMClassifier
estimator_params={"objective": "binary"}
scoring="roc_auc"
n_splits = 2
n_repeats= 5
sf = ShapFire(
  estimator_class=estimator_class,
  scoring=scoring,
  estimator_params=estimator_params,
  n_splits=n_splits,
  n_repeats=n_repeats,
)

# Perform feature importance ranking and selection
_ = sf.fit(X=X_preprocess, y=y_preprocess)

# Print the selected features
# Ideally, only the 10 first features should appear in the selected features
for feature in sf.selected_features:
    print("Selected feature: ", feature)
Clustering progress 100%|=========================| 6.00/6.00 [00:00<00:00, 2.41kit/s]
ShapFire progress   100%|=========================| 20.0/20.0 [00:01<00:00, 13.0it/s]
Selected feature:  feature1
Selected feature:  feature7
Selected feature:  feature4
Selected feature:  feature0
Selected feature:  feature3
Selected feature:  feature9
Selected feature:  feature6
Selected feature:  feature8
Selected feature:  feature5

Feature Importance Plot

To more easily get an overview of the ranked and selected features we can plot using the ShapFire class method: plot_ranking.

[4]:
# Plot the selected features per cluster
fig, ax = sf.plot_ranking()
../../_images/source_examples_classification_demo_7_0.png
[5]:
# Plot the selected features per feature
fig, ax = sf.plot_ranking(groupby="feature")
../../_images/source_examples_classification_demo_8_0.png

Selcted Features: ROC Curves

To get an unbiased estimate of the performance of a model that uses the previously selected features we can perform repeated \(k\)-fold cross-validation on the other half of the initial dataset.

[6]:
refit_helper_selected_features = RefitHelper(
    n_splits=n_splits,
    n_repeats=n_repeats,
    feature_names=sf.selected_features,
    estimator_class=estimator_class,
    estimator_params=estimator_params,
    scoring=scoring,
)
refit_helper_selected_features.fit(X_final, pd.Series(y_final))

_, _ = plot_roc_curve(df=refit_helper_selected_features.history)
../../_images/source_examples_classification_demo_10_0.png

All Features: ROC Curves

Similarly, to get an unbiased estimate of the performance of a baseline model that is trained using all initial features we perform repeated \(k\)-fold cross-validation on the other half of the initial dataset.

[7]:
refit_helper_baseline_features = RefitHelper(
    n_splits=n_splits,
    n_repeats=n_repeats,
    feature_names=X.columns.to_list(),
    estimator_class=estimator_class,
    estimator_params=estimator_params,
    scoring=scoring,
)
refit_helper_baseline_features.fit(X_final, pd.Series(y_final))

_, _ = plot_roc_curve(df=refit_helper_baseline_features.history)
../../_images/source_examples_classification_demo_12_0.png

Last update: Jun 12, 2022