Dirty categories: learning with non normalized strings

Including strings that represent categories often calls for much data preparation. In particular categories may appear with many morphological variants, when they have been manually input, or assembled from diverse sources.

Including such a column in a learning pipeline as a standard categorical colum leads to categories with very high cardinalities and can lose information on which categories are similar.

Here we look at a dataset on wages [1] where the column Employee Position Title contains dirty categories.

We investigate encodings to include such compare different categorical encodings for the dirty column to predict the Current Annual Salary, using gradient boosted trees. For this purpose, we use the skrub library ( https://skrub-data.org ).

The data

Data Importing and preprocessing

We first download the dataset:

from skrub.datasets import fetch_employee_salaries
employee_salaries = fetch_employee_salaries()
print(employee_salaries.description)
Annual salary information including gross pay and overtime pay for all active, permanent employees of Montgomery County, MD paid in calendar year 2016. This information will be published annually each year.

Then we load it:

import pandas as pd
df = employee_salaries.X.copy()
df
gender department department_name division assignment_category employee_position_title date_first_hired year_first_hired
0 F POL Department of Police MSB Information Mgmt and Tech Division Records... Fulltime-Regular Office Services Coordinator 09/22/1986 1986
1 M POL Department of Police ISB Major Crimes Division Fugitive Section Fulltime-Regular Master Police Officer 09/12/1988 1988
2 F HHS Department of Health and Human Services Adult Protective and Case Management Services Fulltime-Regular Social Worker IV 11/19/1989 1989
3 M COR Correction and Rehabilitation PRRS Facility and Security Fulltime-Regular Resident Supervisor II 05/05/2014 2014
4 M HCA Department of Housing and Community Affairs Affordable Housing Programs Fulltime-Regular Planning Specialist III 03/05/2007 2007
... ... ... ... ... ... ... ... ...
9223 F HHS Department of Health and Human Services School Based Health Centers Fulltime-Regular Community Health Nurse II 11/03/2015 2015
9224 F FRS Fire and Rescue Services Human Resources Division Fulltime-Regular Fire/Rescue Division Chief 11/28/1988 1988
9225 M HHS Department of Health and Human Services Child and Adolescent Mental Health Clinic Serv... Parttime-Regular Medical Doctor IV - Psychiatrist 04/30/2001 2001
9226 M CCL County Council Council Central Staff Fulltime-Regular Manager II 09/05/2006 2006
9227 M DLC Department of Liquor Control Licensure, Regulation and Education Fulltime-Regular Alcohol/Tobacco Enforcement Specialist II 01/30/2012 2012

9228 rows × 8 columns



Recover the target

y = employee_salaries.y

A simple default as a learner

The function tabular_learner() is a simple way of creating a default learner for tabular_learner data:

from skrub import tabular_learner
model = tabular_learner("regressor")

We can quickly compute its cross-validation score using the corresponding scikit-learn utility

from sklearn.model_selection import cross_validate
import numpy as np

results = cross_validate(model, df, y)
print(f"Prediction score: {np.mean(results['test_score'])}")
print(f"Training time: {np.mean(results['fit_time'])}")
Prediction score: 0.9108276644460975
Training time: 0.44112162590026854

Below the hood, model is a pipeline:

model
Pipeline(steps=[('tablevectorizer',
                 TableVectorizer(high_cardinality=MinHashEncoder(),
                                 low_cardinality=ToCategorical())),
                ('histgradientboostingregressor',
                 HistGradientBoostingRegressor(categorical_features='from_dtype'))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.


We can see that it is made of first a TableVectorizer, and an HistGradientBoostingRegressor

Understanding the vectorizer + learner pipeline

The number one difficulty is that our input is a complex and heterogeneous dataframe:

df
gender department department_name division assignment_category employee_position_title date_first_hired year_first_hired
0 F POL Department of Police MSB Information Mgmt and Tech Division Records... Fulltime-Regular Office Services Coordinator 09/22/1986 1986
1 M POL Department of Police ISB Major Crimes Division Fugitive Section Fulltime-Regular Master Police Officer 09/12/1988 1988
2 F HHS Department of Health and Human Services Adult Protective and Case Management Services Fulltime-Regular Social Worker IV 11/19/1989 1989
3 M COR Correction and Rehabilitation PRRS Facility and Security Fulltime-Regular Resident Supervisor II 05/05/2014 2014
4 M HCA Department of Housing and Community Affairs Affordable Housing Programs Fulltime-Regular Planning Specialist III 03/05/2007 2007
... ... ... ... ... ... ... ... ...
9223 F HHS Department of Health and Human Services School Based Health Centers Fulltime-Regular Community Health Nurse II 11/03/2015 2015
9224 F FRS Fire and Rescue Services Human Resources Division Fulltime-Regular Fire/Rescue Division Chief 11/28/1988 1988
9225 M HHS Department of Health and Human Services Child and Adolescent Mental Health Clinic Serv... Parttime-Regular Medical Doctor IV - Psychiatrist 04/30/2001 2001
9226 M CCL County Council Council Central Staff Fulltime-Regular Manager II 09/05/2006 2006
9227 M DLC Department of Liquor Control Licensure, Regulation and Education Fulltime-Regular Alcohol/Tobacco Enforcement Specialist II 01/30/2012 2012

9228 rows × 8 columns



The TableVectorizer is a transformer that turns this dataframe into a form suited for machine learning.

Feeding it output to a powerful learner, such as gradient boosted trees, gives a machine-learning method that can be readily applied to the dataframe.

from skrub import TableVectorizer

Assembling the pipeline

We use the TableVectorizer with a HistGradientBoostingRegressor, which is a good predictor for data with heterogeneous columns

from sklearn.ensemble import HistGradientBoostingRegressor

We then create a pipeline chaining our encoders to a learner

from sklearn.pipeline import make_pipeline

pipeline = make_pipeline(
    TableVectorizer(),
    HistGradientBoostingRegressor()
)
pipeline
Pipeline(steps=[('tablevectorizer', TableVectorizer()),
                ('histgradientboostingregressor',
                 HistGradientBoostingRegressor())])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.


Note that it is almost the same model as above (can you spot the differences)

Let’s perform a cross-validation to see how well this model predicts

results = cross_validate(pipeline, df, y)
print(f"Prediction score: {np.mean(results['test_score'])}")
print(f"Training time: {np.mean(results['fit_time'])}")
Prediction score: 0.9209967798378061
Training time: 2.992914152145386

The prediction perform here is pretty much as good as above but the code here is much simpler as it does not involve specifying columns manually.

Analyzing the features created

Let us perform the same workflow, but without the Pipeline, so we can analyze its mechanisms along the way.

tab_vec = TableVectorizer()

We split the data between train and test, and transform them:

from sklearn.model_selection import train_test_split
df_train, df_test, y_train, y_test = train_test_split(
    df, y, test_size=0.15, random_state=42
)

X_train_enc = tab_vec.fit_transform(df_train, y_train)
X_test_enc = tab_vec.transform(df_test)

The encoded data, X_train_enc and X_test_enc are numerical arrays:

X_train_enc
gender_F gender_M gender_nan department_BOA department_BOE department_CAT department_CCL department_CEC department_CEX department_COR department_CUS department_DEP department_DGS department_DHS department_DLC department_DOT department_DPS department_DTS department_ECM department_FIN department_FRS department_HCA department_HHS department_HRC department_IGR department_LIB department_MPB department_NDA department_OAG department_OCP department_OHR department_OIG department_OLO department_OMB department_PIO department_POL department_PRO department_REC department_SHF department_ZAH ... division: enforcement, engineering, mangement division: abandoned, sediment, budget division: nicholson, transit, trips division: district, 3rd, 1st assignment_category_Parttime-Regular employee_position_title: firefighter, rescuer, master employee_position_title: warehouse, craftsworker, welfare employee_position_title: maintenance, attendant, cashier employee_position_title: manager, management, iii employee_position_title: operator, equipment, bus employee_position_title: police, candidate, officer employee_position_title: purchasing, crossing, guard employee_position_title: recreation, occupational, visual employee_position_title: librarian, employee, libraries employee_position_title: school, room, behavioral employee_position_title: specialist, therapist, special employee_position_title: coordinator, services, service employee_position_title: correctional, corporal, correction employee_position_title: technician, mechanic, supply employee_position_title: liquor, clerk, store employee_position_title: community, health, nurse employee_position_title: legislative, principal, executive employee_position_title: information, technology, technologist employee_position_title: captain, rescue, mcfrs employee_position_title: sergeant, sheriff, deputy employee_position_title: officer, office, of employee_position_title: income, assistance, client employee_position_title: assistant, library, fiscal employee_position_title: accountant, attorney, auditor employee_position_title: safety, public, communications employee_position_title: lieutenant, client, latent employee_position_title: program, programs, projects employee_position_title: permitting, planning, senior employee_position_title: enforcement, inspector, abandoned employee_position_title: administrative, administration, administrator date_first_hired_year date_first_hired_month date_first_hired_day date_first_hired_total_seconds year_first_hired
4405 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.059513 0.089732 0.074701 0.064337 0.0 0.132323 0.100549 0.087582 0.062129 4.103054 0.052306 17.542875 0.095996 0.535703 0.055283 17.029894 0.115826 0.056505 0.077598 0.055508 0.053912 0.100931 0.786289 0.163986 0.064722 0.057527 0.102902 0.147806 0.191044 0.080291 0.171816 0.051521 0.255682 6.404267 0.764172 2007.0 8.0 6.0 1.186358e+09 2007.0
5694 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.051012 0.051746 0.054876 0.052139 1.0 0.052190 0.054076 0.051604 0.060680 0.050001 0.051542 0.050172 0.052076 0.052090 0.066081 0.055037 0.052079 0.054267 0.054281 0.052221 34.433380 0.052023 0.051995 0.050981 0.054219 0.056543 0.054457 0.051497 0.056217 0.068432 0.050823 0.051135 0.052789 0.052945 0.054166 2005.0 8.0 8.0 1.123459e+09 2005.0
1516 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.064292 0.064669 0.060579 0.063320 0.0 0.063445 0.063706 1.131922 0.054253 0.056917 0.055187 0.195404 0.087418 0.245737 0.111722 0.051719 0.055121 0.872968 1.688623 0.059958 0.057072 0.153164 0.121743 0.120378 0.051659 0.052370 0.065357 0.504419 0.062232 0.061198 0.063911 0.056040 15.780376 0.503370 0.052610 2009.0 4.0 27.0 1.240790e+09 2009.0
8960 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.052758 0.055110 0.053066 0.053376 0.0 0.061937 0.062429 36.499977 0.131811 0.054731 0.062183 0.050466 1.974803 0.055896 0.058158 0.065944 0.069388 16.433071 0.055760 0.068189 0.082328 0.055553 0.069851 0.053879 0.096412 0.060633 0.066237 0.058951 0.063034 0.093715 3.323277 0.073094 0.066972 0.069659 0.061665 1997.0 2.0 3.0 8.549280e+08 1997.0
6108 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 ... 0.053168 0.056513 0.054138 31.286928 0.0 0.060541 0.058996 0.052292 0.086161 0.053352 0.413279 0.050128 0.050883 0.052755 0.051710 0.054140 0.063152 0.068843 0.053487 0.055232 0.052678 0.050882 0.052260 0.052003 0.060964 23.520971 0.053813 0.051069 0.056065 0.055291 0.052718 0.055576 0.052608 0.054621 0.053528 2006.0 1.0 17.0 1.137456e+09 2006.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
5734 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.050244 0.055088 0.052427 0.052690 0.0 35.874569 0.053833 0.054900 0.068000 0.051995 0.067445 0.050342 0.053160 0.051458 0.051417 0.055237 0.052027 0.052681 0.051789 0.057678 0.050011 0.052551 0.054957 0.076519 0.051922 0.054453 0.056191 0.059744 0.053459 0.050678 0.072990 0.057609 0.055130 0.050830 0.056426 2005.0 5.0 16.0 1.116202e+09 2005.0
5191 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.057875 0.057612 0.052532 0.050780 0.0 0.063285 0.072949 0.057936 0.058596 0.052457 0.055490 0.058436 7.792153 0.081101 0.057484 0.057249 0.064313 0.144067 0.059113 0.061133 0.075646 5.794601 0.114197 0.075839 0.061050 0.059691 0.062009 0.054906 27.812473 0.067140 0.070344 0.058548 0.074940 0.072757 0.310097 2001.0 8.0 6.0 9.970560e+08 2001.0
5390 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.199933 0.072601 0.105001 0.084566 0.0 0.075516 0.074033 0.055670 1.359716 0.132262 0.059397 0.050219 0.068662 0.075191 0.105756 0.121500 0.062121 0.065201 20.057005 0.055676 0.068407 0.060083 0.093676 0.050389 0.059406 0.126675 0.066071 0.063031 0.078170 0.066022 0.052078 0.057840 4.918442 1.706641 0.115145 1990.0 5.0 31.0 6.441120e+08 1990.0
860 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.059274 0.070263 0.060939 0.054224 0.0 0.064339 50.651749 0.052650 0.054647 0.161245 0.054747 0.055111 0.068475 0.063212 0.075529 0.061818 0.054433 0.053690 0.050571 0.061195 0.096171 0.064169 0.054841 0.057005 0.062972 0.054394 0.056493 0.053470 0.057363 0.051896 0.055842 0.052151 0.068301 0.062310 0.069207 2012.0 11.0 5.0 1.352074e+09 2012.0
7270 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.050900 0.051787 0.051710 0.051527 0.0 0.061843 20.759100 0.051095 0.119294 0.053394 0.057225 0.050203 0.064544 0.055189 0.057682 0.073517 0.052451 0.068643 0.058838 0.058730 0.055270 0.053300 0.058439 0.050072 0.051530 0.075845 0.058239 0.052231 0.059433 0.052279 0.050009 0.059396 0.061271 0.057179 0.063758 2014.0 6.0 16.0 1.402877e+09 2014.0

7843 rows × 143 columns



They have more columns than the original dataframe, but not much more:

X_train_enc.shape
(7843, 143)

Inspecting the features created

The TableVectorizer assigns a transformer for each column. We can inspect this choice:

tab_vec.transformers_
{'year_first_hired': PassThrough(), 'date_first_hired': DatetimeEncoder(), 'gender': OneHotEncoder(drop='if_binary', dtype='float32', handle_unknown='ignore',
              sparse_output=False), 'department': OneHotEncoder(drop='if_binary', dtype='float32', handle_unknown='ignore',
              sparse_output=False), 'department_name': OneHotEncoder(drop='if_binary', dtype='float32', handle_unknown='ignore',
              sparse_output=False), 'assignment_category': OneHotEncoder(drop='if_binary', dtype='float32', handle_unknown='ignore',
              sparse_output=False), 'division': GapEncoder(n_components=30), 'employee_position_title': GapEncoder(n_components=30)}

This is what is being passed to transform the different columns under the hood. We can notice it classified the columns “gender” and “assignment_category” as low cardinality string variables. A OneHotEncoder will be applied to these columns.

The vectorizer actually makes the difference between string variables (data type object and string) and categorical variables (data type category).

Next, we can have a look at the encoded feature names.

Before encoding:

df.columns.to_list()
['gender', 'department', 'department_name', 'division', 'assignment_category', 'employee_position_title', 'date_first_hired', 'year_first_hired']

After encoding (we only plot the first 8 feature names):

feature_names = tab_vec.get_feature_names_out()
feature_names[:8]
array(['gender_F', 'gender_M', 'gender_nan', 'department_BOA',
       'department_BOE', 'department_CAT', 'department_CCL',
       'department_CEC'], dtype='<U70')

As we can see, it created a new column for each unique value. This is because we used SimilarityEncoder on the column “division”, which was classified as a high cardinality string variable. (default values, see TableVectorizer’s docstring).

In total, we have reasonnable number of encoded columns.

len(feature_names)
143

Feature importance in the statistical model

Here we consider interpretability, plot the feature importances of a classifier. We can do this because the GapEncoder leads to interpretable features even with messy categories

First, let’s train the RandomForestRegressor,

from sklearn.ensemble import RandomForestRegressor
regressor = RandomForestRegressor()
regressor.fit(X_train_enc, y_train)
RandomForestRegressor()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.


Retrieving the feature importances

importances = regressor.feature_importances_
std = np.std(
    [
        tree.feature_importances_
        for tree in regressor.estimators_
    ],
    axis=0
)
indices = np.argsort(importances)[::-1]

Plotting the results:

import matplotlib.pyplot as plt
plt.figure(figsize=(12, 9))
plt.title("Feature importances")
n = 20
n_indices = indices[:n]
labels = np.array(feature_names)[n_indices]
plt.barh(range(n), importances[n_indices], color="b", yerr=std[n_indices])
plt.yticks(range(n), labels, size=15)
plt.tight_layout(pad=1)
plt.show()
Feature importances

We can deduce from this data that the three factors that define the most the salary are: being hired for a long time, being a manager, and having a permanent, full-time job :).

Exploring different machine-learning pipeline to encode the data

The learning pipeline

To build a learning pipeline, we need to assemble encoders for each column, and apply a supervised learning model on top.

Encoding the table

The TableVectorizer applies different transformations to the different columns to turn them into numerical values suitable for learning

from skrub import TableVectorizer
encoder = TableVectorizer()

Pipelining an encoder with a learner

Here again we use a pipeline with HistGradientBoostingRegressor

from sklearn.ensemble import HistGradientBoostingRegressor
pipeline = make_pipeline(encoder, HistGradientBoostingRegressor())

The pipeline can be readily applied to the dataframe for prediction

pipeline.fit(df, y)

# The categorical encoders
# ........................
#
# A encoder is needed to turn a categorical column into a numerical
# representation
from sklearn.preprocessing import OneHotEncoder

one_hot = OneHotEncoder(handle_unknown='ignore', sparse_output=False)

Dirty-category encoding

The one-hot encoder is actually not well suited to the ‘Employee Position Title’ column, as this columns contains 400 different entries.

We will now experiments with different encoders for dirty columns

from skrub import SimilarityEncoder, MinHashEncoder,\
    GapEncoder
from sklearn.preprocessing import TargetEncoder

similarity = SimilarityEncoder()
target = TargetEncoder()
minhash = MinHashEncoder(n_components=100)
gap = GapEncoder(n_components=100)

encoders = {
    'one-hot': one_hot,
    'similarity': similarity,
    'target': target,
    'minhash': minhash,
    'gap': gap}

We now loop over the different encoding methods, instantiate each time a new pipeline, fit it and store the returned cross-validation score:

all_scores = dict()

for name, method in encoders.items():
    encoder = TableVectorizer(high_cardinality=method)

    pipeline = make_pipeline(encoder, HistGradientBoostingRegressor())
    scores = cross_validate(pipeline, df, y)
    print('{} encoding'.format(name))
    print('r2 score:  mean: {:.3f}; std: {:.3f}'.format(
        np.mean(scores['test_score']), np.std(scores['test_score'])))
    print('time:  {:.3f}\n'.format(
        np.mean(scores['fit_time'])))
    all_scores[name] = scores['test_score']
one-hot encoding
r2 score:  mean: 0.790; std: 0.035
time:  2.968

similarity encoding
r2 score:  mean: 0.930; std: 0.011
time:  4.063

target encoding
r2 score:  mean: 0.906; std: 0.017
time:  0.407

minhash encoding
r2 score:  mean: 0.924; std: 0.012
time:  1.492

gap encoding
r2 score:  mean: 0.929; std: 0.013
time:  7.753

Note that the time it takes to fit varies also a lot, and not only the prediction score

Plotting the results

Finally, we plot the scores on a boxplot:

import seaborn
import matplotlib.pyplot as plt
plt.figure(figsize=(4, 3))
ax = seaborn.boxplot(data=pd.DataFrame(all_scores), orient='h')
plt.ylabel('Encoding', size=20)
plt.xlabel('Prediction accuracy     ', size=20)
plt.yticks(size=20)
plt.tight_layout()
02 dirty categories

The clear trend is that encoders that use the string form of the category (similarity, minhash, and gap) perform better than those that discard it.

SimilarityEncoder is the best performer, but it is less scalable on big data than MinHashEncoder and GapEncoder. The most scalable encoder is the MinHashEncoder. GapEncoder, on the other hand, has the benefit that it provides interpretable features, as shown above


Total running time of the script: (2 minutes 14.831 seconds)

Gallery generated by Sphinx-Gallery