Note

Go to the end to download the full example code. or to run this example in your browser via JupyterLite or Binder

Dirty categories: learning with non normalized strings¶

Including strings that represent categories often calls for much data preparation. In particular categories may appear with many morphological variants, when they have been manually input, or assembled from diverse sources.

Including such a column in a learning pipeline as a standard categorical colum leads to categories with very high cardinalities and can lose information on which categories are similar.

Here we look at a dataset on wages [1] where the column Employee Position Title contains dirty categories.

We investigate encodings to include such compare different categorical encodings for the dirty column to predict the Current Annual Salary, using gradient boosted trees. For this purpose, we use the skrub library ( https://skrub-data.org ).

The data¶

Data Importing and preprocessing¶

We first download the dataset:

from skrub.datasets import fetch_employee_salaries
employee_salaries = fetch_employee_salaries()
print(employee_salaries.description)

Annual salary information including gross pay and overtime pay for all active, permanent employees of Montgomery County, MD paid in calendar year 2016. This information will be published annually each year.

Then we load it:

import pandas as pd
df = employee_salaries.X.copy()
df

	gender	department	department_name	division	assignment_category	employee_position_title	date_first_hired	year_first_hired
0	F	POL	Department of Police	MSB Information Mgmt and Tech Division Records...	Fulltime-Regular	Office Services Coordinator	09/22/1986	1986
1	M	POL	Department of Police	ISB Major Crimes Division Fugitive Section	Fulltime-Regular	Master Police Officer	09/12/1988	1988
2	F	HHS	Department of Health and Human Services	Adult Protective and Case Management Services	Fulltime-Regular	Social Worker IV	11/19/1989	1989
3	M	COR	Correction and Rehabilitation	PRRS Facility and Security	Fulltime-Regular	Resident Supervisor II	05/05/2014	2014
4	M	HCA	Department of Housing and Community Affairs	Affordable Housing Programs	Fulltime-Regular	Planning Specialist III	03/05/2007	2007
...	...	...	...	...	...	...	...	...
9223	F	HHS	Department of Health and Human Services	School Based Health Centers	Fulltime-Regular	Community Health Nurse II	11/03/2015	2015
9224	F	FRS	Fire and Rescue Services	Human Resources Division	Fulltime-Regular	Fire/Rescue Division Chief	11/28/1988	1988
9225	M	HHS	Department of Health and Human Services	Child and Adolescent Mental Health Clinic Serv...	Parttime-Regular	Medical Doctor IV - Psychiatrist	04/30/2001	2001
9226	M	CCL	County Council	Council Central Staff	Fulltime-Regular	Manager II	09/05/2006	2006
9227	M	DLC	Department of Liquor Control	Licensure, Regulation and Education	Fulltime-Regular	Alcohol/Tobacco Enforcement Specialist II	01/30/2012	2012

9228 rows × 8 columns

Recover the target

y = employee_salaries.y

Understanding the vectorizer + learner pipeline¶

The number one difficulty is that our input is a complex and heterogeneous dataframe:

df

	gender	department	department_name	division	assignment_category	employee_position_title	date_first_hired	year_first_hired
0	F	POL	Department of Police	MSB Information Mgmt and Tech Division Records...	Fulltime-Regular	Office Services Coordinator	09/22/1986	1986
1	M	POL	Department of Police	ISB Major Crimes Division Fugitive Section	Fulltime-Regular	Master Police Officer	09/12/1988	1988
2	F	HHS	Department of Health and Human Services	Adult Protective and Case Management Services	Fulltime-Regular	Social Worker IV	11/19/1989	1989
3	M	COR	Correction and Rehabilitation	PRRS Facility and Security	Fulltime-Regular	Resident Supervisor II	05/05/2014	2014
4	M	HCA	Department of Housing and Community Affairs	Affordable Housing Programs	Fulltime-Regular	Planning Specialist III	03/05/2007	2007
...	...	...	...	...	...	...	...	...
9223	F	HHS	Department of Health and Human Services	School Based Health Centers	Fulltime-Regular	Community Health Nurse II	11/03/2015	2015
9224	F	FRS	Fire and Rescue Services	Human Resources Division	Fulltime-Regular	Fire/Rescue Division Chief	11/28/1988	1988
9225	M	HHS	Department of Health and Human Services	Child and Adolescent Mental Health Clinic Serv...	Parttime-Regular	Medical Doctor IV - Psychiatrist	04/30/2001	2001
9226	M	CCL	County Council	Council Central Staff	Fulltime-Regular	Manager II	09/05/2006	2006
9227	M	DLC	Department of Liquor Control	Licensure, Regulation and Education	Fulltime-Regular	Alcohol/Tobacco Enforcement Specialist II	01/30/2012	2012

9228 rows × 8 columns

The TableVectorizer is a transformer that turns this dataframe into a form suited for machine learning.

Feeding it output to a powerful learner, such as gradient boosted trees, gives a machine-learning method that can be readily applied to the dataframe.

from skrub import TableVectorizer

Analyzing the features created¶

Let us perform the same workflow, but without the Pipeline, so we can analyze its mechanisms along the way.

tab_vec = TableVectorizer()

We split the data between train and test, and transform them:

from sklearn.model_selection import train_test_split
df_train, df_test, y_train, y_test = train_test_split(
    df, y, test_size=0.15, random_state=42
)

X_train_enc = tab_vec.fit_transform(df_train, y_train)
X_test_enc = tab_vec.transform(df_test)

The encoded data, X_train_enc and X_test_enc are numerical arrays:

X_train_enc

	gender_F	gender_M	gender_nan	department_BOA	department_BOE	department_CAT	department_CCL	department_CEC	department_CEX	department_COR	department_CUS	department_DEP	department_DGS	department_DHS	department_DLC	department_DOT	department_DPS	department_DTS	department_ECM	department_FIN	department_FRS	department_HCA	department_HHS	department_HRC	department_IGR	department_LIB	department_MPB	department_NDA	department_OAG	department_OCP	department_OHR	department_OIG	department_OLO	department_OMB	department_PIO	department_POL	department_PRO	department_REC	department_SHF	department_ZAH	...	division: enforcement, engineering, mangement	division: abandoned, sediment, budget	division: nicholson, transit, trips	division: district, 3rd, 1st	assignment_category_Parttime-Regular	employee_position_title: firefighter, rescuer, master	employee_position_title: warehouse, craftsworker, welfare	employee_position_title: maintenance, attendant, cashier	employee_position_title: manager, management, iii	employee_position_title: operator, equipment, bus	employee_position_title: police, candidate, officer	employee_position_title: purchasing, crossing, guard	employee_position_title: recreation, occupational, visual	employee_position_title: librarian, employee, libraries	employee_position_title: school, room, behavioral	employee_position_title: specialist, therapist, special	employee_position_title: coordinator, services, service	employee_position_title: correctional, corporal, correction	employee_position_title: technician, mechanic, supply	employee_position_title: liquor, clerk, store	employee_position_title: community, health, nurse	employee_position_title: legislative, principal, executive	employee_position_title: information, technology, technologist	employee_position_title: captain, rescue, mcfrs	employee_position_title: sergeant, sheriff, deputy	employee_position_title: officer, office, of	employee_position_title: income, assistance, client	employee_position_title: assistant, library, fiscal	employee_position_title: accountant, attorney, auditor	employee_position_title: safety, public, communications	employee_position_title: lieutenant, client, latent	employee_position_title: program, programs, projects	employee_position_title: permitting, planning, senior	employee_position_title: enforcement, inspector, abandoned	employee_position_title: administrative, administration, administrator	date_first_hired_year	date_first_hired_month	date_first_hired_day	date_first_hired_total_seconds	year_first_hired
4405	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.059513	0.089732	0.074701	0.064337	0.0	0.132323	0.100549	0.087582	0.062129	4.103054	0.052306	17.542875	0.095996	0.535703	0.055283	17.029894	0.115826	0.056505	0.077598	0.055508	0.053912	0.100931	0.786289	0.163986	0.064722	0.057527	0.102902	0.147806	0.191044	0.080291	0.171816	0.051521	0.255682	6.404267	0.764172	2007.0	8.0	6.0	1.186358e+09	2007.0
5694	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.051012	0.051746	0.054876	0.052139	1.0	0.052190	0.054076	0.051604	0.060680	0.050001	0.051542	0.050172	0.052076	0.052090	0.066081	0.055037	0.052079	0.054267	0.054281	0.052221	34.433380	0.052023	0.051995	0.050981	0.054219	0.056543	0.054457	0.051497	0.056217	0.068432	0.050823	0.051135	0.052789	0.052945	0.054166	2005.0	8.0	8.0	1.123459e+09	2005.0
1516	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.064292	0.064669	0.060579	0.063320	0.0	0.063445	0.063706	1.131922	0.054253	0.056917	0.055187	0.195404	0.087418	0.245737	0.111722	0.051719	0.055121	0.872968	1.688623	0.059958	0.057072	0.153164	0.121743	0.120378	0.051659	0.052370	0.065357	0.504419	0.062232	0.061198	0.063911	0.056040	15.780376	0.503370	0.052610	2009.0	4.0	27.0	1.240790e+09	2009.0
8960	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.052758	0.055110	0.053066	0.053376	0.0	0.061937	0.062429	36.499977	0.131811	0.054731	0.062183	0.050466	1.974803	0.055896	0.058158	0.065944	0.069388	16.433071	0.055760	0.068189	0.082328	0.055553	0.069851	0.053879	0.096412	0.060633	0.066237	0.058951	0.063034	0.093715	3.323277	0.073094	0.066972	0.069659	0.061665	1997.0	2.0	3.0	8.549280e+08	1997.0
6108	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	...	0.053168	0.056513	0.054138	31.286928	0.0	0.060541	0.058996	0.052292	0.086161	0.053352	0.413279	0.050128	0.050883	0.052755	0.051710	0.054140	0.063152	0.068843	0.053487	0.055232	0.052678	0.050882	0.052260	0.052003	0.060964	23.520971	0.053813	0.051069	0.056065	0.055291	0.052718	0.055576	0.052608	0.054621	0.053528	2006.0	1.0	17.0	1.137456e+09	2006.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
5734	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.050244	0.055088	0.052427	0.052690	0.0	35.874569	0.053833	0.054900	0.068000	0.051995	0.067445	0.050342	0.053160	0.051458	0.051417	0.055237	0.052027	0.052681	0.051789	0.057678	0.050011	0.052551	0.054957	0.076519	0.051922	0.054453	0.056191	0.059744	0.053459	0.050678	0.072990	0.057609	0.055130	0.050830	0.056426	2005.0	5.0	16.0	1.116202e+09	2005.0
5191	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.057875	0.057612	0.052532	0.050780	0.0	0.063285	0.072949	0.057936	0.058596	0.052457	0.055490	0.058436	7.792153	0.081101	0.057484	0.057249	0.064313	0.144067	0.059113	0.061133	0.075646	5.794601	0.114197	0.075839	0.061050	0.059691	0.062009	0.054906	27.812473	0.067140	0.070344	0.058548	0.074940	0.072757	0.310097	2001.0	8.0	6.0	9.970560e+08	2001.0
5390	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.199933	0.072601	0.105001	0.084566	0.0	0.075516	0.074033	0.055670	1.359716	0.132262	0.059397	0.050219	0.068662	0.075191	0.105756	0.121500	0.062121	0.065201	20.057005	0.055676	0.068407	0.060083	0.093676	0.050389	0.059406	0.126675	0.066071	0.063031	0.078170	0.066022	0.052078	0.057840	4.918442	1.706641	0.115145	1990.0	5.0	31.0	6.441120e+08	1990.0
860	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.059274	0.070263	0.060939	0.054224	0.0	0.064339	50.651749	0.052650	0.054647	0.161245	0.054747	0.055111	0.068475	0.063212	0.075529	0.061818	0.054433	0.053690	0.050571	0.061195	0.096171	0.064169	0.054841	0.057005	0.062972	0.054394	0.056493	0.053470	0.057363	0.051896	0.055842	0.052151	0.068301	0.062310	0.069207	2012.0	11.0	5.0	1.352074e+09	2012.0
7270	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.050900	0.051787	0.051710	0.051527	0.0	0.061843	20.759100	0.051095	0.119294	0.053394	0.057225	0.050203	0.064544	0.055189	0.057682	0.073517	0.052451	0.068643	0.058838	0.058730	0.055270	0.053300	0.058439	0.050072	0.051530	0.075845	0.058239	0.052231	0.059433	0.052279	0.050009	0.059396	0.061271	0.057179	0.063758	2014.0	6.0	16.0	1.402877e+09	2014.0

7843 rows × 143 columns

They have more columns than the original dataframe, but not much more:

X_train_enc.shape

(7843, 143)

Inspecting the features created¶

The TableVectorizer assigns a transformer for each column. We can inspect this choice:

tab_vec.transformers_

{'year_first_hired': PassThrough(), 'date_first_hired': DatetimeEncoder(), 'gender': OneHotEncoder(drop='if_binary', dtype='float32', handle_unknown='ignore',
              sparse_output=False), 'department': OneHotEncoder(drop='if_binary', dtype='float32', handle_unknown='ignore',
              sparse_output=False), 'department_name': OneHotEncoder(drop='if_binary', dtype='float32', handle_unknown='ignore',
              sparse_output=False), 'assignment_category': OneHotEncoder(drop='if_binary', dtype='float32', handle_unknown='ignore',
              sparse_output=False), 'division': GapEncoder(n_components=30), 'employee_position_title': GapEncoder(n_components=30)}

This is what is being passed to transform the different columns under the hood. We can notice it classified the columns “gender” and “assignment_category” as low cardinality string variables. A OneHotEncoder will be applied to these columns.

The vectorizer actually makes the difference between string variables (data type object and string) and categorical variables (data type category).

Next, we can have a look at the encoded feature names.

Before encoding:

df.columns.to_list()

['gender', 'department', 'department_name', 'division', 'assignment_category', 'employee_position_title', 'date_first_hired', 'year_first_hired']

After encoding (we only plot the first 8 feature names):

feature_names = tab_vec.get_feature_names_out()
feature_names[:8]

array(['gender_F', 'gender_M', 'gender_nan', 'department_BOA',
       'department_BOE', 'department_CAT', 'department_CCL',
       'department_CEC'], dtype='<U70')

As we can see, it created a new column for each unique value. This is because we used SimilarityEncoder on the column “division”, which was classified as a high cardinality string variable. (default values, see TableVectorizer’s docstring).

In total, we have reasonnable number of encoded columns.

len(feature_names)

Feature importance in the statistical model¶

Here we consider interpretability, plot the feature importances of a classifier. We can do this because the GapEncoder leads to interpretable features even with messy categories

First, let’s train the RandomForestRegressor,

from sklearn.ensemble import RandomForestRegressor
regressor = RandomForestRegressor()
regressor.fit(X_train_enc, y_train)

RandomForestRegressor()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Retrieving the feature importances

importances = regressor.feature_importances_
std = np.std(
    [
        tree.feature_importances_
        for tree in regressor.estimators_
    ],
    axis=0
)
indices = np.argsort(importances)[::-1]

Plotting the results:

import matplotlib.pyplot as plt
plt.figure(figsize=(12, 9))
plt.title("Feature importances")
n = 20
n_indices = indices[:n]
labels = np.array(feature_names)[n_indices]
plt.barh(range(n), importances[n_indices], color="b", yerr=std[n_indices])
plt.yticks(range(n), labels, size=15)
plt.tight_layout(pad=1)
plt.show()

We can deduce from this data that the three factors that define the most the salary are: being hired for a long time, being a manager, and having a permanent, full-time job :).

Exploring different machine-learning pipeline to encode the data¶

The learning pipeline¶

To build a learning pipeline, we need to assemble encoders for each column, and apply a supervised learning model on top.

Encoding the table¶

The TableVectorizer applies different transformations to the different columns to turn them into numerical values suitable for learning

from skrub import TableVectorizer
encoder = TableVectorizer()

Pipelining an encoder with a learner¶

Here again we use a pipeline with HistGradientBoostingRegressor

from sklearn.ensemble import HistGradientBoostingRegressor
pipeline = make_pipeline(encoder, HistGradientBoostingRegressor())

The pipeline can be readily applied to the dataframe for prediction

pipeline.fit(df, y)

# The categorical encoders
# ........................
#
# A encoder is needed to turn a categorical column into a numerical
# representation
from sklearn.preprocessing import OneHotEncoder

one_hot = OneHotEncoder(handle_unknown='ignore', sparse_output=False)

Dirty-category encoding¶

The one-hot encoder is actually not well suited to the ‘Employee Position Title’ column, as this columns contains 400 different entries.

We will now experiments with different encoders for dirty columns

from skrub import SimilarityEncoder, MinHashEncoder,\
    GapEncoder
from sklearn.preprocessing import TargetEncoder

similarity = SimilarityEncoder()
target = TargetEncoder()
minhash = MinHashEncoder(n_components=100)
gap = GapEncoder(n_components=100)

encoders = {
    'one-hot': one_hot,
    'similarity': similarity,
    'target': target,
    'minhash': minhash,
    'gap': gap}

We now loop over the different encoding methods, instantiate each time a new pipeline, fit it and store the returned cross-validation score:

all_scores = dict()

for name, method in encoders.items():
    encoder = TableVectorizer(high_cardinality=method)

    pipeline = make_pipeline(encoder, HistGradientBoostingRegressor())
    scores = cross_validate(pipeline, df, y)
    print('{} encoding'.format(name))
    print('r2 score:  mean: {:.3f}; std: {:.3f}'.format(
        np.mean(scores['test_score']), np.std(scores['test_score'])))
    print('time:  {:.3f}\n'.format(
        np.mean(scores['fit_time'])))
    all_scores[name] = scores['test_score']

one-hot encoding
r2 score:  mean: 0.790; std: 0.035
time:  2.968

similarity encoding
r2 score:  mean: 0.930; std: 0.011
time:  4.063

target encoding
r2 score:  mean: 0.906; std: 0.017
time:  0.407

minhash encoding
r2 score:  mean: 0.924; std: 0.012
time:  1.492

gap encoding
r2 score:  mean: 0.929; std: 0.013
time:  7.753

Note that the time it takes to fit varies also a lot, and not only the prediction score

Plotting the results¶

Finally, we plot the scores on a boxplot:

import seaborn
import matplotlib.pyplot as plt
plt.figure(figsize=(4, 3))
ax = seaborn.boxplot(data=pd.DataFrame(all_scores), orient='h')
plt.ylabel('Encoding', size=20)
plt.xlabel('Prediction accuracy     ', size=20)
plt.yticks(size=20)
plt.tight_layout()

The clear trend is that encoders that use the string form of the category (similarity, minhash, and gap) perform better than those that discard it.

SimilarityEncoder is the best performer, but it is less scalable on big data than MinHashEncoder and GapEncoder. The most scalable encoder is the MinHashEncoder. GapEncoder, on the other hand, has the benefit that it provides interpretable features, as shown above

Total running time of the script: (2 minutes 14.831 seconds)

Gallery generated by Sphinx-Gallery

Dirty data science

This page:

Related Topics