.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "gen_notes/02_dirty_categories.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note Click :ref:`here ` to download the full example code or to run this example in your browser via Binder .. rst-class:: sphx-glr-example-title .. _sphx_glr_gen_notes_02_dirty_categories.py: ======================================================== Dirty categories: learning with non normalized strings ======================================================== Including strings that represent categories often calls for much data preparation. In particular categories may appear with many morphological variants, when they have been manually input, or assembled from diverse sources. Including such a column in a learning pipeline as a standard categorical colum leads to categories with very high cardinalities and can loose information on which categories are similar. Here we look at a dataset on wages [#]_ where the column *Employee Position Title* contains dirty categories. .. [#] https://catalog.data.gov/dataset/employee-salaries-2016 We investigate encodings to include such compare different categorical encodings for the dirty column to predict the *Current Annual Salary*, using gradient boosted trees. For this purpose, we use the dirty-cat library ( https://dirty-cat.github.io ). .. GENERATED FROM PYTHON SOURCE LINES 28-35 The data ======== Data Importing and preprocessing -------------------------------- We first download the dataset: .. GENERATED FROM PYTHON SOURCE LINES 36-40 .. code-block:: default from dirty_cat.datasets import fetch_employee_salaries employee_salaries = fetch_employee_salaries() print(employee_salaries['DESCR']) .. rst-class:: sphx-glr-script-out Out: .. code-block:: none Annual salary information including gross pay and overtime pay for all active, permanent employees of Montgomery County, MD paid in calendar year 2016. This information will be published annually each year. Downloaded from openml.org. .. GENERATED FROM PYTHON SOURCE LINES 41-42 Then we load it: .. GENERATED FROM PYTHON SOURCE LINES 42-46 .. code-block:: default import pandas as pd df = employee_salaries['data'].copy() df .. raw:: html

	full_name	gender	2016_gross_pay_received	2016_overtime_pay	department	department_name	division	assignment_category	employee_position_title	underfilled_job_title	date_first_hired	year_first_hired	Current Annual Salary
0	Aarhus, Pam J.	F	71225.98	416.10	POL	Department of Police	MSB Information Mgmt and Tech Division Records...	Fulltime-Regular	Office Services Coordinator	None	09/22/1986	1986.0	69222.18
1	Aaron, David J.	M	103088.48	3326.19	POL	Department of Police	ISB Major Crimes Division Fugitive Section	Fulltime-Regular	Master Police Officer	None	09/12/1988	1988.0	97392.47
2	Aaron, Marsha M.	F	107000.24	1353.32	HHS	Department of Health and Human Services	Adult Protective and Case Management Services	Fulltime-Regular	Social Worker IV	None	11/19/1989	1989.0	104717.28
3	Ababio, Godfred A.	M	57819.04	3423.07	COR	Correction and Rehabilitation	PRRS Facility and Security	Fulltime-Regular	Resident Supervisor II	None	05/05/2014	2014.0	52734.57
4	Ababu, Essayas	M	95815.17	NaN	HCA	Department of Housing and Community Affairs	Affordable Housing Programs	Fulltime-Regular	Planning Specialist III	None	03/05/2007	2007.0	93396.00
...	...	...	...	...	...	...	...	...	...	...	...	...	...
9223	Zurita, Justina	F	58154.47	NaN	HHS	Department of Health and Human Services	School Based Health Centers	Fulltime-Regular	Community Health Nurse II	None	11/03/2015	2015.0	72094.53
9224	Zuspan, Diane M.	F	173173.01	956.97	FRS	Fire and Rescue Services	Human Resources Division	Fulltime-Regular	Fire/Rescue Division Chief	None	11/28/1988	1988.0	169543.85
9225	Zwerdling, David	M	104238.18	NaN	HHS	Department of Health and Human Services	Child and Adolescent Mental Health Clinic Serv...	Parttime-Regular	Medical Doctor IV - Psychiatrist	None	04/30/2001	2001.0	102736.52
9226	Zyontz, Jeffrey L.	M	149105.25	NaN	CCL	County Council	Council Central Staff	Fulltime-Regular	Manager II	None	09/05/2006	2006.0	153747.50
9227	Zywiolek, Tim R.	M	74975.53	NaN	DLC	Department of Liquor Control	Licensure, Regulation and Education	Fulltime-Regular	Alcohol/Tobacco Enforcement Specialist II	None	01/30/2012	2012.0	75484.08

9228 rows × 13 columns

.. GENERATED FROM PYTHON SOURCE LINES 47-48 Now, let's carry out some basic preprocessing: .. GENERATED FROM PYTHON SOURCE LINES 48-54 .. code-block:: default df['Date First Hired'] = pd.to_datetime(df['date_first_hired']) df['Year First Hired'] = df['Date First Hired'].apply(lambda x: x.year) # drop rows with NaN in gender df.dropna(subset=['gender'], inplace=True) .. GENERATED FROM PYTHON SOURCE LINES 55-56 First we extract the target .. GENERATED FROM PYTHON SOURCE LINES 56-60 .. code-block:: default target_column = 'Current Annual Salary' y = df[target_column].values.ravel() .. GENERATED FROM PYTHON SOURCE LINES 61-69 Assembling a machine-learning pipeline that encodes the data ============================================================= The learning pipeline ---------------------------- To build a learning pipeline, we need to assemble encoders for each column, and apply a supvervised learning model on top. .. GENERATED FROM PYTHON SOURCE LINES 72-77 The categorical encoders ........................ A encoder is needed to turn a categorical column into a numerical representation .. GENERATED FROM PYTHON SOURCE LINES 77-81 .. code-block:: default from sklearn.preprocessing import OneHotEncoder one_hot = OneHotEncoder(handle_unknown='ignore', sparse=False) .. GENERATED FROM PYTHON SOURCE LINES 82-85 We assemble these to be applied on the relevant columns. The column transformer is created by specifying a set of transformers alongside with the column names on which to apply them .. GENERATED FROM PYTHON SOURCE LINES 85-95 .. code-block:: default from sklearn.compose import make_column_transformer encoder = make_column_transformer( (one_hot, ['gender', 'department_name', 'assignment_category']), ('passthrough', ['Year First Hired']), # Last but not least, our dirty column (one_hot, ['employee_position_title']), remainder='drop', ) .. GENERATED FROM PYTHON SOURCE LINES 96-102 Pipelining an encoder with a learner .................................... We will use a HistGradientBoostingRegressor, which is a good predictor for data with heterogeneous columns (for scikit-learn 0.24 we need to require the experimental feature) .. GENERATED FROM PYTHON SOURCE LINES 102-110 .. code-block:: default from sklearn.experimental import enable_hist_gradient_boosting # now you can import normally from ensemble from sklearn.ensemble import HistGradientBoostingRegressor # We then create a pipeline chaining our encoders to a learner from sklearn.pipeline import make_pipeline pipeline = make_pipeline(encoder, HistGradientBoostingRegressor()) .. GENERATED FROM PYTHON SOURCE LINES 111-112 The pipeline can be readily applied to the dataframe for prediction .. GENERATED FROM PYTHON SOURCE LINES 112-114 .. code-block:: default pipeline.fit(df, y) .. rst-class:: sphx-glr-script-out Out: .. code-block:: none Pipeline(steps=[('columntransformer', ColumnTransformer(transformers=[('onehotencoder-1', OneHotEncoder(handle_unknown='ignore', sparse=False), ['gender', 'department_name', 'assignment_category']), ('passthrough', 'passthrough', ['Year First Hired']), ('onehotencoder-2', OneHotEncoder(handle_unknown='ignore', sparse=False), ['employee_position_title'])])), ('histgradientboostingregressor', HistGradientBoostingRegressor())]) .. GENERATED FROM PYTHON SOURCE LINES 115-122 Dirty-category encoding ------------------------- The one-hot encoder is actually not well suited to the 'Employee Position Title' column, as this columns contains 400 different entries. We will now experiments with encoders for dirty columns .. GENERATED FROM PYTHON SOURCE LINES 122-137 .. code-block:: default from dirty_cat import SimilarityEncoder, TargetEncoder, MinHashEncoder,\ GapEncoder similarity = SimilarityEncoder(similarity='ngram') target = TargetEncoder(handle_unknown='ignore') minhash = MinHashEncoder(n_components=100) gap = GapEncoder(n_components=100) encoders = { 'one-hot': one_hot, 'similarity': SimilarityEncoder(similarity='ngram'), 'target': target, 'minhash': minhash, 'gap': gap} .. GENERATED FROM PYTHON SOURCE LINES 138-141 We now loop over the different encoding methods, instantiate each time a new pipeline, fit it and store the returned cross-validation score: .. GENERATED FROM PYTHON SOURCE LINES 141-163 .. code-block:: default from sklearn.model_selection import cross_val_score import numpy as np all_scores = dict() for name, method in encoders.items(): encoder = make_column_transformer( (one_hot, ['gender', 'department_name', 'assignment_category']), ('passthrough', ['Year First Hired']), # Last but not least, our dirty column (method, ['employee_position_title']), remainder='drop', ) pipeline = make_pipeline(encoder, HistGradientBoostingRegressor()) scores = cross_val_score(pipeline, df, y) print('{} encoding'.format(name)) print('r2 score: mean: {:.3f}; std: {:.3f}\n'.format( np.mean(scores), np.std(scores))) all_scores[name] = scores .. rst-class:: sphx-glr-script-out Out: .. code-block:: none one-hot encoding r2 score: mean: 0.776; std: 0.028 similarity encoding r2 score: mean: 0.923; std: 0.014 target encoding r2 score: mean: 0.842; std: 0.030 minhash encoding r2 score: mean: 0.919; std: 0.012 gap encoding r2 score: mean: 0.909; std: 0.011 .. GENERATED FROM PYTHON SOURCE LINES 164-167 Plotting the results ..................... Finally, we plot the scores on a boxplot: .. GENERATED FROM PYTHON SOURCE LINES 167-177 .. code-block:: default import seaborn import matplotlib.pyplot as plt plt.figure(figsize=(4, 3)) ax = seaborn.boxplot(data=pd.DataFrame(all_scores), orient='h') plt.ylabel('Encoding', size=20) plt.xlabel('Prediction accuracy ', size=20) plt.yticks(size=20) plt.tight_layout() .. image:: /gen_notes/images/sphx_glr_02_dirty_categories_001.png :alt: 02 dirty categories :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 178-189 The clear trend is that encoders that use the string form of the category (similarity, minhash, and gap) perform better than those that discard it. SimilarityEncoder is the best performer, but it is less scalable on big data than MinHashEncoder and GapEncoder. The most scalable encoder is the MinHashEncoder. GapEncoder, on the other hand, has the benefit that it provides interpretable features (see :ref:`sphx_glr_auto_examples_04_feature_interpretation_gap_encoder.py`) | .. GENERATED FROM PYTHON SOURCE LINES 191-215 An easier way: automatic vectorization ======================================= .. |SV| replace:: :class:`~dirty_cat.SuperVectorizer` .. |OneHotEncoder| replace:: :class:`~sklearn.preprocessing.OneHotEncoder` .. |ColumnTransformer| replace:: :class:`~sklearn.compose.ColumnTransformer` .. |RandomForestRegressor| replace:: :class:`~sklearn.ensemble.RandomForestRegressor` .. |SE| replace:: :class:`~dirty_cat.SimilarityEncoder` .. |permutation importances| replace:: :func:`~sklearn.inspection.permutation_importance` The code to assemble the column transformer is a bit tedious. We will now explore a simpler, automated, way of encoding the data. Let's start again from the raw data: .. GENERATED FROM PYTHON SOURCE LINES 215-218 .. code-block:: default X = employee_salaries['data'].copy() y = employee_salaries['target'] .. GENERATED FROM PYTHON SOURCE LINES 219-220 We'll drop a few columns we don't want .. GENERATED FROM PYTHON SOURCE LINES 220-228 .. code-block:: default X.drop([ 'Current Annual Salary', # Too linked with target 'full_name', # Not relevant to the analysis '2016_gross_pay_received', # Too linked with target '2016_overtime_pay', # Too linked with target 'date_first_hired' # Redundant with "year_first_hired" ], axis=1, inplace=True) .. GENERATED FROM PYTHON SOURCE LINES 229-230 We still have a complex and heterogeneous dataframe: .. GENERATED FROM PYTHON SOURCE LINES 230-232 .. code-block:: default X .. raw:: html

	gender	department	department_name	division	assignment_category	employee_position_title	underfilled_job_title	year_first_hired
0	F	POL	Department of Police	MSB Information Mgmt and Tech Division Records...	Fulltime-Regular	Office Services Coordinator	None	1986.0
1	M	POL	Department of Police	ISB Major Crimes Division Fugitive Section	Fulltime-Regular	Master Police Officer	None	1988.0
2	F	HHS	Department of Health and Human Services	Adult Protective and Case Management Services	Fulltime-Regular	Social Worker IV	None	1989.0
3	M	COR	Correction and Rehabilitation	PRRS Facility and Security	Fulltime-Regular	Resident Supervisor II	None	2014.0
4	M	HCA	Department of Housing and Community Affairs	Affordable Housing Programs	Fulltime-Regular	Planning Specialist III	None	2007.0
...	...	...	...	...	...	...	...	...
9223	F	HHS	Department of Health and Human Services	School Based Health Centers	Fulltime-Regular	Community Health Nurse II	None	2015.0
9224	F	FRS	Fire and Rescue Services	Human Resources Division	Fulltime-Regular	Fire/Rescue Division Chief	None	1988.0
9225	M	HHS	Department of Health and Human Services	Child and Adolescent Mental Health Clinic Serv...	Parttime-Regular	Medical Doctor IV - Psychiatrist	None	2001.0
9226	M	CCL	County Council	Council Central Staff	Fulltime-Regular	Manager II	None	2006.0
9227	M	DLC	Department of Liquor Control	Licensure, Regulation and Education	Fulltime-Regular	Alcohol/Tobacco Enforcement Specialist II	None	2012.0

9228 rows × 8 columns

.. GENERATED FROM PYTHON SOURCE LINES 233-235 The |SV| can to turn this dataframe into a form suited for machine learning. .. GENERATED FROM PYTHON SOURCE LINES 237-243 Using the SuperVectorizer in a supervised-learning pipeline ------------------------------------------------------------ Assembling the |SV| in a pipeline with a powerful learner, such as gradient boosted trees, gives **a machine-learning method that can be readily applied to the dataframe**. .. GENERATED FROM PYTHON SOURCE LINES 243-257 .. code-block:: default # The supervectorizer requires dirty_cat 0.2.0a1. If you have an older # version, you can install the alpha release with # # pip install -pre dirty_cat==0.2.0a1 # from dirty_cat import SuperVectorizer pipeline = make_pipeline( SuperVectorizer(auto_cast=True), HistGradientBoostingRegressor() ) .. GENERATED FROM PYTHON SOURCE LINES 258-259 Let's perform a cross-validation to see how well this model predicts .. GENERATED FROM PYTHON SOURCE LINES 259-268 .. code-block:: default from sklearn.model_selection import cross_val_score scores = cross_val_score(pipeline, X, y, scoring='r2') import numpy as np print(f'{scores=}') print(f'mean={np.mean(scores)}') print(f'std={np.std(scores)}') .. rst-class:: sphx-glr-script-out Out: .. code-block:: none scores=array([0.87651907, 0.87186239, 0.90534537, 0.9120328 , 0.92163719]) mean=0.897479364762644 std=0.0197627927771675 .. GENERATED FROM PYTHON SOURCE LINES 269-272 The prediction perform here is pretty much as good as above but the code here is much simpler as it does not involve specifying columns manually. .. GENERATED FROM PYTHON SOURCE LINES 274-279 Analyzing the features created ------------------------------- Let us perform the same workflow, but without the `Pipeline`, so we can analyze its mechanisms along the way. .. GENERATED FROM PYTHON SOURCE LINES 279-281 .. code-block:: default sup_vec = SuperVectorizer(auto_cast=True) .. GENERATED FROM PYTHON SOURCE LINES 282-283 We split the data between train and test, and transform them: .. GENERATED FROM PYTHON SOURCE LINES 283-291 .. code-block:: default from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.15, random_state=42 ) X_train_enc = sup_vec.fit_transform(X_train, y_train) X_test_enc = sup_vec.transform(X_test) .. GENERATED FROM PYTHON SOURCE LINES 292-293 The encoded data, X_train_enc and X_test_enc are numerical arrays: .. GENERATED FROM PYTHON SOURCE LINES 293-295 .. code-block:: default X_train_enc .. rst-class:: sphx-glr-script-out Out: .. code-block:: none array([[1.0449171656292862, 0.7205922491346307, 0.39571982605966466, ..., 0.07790133090254987, 0.0866859328077334, 2007], [0.05211638296647561, 29.96795271024839, 0.06107061378494277, ..., 0.08273956019725892, 0.0597482324987131, 2005], [0.06000891970972247, 0.057149413239210564, 0.07738515680926894, ..., 0.08811993992371518, 0.05843022597056747, 2009], ..., [0.09535114306092987, 0.06932818336882285, 0.16664185835837403, ..., 0.0827813134238929, 0.06117155771841887, 1990], [0.08348536132941996, 0.06641634058081916, 0.07844733009530419, ..., 0.0836319932541453, 0.056200358797388036, 2012], [0.060952392787100976, 0.07522578586235484, 0.055811123562746645, ..., 0.08273956019725892, 0.0597482324987131, 2014]], dtype=object) .. GENERATED FROM PYTHON SOURCE LINES 296-297 They have more columns than the original dataframe, but not much more: .. GENERATED FROM PYTHON SOURCE LINES 297-299 .. code-block:: default X_train_enc.shape .. rst-class:: sphx-glr-script-out Out: .. code-block:: none (7843, 56) .. GENERATED FROM PYTHON SOURCE LINES 300-305 Inspecting the features created ................................. The |SV| assigns a transformer for each column. We can inspect this choice: .. GENERATED FROM PYTHON SOURCE LINES 305-307 .. code-block:: default sup_vec.transformers_ .. rst-class:: sphx-glr-script-out Out: .. code-block:: none [('high_card_str', GapEncoder(), ['division', 'employee_position_title', 'underfilled_job_title']), ('low_card_cat', OneHotEncoder(), ['gender', 'assignment_category']), ('high_card_cat', GapEncoder(), ['department', 'department_name']), ('remainder', 'passthrough', [7])] .. GENERATED FROM PYTHON SOURCE LINES 308-321 This is what is being passed to the |ColumnTransformer| under the hood. If you're familiar with how the later works, it should be very intuitive. We can notice it classified the columns "gender" and "assignment_category" as low cardinality string variables. A |OneHotEncoder| will be applied to these columns. The vectorizer actually makes the difference between string variables (data type ``object`` and ``string``) and categorical variables (data type ``category``). Next, we can have a look at the encoded feature names. Before encoding: .. GENERATED FROM PYTHON SOURCE LINES 321-323 .. code-block:: default X.columns.to_list() .. rst-class:: sphx-glr-script-out Out: .. code-block:: none ['gender', 'department', 'department_name', 'division', 'assignment_category', 'employee_position_title', 'underfilled_job_title', 'year_first_hired'] .. GENERATED FROM PYTHON SOURCE LINES 324-325 After encoding (we only plot the first 8 feature names): .. GENERATED FROM PYTHON SOURCE LINES 325-328 .. code-block:: default feature_names = sup_vec.get_feature_names() feature_names[:8] .. rst-class:: sphx-glr-script-out Out: .. code-block:: none ['division: supports, twinbrook, compliance', 'division: health, emergency, school', 'division: technology, traffic, safety', 'division: gaithersburg, transit, silver', 'division: assignment, assessment, programs', 'division: administration, operations, battalion', 'division: accountability, development, planning', 'division: management, facilities, maintenance'] .. GENERATED FROM PYTHON SOURCE LINES 329-335 As we can see, it created a new column for each unique value. This is because we used |SE| on the column "division", which was classified as a high cardinality string variable. (default values, see |SV|'s docstring). In total, we have reasonnable number of encoded columns. .. GENERATED FROM PYTHON SOURCE LINES 335-338 .. code-block:: default len(feature_names) .. rst-class:: sphx-glr-script-out Out: .. code-block:: none 56 .. GENERATED FROM PYTHON SOURCE LINES 339-351 Feature importance in the statistical model --------------------------------------------- In this section, we will train a regressor, and plot the feature importances .. topic:: Note: To minimize compute time, use the feature importances computed by the |RandomForestRegressor|, but you should prefer |permutation importances| instead (which are less subject to biases) First, let's train the |RandomForestRegressor|, .. GENERATED FROM PYTHON SOURCE LINES 351-357 .. code-block:: default from sklearn.ensemble import RandomForestRegressor regressor = RandomForestRegressor() regressor.fit(X_train_enc, y_train) .. rst-class:: sphx-glr-script-out Out: .. code-block:: none RandomForestRegressor() .. GENERATED FROM PYTHON SOURCE LINES 358-359 Retreiving the feature importances .. GENERATED FROM PYTHON SOURCE LINES 359-369 .. code-block:: default importances = regressor.feature_importances_ std = np.std( [ tree.feature_importances_ for tree in regressor.estimators_ ], axis=0 ) indices = np.argsort(importances)[::-1] .. GENERATED FROM PYTHON SOURCE LINES 370-371 Plotting the results: .. GENERATED FROM PYTHON SOURCE LINES 371-383 .. code-block:: default import matplotlib.pyplot as plt plt.figure(figsize=(12, 9)) plt.title("Feature importances") n = 20 n_indices = indices[:n] labels = np.array(feature_names)[n_indices] plt.barh(range(n), importances[n_indices], color="b", yerr=std[n_indices]) plt.yticks(range(n), labels, size=15) plt.tight_layout(pad=1) plt.show() .. image:: /gen_notes/images/sphx_glr_02_dirty_categories_002.png :alt: Feature importances :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 384-395 We can deduce from this data that the three factors that define the most the salary are: being hired for a long time, being a manager, and having a permanent, full-time job :). .. topic:: The SuperVectorizer automates preprocessing As this notebook demonstrates, many preprocessing steps can be automated by the |SV|, and the resulting pipeline can still be inspected, even with non-normalized entries. .. rst-class:: sphx-glr-timing **Total running time of the script:** ( 3 minutes 14.261 seconds) .. _sphx_glr_download_gen_notes_02_dirty_categories.py: .. only :: html .. container:: sphx-glr-footer :class: sphx-glr-footer-example .. container:: binder-badge .. image:: images/binder_badge_logo.svg :target: https://mybinder.org/v2/gh/dirty-data-science/python/gh-pages?filepath=notes/gen_notes/02_dirty_categories.ipynb :alt: Launch binder :width: 150 px .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: 02_dirty_categories.py <02_dirty_categories.py>` .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: 02_dirty_categories.ipynb <02_dirty_categories.ipynb>` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_