.. DO NOT EDIT.
.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
.. "gen_notes/01_missing_values.py"
.. LINE NUMBERS ARE GIVEN BELOW.

.. only:: html

    .. note::
        :class: sphx-glr-download-link-note

        :ref:`Go to the end <sphx_glr_download_gen_notes_01_missing_values.py>`
        to download the full example code. or to run this example in your browser via JupyterLite or Binder

.. rst-class:: sphx-glr-example-title

.. _sphx_glr_gen_notes_01_missing_values.py:


=========================================
Machine learning with missing values
=========================================

Here we use simulated data to understanding the fundamentals of statistical
learning with missing values.

This notebook reveals why a HistGradientBoostingRegressor (
:class:`sklearn.ensemble.HistGradientBoostingRegressor` ) is a good choice to
predict with missing values.

We use simulations to control the missing-value mechanism, and inspect
it's impact on predictive models. In particular, standard imputation
procedures can reconstruct missing values without distortion only if the
data is *missing at random*.

A good introduction to the mathematics behind this notebook can be found in
https://arxiv.org/abs/1902.06931

.. topic:: **Missing values in categorical data**

    If a categorical column has missing values, the simplest approach is
    to create a specific category "missing" and assign missing values to
    this new category, to represent missingness in the classifier.
    Indeed, as we will see, imputation is not crucial for prediction.
    In the following we focus on continuous columns, where the discrete
    nature of a missing value poses more problems.

.. GENERATED FROM PYTHON SOURCE LINES 34-43

The fully-observed data: a toy regression problem
==================================================

We consider a simple regression problem where X (the data) is bivariate
gaussian, and y (the prediction target)  is a linear function of the first
coordinate, with noise.

The data-generating mechanism
------------------------------

.. GENERATED FROM PYTHON SOURCE LINES 43-58

.. code-block:: Python


    import numpy as np

    def generate_without_missing_values(n_samples, rng=42):
        mean = [0, 0]
        cov = [[1, 0.9], [0.9, 1]]
        if not isinstance(rng, np.random.RandomState):
            rng = np.random.RandomState(rng)
        X = rng.multivariate_normal(mean, cov, size=n_samples)

        epsilon = 0.1 * rng.randn(n_samples)
        y = X[:, 0] + epsilon

        return X, y


.. GENERATED FROM PYTHON SOURCE LINES 59-60

A quick plot reveals what the data looks like

.. GENERATED FROM PYTHON SOURCE LINES 60-69

.. code-block:: Python


    import matplotlib.pyplot as plt
    plt.rcParams['figure.figsize'] = (5, 4) # Smaller default figure size

    plt.figure()
    X_full, y_full = generate_without_missing_values(1000)
    plt.scatter(X_full[:, 0], X_full[:, 1], c=y_full)
    plt.colorbar(label='y')


.. image-sg:: /gen_notes/images/sphx_glr_01_missing_values_001.png
   :alt: 01 missing values
   :srcset: /gen_notes/images/sphx_glr_01_missing_values_001.png
   :class: sphx-glr-single-img


.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    <matplotlib.colorbar.Colorbar object at 0x7991350e6de0>


.. GENERATED FROM PYTHON SOURCE LINES 70-79

Missing completely at random settings
======================================

We now consider missing completely at random settings (a special case
of missing at random): the missingness is completely independent from
the values.

The missing-values mechanism
-----------------------------

.. GENERATED FROM PYTHON SOURCE LINES 79-90

.. code-block:: Python


    def generate_mcar(n_samples, missing_rate=.5, rng=42):
        X, y = generate_without_missing_values(n_samples, rng=rng)
        if not isinstance(rng, np.random.RandomState):
            rng = np.random.RandomState(rng)

        M = rng.binomial(1, missing_rate, (n_samples, 2))
        np.putmask(X, M, np.nan)

        return X, y


.. GENERATED FROM PYTHON SOURCE LINES 91-92

A quick plot to look at the data

.. GENERATED FROM PYTHON SOURCE LINES 92-100

.. code-block:: Python

    X, y = generate_mcar(500)

    plt.figure()
    plt.scatter(X_full[:, 0], X_full[:, 1], color='.8', ec='.5', label='All data')
    plt.colorbar(label='y')
    plt.scatter(X[:, 0], X[:, 1], c=y, label='Fully observed')
    plt.legend()


.. image-sg:: /gen_notes/images/sphx_glr_01_missing_values_002.png
   :alt: 01 missing values
   :srcset: /gen_notes/images/sphx_glr_01_missing_values_002.png
   :class: sphx-glr-single-img


.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    <matplotlib.legend.Legend object at 0x799149880860>


.. GENERATED FROM PYTHON SOURCE LINES 101-112

We can see that the distribution of the fully-observed data is the same
than that of the original data

Conditional Imputation with the IterativeImputer
------------------------------------------------

As the data is MAR (missing at random), an imputer can use the
conditional dependencies between the observed and the missing values to
impute the missing values.

We'll use the IterativeImputer, a good imputer, but it needs to be enabled

.. GENERATED FROM PYTHON SOURCE LINES 112-116

.. code-block:: Python

    from sklearn.experimental import enable_iterative_imputer
    from sklearn import impute
    iterative_imputer = impute.IterativeImputer()


.. GENERATED FROM PYTHON SOURCE LINES 117-121

Let us try the imputer on the small data used to visualize

**The imputation is learned by fitting the imputer on the data with
missing values**

.. GENERATED FROM PYTHON SOURCE LINES 121-123

.. code-block:: Python

    iterative_imputer.fit(X)


.. raw:: html

    <div class="output_subarea output_html rendered_html output_result">
    <style>#sk-container-id-1 {
      /* Definition of color scheme common for light and dark mode */
      --sklearn-color-text: black;
      --sklearn-color-line: gray;
      /* Definition of color scheme for unfitted estimators */
      --sklearn-color-unfitted-level-0: #fff5e6;
      --sklearn-color-unfitted-level-1: #f6e4d2;
      --sklearn-color-unfitted-level-2: #ffe0b3;
      --sklearn-color-unfitted-level-3: chocolate;
      /* Definition of color scheme for fitted estimators */
      --sklearn-color-fitted-level-0: #f0f8ff;
      --sklearn-color-fitted-level-1: #d4ebff;
      --sklearn-color-fitted-level-2: #b3dbfd;
      --sklearn-color-fitted-level-3: cornflowerblue;

      /* Specific color for light theme */
      --sklearn-color-text-on-default-background: var(--sg-text-color, var(--theme-code-foreground, var(--jp-content-font-color1, black)));
      --sklearn-color-background: var(--sg-background-color, var(--theme-background, var(--jp-layout-color0, white)));
      --sklearn-color-border-box: var(--sg-text-color, var(--theme-code-foreground, var(--jp-content-font-color1, black)));
      --sklearn-color-icon: #696969;

      @media (prefers-color-scheme: dark) {
        /* Redefinition of color scheme for dark theme */
        --sklearn-color-text-on-default-background: var(--sg-text-color, var(--theme-code-foreground, var(--jp-content-font-color1, white)));
        --sklearn-color-background: var(--sg-background-color, var(--theme-background, var(--jp-layout-color0, #111)));
        --sklearn-color-border-box: var(--sg-text-color, var(--theme-code-foreground, var(--jp-content-font-color1, white)));
        --sklearn-color-icon: #878787;
      }
    }

    #sk-container-id-1 {
      color: var(--sklearn-color-text);
    }

    #sk-container-id-1 pre {
      padding: 0;
    }

    #sk-container-id-1 input.sk-hidden--visually {
      border: 0;
      clip: rect(1px 1px 1px 1px);
      clip: rect(1px, 1px, 1px, 1px);
      height: 1px;
      margin: -1px;
      overflow: hidden;
      padding: 0;
      position: absolute;
      width: 1px;
    }

    #sk-container-id-1 div.sk-dashed-wrapped {
      border: 1px dashed var(--sklearn-color-line);
      margin: 0 0.4em 0.5em 0.4em;
      box-sizing: border-box;
      padding-bottom: 0.4em;
      background-color: var(--sklearn-color-background);
    }

    #sk-container-id-1 div.sk-container {
      /* jupyter's `normalize.less` sets `[hidden] { display: none; }`
         but bootstrap.min.css set `[hidden] { display: none !important; }`
         so we also need the `!important` here to be able to override the
         default hidden behavior on the sphinx rendered scikit-learn.org.
         See: https://github.com/scikit-learn/scikit-learn/issues/21755 */
      display: inline-block !important;
      position: relative;
    }

    #sk-container-id-1 div.sk-text-repr-fallback {
      display: none;
    }

    div.sk-parallel-item,
    div.sk-serial,
    div.sk-item {
      /* draw centered vertical line to link estimators */
      background-image: linear-gradient(var(--sklearn-color-text-on-default-background), var(--sklearn-color-text-on-default-background));
      background-size: 2px 100%;
      background-repeat: no-repeat;
      background-position: center center;
    }

    /* Parallel-specific style estimator block */

    #sk-container-id-1 div.sk-parallel-item::after {
      content: "";
      width: 100%;
      border-bottom: 2px solid var(--sklearn-color-text-on-default-background);
      flex-grow: 1;
    }

    #sk-container-id-1 div.sk-parallel {
      display: flex;
      align-items: stretch;
      justify-content: center;
      background-color: var(--sklearn-color-background);
      position: relative;
    }

    #sk-container-id-1 div.sk-parallel-item {
      display: flex;
      flex-direction: column;
    }

    #sk-container-id-1 div.sk-parallel-item:first-child::after {
      align-self: flex-end;
      width: 50%;
    }

    #sk-container-id-1 div.sk-parallel-item:last-child::after {
      align-self: flex-start;
      width: 50%;
    }

    #sk-container-id-1 div.sk-parallel-item:only-child::after {
      width: 0;
    }

    /* Serial-specific style estimator block */

    #sk-container-id-1 div.sk-serial {
      display: flex;
      flex-direction: column;
      align-items: center;
      background-color: var(--sklearn-color-background);
      padding-right: 1em;
      padding-left: 1em;
    }


    /* Toggleable style: style used for estimator/Pipeline/ColumnTransformer box that is
    clickable and can be expanded/collapsed.
    - Pipeline and ColumnTransformer use this feature and define the default style
    - Estimators will overwrite some part of the style using the `sk-estimator` class
    */

    /* Pipeline and ColumnTransformer style (default) */

    #sk-container-id-1 div.sk-toggleable {
      /* Default theme specific background. It is overwritten whether we have a
      specific estimator or a Pipeline/ColumnTransformer */
      background-color: var(--sklearn-color-background);
    }

    /* Toggleable label */
    #sk-container-id-1 label.sk-toggleable__label {
      cursor: pointer;
      display: block;
      width: 100%;
      margin-bottom: 0;
      padding: 0.5em;
      box-sizing: border-box;
      text-align: center;
    }

    #sk-container-id-1 label.sk-toggleable__label-arrow:before {
      /* Arrow on the left of the label */
      content: "▸";
      float: left;
      margin-right: 0.25em;
      color: var(--sklearn-color-icon);
    }

    #sk-container-id-1 label.sk-toggleable__label-arrow:hover:before {
      color: var(--sklearn-color-text);
    }

    /* Toggleable content - dropdown */

    #sk-container-id-1 div.sk-toggleable__content {
      max-height: 0;
      max-width: 0;
      overflow: hidden;
      text-align: left;
      /* unfitted */
      background-color: var(--sklearn-color-unfitted-level-0);
    }

    #sk-container-id-1 div.sk-toggleable__content.fitted {
      /* fitted */
      background-color: var(--sklearn-color-fitted-level-0);
    }

    #sk-container-id-1 div.sk-toggleable__content pre {
      margin: 0.2em;
      border-radius: 0.25em;
      color: var(--sklearn-color-text);
      /* unfitted */
      background-color: var(--sklearn-color-unfitted-level-0);
    }

    #sk-container-id-1 div.sk-toggleable__content.fitted pre {
      /* unfitted */
      background-color: var(--sklearn-color-fitted-level-0);
    }

    #sk-container-id-1 input.sk-toggleable__control:checked~div.sk-toggleable__content {
      /* Expand drop-down */
      max-height: 200px;
      max-width: 100%;
      overflow: auto;
    }

    #sk-container-id-1 input.sk-toggleable__control:checked~label.sk-toggleable__label-arrow:before {
      content: "▾";
    }

    /* Pipeline/ColumnTransformer-specific style */

    #sk-container-id-1 div.sk-label input.sk-toggleable__control:checked~label.sk-toggleable__label {
      color: var(--sklearn-color-text);
      background-color: var(--sklearn-color-unfitted-level-2);
    }

    #sk-container-id-1 div.sk-label.fitted input.sk-toggleable__control:checked~label.sk-toggleable__label {
      background-color: var(--sklearn-color-fitted-level-2);
    }

    /* Estimator-specific style */

    /* Colorize estimator box */
    #sk-container-id-1 div.sk-estimator input.sk-toggleable__control:checked~label.sk-toggleable__label {
      /* unfitted */
      background-color: var(--sklearn-color-unfitted-level-2);
    }

    #sk-container-id-1 div.sk-estimator.fitted input.sk-toggleable__control:checked~label.sk-toggleable__label {
      /* fitted */
      background-color: var(--sklearn-color-fitted-level-2);
    }

    #sk-container-id-1 div.sk-label label.sk-toggleable__label,
    #sk-container-id-1 div.sk-label label {
      /* The background is the default theme color */
      color: var(--sklearn-color-text-on-default-background);
    }

    /* On hover, darken the color of the background */
    #sk-container-id-1 div.sk-label:hover label.sk-toggleable__label {
      color: var(--sklearn-color-text);
      background-color: var(--sklearn-color-unfitted-level-2);
    }

    /* Label box, darken color on hover, fitted */
    #sk-container-id-1 div.sk-label.fitted:hover label.sk-toggleable__label.fitted {
      color: var(--sklearn-color-text);
      background-color: var(--sklearn-color-fitted-level-2);
    }

    /* Estimator label */

    #sk-container-id-1 div.sk-label label {
      font-family: monospace;
      font-weight: bold;
      display: inline-block;
      line-height: 1.2em;
    }

    #sk-container-id-1 div.sk-label-container {
      text-align: center;
    }

    /* Estimator-specific */
    #sk-container-id-1 div.sk-estimator {
      font-family: monospace;
      border: 1px dotted var(--sklearn-color-border-box);
      border-radius: 0.25em;
      box-sizing: border-box;
      margin-bottom: 0.5em;
      /* unfitted */
      background-color: var(--sklearn-color-unfitted-level-0);
    }

    #sk-container-id-1 div.sk-estimator.fitted {
      /* fitted */
      background-color: var(--sklearn-color-fitted-level-0);
    }

    /* on hover */
    #sk-container-id-1 div.sk-estimator:hover {
      /* unfitted */
      background-color: var(--sklearn-color-unfitted-level-2);
    }

    #sk-container-id-1 div.sk-estimator.fitted:hover {
      /* fitted */
      background-color: var(--sklearn-color-fitted-level-2);
    }

    /* Specification for estimator info (e.g. "i" and "?") */

    /* Common style for "i" and "?" */

    .sk-estimator-doc-link,
    a:link.sk-estimator-doc-link,
    a:visited.sk-estimator-doc-link {
      float: right;
      font-size: smaller;
      line-height: 1em;
      font-family: monospace;
      background-color: var(--sklearn-color-background);
      border-radius: 1em;
      height: 1em;
      width: 1em;
      text-decoration: none !important;
      margin-left: 1ex;
      /* unfitted */
      border: var(--sklearn-color-unfitted-level-1) 1pt solid;
      color: var(--sklearn-color-unfitted-level-1);
    }

    .sk-estimator-doc-link.fitted,
    a:link.sk-estimator-doc-link.fitted,
    a:visited.sk-estimator-doc-link.fitted {
      /* fitted */
      border: var(--sklearn-color-fitted-level-1) 1pt solid;
      color: var(--sklearn-color-fitted-level-1);
    }

    /* On hover */
    div.sk-estimator:hover .sk-estimator-doc-link:hover,
    .sk-estimator-doc-link:hover,
    div.sk-label-container:hover .sk-estimator-doc-link:hover,
    .sk-estimator-doc-link:hover {
      /* unfitted */
      background-color: var(--sklearn-color-unfitted-level-3);
      color: var(--sklearn-color-background);
      text-decoration: none;
    }

    div.sk-estimator.fitted:hover .sk-estimator-doc-link.fitted:hover,
    .sk-estimator-doc-link.fitted:hover,
    div.sk-label-container:hover .sk-estimator-doc-link.fitted:hover,
    .sk-estimator-doc-link.fitted:hover {
      /* fitted */
      background-color: var(--sklearn-color-fitted-level-3);
      color: var(--sklearn-color-background);
      text-decoration: none;
    }

    /* Span, style for the box shown on hovering the info icon */
    .sk-estimator-doc-link span {
      display: none;
      z-index: 9999;
      position: relative;
      font-weight: normal;
      right: .2ex;
      padding: .5ex;
      margin: .5ex;
      width: min-content;
      min-width: 20ex;
      max-width: 50ex;
      color: var(--sklearn-color-text);
      box-shadow: 2pt 2pt 4pt #999;
      /* unfitted */
      background: var(--sklearn-color-unfitted-level-0);
      border: .5pt solid var(--sklearn-color-unfitted-level-3);
    }

    .sk-estimator-doc-link.fitted span {
      /* fitted */
      background: var(--sklearn-color-fitted-level-0);
      border: var(--sklearn-color-fitted-level-3);
    }

    .sk-estimator-doc-link:hover span {
      display: block;
    }

    /* "?"-specific style due to the `<a>` HTML tag */

    #sk-container-id-1 a.estimator_doc_link {
      float: right;
      font-size: 1rem;
      line-height: 1em;
      font-family: monospace;
      background-color: var(--sklearn-color-background);
      border-radius: 1rem;
      height: 1rem;
      width: 1rem;
      text-decoration: none;
      /* unfitted */
      color: var(--sklearn-color-unfitted-level-1);
      border: var(--sklearn-color-unfitted-level-1) 1pt solid;
    }

    #sk-container-id-1 a.estimator_doc_link.fitted {
      /* fitted */
      border: var(--sklearn-color-fitted-level-1) 1pt solid;
      color: var(--sklearn-color-fitted-level-1);
    }

    /* On hover */
    #sk-container-id-1 a.estimator_doc_link:hover {
      /* unfitted */
      background-color: var(--sklearn-color-unfitted-level-3);
      color: var(--sklearn-color-background);
      text-decoration: none;
    }

    #sk-container-id-1 a.estimator_doc_link.fitted:hover {
      /* fitted */
      background-color: var(--sklearn-color-fitted-level-3);
    }
    </style><div id="sk-container-id-1" class="sk-top-container"><div class="sk-text-repr-fallback"><pre>IterativeImputer()</pre><b>In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. <br />On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.</b></div><div class="sk-container" hidden><div class="sk-item"><div class="sk-estimator fitted sk-toggleable"><input class="sk-toggleable__control sk-hidden--visually" id="sk-estimator-id-1" type="checkbox" checked><label for="sk-estimator-id-1" class="sk-toggleable__label fitted sk-toggleable__label-arrow fitted">&nbsp;&nbsp;IterativeImputer<a class="sk-estimator-doc-link fitted" rel="noreferrer" target="_blank" href="https://scikit-learn.org/1.5/modules/generated/sklearn.impute.IterativeImputer.html">?<span>Documentation for IterativeImputer</span></a><span class="sk-estimator-doc-link fitted">i<span>Fitted</span></span></label><div class="sk-toggleable__content fitted"><pre>IterativeImputer()</pre></div> </div></div></div></div>
    </div>
    <br />
    <br />

.. GENERATED FROM PYTHON SOURCE LINES 124-125

**The data are imputed with the transform method**

.. GENERATED FROM PYTHON SOURCE LINES 125-127

.. code-block:: Python

    X_imputed = iterative_imputer.transform(X)


.. GENERATED FROM PYTHON SOURCE LINES 128-129

We can display the imputed data as our previous visualization

.. GENERATED FROM PYTHON SOURCE LINES 129-137

.. code-block:: Python

    plt.figure()
    plt.scatter(X_full[:, 0], X_full[:, 1], color='.8', ec='.5',
                label='All data', alpha=.5)
    plt.scatter(X_imputed[:, 0], X_imputed[:, 1], c=y, marker='X',
                label='Imputed')
    plt.colorbar(label='y')
    plt.legend()


.. image-sg:: /gen_notes/images/sphx_glr_01_missing_values_003.png
   :alt: 01 missing values
   :srcset: /gen_notes/images/sphx_glr_01_missing_values_003.png
   :class: sphx-glr-single-img


.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    <matplotlib.legend.Legend object at 0x799134f23ce0>


.. GENERATED FROM PYTHON SOURCE LINES 138-150

We can see that the imputer did a fairly good job of recovering the
data distribution

Supervised learning: imputation and a linear model
-----------------------------------------------------------

Given that the relationship between the fully-observed X and y is a
linear relationship, it seems natural to use a linear model for
prediction. It must be adapted to missing values using imputation.

To use it in supervised setting, we will pipeline it with a linear
model, using a ridge, which is a good default linear model

.. GENERATED FROM PYTHON SOURCE LINES 150-155

.. code-block:: Python

    from sklearn.pipeline import make_pipeline
    from sklearn.linear_model import RidgeCV

    iterative_and_ridge = make_pipeline(impute.IterativeImputer(), RidgeCV())


.. GENERATED FROM PYTHON SOURCE LINES 156-159

We can evaluate the model performance in a cross-validation loop
(for better evaluation accuracy, we increase slightly the number of
folds to 10)

.. GENERATED FROM PYTHON SOURCE LINES 159-165

.. code-block:: Python

    from sklearn import model_selection
    scores_iterative_and_ridge = model_selection.cross_val_score(
        iterative_and_ridge, X, y, cv=10)

    scores_iterative_and_ridge


.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    array([0.61639853, 0.5814862 , 0.70136887, 0.64571923, 0.58785589,
           0.79618649, 0.65278055, 0.8454113 , 0.81722841, 0.76948479])


.. GENERATED FROM PYTHON SOURCE LINES 166-169

**Computational cost**: One drawback of the IterativeImputer to keep in
mind is that its computational cost can become prohibitive of large
datasets (it has a bad computation scalability).

.. GENERATED FROM PYTHON SOURCE LINES 171-175

Mean imputation: SimpleImputer
-------------------------------

We can try a simple imputer: imputation by the mean

.. GENERATED FROM PYTHON SOURCE LINES 175-177

.. code-block:: Python

    mean_imputer = impute.SimpleImputer()


.. GENERATED FROM PYTHON SOURCE LINES 178-179

A quick visualization reveals a larger disortion of the distribution

.. GENERATED FROM PYTHON SOURCE LINES 179-187

.. code-block:: Python

    X_imputed = mean_imputer.fit_transform(X)
    plt.figure()
    plt.scatter(X_full[:, 0], X_full[:, 1], color='.8', ec='.5',
                label='All data', alpha=.5)
    plt.scatter(X_imputed[:, 0], X_imputed[:, 1], c=y, marker='X',
                label='Imputed')
    plt.colorbar(label='y')


.. image-sg:: /gen_notes/images/sphx_glr_01_missing_values_004.png
   :alt: 01 missing values
   :srcset: /gen_notes/images/sphx_glr_01_missing_values_004.png
   :class: sphx-glr-single-img


.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    <matplotlib.colorbar.Colorbar object at 0x7990efd03fb0>


.. GENERATED FROM PYTHON SOURCE LINES 188-189

Evaluating in prediction pipeline

.. GENERATED FROM PYTHON SOURCE LINES 189-195

.. code-block:: Python

    mean_and_ridge = make_pipeline(impute.SimpleImputer(), RidgeCV())
    scores_mean_and_ridge = model_selection.cross_val_score(
        mean_and_ridge, X, y, cv=10)

    scores_mean_and_ridge


.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    array([0.58596256, 0.55215184, 0.61081314, 0.55282029, 0.54053836,
           0.64325051, 0.60147921, 0.84188079, 0.68152965, 0.67441335])


.. GENERATED FROM PYTHON SOURCE LINES 196-201

Supervised learning without imputation
----------------------------------------

The HistGradientBoosting models are based on trees, which can be
adapted to model directly missing values

.. GENERATED FROM PYTHON SOURCE LINES 201-208

.. code-block:: Python

    from sklearn.experimental import enable_hist_gradient_boosting
    from sklearn.ensemble import HistGradientBoostingRegressor
    score_hist_gradient_boosting = model_selection.cross_val_score(
        HistGradientBoostingRegressor(), X, y, cv=10)

    score_hist_gradient_boosting


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    /home/varoquau/dev/dirty_data_tutorial/venv/lib/python3.12/site-packages/sklearn/experimental/enable_hist_gradient_boosting.py:16: UserWarning: Since version 1.0, it is not needed to import enable_hist_gradient_boosting anymore. HistGradientBoostingClassifier and HistGradientBoostingRegressor are now stable and can be normally imported from sklearn.ensemble.
      warnings.warn(

    array([0.61170698, 0.56911116, 0.635685  , 0.62752111, 0.58094569,
           0.71569692, 0.62359162, 0.83829684, 0.81283438, 0.74080482])


.. GENERATED FROM PYTHON SOURCE LINES 209-213

Recap: which pipeline predicts well on our small data?
-------------------------------------------------------

Let's plot the scores to see things better

.. GENERATED FROM PYTHON SOURCE LINES 213-227

.. code-block:: Python

    import pandas as pd
    import seaborn as sns

    scores = pd.DataFrame({'Mean imputation + Ridge': scores_mean_and_ridge,
                 'IterativeImputer + Ridge': scores_iterative_and_ridge,
                 'HistGradientBoostingRegressor': score_hist_gradient_boosting,
        })

    sns.boxplot(data=scores, orient='h')
    plt.title('Prediction accuracy\n linear and small data\n'
              'Missing Completely at Random')
    plt.tight_layout()


.. image-sg:: /gen_notes/images/sphx_glr_01_missing_values_005.png
   :alt: Prediction accuracy  linear and small data Missing Completely at Random
   :srcset: /gen_notes/images/sphx_glr_01_missing_values_005.png
   :class: sphx-glr-single-img


.. GENERATED FROM PYTHON SOURCE LINES 228-235

Not much difference with the more sophisticated imputer. A more thorough
analysis would be necessary, with more cross-validation runs.

Prediction performance with large datasets
-------------------------------------------

Let us compare models in regimes where there is plenty of data

.. GENERATED FROM PYTHON SOURCE LINES 235-238

.. code-block:: Python


    X, y = generate_mcar(n_samples=20000)


.. GENERATED FROM PYTHON SOURCE LINES 239-240

Iterative imputation and linear model

.. GENERATED FROM PYTHON SOURCE LINES 240-243

.. code-block:: Python

    scores_iterative_and_ridge= model_selection.cross_val_score(
        iterative_and_ridge, X, y, cv=10)


.. GENERATED FROM PYTHON SOURCE LINES 244-245

Mean imputation and linear model

.. GENERATED FROM PYTHON SOURCE LINES 245-248

.. code-block:: Python

    scores_mean_and_ridge = model_selection.cross_val_score(
        mean_and_ridge, X, y, cv=10)


.. GENERATED FROM PYTHON SOURCE LINES 249-251

And now the HistGradientBoostingRegressor, which does not need
imputation

.. GENERATED FROM PYTHON SOURCE LINES 251-254

.. code-block:: Python

    score_hist_gradient_boosting = model_selection.cross_val_score(
        HistGradientBoostingRegressor(), X, y, cv=10)


.. GENERATED FROM PYTHON SOURCE LINES 255-256

We plot the results

.. GENERATED FROM PYTHON SOURCE LINES 256-267

.. code-block:: Python

    scores = pd.DataFrame({'Mean imputation + Ridge': scores_mean_and_ridge,
                 'IterativeImputer + Ridge': scores_iterative_and_ridge,
                 'HistGradientBoostingRegressor': score_hist_gradient_boosting,
        })

    sns.boxplot(data=scores, orient='h')
    plt.title('Prediction accuracy\n linear and large data\n'
              'Missing Completely at Random')
    plt.tight_layout()


.. image-sg:: /gen_notes/images/sphx_glr_01_missing_values_006.png
   :alt: Prediction accuracy  linear and large data Missing Completely at Random
   :srcset: /gen_notes/images/sphx_glr_01_missing_values_006.png
   :class: sphx-glr-single-img


.. GENERATED FROM PYTHON SOURCE LINES 268-281

**When there is a reasonnable amout of data, the
HistGradientBoostingRegressor is the best strategy** even for a linear
data-generating mechanism, in MAR settings, which are settings
favorable to imputation + linear model [#]_.

.. [#] Even in the case of a linear data-generating mechanism, the
       optimal prediction one data imputed by a constant
       is a piecewise affine function with 2^d regions (
       http://proceedings.mlr.press/v108/morvan20a.html ). The
       larger the dimensionality (number of features), the more a
       imperfect imputation is hard to approximate with a simple model.

|

.. GENERATED FROM PYTHON SOURCE LINES 284-293

Missing not at random: censoring
======================================

We now consider missing not at random settings, in particular
self-masking or censoring, where large values are more likely to be
missing.

The missing-values mechanism
-----------------------------

.. GENERATED FROM PYTHON SOURCE LINES 293-306

.. code-block:: Python


    def generate_censored(n_samples, missing_rate=.4, rng=42):
        X, y = generate_without_missing_values(n_samples, rng=rng)
        if not isinstance(rng, np.random.RandomState):
            rng = np.random.RandomState(rng)

        B = rng.binomial(1, 2 * missing_rate, (n_samples, 2))
        M = (X > 0.5) * B

        np.putmask(X, M, np.nan)

        return X, y


.. GENERATED FROM PYTHON SOURCE LINES 307-308

A quick plot to look at the data

.. GENERATED FROM PYTHON SOURCE LINES 308-317

.. code-block:: Python

    X, y = generate_censored(500, missing_rate=.4)

    plt.figure()
    plt.scatter(X_full[:, 0], X_full[:, 1], color='.8', ec='.5',
                label='All data')
    plt.colorbar(label='y')
    plt.scatter(X[:, 0], X[:, 1], c=y, label='Fully observed')
    plt.legend()


.. image-sg:: /gen_notes/images/sphx_glr_01_missing_values_007.png
   :alt: 01 missing values
   :srcset: /gen_notes/images/sphx_glr_01_missing_values_007.png
   :class: sphx-glr-single-img


.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    <matplotlib.legend.Legend object at 0x7990f1ef9850>


.. GENERATED FROM PYTHON SOURCE LINES 318-320

Here the full-observed data does not reflect well at all the
distribution of all the data

.. GENERATED FROM PYTHON SOURCE LINES 322-327

Imputation fails to recover the distribution
--------------------------------------------------------

With MNAR data, off-the-shelf imputation methods do not recover the
initial distribution:

.. GENERATED FROM PYTHON SOURCE LINES 327-339

.. code-block:: Python


    iterative_imputer = impute.IterativeImputer()
    X_imputed = iterative_imputer.fit_transform(X)

    plt.figure()
    plt.scatter(X_full[:, 0], X_full[:, 1], color='.8', ec='.5',
                label='All data', alpha=.5)
    plt.scatter(X_imputed[:, 0], X_imputed[:, 1], c=y, marker='X',
                label='Imputed')
    plt.colorbar(label='y')
    plt.legend()


.. image-sg:: /gen_notes/images/sphx_glr_01_missing_values_008.png
   :alt: 01 missing values
   :srcset: /gen_notes/images/sphx_glr_01_missing_values_008.png
   :class: sphx-glr-single-img


.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    <matplotlib.legend.Legend object at 0x7991498f9850>


.. GENERATED FROM PYTHON SOURCE LINES 340-358

Recovering the initial data distribution would need much more mass on
the right and the top of the figure. The imputed data is shifted to
lower values than the original data.

Note also that as imputed values typically have lower X values than
their full-observed counterparts, the association between X and y is
also distorted. This is visible as the imputed values appear as lighter
diagonal lines.

An important consequence is that **the link between imputed X and y is no
longer linear**, although the original data-generating mechanism is
linear [#]_. For this reason, **it is often a good idea to use non-linear
learners in the presence of missing values**.

.. [#] As mentionned above, even in the case of a linear
   data-generating mechanism, imperfect imputation leads to complex
   functions to link to y (
   http://proceedings.mlr.press/v108/morvan20a.html )

.. GENERATED FROM PYTHON SOURCE LINES 360-364

Predictive pipelines
-----------------------------

Let us now evaluate predictive pipelines

.. GENERATED FROM PYTHON SOURCE LINES 364-397

.. code-block:: Python

    scores = dict()

    # Iterative imputation and linear model
    scores['IterativeImputer + Ridge'] = model_selection.cross_val_score(
        iterative_and_ridge, X, y, cv=10)

    # Mean imputation and linear model
    scores['Mean imputation + Ridge'] = model_selection.cross_val_score(
        mean_and_ridge, X, y, cv=10)

    # IterativeImputer and non-linear model
    iterative_and_gb = make_pipeline(impute.IterativeImputer(),
                                HistGradientBoostingRegressor())
    scores['Mean imputation\n+ HistGradientBoostingRegressor'] = model_selection.cross_val_score(
        iterative_and_gb, X, y, cv=10)

    # Mean imputation and non-linear model
    mean_and_gb = make_pipeline(impute.SimpleImputer(),
                                HistGradientBoostingRegressor())
    scores['IterativeImputer\n+ HistGradientBoostingRegressor'] = model_selection.cross_val_score(
        mean_and_gb, X, y, cv=10)

    # And now the HistGradientBoostingRegressor, whithout imputation
    scores['HistGradientBoostingRegressor'] = model_selection.cross_val_score(
        HistGradientBoostingRegressor(), X, y, cv=10)

    # We plot the results
    sns.boxplot(data=pd.DataFrame(scores), orient='h')
    plt.title('Prediction accuracy\n linear and small data\n'
              'Missing not at Random')
    plt.tight_layout()


.. image-sg:: /gen_notes/images/sphx_glr_01_missing_values_009.png
   :alt: Prediction accuracy  linear and small data Missing not at Random
   :srcset: /gen_notes/images/sphx_glr_01_missing_values_009.png
   :class: sphx-glr-single-img


.. GENERATED FROM PYTHON SOURCE LINES 398-406

We can see that the imputation is not the most important step of the
pipeline [#]_, rather **what is important is to use a powerful model**.
Here there is information in missingness (if a value is missing, it is
large), information that a model can use to predict better.

.. [#] Note that there are less missing values in the example here
   compared to the section above on MCAR, hence the absolute prediction
   accuracies are not comparable.

.. GENERATED FROM PYTHON SOURCE LINES 408-414

.. topic:: Prediction with missing values

  The data above are very simple: linear data-generating mechanism,
  Gaussian, and low dimensional. Yet, they show the importance of using
  non-linear models, in particular the HistGradientBoostingRegressor
  which natively deals with missing values.

.. GENERATED FROM PYTHON SOURCE LINES 417-422

Using a predictor for the fully-observed case
==============================================

Let us go back to the "easy" case of the missing completely at random
settings with plenty of data

.. GENERATED FROM PYTHON SOURCE LINES 422-426

.. code-block:: Python

    n_samples = 20000

    X, y = generate_mcar(n_samples, missing_rate=.5)


.. GENERATED FROM PYTHON SOURCE LINES 427-429

Suppose we have been able to train a predictive model that works on
fully-observed data:

.. GENERATED FROM PYTHON SOURCE LINES 429-436

.. code-block:: Python


    X_full, y_full = generate_without_missing_values(n_samples)
    full_data_predictor = HistGradientBoostingRegressor()
    full_data_predictor.fit(X_full, y_full)

    model_selection.cross_val_score(full_data_predictor, X_full, y_full)


.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    array([0.98924829, 0.98970451, 0.98936072, 0.98917199, 0.98907541])


.. GENERATED FROM PYTHON SOURCE LINES 437-440

The cross validation reveals that the predictor achieves an excellent
explained variance; it is a near-perfect predictor on fully observed
data

.. GENERATED FROM PYTHON SOURCE LINES 442-445

Now we turn to data with missing values. Given that our data is MAR
(missing at random), we will use imputation to build a completed data
that looks like the full-observed data

.. GENERATED FROM PYTHON SOURCE LINES 445-449

.. code-block:: Python


    iterative_imputer = impute.IterativeImputer()
    X_imputed = iterative_imputer.fit_transform(X)


.. GENERATED FROM PYTHON SOURCE LINES 450-451

The full data predictor can be used on the imputed data

.. GENERATED FROM PYTHON SOURCE LINES 451-454

.. code-block:: Python

    from sklearn import metrics
    metrics.r2_score(y, full_data_predictor.predict(X_imputed))


.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    0.7010120264186497


.. GENERATED FROM PYTHON SOURCE LINES 455-458

This prediction is less good than on the full data, but this is
expected, as missing values lead to a loss of information. We can
compare it to a model trained to predict on data with missing values

.. GENERATED FROM PYTHON SOURCE LINES 458-465

.. code-block:: Python


    X_train, y_train = generate_mcar(n_samples, missing_rate=.5)
    na_predictor = HistGradientBoostingRegressor()
    na_predictor.fit(X_train, y_train)

    metrics.r2_score(y, na_predictor.predict(X))


.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    0.7037829753471433


.. GENERATED FROM PYTHON SOURCE LINES 466-469

Applying a model valid on the full data to imputed data work almost
as well as a model trained for missing values. The small loss in
performance is because the imputation is imperfect.

.. GENERATED FROM PYTHON SOURCE LINES 471-476

When the data-generation is non linear
---------------------------------------

We now modify a bit the example above to consider the situation where y
is a non-linear function of X

.. GENERATED FROM PYTHON SOURCE LINES 476-488

.. code-block:: Python


    X, y = generate_mcar(n_samples, missing_rate=.5)
    y = y ** 2

    # Train a predictive model that works on fully-observed data:
    X_full, y_full = generate_without_missing_values(n_samples)
    y_full = y_full ** 2
    full_data_predictor = HistGradientBoostingRegressor()
    full_data_predictor.fit(X_full, y_full)

    model_selection.cross_val_score(full_data_predictor, X_full, y_full)


.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    array([0.96930741, 0.96777809, 0.96467215, 0.97011835, 0.96888601])


.. GENERATED FROM PYTHON SOURCE LINES 489-492

Once again, we have a near-perfect predictor on fully-observed data

On data with missing values:

.. GENERATED FROM PYTHON SOURCE LINES 492-499

.. code-block:: Python


    iterative_imputer = impute.IterativeImputer()
    X_imputed = iterative_imputer.fit_transform(X)

    from sklearn import metrics
    metrics.r2_score(y, full_data_predictor.predict(X_imputed))


.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    0.533795253130883


.. GENERATED FROM PYTHON SOURCE LINES 500-503

The full-data predictor works much less well

Now we use a model trained to predict on data with missing values

.. GENERATED FROM PYTHON SOURCE LINES 503-511

.. code-block:: Python


    X_train, y_train = generate_mcar(n_samples, missing_rate=.5)
    y_train = y_train ** 2
    na_predictor = HistGradientBoostingRegressor()
    na_predictor.fit(X_train, y_train)

    metrics.r2_score(y, na_predictor.predict(X))


.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    0.6564162329027907


.. GENERATED FROM PYTHON SOURCE LINES 512-526

The model trained on data with missing values works significantly
better than that was optimal for the fully-observed data.

**Only for linear mechanism is the model on full data also optimal for
perfectly imputed data**. When the function linking X to y has
curvature, this curvature turns uncertainty resulting from missingness
into bias [#]_.

.. [#] The detailed mathematical analysis of prediction after
   imputation can be found here: https://arxiv.org/abs/2106.00311

|

________


.. rst-class:: sphx-glr-timing

   **Total running time of the script:** (0 minutes 14.889 seconds)


.. _sphx_glr_download_gen_notes_01_missing_values.py:

.. only:: html

  .. container:: sphx-glr-footer sphx-glr-footer-example

    .. container:: binder-badge

      .. image:: images/binder_badge_logo.svg
        :target: https://mybinder.org/v2/gh/dirty-data-science/python/gh-pages?filepath=notes/gen_notes/01_missing_values.ipynb
        :alt: Launch binder
        :width: 150 px

    .. container:: lite-badge

      .. image:: images/jupyterlite_badge_logo.svg
        :target: ../lite/retro/notebooks/?path=gen_notes/01_missing_values.ipynb
        :alt: Launch JupyterLite
        :width: 150 px

    .. container:: sphx-glr-download sphx-glr-download-jupyter

      :download:`Download Jupyter notebook: 01_missing_values.ipynb <01_missing_values.ipynb>`

    .. container:: sphx-glr-download sphx-glr-download-python

      :download:`Download Python source code: 01_missing_values.py <01_missing_values.py>`


.. only:: html

 .. rst-class:: sphx-glr-signature

    `Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io>`_