Anscombe’s Quartet

In 1973, Francis Anscombe introduced four small datasets with nearly identical summary statistics but very different shapes. This notebook recreates that idea with plotnine, polars, and scikit-learn.

import polars as pl
from plotnine import ggplot, element_blank, element_line, element_rect, element_text
from plotnine.data import anscombe_quartet

Convert DataFrame to Polars

anscombe_quartet = pl.from_pandas(anscombe_quartet)
anscombe_quartet

shape: (44, 3)

dataset	x	y
str	i64	f64
"I"	10	8.04
"I"	8	6.95
"I"	13	7.58
"I"	9	8.81
"I"	11	8.33
…	…	…
"IV"	8	5.25
"IV"	19	12.5
"IV"	8	5.56
"IV"	8	7.91
"IV"	8	6.89

Compute Descriptive Statistics

pl.Config.set_float_precision(2)

anscombe_quartet.group_by("dataset", maintain_order=True).agg(
    pl.col("x", "y").mean().name.prefix("mean_"),
    pl.col("x", "y").var().name.prefix("variance_"),
    pl.corr("x", "y").alias("correlation_xy"),
)

shape: (4, 6)

dataset	mean_x	mean_y	variance_x	variance_y	correlation_xy
str	f64	f64	f64	f64	f64
"I"	9.00	7.50	11.00	4.13	0.82
"II"	9.00	7.50	11.00	4.13	0.82
"III"	9.00	7.50	11.00	4.12	0.82
"IV"	9.00	7.50	11.00	4.12	0.82

Exploratory Data Visualization

(
    ggplot(anscombe_quartet)
    .aes("x", "y")
    .geom_point()
)

(
    ggplot(anscombe_quartet)
    .aes("x", "y", color="dataset")
    .geom_point()
)

(
    ggplot(anscombe_quartet)
    .aes("x", "y", color="dataset")
    .facet_wrap("dataset")
    .geom_point()
)

(
    ggplot(anscombe_quartet)
    .aes("x", "y")
    .geom_point()
    .geom_smooth(method="lm", se=False, fullrange=True, color="blue")
    .facet_wrap("dataset")
)

A Fine-Tuned Data Visualization

(
    ggplot(anscombe_quartet)
    .aes("x", "y")
    .geom_point(color="sienna", fill="darkorange", size=3)
    .geom_smooth(method="lm", se=False, fullrange=True, color="steelblue", size=1)
    .facet_wrap("dataset")
    .scale_y_continuous(breaks=(4, 8, 12))
    .coord_fixed(xlim=(3, 22), ylim=(2, 14))
    .labs(title="Anscombe's Quartet")
    .theme_tufte(base_family="Futura")
    .add_theme(
        axis_line=element_line(color="#4d4d4d"),
        axis_ticks_major=element_line(color="#00000000"),
        axis_title=element_blank(),
        plot_background=element_rect(fill="#ffffff", color="#ffffff"),
        dpi=144,
        panel_spacing=0.09,
        strip_text=element_text(size=12),
        title=element_text(size=16, margin={"b": 20}),
    )
)

Bonus: Apply Linear Regression

from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score


def fit_lr(s):
    lr = LinearRegression()
    X = s.struct.field("x").to_numpy().reshape(-1, 1)
    y = s.struct.field("y").to_numpy()

    lr.fit(X, y)
    intercept = lr.intercept_
    slope = lr.coef_[0]
    r2 = r2_score(y, intercept + slope * X)

    return {"intercept": intercept, "slope": slope, "r2": r2}


(
    anscombe_quartet
    .group_by("dataset", maintain_order=True)
    .agg(
        pl.col("x", "y").mean().name.prefix("mean_"),
        pl.col("x", "y").var().name.prefix("variance_"),
        pl.corr("x", "y").alias("correlation_xy"),
        (
            pl.struct("x", "y")
            .implode()
            .map_elements(
                fit_lr,
                return_dtype=pl.Struct(
                    {"intercept": pl.Float64, "slope": pl.Float64, "r2": pl.Float64}
                ),
            )
            .alias("lr")
        ),
    )
    .unnest("lr")
)

shape: (4, 9)

dataset	mean_x	mean_y	variance_x	variance_y	correlation_xy	intercept	slope	r2
str	f64	f64	f64	f64	f64	f64	f64	f64
"I"	9.00	7.50	11.00	4.13	0.82	3.00	0.50	0.67
"II"	9.00	7.50	11.00	4.13	0.82	3.00	0.50	0.67
"III"	9.00	7.50	11.00	4.12	0.82	3.00	0.50	0.67
"IV"	9.00	7.50	11.00	4.12	0.82	3.00	0.50	0.67