Interaction Estimation

In this example we aim to estimate the average interaction effect of two, potentially correlated, treatment variables T1 and T2 on an outcome Y.

Data Generating Process

Let's consider the following structural causal model where the shaded nodes represent the observed variables.

interaction-graph

In other words, only one confounding variable is observed (W1). This would be a major problem if we wanted to estimate the average treatment effect of T1 or T2 on Y separately. However, here, we are interested in interactions and thus W1 is a sufficient adjustment set. This artificial situation is ubiquitous in genetics, where two main sources of confounding exist. Ancestry, can be estimated (here W1) and linkage disequilibrium is usually more challenging to address (here W2).

Let us first define some helper functions and import some libraries.

using Distributions
using Random
using DataFrames
using Statistics
using CategoricalArrays
using TMLE
using CairoMakie
using MLJXGBoostInterface
using MLJ
using MLJLinearModels
Random.seed!(123)

μT(w) = [sum(w), sum(w)]

μY(t, w) = 1 + 10t[1] - 3t[2] * t[1] * w

μY (generic function with 1 method)

We assume that W1 and W2, the confounding variables, follow a uniform distribution.

generate_W(n) = rand(Uniform(0, 1), 2, n)

generate_W (generic function with 1 method)

T1 and T2 are generated via a copula method through a multivariate normal to induce some statistical dependence (via σ).

function generate_T(W, n; σ=0.5, threshold=0)
    covariance = [
        1. σ
        σ 1.
    ]
    T = zeros(Bool, 2, n)
    for i in 1:n
        dTi = MultivariateNormal(μT(W[:, i]), covariance)
        T[:, i] = rand(dTi) .> threshold
    end
    return T
end

generate_T (generic function with 1 method)

Finally, Y is generated through a simple linear model with an interaction term.

function generate_Y(T, W1, n; σY=1)
    Y = zeros(Float64, n)
    for i in 1:n
        dY = Normal(μY(T[:, i], W1[i]), σY)
        Y[i] = rand(dY)
    end
    return Y
end

generate_Y (generic function with 1 method)

Importantly, the average interaction effect between T1 and T2 is thus $-3 \mathbb{E}[W] = -1.5$.

We will generate a full dataset with the following function.

function generate_dataset(;n=1000, σ=0.5, threshold=0., σY=1)

    W = generate_W(n)
    T = generate_T(W, n; σ=σ, threshold=threshold)

    W = permutedims(W)
    W1 = W[:, 1]
    W2 = W[:, 2]

    Y = generate_Y(T, W1, n; σY=σY)

    T = permutedims(T)
    T1 = categorical(T[:, 1])
    T2 = categorical(T[:, 2])

    return DataFrame(W1=W1, W2=W2, T1=T1, T2=T2, Y=Y)
end

dataset = generate_dataset()

first(dataset, 5)

5×5 DataFrame

Row	W1	W2	T1	T2	Y
	Float64	Float64	Cat…	Cat…	Float64
1	0.9063	0.443494	true	true	7.34663
2	0.745673	0.512083	true	true	6.30484
3	0.253849	0.334152	true	true	9.73191
4	0.427328	0.867547	true	false	9.49304
5	0.0991336	0.125287	true	true	11.2495

Let's verify that each treatment level is sufficiently present in the dataset (≈positivity).

combine(groupby(dataset, [:T1, :T2]), proprow => :JOINT_TREATMENT_FREQ)

4×3 DataFrame

Row	T1	T2	JOINT_TREATMENT_FREQ
	Cat…	Cat…	Float64
1	false	false	0.073
2	false	true	0.098
3	true	false	0.107
4	true	true	0.722

And that T1 and T2 are indeed correlated.

treatment_correlation(dataset) = cor(unwrap.(dataset.T1), unwrap.(dataset.T2))
treatment_correlation(dataset)

0.2918769031970086

Estimation

We can now proceed to estimation using TMLE and default (linear) models.

Interactions are defined via the AIE function (note that we only set W1 as a confounder).

Ψ = AIE(
    outcome=:Y,
    treatment_values= (
        T1=(case=1, control=0),
        T2=(case=1, control=0)
    ),
    treatment_confounders = [:W1]
)
linear_models = default_models(G=LogisticClassifier(lambda=0), Q_continuous=LinearRegressor())
estimator = TMLEE(models=linear_models, weighted=true)
result, _ = estimator(Ψ, dataset; verbosity=0)
significance_test(result)

One sample t-test
-----------------
Population details:
    parameter of interest:   Mean
    value under h_0:         0
    point estimate:          -1.61436
    95% confidence interval: (-2.232, -0.997)

Test summary:
    outcome with 95% confidence: reject h_0
    two-sided p-value:           <1e-06

Details:
    number of observations:   1000
    t-statistic:              -5.131324598616492
    degrees of freedom:       999
    empirical standard error: 0.3146097888636996

The true effect size is thus covered by our confidence interval.

Varying levels of correlation

We now vary the correlation level between T1 and T2 to observe how it affects the estimation results. First, let's see how the parameter σ affects the correlation between T1 and T2.

function plot_correlations(;σs = 0.1:0.1:1, n=1000, threshold=0., σY=1.)
    fig = Figure()
    ax = Axis(fig[1, 1], xlabel="σ", ylabel="Correlation(T1, T2)")
    correlations = map(σs) do σ
        dataset = generate_dataset(;n=n, σ=σ, threshold=threshold, σY=σY)
        return treatment_correlation(dataset)
    end
    scatter!(ax, σs, correlations, color=:blue)
    return fig
end

σs = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.95, 0.99]
plot_correlations(;σs=σs, n=10_000)

As expected, the correlation between T1 and T2 increases with σ. Let's see how this affects estimation, for this, we will vary both the dataset size and the correlation level.

function estimate_across_correlation_levels(σs; n=1000, estimator=TMLEE(weighted=true))
    results = []
    for σ in σs
        dataset = generate_dataset(n=n, σ=σ)
        result, _ = estimator(Ψ, dataset; verbosity=0)
        push!(results, result)
    end
    Ψ̂s = TMLE.estimate.(results)
    errors = last.(confint.(significance_test.(results))) .- Ψ̂s
    return Ψ̂s, errors
end

function estimate_across_sample_sizes_and_correlation_levels(ns, σs; estimator=TMLEE(models=linear_models, weighted=true))
    results = []
    for n in ns
        Ψ̂s, errors = estimate_across_correlation_levels(σs; n=n, estimator=estimator)
        push!(results, (Ψ̂s, errors))
    end
    return results
end

function plot_across_sample_sizes_and_correlation_levels(results, ns, σs; title="Estimation via TMLE (GLMs)")
    fig = Figure(size=(800, 800))
    for (index, n) in enumerate(ns)
        Ψ̂s, errors = results[index]
        ax = if n == last(ns)
            Axis(fig[index, 1], xlabel="σ", ylabel="AIE\n(n=$n)")
        else
            Axis(fig[index, 1], ylabel="AIE\n(n=$n)", xticklabelsvisible=false)
        end
        errorbars!(ax, σs, Ψ̂s, errors, color = :blue, whiskerwidth = 10)
        scatter!(ax, σs, Ψ̂s, color=:red, markersize=10)
        hlines!(ax, [-1.5], color=:green, linestyle=:dash)
    end
    Label(fig[0, :], title; tellwidth=false, fontsize=24)
    return fig
end

ns = [100, 1000, 10_000, 100_000, 1_000_000]
σs = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.95, 0.999]
results = estimate_across_sample_sizes_and_correlation_levels(ns, σs; estimator=TMLEE(models=linear_models, weighted=true))
plot_across_sample_sizes_and_correlation_levels(results, ns, σs; title="Estimation via TMLE (GLMs)")

First, notice that only extreme correlations (>0.9) tend to blow up the size of the confidence intervals. This implies that statistical power may be limited in such circumstances.

Furthermore, and perhaps unexpectedly, coverage decreases as sample size grows for larger correlations. Since we have used simple linear models until now, this could be due to model misspecification. We can verify this by using a more flexible modelling strategy. Here we will use XGBoost (with tree_method=hist to speed things up a little). Because this model is prone to overfitting we will also use cross-validation (this will take a few minutes).

xgboost_estimator = TMLEE(
    models=default_models(G=XGBoostClassifier(tree_method="hist"), Q_continuous=XGBoostRegressor(tree_method="hist")),
    weighted=true,
    resampling=StratifiedCV(nfolds=3)
)
xgboost_results = estimate_across_sample_sizes_and_correlation_levels(ns, σs, estimator=xgboost_estimator)
plot_across_sample_sizes_and_correlation_levels(xgboost_results, ns, σs; title="Estimation via TMLE (XGboost)")

As expected, XGBoost improves estimation performance in the asymptotic regime, furthermore, the correlation between T1 and T2 seems harmless (except when σ > 0.9 as before).

This page was generated using Literate.jl.