Walk Through

The goal of this section is to provide a comprehensive (but non-exhaustive) illustration of the estimation process provided in TMLE.jl. For an in-depth explanation, please refer to the User Guide.

The Dataset

TMLE.jl is compatible with any dataset respecting the Tables.jl interface, that is for instance, a NamedTuple, a DataFrame, an Arrow.Table etc... In this section, we will be working with the same dataset all along.

⚠️ One thing to note is that treatment variables as well as binary outcomes must be encoded as categorical variables in the dataset (see MLJ Working with categorical data).

The dataset is generated as follows:

using TMLE
using Random
using Distributions
using DataFrames
using StableRNGs
using CategoricalArrays
using TMLE
using LogExpFunctions
using MLJLinearModels

function make_dataset(;n=1000)
    rng = StableRNG(123)
    # Confounders
    W₁₁= rand(rng, Uniform(), n)
    W₁₂ = rand(rng, Uniform(), n)
    W₂₁= rand(rng, Uniform(), n)
    W₂₂ = rand(rng, Uniform(), n)
    # Covariates
    C = rand(rng, Uniform(), n)
    # Treatment | Confounders
    T₁ = rand(rng, Uniform(), n) .< logistic.(0.5sin.(W₁₁) .- 1.5W₁₂)
    T₂ = rand(rng, Uniform(), n) .< logistic.(-3W₂₁ - 1.5W₂₂)
    # Target | Confounders, Covariates, Treatments
    Y = 1 .+ 2W₂₁ .+ 3W₂₂ .+ W₁₁ .- 4C.*T₁ .- 2T₂.*T₁.*W₁₂ .+ rand(rng, Normal(0, 0.1), n)
    return DataFrame(
        W₁₁ = W₁₁,
        W₁₂ = W₁₂,
        W₂₁ = W₂₁,
        W₂₂ = W₂₂,
        C   = C,
        T₁  = categorical(T₁),
        T₂  = categorical(T₂),
        Y   = Y
        )
end
dataset = make_dataset()

Even though the role of a variable (treatment, outcome, confounder, ...) is relative to the problem setting, this dataset can intuitively be decomposed into:

  • 1 Outcome variable ($Y$).
  • 2 Treatment variables $(T₁, T₂)$ with confounders $(W₁₁, W₁₂)$ and $(W₂₁, W₂₂)$ respectively.
  • 1 Outcome extra covariate variable ($C$).

The Structural Causal Model

The modeling stage starts from the definition of a Structural Causal Model (SCM). This is simply a list of relationships between the random variables in our dataset. See Structural Causal Models for an in-depth explanation. For our purposes, because we know the data generating process, we can define it as follows:

scm = SCM([
    :Y  => [:T₁, :T₂, :W₁₁, :W₁₂, :W₂₁, :W₂₂, :C],
    :T₁ => [:W₁₁, :W₁₂],
    :T₂ => [:W₂₁, :W₂₂]
]
)
SCM
---
T₁ = f₂(W₁₂, W₁₁)
T₂ = f₃(W₂₂, W₂₁)
Y = f₁(C, W₁₂, W₂₂, W₂₁, W₁₁, T₁, T₂)

The Causal Estimands

From the previous causal model we can ask multiple causal questions, all represented by distinct causal estimands. The set of available estimands types can be listed as follow:

AVAILABLE_ESTIMANDS
3-element Vector{Symbol}:
 :CM
 :AIE
 :ATE

At the moment there are 3 main causal estimands in TMLE.jl, we provide below a few examples.

  • The Counterfactual Mean:
cm = CM(
    outcome = :Y,
    treatment_values = (T₁=true,)
)
TMLE.CausalCM(:Y, OrderedCollections.OrderedDict{Symbol, Bool}(:T₁ => 1))
  • The Average Treatment Effect:
total_ate = ATE(
    outcome = :Y,
    treatment_values = (
        T₁=(case=1, control=0),
        T₂=(case=1, control=0)
    )
)
marginal_ate_t1 = ATE(
    outcome = :Y,
    treatment_values = (T₁=(case=1, control=0),)
)
TMLE.CausalATE(:Y, OrderedCollections.OrderedDict{Symbol, @NamedTuple{control::Int64, case::Int64}}(:T₁ => (control = 0, case = 1)))
  • The Average Interaction Effect:
aie = AIE(
    outcome = :Y,
    treatment_values = (
        T₁=(case=1, control=0),
        T₂=(case=1, control=0)
    )
)
TMLE.CausalAIE(:Y, OrderedCollections.OrderedDict{Symbol, @NamedTuple{control::Int64, case::Int64}}(:T₁ => (control = 0, case = 1), :T₂ => (control = 0, case = 1)))

Identification

Identification is the process by which a Causal Estimand is turned into a Statistical Estimand, that is, a quantity we may estimate from data. This is done via the identify function which also takes in the $SCM$:

statistical_aie = identify(aie, scm)
StatisticalAIE
- Outcome: Y
- Treatment: T₁ => (control = 0, case = 1) & T₂ => (control = 0, case = 1)

Alternatively, you can also directly define the statistical parameters (see Estimands).

Estimation

Then each parameter can be estimated by building an estimator (which is simply a function) and evaluating it on data. For illustration, we will keep the models simple. We define a Targeted Maximum Likelihood Estimator:

tmle = TMLEE()
TMLEE(Dict{Symbol, MLJBase.ProbabilisticPipeline{N, MLJModelInterface.predict} where N<:NamedTuple}(:Q_binary_default => ProbabilisticPipeline(continuous_encoder = ContinuousEncoder(drop_last = true, …), …), :G_default => ProbabilisticPipeline(continuous_encoder = ContinuousEncoder(drop_last = true, …), …), :Q_continuous_default => ProbabilisticPipeline(continuous_encoder = ContinuousEncoder(drop_last = true, …), …)), nothing, 1.0e-8, false, nothing, 1, false)

Because we haven't identified the cm causal estimand yet, we need to provide the scm as well to the estimator:

result, cache = tmle(cm, scm, dataset);
result
One sample t-test
-----------------
Population details:
    parameter of interest:   Mean
    value under h_0:         0
    point estimate:          1.81806
    95% confidence interval: (1.644, 1.992)

Test summary:
    outcome with 95% confidence: reject h_0
    two-sided p-value:           <1e-77

Details:
    number of observations:   1000
    t-statistic:              20.45890634199551
    degrees of freedom:       999
    empirical standard error: 0.08886416319776659

Statistical Estimands can be estimated without a $SCM$, let's use the One-Step estimator:

ose = OSE()
result, cache = ose(statistical_aie, dataset)
result
One sample t-test
-----------------
Population details:
    parameter of interest:   Mean
    value under h_0:         0
    point estimate:          -0.559897
    95% confidence interval: (-1.103, -0.01683)

Test summary:
    outcome with 95% confidence: reject h_0
    two-sided p-value:           0.0433

Details:
    number of observations:   1000
    t-statistic:              -2.0231481754236276
    degrees of freedom:       999
    empirical standard error: 0.27674525293126967

Hypothesis Testing

Both TMLE and OSE asymptotically follow a Normal distribution. It means we can perform standard T/Z tests of null hypothesis. TMLE.jl extends the method provided by the HypothesisTests.jl package that can be used as follows.

OneSampleTTest(result)
One sample t-test
-----------------
Population details:
    parameter of interest:   Mean
    value under h_0:         0
    point estimate:          -0.559897
    95% confidence interval: (-1.103, -0.01683)

Test summary:
    outcome with 95% confidence: reject h_0
    two-sided p-value:           0.0433

Details:
    number of observations:   1000
    t-statistic:              -2.0231481754236276
    degrees of freedom:       999
    empirical standard error: 0.27674525293126967

If the estimate is high-dimensional, a OneSampleHotellingT2Test should be performed instead. Alternatively, the significance_test function will automatically select the appropriate test for the estimate and return its result.