Walk Through

The goal of this section is to provide a comprehensive (but non-exhaustive) illustration of the estimation process provided in TMLE.jl. For an in-depth explanation, please refer to the User Guide.

The Dataset

TMLE.jl is compatible with any dataset wrapped in a DataFrame, note that it is possible to wrap an Arrow Table for instance, in a Dataframe object. In this section, we will be working with the same dataset all along.

⚠️ One thing to note is that treatment variables as well as binary outcomes must be encoded as categorical variables in the dataset (see MLJ Working with categorical data).

The dataset is generated as follows:

using TMLE
using Random
using Distributions
using DataFrames
using StableRNGs
using CategoricalArrays
using TMLE
using LogExpFunctions
using MLJLinearModels

function make_dataset(;n=1000)
    rng = StableRNG(123)
    # Confounders
    W₁₁= rand(rng, Uniform(), n)
    W₁₂ = rand(rng, Uniform(), n)
    W₂₁= rand(rng, Uniform(), n)
    W₂₂ = rand(rng, Uniform(), n)
    # Covariates
    C = rand(rng, Uniform(), n)
    # Treatment | Confounders
    T₁ = rand(rng, Uniform(), n) .< logistic.(0.5sin.(W₁₁) .- 1.5W₁₂)
    T₂ = rand(rng, Uniform(), n) .< logistic.(-3W₂₁ - 1.5W₂₂)
    # Target | Confounders, Covariates, Treatments
    Y = 1 .+ 2W₂₁ .+ 3W₂₂ .+ W₁₁ .- 4C.*T₁ .- 2T₂.*T₁.*W₁₂ .+ rand(rng, Normal(0, 0.1), n)
    return DataFrame(
        W₁₁ = W₁₁,
        W₁₂ = W₁₂,
        W₂₁ = W₂₁,
        W₂₂ = W₂₂,
        C   = C,
        T₁  = categorical(T₁),
        T₂  = categorical(T₂),
        Y   = Y
        )
end
dataset = make_dataset()

Even though the role of a variable (treatment, outcome, confounder, ...) is relative to the problem setting, this dataset can intuitively be decomposed into:

1 Outcome variable ($Y$).
2 Treatment variables $(T₁, T₂)$ with confounders $(W₁₁, W₁₂)$ and $(W₂₁, W₂₂)$ respectively.
1 Outcome extra covariate variable ($C$).

The Structural Causal Model

The modeling stage starts from the definition of a Structural Causal Model (SCM). This is simply a list of relationships between the random variables in our dataset. See Structural Causal Models for an in-depth explanation. For our purposes, because we know the data generating process, we can define it as follows:

scm = SCM([
    :Y  => [:T₁, :T₂, :W₁₁, :W₁₂, :W₂₁, :W₂₂, :C],
    :T₁ => [:W₁₁, :W₁₂],
    :T₂ => [:W₂₁, :W₂₂]
]
)

SCM
---
T₁ = f₂(W₁₂, W₁₁)
T₂ = f₃(W₂₂, W₂₁)
Y = f₁(C, W₁₂, W₂₂, W₂₁, W₁₁, T₁, T₂)

The Causal Estimands

From the previous causal model we can ask multiple causal questions, all represented by distinct causal estimands. The set of available estimands types can be listed as follow:

AVAILABLE_ESTIMANDS

3-element Vector{Symbol}:
 :CM
 :AIE
 :ATE

At the moment there are 3 main causal estimands in TMLE.jl, we provide below a few examples.

The Counterfactual Mean:

cm = CM(
    outcome = :Y,
    treatment_values = (T₁=true,)
)

CausalCM
	- Outcome: Y
	- Treatment: T₁ => true

The Average Treatment Effect:

total_ate = ATE(
    outcome = :Y,
    treatment_values = (
        T₁=(case=1, control=0),
        T₂=(case=1, control=0)
    )
)
marginal_ate_t1 = ATE(
    outcome = :Y,
    treatment_values = (T₁=(case=1, control=0),)
)

CausalATE
	- Outcome: Y
	- Treatment: T₁ => (control = 0, case = 1)

The Average Interaction Effect:

aie = AIE(
    outcome = :Y,
    treatment_values = (
        T₁=(case=1, control=0),
        T₂=(case=1, control=0)
    )
)

CausalAIE
	- Outcome: Y
	- Treatment: T₁ => (control = 0, case = 1) & T₂ => (control = 0, case = 1)

Identification

Identification is the process by which a Causal Estimand is turned into a Statistical Estimand, that is, a quantity we may estimate from data. This is done via the identify function which also takes in the $SCM$:

statistical_aie = identify(aie, scm)

StatisticalAIE
	- Outcome: Y
	- Treatment: T₁ => (control = 0, case = 1) & T₂ => (control = 0, case = 1)

Alternatively, you can also directly define the statistical parameters (see Estimands).

Estimation

Then each parameter can be estimated by building an estimator (which is simply a function) and evaluating it on data. For illustration, we will keep the models simple. We define a Targeted Maximum Likelihood Estimator:

tmle = Tmle()

Tmle(Dict{Symbol, MLJBase.ProbabilisticPipeline{N, MLJModelInterface.predict} where N<:NamedTuple}(:Q_binary_default => ProbabilisticPipeline(continuous_encoder = ContinuousEncoder(drop_last = true, …), …), :G_default => ProbabilisticPipeline(continuous_encoder = ContinuousEncoder(drop_last = true, …), …), :Q_continuous_default => ProbabilisticPipeline(continuous_encoder = ContinuousEncoder(drop_last = true, …), …)), nothing, nothing, 1.0e-8, true, nothing, 1, false, nothing)

Because we haven't identified the cm causal estimand yet, we need to provide the scm as well to the estimator:

result, cache = tmle(cm, scm, dataset);
result

Targeted Minimum Loss Based Estimator
-------------------------------------
- point estimate         : 1.8180
- 95% confidence interval: [1.6436, 1.9924]
- p-value                : 5.80e-78
- mean influence curve   : -2.95e-17

Full test results can be obtained with `significance_test`

Statistical Estimands can be estimated without a $SCM$, let's use the One-Step estimator:

ose = Ose()
result, cache = ose(statistical_aie, dataset)
result

One Step Estimator
------------------
- point estimate         : -0.6018
- 95% confidence interval: [-1.2989, 0.0953]
- p-value                : 9.06e-02
- mean influence curve   : -5.77e-18

Full test results can be obtained with `significance_test`

Hypothesis Testing

Both TMLE and OSE asymptotically follow a Normal distribution. It means we can perform standard T/Z tests of null hypothesis. TMLE.jl extends the method provided by the HypothesisTests.jl package that can be used as follows.

OneSampleTTest(result)

One sample t-test
-----------------
Population details:
    parameter of interest:   Mean
    value under h_0:         0
    point estimate:          -0.601797
    95% confidence interval: (-1.299, 0.09533)

Test summary:
    outcome with 95% confidence: fail to reject h_0
    two-sided p-value:           0.0906

Details:
    number of observations:   1000
    t-statistic:              -1.6939973061679232
    degrees of freedom:       999
    empirical standard error: 0.35525264646833477

If the estimate is high-dimensional, a OneSampleHotellingT2Test should be performed instead. Alternatively, the significance_test function will automatically select the appropriate test for the estimate and return its result.