Walk Through
The goal of this section is to provide a comprehensive (but non-exhaustive) illustration of the estimation process provided in TMLE.jl. For an in-depth explanation, please refer to the User Guide.
The Dataset
TMLE.jl is compatible with any dataset respecting the Tables.jl interface, that is for instance, a NamedTuple
, a DataFrame
, an Arrow.Table
etc... In this section, we will be working with the same dataset all along.
⚠️ One thing to note is that treatment variables as well as binary outcomes must be encoded as categorical
variables in the dataset (see MLJ Working with categorical data).
The dataset is generated as follows:
using TMLE
using Random
using Distributions
using DataFrames
using StableRNGs
using CategoricalArrays
using TMLE
using LogExpFunctions
using MLJLinearModels
function make_dataset(;n=1000)
rng = StableRNG(123)
# Confounders
W₁₁= rand(rng, Uniform(), n)
W₁₂ = rand(rng, Uniform(), n)
W₂₁= rand(rng, Uniform(), n)
W₂₂ = rand(rng, Uniform(), n)
# Covariates
C = rand(rng, Uniform(), n)
# Treatment | Confounders
T₁ = rand(rng, Uniform(), n) .< logistic.(0.5sin.(W₁₁) .- 1.5W₁₂)
T₂ = rand(rng, Uniform(), n) .< logistic.(-3W₂₁ - 1.5W₂₂)
# Target | Confounders, Covariates, Treatments
Y = 1 .+ 2W₂₁ .+ 3W₂₂ .+ W₁₁ .- 4C.*T₁ .- 2T₂.*T₁.*W₁₂ .+ rand(rng, Normal(0, 0.1), n)
return DataFrame(
W₁₁ = W₁₁,
W₁₂ = W₁₂,
W₂₁ = W₂₁,
W₂₂ = W₂₂,
C = C,
T₁ = categorical(T₁),
T₂ = categorical(T₂),
Y = Y
)
end
dataset = make_dataset()
Even though the role of a variable (treatment, outcome, confounder, ...) is relative to the problem setting, this dataset can intuitively be decomposed into:
- 1 Outcome variable ($Y$).
- 2 Treatment variables $(T₁, T₂)$ with confounders $(W₁₁, W₁₂)$ and $(W₂₁, W₂₂)$ respectively.
- 1 Outcome extra covariate variable ($C$).
The Structural Causal Model
The modeling stage starts from the definition of a Structural Causal Model (SCM
). This is simply a list of relationships between the random variables in our dataset. See Structural Causal Models for an in-depth explanation. For our purposes, because we know the data generating process, we can define it as follows:
scm = SCM([
:Y => [:T₁, :T₂, :W₁₁, :W₁₂, :W₂₁, :W₂₂, :C],
:T₁ => [:W₁₁, :W₁₂],
:T₂ => [:W₂₁, :W₂₂]
]
)
SCM
---
T₁ = f₂(W₁₂, W₁₁)
T₂ = f₃(W₂₂, W₂₁)
Y = f₁(C, W₁₂, W₂₂, W₂₁, W₁₁, T₁, T₂)
The Causal Estimands
From the previous causal model we can ask multiple causal questions, all represented by distinct causal estimands. The set of available estimands types can be listed as follow:
AVAILABLE_ESTIMANDS
3-element Vector{Symbol}:
:CM
:AIE
:ATE
At the moment there are 3 main causal estimands in TMLE.jl, we provide below a few examples.
- The Counterfactual Mean:
cm = CM(
outcome = :Y,
treatment_values = (T₁=true,)
)
TMLE.CausalCM(:Y, OrderedCollections.OrderedDict{Symbol, Bool}(:T₁ => 1))
- The Average Treatment Effect:
total_ate = ATE(
outcome = :Y,
treatment_values = (
T₁=(case=1, control=0),
T₂=(case=1, control=0)
)
)
marginal_ate_t1 = ATE(
outcome = :Y,
treatment_values = (T₁=(case=1, control=0),)
)
TMLE.CausalATE(:Y, OrderedCollections.OrderedDict{Symbol, @NamedTuple{control::Int64, case::Int64}}(:T₁ => (control = 0, case = 1)))
- The Average Interaction Effect:
aie = AIE(
outcome = :Y,
treatment_values = (
T₁=(case=1, control=0),
T₂=(case=1, control=0)
)
)
TMLE.CausalAIE(:Y, OrderedCollections.OrderedDict{Symbol, @NamedTuple{control::Int64, case::Int64}}(:T₁ => (control = 0, case = 1), :T₂ => (control = 0, case = 1)))
Identification
Identification is the process by which a Causal Estimand is turned into a Statistical Estimand, that is, a quantity we may estimate from data. This is done via the identify
function which also takes in the $SCM$:
statistical_aie = identify(aie, scm)
StatisticalAIE
- Outcome: Y
- Treatment: T₁ => (control = 0, case = 1) & T₂ => (control = 0, case = 1)
Alternatively, you can also directly define the statistical parameters (see Estimands).
Estimation
Then each parameter can be estimated by building an estimator (which is simply a function) and evaluating it on data. For illustration, we will keep the models simple. We define a Targeted Maximum Likelihood Estimator:
tmle = TMLEE()
TMLEE(Dict{Symbol, MLJBase.ProbabilisticPipeline{N, MLJModelInterface.predict} where N<:NamedTuple}(:Q_binary_default => ProbabilisticPipeline(continuous_encoder = ContinuousEncoder(drop_last = true, …), …), :G_default => ProbabilisticPipeline(continuous_encoder = ContinuousEncoder(drop_last = true, …), …), :Q_continuous_default => ProbabilisticPipeline(continuous_encoder = ContinuousEncoder(drop_last = true, …), …)), nothing, 1.0e-8, false, nothing, 1, false)
Because we haven't identified the cm
causal estimand yet, we need to provide the scm
as well to the estimator:
result, cache = tmle(cm, scm, dataset);
result
One sample t-test
-----------------
Population details:
parameter of interest: Mean
value under h_0: 0
point estimate: 1.81806
95% confidence interval: (1.644, 1.992)
Test summary:
outcome with 95% confidence: reject h_0
two-sided p-value: <1e-77
Details:
number of observations: 1000
t-statistic: 20.45890634199551
degrees of freedom: 999
empirical standard error: 0.08886416319776659
Statistical Estimands can be estimated without a $SCM$, let's use the One-Step estimator:
ose = OSE()
result, cache = ose(statistical_aie, dataset)
result
One sample t-test
-----------------
Population details:
parameter of interest: Mean
value under h_0: 0
point estimate: -0.559897
95% confidence interval: (-1.103, -0.01683)
Test summary:
outcome with 95% confidence: reject h_0
two-sided p-value: 0.0433
Details:
number of observations: 1000
t-statistic: -2.0231481754236276
degrees of freedom: 999
empirical standard error: 0.27674525293126967
Hypothesis Testing
Both TMLE and OSE asymptotically follow a Normal distribution. It means we can perform standard T/Z tests of null hypothesis. TMLE.jl extends the method provided by the HypothesisTests.jl package that can be used as follows.
OneSampleTTest(result)
One sample t-test
-----------------
Population details:
parameter of interest: Mean
value under h_0: 0
point estimate: -0.559897
95% confidence interval: (-1.103, -0.01683)
Test summary:
outcome with 95% confidence: reject h_0
two-sided p-value: 0.0433
Details:
number of observations: 1000
t-statistic: -2.0231481754236276
degrees of freedom: 999
empirical standard error: 0.27674525293126967
If the estimate is high-dimensional, a OneSampleHotellingT2Test
should be performed instead. Alternatively, the significance_test
function will automatically select the appropriate test for the estimate and return its result.