Estimators

Constructing and Using Estimators

Once a statistical estimand has been defined, we can proceed with estimation. There are two semi-parametric efficient estimators in TMLE.jl:

The Targeted Maximum-Likelihood Estimator (Tmle)
The One-Step Estimator (Ose)

While they have similar asymptotic properties, their finite sample performance may be different. They also have a very distinguishing feature, the TMLE is a plugin estimator, which means it respects the natural bounds of the estimand of interest. In contrast, the OSE may in theory report values outside these bounds. In practice, this is not often the case and the estimand of interest may not impose any restriction on its domain.

Drawing from the example dataset and SCM from the Walk Through section, we can estimate the ATE for T₁. Let's use TMLE:

Ψ₁ = ATE(
    outcome=:Y,
    treatment_values=(T₁=(case=true, control=false),),
    treatment_confounders=(T₁=[:W₁₁, :W₁₂],),
    outcome_extra_covariates=[:C]
)
tmle = Tmle()
result₁, cache = tmle(Ψ₁, dataset);
result₁

The cache (see below) contains estimates for the nuisance functions that were necessary to estimate the ATE.

The result₁ structure corresponds to the estimation result and will display the result of a T-Test including:

A point estimate.
A 95% confidence interval.
A p-value (Corresponding to the test that the estimand is different than 0).

Both the TMLE and OSE are asymptotically linear estimators, standard Z/T tests from HypothesisTests.jl can be performed and confint and pvalue methods used.

tmle_test_result₁ = pvalue(OneSampleTTest(result₁))

0.0

Let us now turn to the Average Treatment Effect of T₂, we will estimate it with a Ose:

Ψ₂ = ATE(
    outcome=:Y,
    treatment_values=(T₂=(case=true, control=false),),
    treatment_confounders=(T₂=[:W₂₁, :W₂₂],),
    outcome_extra_covariates=[:C]
)
ose = Ose()
result₂, cache = ose(Ψ₂, dataset;cache=cache);
result₂

Again, required nuisance functions are fitted and stored in the cache.

Specifying Models

By default, TMLE.jl uses generalized linear models for the estimation of relevant and nuisance factors such as the outcome mean and the propensity score. However, this is not the recommended usage since the estimators' performance is closely related to how well we can estimate these factors. More sophisticated models can be provided using the models keyword argument of each estimator which is a Dict{Symbol, Model} mapping variables' names to their respective model.

Rather than specifying a specific model for each variable it may be easier to override the default models using the default_models function:

For example one can override all default models with XGBoost models from MLJXGBoostInterface:

using MLJXGBoostInterface
xgboost_regressor = XGBoostRegressor()
xgboost_classifier = XGBoostClassifier()
models = default_models(
    Q_binary     = xgboost_classifier,
    Q_continuous = xgboost_regressor,
    G            = xgboost_classifier
)
tmle_gboost = Tmle(models=models)

Tmle(Dict{Symbol, MLJModelInterface.Supervised}(:Q_binary_default => ProbabilisticPipeline(continuous_encoder = ContinuousEncoder(drop_last = true, …), …), :G_default => ProbabilisticPipeline(continuous_encoder = ContinuousEncoder(drop_last = true, …), …), :Q_continuous_default => DeterministicPipeline(continuous_encoder = ContinuousEncoder(drop_last = true, …), …)), nothing, nothing, 1.0e-8, true, nothing, 1, false)

The advantage of using default_models is that it will automatically prepend each model with a ContinuousEncoder to make sure the correct types are passed to the downstream models.

Super Learning (Stack) as well as variable specific models can be defined as well. Here is a more customized version:

lr = LogisticClassifier(lambda=0.)
stack_binary = Stack(
    metalearner=lr,
    xgboost=xgboost_classifier,
    lr=lr
)

models = default_models( # For all non-specified variables use the following defaults
        Q_binary     = stack_binary, # A Super Learner
        Q_continuous = xgboost_regressor, # An XGBoost
        # T₁ with XGBoost prepended with a Continuous Encoder
        T₁           = xgboost_classifier
        # Unspecified G defaults to Logistic Regression
)

tmle_custom = Tmle(models=models)

Tmle(Dict{Symbol, MLJModelInterface.Supervised}(:Q_binary_default => ProbabilisticPipeline(continuous_encoder = ContinuousEncoder(drop_last = true, …), …), :T₁ => ProbabilisticPipeline(continuous_encoder = ContinuousEncoder(drop_last = true, …), …), :G_default => ProbabilisticPipeline(continuous_encoder = ContinuousEncoder(drop_last = true, …), …), :Q_continuous_default => DeterministicPipeline(continuous_encoder = ContinuousEncoder(drop_last = true, …), …)), nothing, nothing, 1.0e-8, true, nothing, 1, false)

Notice that with_encoder is simply a shorthand to construct a pipeline with a ContinuousEncoder and that the resulting models is simply a Dict.

CV-Estimation

Canonical TMLE/OSE are essentially using the dataset twice, once for the estimation of the nuisance functions and once for the estimation of the parameter of interest. This means that there is a risk of over-fitting and residual bias (see here for some discussion). One way to address this limitation is to use a technique called sample-splitting / cross-validation. In order to activate the sample-splitting mode, simply provide a MLJ.ResamplingStrategy using the resampling keyword argument:

Tmle(resampling=StratifiedCV());

Tmle(Dict{Symbol, MLJBase.ProbabilisticPipeline{N, MLJModelInterface.predict} where N<:NamedTuple}(:Q_binary_default => ProbabilisticPipeline(continuous_encoder = ContinuousEncoder(drop_last = true, …), …), :G_default => ProbabilisticPipeline(continuous_encoder = ContinuousEncoder(drop_last = true, …), …), :Q_continuous_default => ProbabilisticPipeline(continuous_encoder = ContinuousEncoder(drop_last = true, …), …)), StratifiedCV(nfolds = 6, …), nothing, 1.0e-8, true, nothing, 1, false)

Ose(resampling=StratifiedCV(nfolds=3));

There are some practical considerations

Choice of resampling Strategy: The theory behind sample-splitting requires the nuisance functions to be sufficiently well estimated on each and every fold. A practical aspect of it is that each fold should contain a sample representative of the dataset. In particular, when the treatment and outcome variables are categorical it is important to make sure the proportions are preserved. This is typically done using StratifiedCV.
Computational Complexity: Sample-splitting results in $K$ fits of the nuisance functions, drastically increasing computational complexity. In particular, if the nuisance functions are estimated using (P-fold) Super-Learning, this will result in two nested cross-validation loops and $K \times P$ fits.
Caching of Nuisance Functions: Because the resampling strategy typically needs to preserve the outcome and treatment proportions, very little reuse of cached models is possible (see Using the Cache).

C-TMLE

Collaborative TMLE (C-TMLE) is an estimation strategy which optimises all nuisance parameters in order to improve estimation performance. For some background material, see this paper and the original paper. In the example of the average treatment effect, the propensity score is optimised alongside the targeted step optimising the outcome mean. A sequence of nested estimators is built and the propensity score candidate minimising the cross-validated loss of the associated targeted outcome mean model is selected.

There exist many strategies to optimise the propensity score, at the moment we provide two such strategies in this package. Since C-TMLE is an expensive estimation strategy, an additional patience variable can be used for early stopping. If the cross-validated loss of the targeted outcome mean model is not reduced in patience iterations, the procedure returns the current best candidate.

Greedy Strategy

This is the original implementation of the C-TMLE template, it operates as a forward variable selection. Initially, the propensity score consists in a marginal model $p(T|W)=p(T)$. Then, at each iteration, remaining confounding variables are temptatively added to the propensity score model one at a time. The variable minimising the targeted outcome mean model's loss is retained for the iteration.

greedy_strategy = GreedyStrategy(patience=10)
greedy_ctmle = Tmle(collaborative_strategy=greedy_strategy)

Tmle(Dict{Symbol, MLJBase.ProbabilisticPipeline{N, MLJModelInterface.predict} where N<:NamedTuple}(:Q_binary_default => ProbabilisticPipeline(continuous_encoder = ContinuousEncoder(drop_last = true, …), …), :G_default => ProbabilisticPipeline(continuous_encoder = ContinuousEncoder(drop_last = true, …), …), :Q_continuous_default => ProbabilisticPipeline(continuous_encoder = ContinuousEncoder(drop_last = true, …), …)), CausalStratifiedCV(resampling = StratifiedCV(nfolds = 6, …), …), GreedyStrategy(10, nothing, Set{Symbol}(), Set{Symbol}()), 1.0e-8, true, nothing, 1, false)

Adaptive Correlation Strategy

This is a scalable version of the C-TMLE template which also operates as a forward selection method. However, instead of iterating through all potential confounders at each iteration, the confounder most associated with the residuals of the last targeted outcome mean is selected.

adapt_cor_strategy = AdaptiveCorrelationStrategy(patience=10)
adapt_cor_ctmle = Tmle(collaborative_strategy=adapt_cor_strategy)

Tmle(Dict{Symbol, MLJBase.ProbabilisticPipeline{N, MLJModelInterface.predict} where N<:NamedTuple}(:Q_binary_default => ProbabilisticPipeline(continuous_encoder = ContinuousEncoder(drop_last = true, …), …), :G_default => ProbabilisticPipeline(continuous_encoder = ContinuousEncoder(drop_last = true, …), …), :Q_continuous_default => ProbabilisticPipeline(continuous_encoder = ContinuousEncoder(drop_last = true, …), …)), CausalStratifiedCV(resampling = StratifiedCV(nfolds = 6, …), …), AdaptiveCorrelationStrategy(10, nothing, Set{Symbol}(), Set{Symbol}()), 1.0e-8, true, nothing, 1, false)

Using the Cache

TMLE and OSE are expensive procedures, it may therefore be useful to store some information for further reuse. This is the purpose of the cache object, which is produced as a byproduct of the estimation process.

Reusing Models

The cache contains in particular the machine-learning models that were fitted in the process and which can sometimes be reused to estimate other quantities of interest. For example, say we are now interested in the Joint Average Treatment Effect of both T₁ and T₂. We can provide the cache to the next round of estimation as follows.

Ψ₃ = ATE(
    outcome=:Y,
    treatment_values=(
        T₁=(case=true, control=false),
        T₂=(case=true, control=false)
    ),
    treatment_confounders=(
        T₁=[:W₁₁, :W₁₂],
        T₂=[:W₂₁, :W₂₂],
    ),
    outcome_extra_covariates=[:C]
)
result₃, cache = tmle(Ψ₃, dataset; cache=cache);
result₃

Only the conditional distribution of Y given T₁ and T₂ is fitted as it is absent from the cache. However, the propensity scores corresponding to T₁ and T₂ have been reused. Finally, let's see what happens if we estimate the interaction effect between T₁ and T₂ on Y.

Ψ₄ = AIE(
    outcome=:Y,
    treatment_values=(
        T₁=(case=true, control=false),
        T₂=(case=true, control=false)
    ),
    treatment_confounders=(
        T₁=[:W₁₁, :W₁₂],
        T₂=[:W₂₁, :W₂₂],
    ),
    outcome_extra_covariates=[:C]
)
result₄, cache = tmle(Ψ₄, dataset; cache=cache);
result₄

All nuisance functions have been reused, only the fluctuation is fitted!

Accessing Fluctuations' Reports

The cache also holds the last targeted factor that was estimated if TMLE was used. Some key information related to the targeting steps can be accessed, for example:

gradients(cache);
estimates(cache);
epsilons(cache)

1-element Vector{Any}:
 [-0.2493359877276071]

correspond to the gradients, point estimates and epsilons obtained after each targeting step which was performed (usually only one).

One can for instance check that the mean of the gradient is close to zero.

mean(last(gradients(cache)))

5.343281372915954e-16

Joint Estimands and Composition

Sometimes we are interested in testing hypotheses for a multiple estimands at once: is any of two average treatment effects different from 0? We may also be interested in functions of such estimands like a difference or a risk ratio. This is possible via JointEstimand and compose.

As explained in Joint And Composed Estimands, a joint estimand is simply a collection of estimands. Here, we will illustrate that an Average Interaction Effect is also defined as a difference in partial Average Treatment Effects.

More precisely, we would like to see if the left-hand side of this equation is equal to the right-hand side:

\[AIE_{T_1=0 \rightarrow 1, T_2=0 \rightarrow 1} = ATE_{T_1=0 \rightarrow 1, T_2=0 \rightarrow 1} - ATE_{T_1=0, T_2=0 \rightarrow 1} - ATE_{T_1=0 \rightarrow 1, T_2=0}\]

For that, we need to define a joint estimand of three components:

ATE₁ = ATE(
    outcome=:Y,
    treatment_values=(
        T₁=(case=true, control=false),
        T₂=(case=false, control=false)),
    treatment_confounders=(
        T₁=[:W₁₁, :W₁₂],
        T₂=[:W₂₁, :W₂₂],
    ),
)
ATE₂ = ATE(
    outcome=:Y,
    treatment_values=(
        T₁=(case=false, control=false),
        T₂=(case=true, control=false)),
    treatment_confounders=(
        T₁=[:W₁₁, :W₁₂],
        T₂=[:W₂₁, :W₂₂],
    ),
    )
joint_estimand = JointEstimand(Ψ₃, ATE₁, ATE₂)

Joint Estimand:
--------------
- StatisticalATE
	- Outcome: Y
	- Treatment: T₁ => (control = false, case = true) & T₂ => (control = false, case = true)
- StatisticalATE
	- Outcome: Y
	- Treatment: T₁ => (control = false, case = true) & T₂ => (control = false, case = false)
- StatisticalATE
	- Outcome: Y
	- Treatment: T₁ => (control = false, case = false) & T₂ => (control = false, case = true)

where the interaction Ψ₃ was defined earlier. This joint estimand can be estimated like any other estimand using our estimator of choice:

joint_estimate, cache = tmle(joint_estimand, dataset, cache=cache, verbosity=0);
joint_estimate

Joint Targeted Minimum Loss Based Estimator
-------------------------------------------
- point estimate         : [-2.9997, -1.9783, -0.0020]
- p-value                : < 1e-99
- mean influence curves  : [1.41e-15, -2.75e-16, 3.39e-16]

Full test results can be obtained with `significance_test`

The printed output is the result of a Hotelling's T2 Test which is the multivariate counterpart of the Student's T Test. It tells us whether any of the component of this joint estimand is different from 0.

Then we can formally test our hypothesis by leveraging the multivariate Central Limit Theorem and Julia's automatic differentiation. For that, we need to import both DifferentiationInterface and a backend of our choice, here Zygote.

using DifferentiationInterface
using Zygote
composed_result = compose(x -> x[1] - x[2] - x[3], joint_estimate, AutoZygote())
isapprox(
    estimate(result₄),
    first(estimate(composed_result)),
    atol=0.1
)

true