Estimation

Constructing and Using Estimators

Once a statistical estimand has been defined, we can proceed with estimation. There are two semi-parametric efficient estimators in TMLE.jl:

  • The Targeted Maximum-Likelihood Estimator (TMLEE)
  • The One-Step Estimator (OSE)

While they have similar asymptotic properties, their finite sample performance may be different. They also have a very distinguishing feature, the TMLE is a plugin estimator, which means it respects the natural bounds of the estimand of interest. In contrast, the OSE may in theory report values outside these bounds. In practice, this is not often the case and the estimand of interest may not impose any restriction on its domain.

Drawing from the example dataset and SCM from the Walk Through section, we can estimate the ATE for T₁. Let's use TMLE:

Ψ₁ = ATE(
    outcome=:Y,
    treatment_values=(T₁=(case=true, control=false),),
    treatment_confounders=(T₁=[:W₁₁, :W₁₂],),
    outcome_extra_covariates=[:C]
)
tmle = TMLEE()
result₁, cache = tmle(Ψ₁, dataset);
result₁
┌ Info: Required Relevant Factors:
│ - P₀(Y | C, T₁, W₁₁, W₁₂)
└ - P₀(T₁ | W₁₁, W₁₂)
[ Info: Estimating: P₀(T₁ | W₁₁, W₁₂)
[ Info: Estimating: P₀(Y | C, T₁, W₁₁, W₁₂)
[ Info: Performing TMLE...
[ Info: Estimating: P₀(Y | C, T₁, W₁₁, W₁₂)
[ Info: TMLE step: 1.
[ Info: Convergence criterion reached.
[ Info: Done.

The cache (see below) contains estimates for the nuisance functions that were necessary to estimate the ATE.

The result₁ structure corresponds to the estimation result and will display the result of a T-Test including:

  • A point estimate.
  • A 95% confidence interval.
  • A p-value (Corresponding to the test that the estimand is different than 0).

Both the TMLE and OSE are asymptotically linear estimators, standard Z/T tests from HypothesisTests.jl can be performed and confint and pvalue methods used.

tmle_test_result₁ = pvalue(OneSampleTTest(result₁))
0.0

Let us now turn to the Average Treatment Effect of T₂, we will estimate it with a OSE:

Ψ₂ = ATE(
    outcome=:Y,
    treatment_values=(T₂=(case=true, control=false),),
    treatment_confounders=(T₂=[:W₂₁, :W₂₂],),
    outcome_extra_covariates=[:C]
)
ose = OSE()
result₂, cache = ose(Ψ₂, dataset;cache=cache);
result₂
┌ Info: Required Relevant Factors:
│ - P₀(Y | C, T₂, W₂₁, W₂₂)
└ - P₀(T₂ | W₂₁, W₂₂)
[ Info: Estimating: P₀(T₂ | W₂₁, W₂₂)
[ Info: Estimating: P₀(Y | C, T₂, W₂₁, W₂₂)
[ Info: Done.

Again, required nuisance functions are fitted and stored in the cache.

Specifying Models

By default, TMLE.jl uses generalized linear models for the estimation of relevant and nuisance factors such as the outcome mean and the propensity score. However, this is not the recommended usage since the estimators' performance is closely related to how well we can estimate these factors. More sophisticated models can be provided using the models keyword argument of each estimator which is a Dict{Symbol, Model} mapping variables' names to their respective model.

Rather than specifying a specific model for each variable it may be easier to override the default models using the default_models function:

For example one can override all default models with XGBoost models from MLJXGBoostInterface:

using MLJXGBoostInterface
xgboost_regressor = XGBoostRegressor()
xgboost_classifier = XGBoostClassifier()
models = default_models(
    Q_binary     = xgboost_classifier,
    Q_continuous = xgboost_regressor,
    G            = xgboost_classifier
)
tmle_gboost = TMLEE(models=models)
TMLEE(Dict{Symbol, MLJModelInterface.Supervised}(:Q_binary_default => ProbabilisticPipeline(continuous_encoder = ContinuousEncoder(drop_last = true, …), …), :G_default => ProbabilisticPipeline(continuous_encoder = ContinuousEncoder(drop_last = true, …), …), :Q_continuous_default => DeterministicPipeline(continuous_encoder = ContinuousEncoder(drop_last = true, …), …)), nothing, 1.0e-8, false, nothing, 1, false)

The advantage of using default_models is that it will automatically prepend each model with a ContinuousEncoder to make sure the correct types are passed to the downstream models.

Super Learning (Stack) as well as variable specific models can be defined as well. Here is a more customized version:

lr = LogisticClassifier(lambda=0.)
stack_binary = Stack(
    metalearner=lr,
    xgboost=xgboost_classifier,
    lr=lr
)

models = default_models( # For all non-specified variables use the following defaults
        Q_binary     = stack_binary, # A Super Learner
        Q_continuous = xgboost_regressor, # An XGBoost
        # T₁ with XGBoost prepended with a Continuous Encoder
        T₁           = xgboost_classifier
        # Unspecified G defaults to Logistic Regression
)

tmle_custom = TMLEE(models=models)
TMLEE(Dict{Symbol, MLJModelInterface.Supervised}(:Q_binary_default => ProbabilisticPipeline(continuous_encoder = ContinuousEncoder(drop_last = true, …), …), :T₁ => ProbabilisticPipeline(continuous_encoder = ContinuousEncoder(drop_last = true, …), …), :G_default => ProbabilisticPipeline(continuous_encoder = ContinuousEncoder(drop_last = true, …), …), :Q_continuous_default => DeterministicPipeline(continuous_encoder = ContinuousEncoder(drop_last = true, …), …)), nothing, 1.0e-8, false, nothing, 1, false)

Notice that with_encoder is simply a shorthand to construct a pipeline with a ContinuousEncoder and that the resulting models is simply a Dict.

CV-Estimation

Canonical TMLE/OSE are essentially using the dataset twice, once for the estimation of the nuisance functions and once for the estimation of the parameter of interest. This means that there is a risk of over-fitting and residual bias (see here for some discussion). One way to address this limitation is to use a technique called sample-splitting / cross-validation. In order to activate the sample-splitting mode, simply provide a MLJ.ResamplingStrategy using the resampling keyword argument:

TMLEE(resampling=StratifiedCV());
TMLEE(Dict{Symbol, MLJBase.ProbabilisticPipeline{N, MLJModelInterface.predict} where N<:NamedTuple}(:Q_binary_default => ProbabilisticPipeline(continuous_encoder = ContinuousEncoder(drop_last = true, …), …), :G_default => ProbabilisticPipeline(continuous_encoder = ContinuousEncoder(drop_last = true, …), …), :Q_continuous_default => ProbabilisticPipeline(continuous_encoder = ContinuousEncoder(drop_last = true, …), …)), StratifiedCV(nfolds = 6, …), 1.0e-8, false, nothing, 1, false)

or

OSE(resampling=StratifiedCV(nfolds=3));

There are some practical considerations

  • Choice of resampling Strategy: The theory behind sample-splitting requires the nuisance functions to be sufficiently well estimated on each and every fold. A practical aspect of it is that each fold should contain a sample representative of the dataset. In particular, when the treatment and outcome variables are categorical it is important to make sure the proportions are preserved. This is typically done using StratifiedCV.
  • Computational Complexity: Sample-splitting results in $K$ fits of the nuisance functions, drastically increasing computational complexity. In particular, if the nuisance functions are estimated using (P-fold) Super-Learning, this will result in two nested cross-validation loops and $K \times P$ fits.
  • Caching of Nuisance Functions: Because the resampling strategy typically needs to preserve the outcome and treatment proportions, very little reuse of cached models is possible (see Using the Cache).

Using the Cache

TMLE and OSE are expensive procedures, it may therefore be useful to store some information for further reuse. This is the purpose of the cache object, which is produced as a byproduct of the estimation process.

Reusing Models

The cache contains in particular the machine-learning models that were fitted in the process and which can sometimes be reused to estimate other quantities of interest. For example, say we are now interested in the Joint Average Treatment Effect of both T₁ and T₂. We can provide the cache to the next round of estimation as follows.

Ψ₃ = ATE(
    outcome=:Y,
    treatment_values=(
        T₁=(case=true, control=false),
        T₂=(case=true, control=false)
    ),
    treatment_confounders=(
        T₁=[:W₁₁, :W₁₂],
        T₂=[:W₂₁, :W₂₂],
    ),
    outcome_extra_covariates=[:C]
)
result₃, cache = tmle(Ψ₃, dataset; cache=cache);
result₃
┌ Info: Required Relevant Factors:
│ - P₀(Y | C, T₁, T₂, W₁₁, W₁₂, W₂₁, W₂₂)
│ - P₀(T₁ | W₁₁, W₁₂)
└ - P₀(T₂ | W₂₁, W₂₂)
[ Info: Reusing estimate for: P₀(T₁ | W₁₁, W₁₂)
[ Info: Reusing estimate for: P₀(T₂ | W₂₁, W₂₂)
[ Info: Estimating: P₀(Y | C, T₁, T₂, W₁₁, W₁₂, W₂₁, W₂₂)
[ Info: Performing TMLE...
[ Info: Estimating: P₀(Y | C, T₁, T₂, W₁₁, W₁₂, W₂₁, W₂₂)
[ Info: TMLE step: 1.
[ Info: Convergence criterion reached.
[ Info: Done.

Only the conditional distribution of Y given T₁ and T₂ is fitted as it is absent from the cache. However, the propensity scores corresponding to T₁ and T₂ have been reused. Finally, let's see what happens if we estimate the interaction effect between T₁ and T₂ on Y.

Ψ₄ = AIE(
    outcome=:Y,
    treatment_values=(
        T₁=(case=true, control=false),
        T₂=(case=true, control=false)
    ),
    treatment_confounders=(
        T₁=[:W₁₁, :W₁₂],
        T₂=[:W₂₁, :W₂₂],
    ),
    outcome_extra_covariates=[:C]
)
result₄, cache = tmle(Ψ₄, dataset; cache=cache);
result₄
┌ Info: Reusing estimate for: Relevant Factors:
│ - P₀(Y | C, T₁, T₂, W₁₁, W₁₂, W₂₁, W₂₂)
│ - P₀(T₁ | W₁₁, W₁₂)
└ - P₀(T₂ | W₂₁, W₂₂)
[ Info: Performing TMLE...
[ Info: Estimating: P₀(Y | C, T₁, T₂, W₁₁, W₁₂, W₂₁, W₂₂)
[ Info: TMLE step: 1.
[ Info: Convergence criterion reached.
[ Info: Done.

All nuisance functions have been reused, only the fluctuation is fitted!

Accessing Fluctuations' Reports (Advanced)

The cache also holds the last targeted factor that was estimated if TMLE was used. Some key information related to the targeting steps can be accessed, for example:

gradients(cache);
estimates(cache);
epsilons(cache)
1-element Vector{Any}:
 [-0.014339232249731193]

correspond to the gradients, point estimates and epsilons obtained after each targeting step which was performed (usually only one).

One can for instance check that the mean of the gradient is close to zero.

mean(last(gradients(cache)))
-5.115907697472721e-17

Joint Estimands and Composition

As explained in Joint And Composed Estimands, a joint estimand is simply a collection of estimands. Here, we will illustrate that an Average Interaction Effect is also defined as a difference in partial Average Treatment Effects.

More precisely, we would like to see if the left-hand side of this equation is equal to the right-hand side:

\[AIE_{T_1=0 \rightarrow 1, T_2=0 \rightarrow 1} = ATE_{T_1=0 \rightarrow 1, T_2=0 \rightarrow 1} - ATE_{T_1=0, T_2=0 \rightarrow 1} - ATE_{T_1=0 \rightarrow 1, T_2=0}\]

For that, we need to define a joint estimand of three components:

ATE₁ = ATE(
    outcome=:Y,
    treatment_values=(
        T₁=(case=true, control=false),
        T₂=(case=false, control=false)),
    treatment_confounders=(
        T₁=[:W₁₁, :W₁₂],
        T₂=[:W₂₁, :W₂₂],
    ),
)
ATE₂ = ATE(
    outcome=:Y,
    treatment_values=(
        T₁=(case=false, control=false),
        T₂=(case=true, control=false)),
    treatment_confounders=(
        T₁=[:W₁₁, :W₁₂],
        T₂=[:W₂₁, :W₂₂],
    ),
    )
joint_estimand = JointEstimand(Ψ₃, ATE₁, ATE₂)
Joint Estimand:
--------------
- TMLE.StatisticalATE(:Y, OrderedCollections.OrderedDict{Symbol, @NamedTuple{control::Bool, case::Bool}}(:T₁ => (control = 0, case = 1), :T₂ => (control = 0, case = 1)), OrderedCollections.OrderedDict(:T₁ => (:W₁₁, :W₁₂), :T₂ => (:W₂₁, :W₂₂)), (:C,))
- TMLE.StatisticalATE(:Y, OrderedCollections.OrderedDict{Symbol, @NamedTuple{control::Bool, case::Bool}}(:T₁ => (control = 0, case = 1), :T₂ => (control = 0, case = 0)), OrderedCollections.OrderedDict(:T₁ => (:W₁₁, :W₁₂), :T₂ => (:W₂₁, :W₂₂)), ())
- TMLE.StatisticalATE(:Y, OrderedCollections.OrderedDict{Symbol, @NamedTuple{control::Bool, case::Bool}}(:T₁ => (control = 0, case = 0), :T₂ => (control = 0, case = 1)), OrderedCollections.OrderedDict(:T₁ => (:W₁₁, :W₁₂), :T₂ => (:W₂₁, :W₂₂)), ())

where the interaction Ψ₃ was defined earlier. This joint estimand can be estimated like any other estimand using our estimator of choice:

joint_estimate, cache = tmle(joint_estimand, dataset, cache=cache, verbosity=0);
joint_estimate
One sample Hotelling's T² test
------------------------------
Population details:
    parameter of interest:   Mean vector
    value under h_0:         [0.0, 0.0, 0.0]
    point estimate:          [-3.10896, -1.97862, 0.00385808]

Test summary:
    outcome with 95% confidence: reject h_0
    one-sided p-value:           <1e-99

Details:
    number of observations: 10000
    number of variables:    3
    T² statistic:           12003.7
    transformed statistic:  4000.44
    degrees of freedom:     (3, 9997)
    covariance matrix:
        31.2489       0.000187586   0.111722
         0.000187586  4.39377       0.0241609
         0.111722     0.0241609    12.591

The printed output is the result of a Hotelling's T2 Test which is the multivariate counterpart of the Student's T Test. It tells us whether any of the component of this joint estimand is different from 0.

Then we can formally test our hypothesis by leveraging the multivariate Central Limit Theorem and Julia's automatic differentiation.

composed_result = compose(x -> x[1] - x[2] - x[3], joint_estimate)
isapprox(
    estimate(result₄),
    first(estimate(composed_result)),
    atol=0.1
)
true

By default, TMLE.jl will use Zygote but since we are using DifferentiationInterface.jl you can change the backend to your favorite AD system.