Targeted Minimum Loss Based Estimation

This is the main script in this package, it provides a command line interface for the estimation of statistical parameters using targeted Learning.

Usage

tmle tmle --help

TMLECLI.tmle — Function

tmle(dataset; 
    estimands="factorialATE", 
    estimators="glmnet"; 
    verbosity=0, 
    outputs=Outputs(),
    chunksize=100,
    rng=123,
    cache_strategy="release-unusable",
    sort_estimands=false
)

TMLE CLI.

Args

dataset: Data file (either .csv or .arrow)

Options

--estimands: A string ("factorialATE") or a serialized TMLE.Configuration (accepted formats: .json | .yaml | .jls)
--estimators: A julia file containing the estimators to use.
-v, --verbosity: Verbosity level.
-o, --outputs: Ouputs to be generated.
--chunksize: Results are written in batches of size chunksize.
-r, --rng: Random seed (Only used for estimands ordering at the moment).
-c, --cache-strategy: Caching Strategy for the nuisance functions, any of ("release-unusable", "no-cache", "max-size").

Flags

-s, --sort_estimands: Sort estimands to minimize cache usage (A brute force approach will be used, resulting in exponentially long sorting time).

source

The easiest way to create an estimands' file is to use the companion Julia TMLE.jl package and create a Configuration structure. This structure can be serialized to a file using any of serialize (Julia serialization format), write_json (JSON) or write_yaml (YAML).

Alternatively you can write this file manually. The following example illustrates the creation of three estimands in YAML format: an Average Treatment Effect (ATE), an Average Interaction Effect (AIE) and a Counterfactual Mean (CM).

type: "Configuration"
estimands:
  - outcome_extra_covariates:
      - C1
    type: "AIE"
    treatment_values:
      T1:
        control: 0
        case: 1
      T2:
        control: 0
        case: 1
    outcome: Y1
    treatment_confounders:
      T2:
        - W21
        - W22
      T1:
        - W11
        - W12
  - outcome_extra_covariates: []
    type: "ATE"
    treatment_values:
      T1:
        control: 0
        case: 1
      T3:
        control: "CC"
        case: "AC"
    outcome: Y3
    treatment_confounders:
      T1:
        - W
      T3:
        - W
  - outcome_extra_covariates: []
    type: "CM"
    treatment_values:
      T1: "CC"
      T3: "AC"
    outcome: Y3
    treatment_confounders:
      T1:
        - W
      T3:
        - W

Specifying Estimators

There are two ways the estimators can be specified, either via a plain Julia file or via a configuration string.

Estimators From A String

An estimator can be described from 3 main axes, depending on:

Whether they use cross-validation (sample-splitting) or not.
The semi-parametric estimator type: TMLE, wTMLE, OSE.
The models used to learn the nuisance functions.

The estimator type and cross-validation scheme are described at once by any of the following

Estimator's Short Name	Estimator's Description
tmle	Canonical Targeted Minimum-Loss Estimator
wtmle	Canonical Targeted Minimum-Loss Estimator with weighted Fluctuation
ose	Canonical One-Step Estimator
cvtmle	Cross-Validated Targeted Minimum-Loss Estimator
cvwtmle	Cross-Validated Targeted Minimum-Loss Estimator with weighted Fluctuation
cvose	Cross-Validated One-Step Estimator

And the available models are

Model's Short Name	Model's Description
glm	A Generalised Linear Model
glmnet	A Cross-Validated Generalised Linear Model
xgboost	The default XGBoost model using the `hist` strategy.
tunedxgboost	A cross-validated grid of XGBoost models across (max_depth, eta) hyperparameters.
sl	A Super Learning strategy using a glmnet, a glm and a grid of xgboost models as in tunedxgboost.

Then, a configuration string describes the estimators and models in the following way: ESTIMATORS–QMODEL–GMODEL.

The ESTIMATORS substring comprises one or more estimators separated by a single dash, e.g. cvtmle-ose. If multiple estimators are specified they will be used sequentially and an estimation result will provide key-value pairs of ESTIMATOR => ESTIMATE.
The optional G_MODEL substring corresponds to the model used to learn the propensity score models. If it is not provided, it will default to the model provided for Q_MODEL.
The optional Q_MODEL substring corresponds to the model used to learn the outcome models, it defaults to glmnet.

It is probably easier to understand with some examples.

Examples

tmle--sl--glm: A single estimator (TMLE) using a Super Learner for the outcome models and a GLM for the propensity score models.
cvtmle-ose--xgboost: Two estimators (CV-TMLE and OSE) using XGBoost for the outcome models and the default strategy for the propensity score models.
cvwtmle-cvose: Two estimators (CV-wTMLE and CV-OSE) using default strategies for both outcome models and propensity score models.

Note on Cross-Validation

Some of the aforementioned estimators and models use cross-validation under the hood. In this case this using a stratified 3-folds cross-validation where the stratification occurs across both the outcome and treatment variables.

Note on GLM and GLMNet

Linear models typically do not involve any interaction terms. Here, to add extra flexibility, both GLM and GLMNet comprise pairwise interaction terms between treatment variables and all other covariates.

Estimators Via A Julia File

Building an estimator via a configuration string is quite flexible and should cover most use cases. However, in some cases you may want to have full control over the estimation procedure. This is possible by instead providing a Julia configuration file describing the estimators to be used. The file should define an ESTIMATORS NamedTuple describing the estimators to be used, and some examples can be found here.

For further information, we recommend you have a look at both:

TMLE.jl: The Julia package on which this command line interface is built.
MLJ: The Julia package used for machine-learning throughout.

Note on Outputs

We can output results in three different formats: HDF5, JSON and JLS. By default no output is written, so you need to specify at least one. An output can be generated by specifying an output filename for it. For instance --outputs.json.filename=output.json will output a JSON file. Note that you can generate multiple formats at once, e.g. --outputs.json.filename=output.json --outputs.hdf5.filename=output.hdf5 will output both JSON and HDF5 result files. Another important output option is the pval_threshold. Each estimation result is accompanied by an influence curve vector and by default these vectors are erased before saving the results because they typically take up too much space and are not usually needed. In some occasions you might want to keep them and this can be achieved by specifiying the output's pval_threhsold. For instance --outputs.hdf5.pval_threshold=1. will keep all such vectors because all p-values lie in between 0 and 1.