All of Us

Analyses within the All of Us (AoU) Researcher Workbench using genetic data must be run within the Controlled Tier Access (See Data Access). Workspaces launched within this tier will automatically have nextflow installed and can use TarGene immediately.

Each Workspace will be assigned a bucket for storage on Google Cloud, that can be found on the right hand panel of the About page of your workspace. Each Workspace will also contain Google Cloud-specific credentials in order to submit jobs via the Google Lifesciences API. These can be found in your Workspace-specific nextflow profile, located at ~/.nextflow/config, and available by using the flag -profile gls when you run nextflow. TarGene requires some additional configuration to run on the AoU Researcher workbench that is built into the allofus profile, which can be combined with your Workspace-specific gls configuration to batch out jobs when running TarGene on this platform.

We reccommend running this by first entering a Cloud Analysis Terminal on your current Workspace, creating a configuration for the analysis you would like to run, and running TarGene in a screen session. See Workflows in the All of Us Researched Workbench for more information.

A minimalist run configuration to run a flat config run on the AoU Researcher Workbench using TarGene looks like the following:

params {
    COHORT = "ALLOFUS"
    ESTIMANDS_CONFIG = "allofus_config.yaml"

    // All of Us srWGS data
    BGEN_FILE = "gs://fc-aou-datasets-controlled/v7/wgs/short_read/snpindel/clinvar_v7.1/bgen/clinvar.chr{1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22}.{bgen,sample,bgen.bgi}"
    BED_FILES = "gs://fc-aou-datasets-controlled/v7/wgs/short_read/snpindel/clinvar_v7.1/plink_bed/clinvar.chr{1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22}.{bed,bim,fam}"
    TRAITS_DATASET = "allofus_traits.csv"
}

Apart from the data related parameters, there are two main parameters here: ESTIMANDS_CONFIG and the ESTIMATORS_CONFIG. These parameters describe the estimands (questions of interest) and how to estimate them respectively.

The ESTIMANDS_CONFIG here follows the same format as the flat configuration detailed in the PheWAS section. Here we are estimating the ATE of the FTO variant (encoded as chr16:53767042:T:C in the BGEN data for the AoU cohort) on the traits present in our allofus_traits.csv file. We have also added Sex at birth as a covariate.

type: flat

estimands:
  - type: ATE

variants:
  - chr16:53767042:T:C

outcome_extra_covariates:
  - "Sex at birth"

The optional outcome_extra_covariates are variables to be used as extra predictors of the outcome (but not as confounders). This information must be contained in the allofus_traits.csv file, along with your outcomes-of-interest.

The allofus_traits.csv might look as follows:

SAMPLE_IDSex at birthHeight (cm)
100000Male180
100002Female165
100004Male175
100010Female160
.........

Here we have not specified any value for ESTIMATORS_CONFIG, and so the default, ESTIMATORS_CONFIG = "wtmle-ose--tunedxgboost", will be used. This defines the estimation strategy, and more specifically, that we will be using Targeted Minimum-Loss Estimator as well as a One Step Estimator with a tuned XGBoost model to learn the outcome models ($Q_Y$) and propensity scores ($G$).

Then TarGene can then be run on the AoU Researcher Workbench as follows:

nextflow run https://github.com/TARGENE/targene-pipeline -r v0.11.1 -profile gls,allofus

By default, this will generate results in the results/ directory in your Cloud Analysis Terminal. Once complete, you can upload these to your Workspace bucket using the following command:

gsutil -m -u $GOOGLE_PROJECT cp -r results/* gs://path/to/workspace/bucket