Setup (Read First)
A population genetics analysis via TarGene is a 4 steps process where steps 1 and 2 are done only once.
1. Installation
Install TarGene's dependencies as per the Installation section. Here it is assumed that both Nextflow and Singularity are installed. To make sure this is the case you can for instance try
nextflow -vand
singularity --version2. Obtaining the Data
Make sure the data is available. Here we will rely on TarGene's test data located here. To obtain it, the easiest way is maybe to clone the entire project like so
git clone https://github.com/TARGENE/targene-pipelineThen open the repository's folder using your favourite editor. All examples hereafter assume that you have a terminal opened in the repository's root folder.
Even though we have cloned the repository, we do not need to look at the code at all!
3. Writing a Run Configuration File
Write a run configuration file for your project, this can be decomposed further into 2 sub-steps:
- Write the part of your configuration file specific to your platform (HPC, laptop, ...). Here we will simply use the
localprofile which runs every process using singularity and your local cpu. However, you could change this to use docker instead, or resort to a scheduler if you are working on an HPC. It might be worth having a look at this page if you have never worked with Nextflow before. All existing TarGene's configuration files are in theconffolder. - Write the part of your configuration file specific to your run. This is the topic of the remainder of the following sections. That is, each example requires you to write this file. Feel free to experiment by changing the workflow parameters (see Index of Workflows Parameters).
4. General Input Parameter Description
COHORT: Here we assume that the cohort is the UK-Biobank dataset (default). If you have already extracted your covariates and phenotypes you can just useCOHORT=CUSTOMand the belowUKB_CONFIGwon't be used.UKB_CONFIG: In the examples, we assume that the dataset is a raw UK Biobank dataset which is not readily interpretable by machine-learning models. How does TarGene know how to extract covariates and phenotypes from the the UK Biobank main dataset? This is thanks to theUKB_CONFIGfile, which maps UK Biobank data fields to traits (see TheUKB_CONFIGConfiguration File). In our case, this file (test/assets/ukbconfig_gwas.yaml) contains both the outcome of interest (BMI) and the extra predictors we need (Number of vehicles in household and Cheese intake).
traits:
- fields:
- "21001"
phenotypes:
- name: "Body mass index (BMI)"
- fields:
- "728"
phenotypes:
- name: "Number of vehicles in household"
- fields:
- "1408"
phenotypes:
- name: "Cheese intake"Note that the outcomes are defined implicitly, any trait in the UKB_CONFIG file which are not in the outcome_extra_covariates of the ESTIMANDS_FILE (see later) will be considered as outcomes. So you can run multiple GWAS at once by simply adding another trait definition to the above file.
TRAITS_DATASET: This is a CSV file containing both covariates and phenotypes of interest. It should also contain aneidcolumn to uniquely identify individuals and matching the genotypes individuals'IID. In the examples, this file contains raw UK Biobank fields information (e.g.1160-2.0) that will be interpreted by theUKB_CONFIG.UKB_WITHDRAWAL_LIST: List of individuals to be removed from the analysis (oneIIDper line and no header).BED_FILES: This parameter points to the plink bed files containing typed genetic variants information. You will notice the "{1,2,3}.{bed,bim,fam}" suffix of the parameter which is telling Nextflow to group chromosome files together.BGEN_FILES: Contain imputed variants information in BGEN format. You will notice the "{1,2,3}.{bgen,bgen.bgi,sample}" suffix of the parameter which is telling Nextflow to group chromosome files together.
5. Analyse the Results
Hopefully, this is where you'll find something new or interesting in some way. This section is up to you but we recommend to read the section Understanding TarGene's Outputs or you will likely be a little lost...