Beginner’s guide
This guide will walk you through conducting a comprehensive GWAS analysis on 370 Escherichia coli strains using the microGWAS pipeline.
In the study by Galardini et al. (2020) , a mouse model of sepsis was used to characterize the virulence phenotype of the strains.
By using microGWAS in this tutorial, you will uncover genetic variants (unitigs, gene presence/absence, rare variants, gene cluster specific k-mers) associated with this virulence phenotype.
Prerequisites
Basic command-line knowledge
Familiarity with genomic data
A computer with at least 10 GB RAM with 8 cores with a Linux operating system like Ubuntu
The whole tutorial will take at least 24 hours to complete.
1. Before you begin
i. Install Conda (if not already installed):
If you don’t have Conda installed, you can install it via Miniconda. Miniconda is a minimal installer for Conda.
Download the Miniconda installer:
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh
Install Miniconda:
bash miniconda.sh -b -p $HOME/miniconda
Initialize Conda:
eval "$($HOME/miniconda/bin/conda shell.bash hook)"
Verify the installation:
conda --versionYou should see the Conda version printed to the console.
ii. Install mamba (if not already installed):
After installing Conda, we recommend installing Mamba, a faster alternative to Conda for package management. Mamba is the recommended way of using Snakemake’s conda integration.
Install mamba in your base conda environment
conda install -n base -c conda-forge mamba
Verify the mamba installation
mamba --versionYou should see the mamba version printed to the console.
iii. Installing microGWAS:
We recommend obtaining microGWAS by downloading the latest release from GitHub:
Visit the releases page on GitHub.
Download the
microGWAS.tar.gzfile from the latest available release.Unpack the downloaded file (
tar -xvf microGWAS.tar.gz).Navigate to the unpacked directory (
cd microGWAS)
Additional install methods are listed in the Usage page.
iv. Set up the microGWAS conda environment (if not present already):
Add the following channels:
conda config --add channels bioconda conda config --add channels conda-forgeNow, create and activate the microGWAS conda environment:
conda env create -f environment.yml conda activate microGWAS
v. Prepare your input data:
Create a directory structure for your input files:
mkdir -p data/fastasDownload sample genomes in
FASTAformat:wget -O data/genomes.tgz https://figshare.com/ndownloader/files/21781689Extract genome FASTA files:
tar -xzvf data/genomes.tgz -C data/fastas/Download and modify the phenotype data:
wget https://raw.githubusercontent.com/mgalardini/2018_ecoli_pathogenicity/master/data/phenotypes/phenotypes.tsv -O data/data.tsvThe phenotype file contains two reference strains, “ED1a” and “IAI39”. These strains should not be included in the phenotype file as they will cause conflicts within the pipeline. To remove these strains from you phenotype file, do the following:
sed -i '/^ED1a/d; /^IAI39/d' data/data.tsv
The following command will update your
data/data.tsvfile, adding the paths for fasta files.awk 'BEGIN {OFS="\t"} NR==1 {print "strain", "fasta", "phenotype"} NR>1 {print $1, "data/fastas/" $1 ".fasta", $3}' data/data.tsv > temp_file && mv temp_file data/data.tsv
Note
To provide pre-computed GFF annotations instead of executing de novo assembly gene calling, add an optional gff column to the command and supply the paths (e.g. "data/gffs/" $1 ".gff").
Verify the updated phenotype file:
head -n 5 data/data.tsvYou should see an output similar to the example below. The first column lists the sample IDs, the next column is the relative path to the assemblies in fasta format. The last column represents the phenotype: where 1 indicates that the strain is virulent, while 0 indicates the strain is avirulent.
strain fasta phenotype ECOR-01 data/fastas/ECOR-01.fasta 0 ECOR-02 data/fastas/ECOR-02.fasta 1 ECOR-03 data/fastas/ECOR-03.fasta 0 ECOR-04 data/fastas/ECOR-04.fasta 0
Clean up:
Remove the compressed files, you do not need them anymore:
rm data/genomes.tgzVerify your directory structure:
After executing the aforementioned steps, your directory structure should look something like this:
data/ ├── data.tsv ├──fastas/ │ ├── genome1.fasta │ ├── genome2.fasta │ └── ...
You can confirm by doing:
ls data/
vi. Set up the environment and configure the pipeline:
Set up the eggnog-mapper database:
The microGWAS pipeline requires the eggnog database for functional annotation.
You have two options:
i. If you have an existing eggnog database: Create a symbolic link to your actual eggnog data directory.
ln -s /storage/miniconda3/envs/eggnog-mapper/lib/python3.9/site-packages/data/ data/eggnog-mapper
Remember to replace /storage/miniconda3/envs/eggnog-mapper/lib/python3.9/site-packages/data/ with the actual path on your system.
ii. If you do not have the eggnog database:
Simply proceed to run the microGWAS pipeline.
The pipeline will automatically download and setup the required eggnog database during its execution.
Note
Creating a symbolic link is only necessary if you’re using an existing eggNOG database.
This might be preferred as the final database size is more than 50Gb;
it therefore makes sense to
setup this database once and link it in each microGWAS analysis you are carrying out.
Configure the pipeline:
Ensure that the
##### params #####section of theconfig/config.yamlfile matches the print out below.targets: [ "phenotype", ] # MLST scheme mlst_scheme: ecoli # references for association summaries and annotation summary_references: "--reference 536 --reference CFT073 --reference ED1a --reference IAI1 --reference IAI39 --reference K-12_substr._MG1655 --reference UMN026 --reference UTI89" annotation_references: "--focus-strain 536 --focus-strain CFT073 --focus-strain ED1a --focus-strain IAI1 --focus-strain IAI39 --focus-strain K-12_substr._MG1655 --focus-strain UMN026 --focus-strain UTI89" enrichment_reference: "IAI39" # species to be used for AMR and virulence predictions species_amr: "Escherichia"
2. Running the microGWAS pipeline
Run the bootsrapping script.
bash bootstrap.sh Escherichia coli IAI39 GCF_000013305.1,GCF_000007445.1,GCF_000026305.1,GCF_000026265.1,GCF_000026345.1,GCF_000005845.2,GCF_000026325.1,GCF_000013265.1
This script populates the input files used for the analysis and downloads the relevant reference genomes necessary for annotating the hits for Escherichia coli and analyse the variants associated to the phenotype.
The syntax for this script is bash bootstrap.sh GENUS SPECIES MAINREFERENCE REFSEQID1,[...],REFSEQIDn, where REFSEQID indicates the NCBI assembly ID of the various reference genomes to be used in the analysis, and MAINREFERENCE the name of the main reference genome.
To run the full analysis, use the following command.
snakemake -p annotate_summary find_amr_vag manhattan_plots heritability enrichment_plots qq_plots tree wg_metrics --cores 24 --use-conda --conda-frontend mamba
This will:
Run the GWAS analysis to identify and annotate genetic variants that are associated with the virulence phenotype (
annotate_summary)Generate a phylogenetic tree of all isolates and the selected references (
tree)Identify antimicrobial resistant and virulence associated genes (
find_amr_vag)Perform an enrichment analysis for the genes with the associated variants, and plot the results (
enrichment_plots)Compute the heritability of the phenotype, determined by the genetic variants and the lineage (
heritability)Generate QQ-plots to diagnose the association analysis (
qq_plots)Generate manhattan plots on the unitigs mapped to a reference (
manhattan_plots)
The analysis will be parallelized using up to 24 cores (--cores 24), and the
necessary tools will be installed in separated conda environments (--use-conda --conda-frontend mamba).
The whole process will likely take more than 24 hours.
Note
Change the --cores argument to indicate what is the maximum number
of cores that can be used to parallelize the analysis
Customizing your analysis
You can specify which Rules you want the pipeline to run. For example, to run the pipeline without generating a phylogenetic tree:
snakemake -p annotate_summary manhattan_plots heritability enrichment_plots qq_plots wg_metrics --cores 24 --use-conda --conda-frontend mamba
This command runs all the same analyses as before, except for generating the phylogenetic tree and identifying AMR and virulence associated genes.
3. Understanding the results
microGWAS generates multiple output files and figures which can be accessed from within the out/ directory.
For a detailed descripition of all the outputs, refer to Outputs section of this documentation.
For the purpose of this tutorial, we will focus on key results replicated from the Galardini et al. (2020) study
a. Unitig-based association analysis
Unitigs are unique DNA sequences that serve as markers for genetic variation.
microGWAS uses unitigs with a minimum allele frequency (MAF) of > 1%.
This Manhattan plot shows unitigs associated with virulence. Peaks above the red dashed line represent genomic regions strongly associated with the virulence phenotype. These unitigs are related to three iron-uptake systems: the high-pathogenecity island (HPI), aerobactin, and the sitABCD operon. For a closer look at specific genomic regions of interest related to virulence factors in E. coli, you can generate zoomed-in Manhattan plots. The focus will be on three key areas: the high pathogencity island (HPI), the aerobactin siderophore system, and the sitABCD iron transport operon. To created these detailed plots, run the following command first:
python3 workflow/scripts/count_patterns.py out/associations/phenotype/unitigs_patterns.txt --threshold
This command uses the number of unique unitigs’ presence/absence patterns to derive an appropriate p-value threshold.
You should obtain a value of 2.16E-08. You can then run the script to generate the manhattan plots for the three regions of interest:
python3 workflow/scripts/manhattan_plot.py out/associations/phenotype/mapped_all.tsv IAI39 HPI.png --threshold 2.16E-08 --zoom HPI 1.05 1.25 30 python3 workflow/scripts/manhattan_plot.py out/associations/phenotype/mapped_all.tsv IAI39 sitABCD.png --threshold 2.16E-08 --zoom sitABCD 1.95 2.05 15 python3 workflow/scripts/manhattan_plot.py out/associations/phenotype/mapped_all.tsv IAI39 aerobactin.png --threshold 2.16E-08 --zoom aerobactin 3.4 3.8 15
Note
Run python3 workflow/scripts/manhattan_plot.py -h for an explanation of the syntax used by this script
The plot was generated for the “IAI39” reference genome, and the zoomed-in views were based on the genomic positions of the regions of interest.
You can also generate volcano plots to visualise the statistical significance and magnitute of the effect for the tested genetic variants.
The following code will generate a volcano plots using the annotate_summary.tsv file, which contains the summary statistics and gene annotation for the unitigs association analysis.
python3 workflow/scripts/volcano_plot.py out/associations/phenotype/annotated_summary.tsv volcano.png --threshold 2.16E-08 --genes fyuA sitA iucC
This plot represents associations using unitigs as the genetic markers.
Each point represents a specific gene. The highlighted genes belong to the high pathogenecity island, the aerobactin, and the sitABCD operon.
The x-axis represents the average beta value (effect size), which indicates the magnitude and direction of the association between the unitigs and the virulence phenotye.
Points on the right indicate positive associations and those on the left indicate negative associations.
The y-axis shows the statistical significance.
The red dashed horizontal line indicates the signficance threshold computed using the unitigs_patterns.txt file.
Similar plots can be created using annotated_gpa_summary.tsv or annotated_panfeed_summary.tsv.
b. Gene cluster-specific k-mer association analysis
This analysis links specific k-mers to their source genes, using panfeed.
These plots represent association for gene cluster specific k-mers for fyuA, iucC, and sitA genes. The y-axis represents each isolate and the x-axis the k-mer positions relative to the gene start codon for each strain. The colors correspond to the -log10 of the association p-value. The dark gray regions imply that the isolates do not encode for the k-mers, while the light gray regions represent k-mers under the association threshold.
c. Functional enrichment analysis
This analysis identified overrepresented functional categories among genes with associated variants.
The plot shows enrichment of clusters of orthologous groups (COG) categories. The y-axis of the plot represents each COG catergory, and x-axis the number of gene hits belonging to each category. The bars are colored based on the -log10 of the enrichment corrected p-value. Bars colored in grey do not have a significant enrichment.
d. Other outputs
More information about the results of the association analysis can be found within the out directory.
The content of each folder/file is reported in the Outputs section.
4. Troubleshooting
If you have persistent issues, please consult the Troubleshooting Usage guide or seek help in the project’s issue tracker.