Beginner’s guide

This guide will walk you through conducting a comprehensive GWAS analysis on 370 Escherichia coli strains using the microGWAS pipeline. In the study by Galardini et al. (2020) , a mouse model of sepsis was used to characterize the virulence phenotype of the strains. By using microGWAS in this tutorial, you will uncover genetic variants (unitigs, gene presence/absence, rare variants, gene cluster specific k-mers) associated with this virulence phenotype.

Prerequisites

  • Basic command-line knowledge

  • Familiarity with genomic data

  • A computer with at least 10 GB RAM with 8 cores with a Linux operating system like Ubuntu

The whole tutorial will take at least 24 hours to complete.

1. Before you begin

i. Install Conda (if not already installed):

If you don’t have Conda installed, you can install it via Miniconda. Miniconda is a minimal installer for Conda.

  1. Download the Miniconda installer:

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh
  1. Install Miniconda:

bash miniconda.sh -b -p $HOME/miniconda
  1. Initialize Conda:

eval "$($HOME/miniconda/bin/conda shell.bash hook)"
  1. Verify the installation:

conda --version

You should see the Conda version printed to the console.

ii. Install mamba (if not already installed):

After installing Conda, we recommend installing Mamba, a faster alternative to Conda for package management. Mamba is the recommended way of using Snakemake’s conda integration.

  1. Install mamba in your base conda environment

conda install -n base -c conda-forge mamba
  1. Verify the mamba installation

mamba --version

You should see the mamba version printed to the console.

iii. Installing microGWAS:

We recommend obtaining microGWAS by downloading the latest release from GitHub:

  • Visit the releases page on GitHub.

  • Download the microGWAS.tar.gz file from the latest available release.

  • Unpack the downloaded file (tar -xvf microGWAS.tar.gz).

  • Navigate to the unpacked directory (cd microGWAS)

Additional install methods are listed in the Usage page.

iv. Set up the microGWAS conda environment (if not present already):

Add the following channels:

conda config --add channels bioconda
conda config --add channels conda-forge

Now, create and activate the microGWAS conda environment:

conda env create -f environment.yml
conda activate microGWAS

v. Prepare your input data:

  1. Create a directory structure for your input files:

    mkdir -p data/fastas
    
  2. Download sample genomes in FASTA format:

    wget -O data/genomes.tgz https://figshare.com/ndownloader/files/21781689
    

    Extract genome FASTA files:

    tar -xzvf data/genomes.tgz -C data/fastas/
    
  3. Download and modify the phenotype data:

    wget https://raw.githubusercontent.com/mgalardini/2018_ecoli_pathogenicity/master/data/phenotypes/phenotypes.tsv -O data/data.tsv
    

    The phenotype file contains two reference strains, “ED1a” and “IAI39”. These strains should not be included in the phenotype file as they will cause conflicts within the pipeline. To remove these strains from you phenotype file, do the following:

    sed -i '/^ED1a/d;
    /^IAI39/d' data/data.tsv
    

    The following command will update your data/data.tsv file, adding the paths for fasta files.

    awk 'BEGIN {OFS="\t"}
    
    NR==1 {print "strain", "fasta", "phenotype"}
    
    NR>1 {print $1, "data/fastas/" $1 ".fasta", $3}' data/data.tsv > temp_file &&
    
    mv temp_file data/data.tsv
    

Note

To provide pre-computed GFF annotations instead of executing de novo assembly gene calling, add an optional gff column to the command and supply the paths (e.g. "data/gffs/" $1 ".gff").

  1. Verify the updated phenotype file:

    head -n 5 data/data.tsv
    

    You should see an output similar to the example below. The first column lists the sample IDs, the next column is the relative path to the assemblies in fasta format. The last column represents the phenotype: where 1 indicates that the strain is virulent, while 0 indicates the strain is avirulent.

    strain  fasta   phenotype
    ECOR-01 data/fastas/ECOR-01.fasta       0
    ECOR-02 data/fastas/ECOR-02.fasta       1
    ECOR-03 data/fastas/ECOR-03.fasta       0
    ECOR-04 data/fastas/ECOR-04.fasta       0
    
  2. Clean up:

    Remove the compressed files, you do not need them anymore:

    rm data/genomes.tgz
    
  3. Verify your directory structure:

After executing the aforementioned steps, your directory structure should look something like this:

data/
├── data.tsv
├──fastas/
│   ├── genome1.fasta
│   ├── genome2.fasta
│   └── ...

You can confirm by doing:

ls data/

vi. Set up the environment and configure the pipeline:

  1. Set up the eggnog-mapper database:

The microGWAS pipeline requires the eggnog database for functional annotation. You have two options:

i. If you have an existing eggnog database: Create a symbolic link to your actual eggnog data directory.

ln -s /storage/miniconda3/envs/eggnog-mapper/lib/python3.9/site-packages/data/ data/eggnog-mapper

Remember to replace /storage/miniconda3/envs/eggnog-mapper/lib/python3.9/site-packages/data/ with the actual path on your system. ii. If you do not have the eggnog database:

Simply proceed to run the microGWAS pipeline. The pipeline will automatically download and setup the required eggnog database during its execution.

Note

Creating a symbolic link is only necessary if you’re using an existing eggNOG database. This might be preferred as the final database size is more than 50Gb; it therefore makes sense to setup this database once and link it in each microGWAS analysis you are carrying out.

  1. Configure the pipeline:

    Ensure that the ##### params ##### section of the config/config.yaml file matches the print out below.

    targets: [
       "phenotype",
    ]
    
    # MLST scheme
    mlst_scheme: ecoli
    
    # references for association summaries and annotation
    summary_references: "--reference 536 --reference CFT073 --reference ED1a --reference IAI1 --reference IAI39 --reference K-12_substr._MG1655 --reference UMN026 --reference UTI89"
    annotation_references: "--focus-strain 536 --focus-strain CFT073 --focus-strain ED1a --focus-strain IAI1 --focus-strain IAI39 --focus-strain K-12_substr._MG1655 --focus-strain UMN026 --focus-strain UTI89"
    enrichment_reference: "IAI39"
    
    # species to be used for AMR and virulence predictions
    species_amr: "Escherichia"
    

2. Running the microGWAS pipeline

Run the bootsrapping script.

bash bootstrap.sh Escherichia coli IAI39 GCF_000013305.1,GCF_000007445.1,GCF_000026305.1,GCF_000026265.1,GCF_000026345.1,GCF_000005845.2,GCF_000026325.1,GCF_000013265.1

This script populates the input files used for the analysis and downloads the relevant reference genomes necessary for annotating the hits for Escherichia coli and analyse the variants associated to the phenotype. The syntax for this script is bash bootstrap.sh GENUS SPECIES MAINREFERENCE REFSEQID1,[...],REFSEQIDn, where REFSEQID indicates the NCBI assembly ID of the various reference genomes to be used in the analysis, and MAINREFERENCE the name of the main reference genome. To run the full analysis, use the following command.

snakemake -p annotate_summary find_amr_vag manhattan_plots heritability enrichment_plots qq_plots tree wg_metrics --cores 24 --use-conda --conda-frontend mamba

This will:

  • Run the GWAS analysis to identify and annotate genetic variants that are associated with the virulence phenotype (annotate_summary)

  • Generate a phylogenetic tree of all isolates and the selected references (tree)

  • Identify antimicrobial resistant and virulence associated genes (find_amr_vag)

  • Perform an enrichment analysis for the genes with the associated variants, and plot the results (enrichment_plots)

  • Compute the heritability of the phenotype, determined by the genetic variants and the lineage (heritability)

  • Generate QQ-plots to diagnose the association analysis (qq_plots)

  • Generate manhattan plots on the unitigs mapped to a reference (manhattan_plots)

The analysis will be parallelized using up to 24 cores (--cores 24), and the necessary tools will be installed in separated conda environments (--use-conda --conda-frontend mamba). The whole process will likely take more than 24 hours.

Note

Change the --cores argument to indicate what is the maximum number of cores that can be used to parallelize the analysis

Customizing your analysis

You can specify which Rules you want the pipeline to run. For example, to run the pipeline without generating a phylogenetic tree:

snakemake -p annotate_summary manhattan_plots heritability enrichment_plots qq_plots wg_metrics --cores 24 --use-conda --conda-frontend mamba

This command runs all the same analyses as before, except for generating the phylogenetic tree and identifying AMR and virulence associated genes.

3. Understanding the results

microGWAS generates multiple output files and figures which can be accessed from within the out/ directory. For a detailed descripition of all the outputs, refer to Outputs section of this documentation. For the purpose of this tutorial, we will focus on key results replicated from the Galardini et al. (2020) study

a. Unitig-based association analysis

Unitigs are unique DNA sequences that serve as markers for genetic variation. microGWAS uses unitigs with a minimum allele frequency (MAF) of > 1%.

Manhattan plot of the associated variants

This Manhattan plot shows unitigs associated with virulence. Peaks above the red dashed line represent genomic regions strongly associated with the virulence phenotype. These unitigs are related to three iron-uptake systems: the high-pathogenecity island (HPI), aerobactin, and the sitABCD operon. For a closer look at specific genomic regions of interest related to virulence factors in E. coli, you can generate zoomed-in Manhattan plots. The focus will be on three key areas: the high pathogencity island (HPI), the aerobactin siderophore system, and the sitABCD iron transport operon. To created these detailed plots, run the following command first:

python3 workflow/scripts/count_patterns.py out/associations/phenotype/unitigs_patterns.txt --threshold

This command uses the number of unique unitigs’ presence/absence patterns to derive an appropriate p-value threshold. You should obtain a value of 2.16E-08. You can then run the script to generate the manhattan plots for the three regions of interest:

python3 workflow/scripts/manhattan_plot.py out/associations/phenotype/mapped_all.tsv IAI39 HPI.png --threshold 2.16E-08 --zoom HPI 1.05 1.25 30
python3 workflow/scripts/manhattan_plot.py out/associations/phenotype/mapped_all.tsv IAI39 sitABCD.png --threshold 2.16E-08 --zoom sitABCD 1.95 2.05 15
python3 workflow/scripts/manhattan_plot.py out/associations/phenotype/mapped_all.tsv IAI39 aerobactin.png --threshold 2.16E-08 --zoom aerobactin 3.4 3.8 15

Note

Run python3 workflow/scripts/manhattan_plot.py -h for an explanation of the syntax used by this script

A zoom-in on the associated areas of the Manhattan plot for the HPI, aerobacting and *sitABCD* operon regions.

The plot was generated for the “IAI39” reference genome, and the zoomed-in views were based on the genomic positions of the regions of interest. You can also generate volcano plots to visualise the statistical significance and magnitute of the effect for the tested genetic variants. The following code will generate a volcano plots using the annotate_summary.tsv file, which contains the summary statistics and gene annotation for the unitigs association analysis.

python3 workflow/scripts/volcano_plot.py out/associations/phenotype/annotated_summary.tsv volcano.png --threshold 2.16E-08 --genes fyuA sitA iucC

This plot represents associations using unitigs as the genetic markers.

A volcano plot of the associated variants.

Each point represents a specific gene. The highlighted genes belong to the high pathogenecity island, the aerobactin, and the sitABCD operon. The x-axis represents the average beta value (effect size), which indicates the magnitude and direction of the association between the unitigs and the virulence phenotye. Points on the right indicate positive associations and those on the left indicate negative associations. The y-axis shows the statistical significance. The red dashed horizontal line indicates the signficance threshold computed using the unitigs_patterns.txt file. Similar plots can be created using annotated_gpa_summary.tsv or annotated_panfeed_summary.tsv.

b. Gene cluster-specific k-mer association analysis

This analysis links specific k-mers to their source genes, using panfeed.

Associations plots for gene cluster specific k-mers.

These plots represent association for gene cluster specific k-mers for fyuA, iucC, and sitA genes. The y-axis represents each isolate and the x-axis the k-mer positions relative to the gene start codon for each strain. The colors correspond to the -log10 of the association p-value. The dark gray regions imply that the isolates do not encode for the k-mers, while the light gray regions represent k-mers under the association threshold.

c. Functional enrichment analysis

This analysis identified overrepresented functional categories among genes with associated variants.

Enrichment analysis of the associated unitigs for different COG categories.

The plot shows enrichment of clusters of orthologous groups (COG) categories. The y-axis of the plot represents each COG catergory, and x-axis the number of gene hits belonging to each category. The bars are colored based on the -log10 of the enrichment corrected p-value. Bars colored in grey do not have a significant enrichment.

d. Other outputs

More information about the results of the association analysis can be found within the out directory. The content of each folder/file is reported in the Outputs section.

4. Troubleshooting

If you have persistent issues, please consult the Troubleshooting Usage guide or seek help in the project’s issue tracker.