Beginner's guide ================ This guide will walk you through conducting a comprehensive GWAS analysis on 370 *Escherichia coli* strains using the ``microGWAS`` pipeline. In the `study by Galardini et al. (2020) `_ , a mouse model of sepsis was used to characterize the virulence phenotype of the strains. By using ``microGWAS`` in this tutorial, you will uncover genetic variants (unitigs, gene presence/absence, rare variants, gene cluster specific k-mers) associated with this virulence phenotype. Prerequisites ------------- - Basic command-line knowledge - Familiarity with genomic data - A computer with at least 10 GB RAM with 8 cores with a Linux operating system like Ubuntu The whole tutorial will take at least 24 hours to complete. 1. Before you begin -------------------- i. Install Conda (if not already installed): ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ If you don't have Conda installed, you can install it via Miniconda. Miniconda is a minimal installer for Conda. a. Download the Miniconda installer: .. code-block:: console wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh b. Install Miniconda: .. code-block:: console bash miniconda.sh -b -p $HOME/miniconda c. Initialize Conda: .. code-block:: console eval "$($HOME/miniconda/bin/conda shell.bash hook)" d. Verify the installation: .. code-block:: console conda --version You should see the Conda version printed to the console. ii. Install mamba (if not already installed): ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ After installing Conda, we recommend installing Mamba, a faster alternative to Conda for package management. Mamba is the recommended way of using Snakemake's conda integration. a. Install mamba in your base conda environment .. code-block:: console conda install -n base -c conda-forge mamba b. Verify the mamba installation .. code-block:: console mamba --version You should see the mamba version printed to the console. iii. Installing ``microGWAS``: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ We recommend obtaining ``microGWAS`` by downloading the latest release from GitHub: * Visit the `releases page `_ on GitHub. * Download the ``microGWAS.tar.gz`` file from the latest available release. * Unpack the downloaded file (``tar -xvf microGWAS.tar.gz``). * Navigate to the unpacked directory (``cd microGWAS``) Additional install methods are listed in the :doc:`Usage page `. iv. Set up the ``microGWAS`` conda environment (if not present already): ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Add the following channels: .. code-block:: console conda config --add channels bioconda conda config --add channels conda-forge Now, create and activate the microGWAS conda environment: .. code-block:: console conda env create -f environment.yml conda activate microGWAS v. Prepare your input data: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ a. Create a directory structure for your input files: .. code-block:: console mkdir -p data/fastas b. Download sample genomes in ``FASTA`` format: .. code-block:: console wget -O data/genomes.tgz https://figshare.com/ndownloader/files/21781689 Extract genome FASTA files: .. code-block:: console tar -xzvf data/genomes.tgz -C data/fastas/ c. Download and modify the phenotype data: .. code-block:: console wget https://raw.githubusercontent.com/mgalardini/2018_ecoli_pathogenicity/master/data/phenotypes/phenotypes.tsv -O data/data.tsv The phenotype file contains two reference strains, "ED1a" and "IAI39". These strains should not be included in the phenotype file as they will cause conflicts within the pipeline. To remove these strains from you phenotype file, do the following: .. code-block:: console sed -i '/^ED1a/d; /^IAI39/d' data/data.tsv The following command will update your ``data/data.tsv`` file, adding the paths for fasta files. .. code-block:: console awk 'BEGIN {OFS="\t"} NR==1 {print "strain", "fasta", "phenotype"} NR>1 {print $1, "data/fastas/" $1 ".fasta", $3}' data/data.tsv > temp_file && mv temp_file data/data.tsv .. note:: To provide pre-computed GFF annotations instead of executing *de novo* assembly gene calling, add an optional ``gff`` column to the command and supply the paths (e.g. ``"data/gffs/" $1 ".gff"``). d. Verify the updated phenotype file: .. code-block:: console head -n 5 data/data.tsv You should see an output similar to the example below. The first column lists the sample IDs, the next column is the relative path to the assemblies in fasta format. The last column represents the phenotype: where 1 indicates that the strain is virulent, while 0 indicates the strain is avirulent. .. code-block:: none strain fasta phenotype ECOR-01 data/fastas/ECOR-01.fasta 0 ECOR-02 data/fastas/ECOR-02.fasta 1 ECOR-03 data/fastas/ECOR-03.fasta 0 ECOR-04 data/fastas/ECOR-04.fasta 0 e. Clean up: Remove the compressed files, you do not need them anymore: .. code-block:: console rm data/genomes.tgz f. Verify your directory structure: After executing the aforementioned steps, your directory structure should look something like this: .. code-block:: none data/ ├── data.tsv ├──fastas/ │ ├── genome1.fasta │ ├── genome2.fasta │ └── ... You can confirm by doing: .. code-block:: console ls data/ vi. Set up the environment and configure the pipeline: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ a. Set up the eggnog-mapper database: The ``microGWAS`` pipeline requires the eggnog database for functional annotation. You have two options: i. If you have an existing eggnog database: Create a symbolic link to your actual eggnog data directory. .. code-block:: console ln -s /storage/miniconda3/envs/eggnog-mapper/lib/python3.9/site-packages/data/ data/eggnog-mapper Remember to replace ``/storage/miniconda3/envs/eggnog-mapper/lib/python3.9/site-packages/data/`` with the actual path on your system. ii. If you do not have the eggnog database: Simply proceed to run the ``microGWAS`` pipeline. The pipeline will automatically download and setup the required eggnog database during its execution. .. note:: Creating a symbolic link is only necessary if you're using an existing eggNOG database. This might be preferred as the final database size is more than 50Gb; it therefore makes sense to setup this database once and link it in each ``microGWAS`` analysis you are carrying out. b. Configure the pipeline: Ensure that the ``##### params #####`` section of the ``config/config.yaml`` file matches the print out below. .. code-block:: yaml targets: [ "phenotype", ] # MLST scheme mlst_scheme: ecoli # references for association summaries and annotation summary_references: "--reference 536 --reference CFT073 --reference ED1a --reference IAI1 --reference IAI39 --reference K-12_substr._MG1655 --reference UMN026 --reference UTI89" annotation_references: "--focus-strain 536 --focus-strain CFT073 --focus-strain ED1a --focus-strain IAI1 --focus-strain IAI39 --focus-strain K-12_substr._MG1655 --focus-strain UMN026 --focus-strain UTI89" enrichment_reference: "IAI39" # species to be used for AMR and virulence predictions species_amr: "Escherichia" 2. Running the ``microGWAS`` pipeline ------------------------------------- Run the bootsrapping script. .. code-block:: console bash bootstrap.sh Escherichia coli IAI39 GCF_000013305.1,GCF_000007445.1,GCF_000026305.1,GCF_000026265.1,GCF_000026345.1,GCF_000005845.2,GCF_000026325.1,GCF_000013265.1 This script populates the input files used for the analysis and downloads the relevant reference genomes necessary for annotating the hits for *Escherichia coli* and analyse the variants associated to the phenotype. The syntax for this script is ``bash bootstrap.sh GENUS SPECIES MAINREFERENCE REFSEQID1,[...],REFSEQIDn``, where ``REFSEQID`` indicates the NCBI assembly ID of the various reference genomes to be used in the analysis, and ``MAINREFERENCE`` the name of the main reference genome. To run the full analysis, use the following command. .. code-block:: console snakemake -p annotate_summary find_amr_vag manhattan_plots heritability enrichment_plots qq_plots tree wg_metrics --cores 24 --use-conda --conda-frontend mamba This will: - Run the GWAS analysis to identify and annotate genetic variants that are associated with the virulence phenotype (``annotate_summary``) - Generate a phylogenetic tree of all isolates and the selected references (``tree``) - Identify antimicrobial resistant and virulence associated genes (``find_amr_vag``) - Perform an enrichment analysis for the genes with the associated variants, and plot the results (``enrichment_plots``) - Compute the heritability of the phenotype, determined by the genetic variants and the lineage (``heritability``) - Generate QQ-plots to diagnose the association analysis (``qq_plots``) - Generate manhattan plots on the unitigs mapped to a reference (``manhattan_plots``) The analysis will be parallelized using up to 24 cores (``--cores 24``), and the necessary tools will be installed in separated conda environments (``--use-conda --conda-frontend mamba``). The whole process will likely take more than 24 hours. .. note:: Change the ``--cores`` argument to indicate what is the maximum number of cores that can be used to parallelize the analysis Customizing your analysis ^^^^^^^^^^^^^^^^^^^^^^^^^^ You can specify which :doc:`rules` you want the pipeline to run. For example, to run the pipeline without generating a phylogenetic tree: .. code-block:: console snakemake -p annotate_summary manhattan_plots heritability enrichment_plots qq_plots wg_metrics --cores 24 --use-conda --conda-frontend mamba This command runs all the same analyses as before, except for generating the phylogenetic tree and identifying AMR and virulence associated genes. 3. Understanding the results ---------------------------- ``microGWAS`` generates multiple output files and figures which can be accessed from within the ``out/`` directory. For a detailed descripition of all the outputs, refer to :doc:`outputs` section of this documentation. For the purpose of this tutorial, we will focus on key results replicated from the `Galardini et al. (2020) study `_ a. Unitig-based association analysis ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Unitigs are unique DNA sequences that serve as markers for genetic variation. ``microGWAS`` uses unitigs with a minimum allele frequency (MAF) of > 1%. .. image:: ../images/manhattan.png :alt: Manhattan plot of the associated variants :align: center This Manhattan plot shows unitigs associated with virulence. Peaks above the red dashed line represent genomic regions strongly associated with the virulence phenotype. These unitigs are related to three iron-uptake systems: the high-pathogenecity island (HPI), aerobactin, and the *sitABCD* operon. For a closer look at specific genomic regions of interest related to virulence factors in *E. coli*, you can generate zoomed-in Manhattan plots. The focus will be on three key areas: the high pathogencity island (HPI), the aerobactin siderophore system, and the *sitABCD* iron transport operon. To created these detailed plots, run the following command first: .. code-block:: console python3 workflow/scripts/count_patterns.py out/associations/phenotype/unitigs_patterns.txt --threshold This command uses the number of unique unitigs' presence/absence patterns to derive an appropriate p-value threshold. You should obtain a value of ``2.16E-08``. You can then run the script to generate the manhattan plots for the three regions of interest: .. code-block:: console python3 workflow/scripts/manhattan_plot.py out/associations/phenotype/mapped_all.tsv IAI39 HPI.png --threshold 2.16E-08 --zoom HPI 1.05 1.25 30 python3 workflow/scripts/manhattan_plot.py out/associations/phenotype/mapped_all.tsv IAI39 sitABCD.png --threshold 2.16E-08 --zoom sitABCD 1.95 2.05 15 python3 workflow/scripts/manhattan_plot.py out/associations/phenotype/mapped_all.tsv IAI39 aerobactin.png --threshold 2.16E-08 --zoom aerobactin 3.4 3.8 15 .. note:: Run ``python3 workflow/scripts/manhattan_plot.py -h`` for an explanation of the syntax used by this script .. image:: ../images/zoom.png :alt: A zoom-in on the associated areas of the Manhattan plot for the HPI, aerobacting and *sitABCD* operon regions. :align: center The plot was generated for the "IAI39" reference genome, and the zoomed-in views were based on the genomic positions of the regions of interest. You can also generate volcano plots to visualise the statistical significance and magnitute of the effect for the tested genetic variants. The following code will generate a volcano plots using the ``annotate_summary.tsv`` file, which contains the summary statistics and gene annotation for the unitigs association analysis. .. code-block:: console python3 workflow/scripts/volcano_plot.py out/associations/phenotype/annotated_summary.tsv volcano.png --threshold 2.16E-08 --genes fyuA sitA iucC This plot represents associations using unitigs as the genetic markers. .. image:: ../images/volcano.png :alt: A volcano plot of the associated variants. :align: center Each point represents a specific gene. The highlighted genes belong to the high pathogenecity island, the aerobactin, and the *sitABCD* operon. The x-axis represents the average beta value (effect size), which indicates the magnitude and direction of the association between the unitigs and the virulence phenotye. Points on the right indicate positive associations and those on the left indicate negative associations. The y-axis shows the statistical significance. The red dashed horizontal line indicates the signficance threshold computed using the ``unitigs_patterns.txt`` file. Similar plots can be created using ``annotated_gpa_summary.tsv`` or ``annotated_panfeed_summary.tsv``. b. Gene cluster-specific k-mer association analysis ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ This analysis links specific k-mers to their source genes, using ``panfeed``. .. image:: ../images/panfeed.png :alt: Associations plots for gene cluster specific k-mers. :align: center These plots represent association for gene cluster specific k-mers for *fyuA*, *iucC*, and *sitA* genes. The y-axis represents each isolate and the x-axis the k-mer positions relative to the gene start codon for each strain. The colors correspond to the -log10 of the association p-value. The dark gray regions imply that the isolates do not encode for the k-mers, while the light gray regions represent k-mers under the association threshold. c. Functional enrichment analysis ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ This analysis identified overrepresented functional categories among genes with associated variants. .. image:: ../images/enrich_cog.png :alt: Enrichment analysis of the associated unitigs for different COG categories. :align: center The plot shows enrichment of clusters of orthologous groups (COG) categories. The y-axis of the plot represents each COG catergory, and x-axis the number of gene hits belonging to each category. The bars are colored based on the -log10 of the enrichment corrected p-value. Bars colored in grey do not have a significant enrichment. d. Other outputs ^^^^^^^^^^^^^^^^ More information about the results of the association analysis can be found within the ``out`` directory. The content of each folder/file is reported in the :doc:`outputs` section. 4. Troubleshooting ------------------- If you have persistent issues, please consult the ``Troubleshooting`` :doc:`usage` guide or seek help in the `project's issue tracker `_.