Outputs

General output folder structure

Running the suggested pipeline rules (annotate_summary find_amr_vag map_back manhattan_plots heritability enrichment_plots qq_plots tree) will yield the following directory strcture in the out outputs folder:

out
|____associations
|
|____inputs
| | |____phenotype
| |____phenotype
| |
| |____panfeed_plots
|____wg
| |____inputs
| |
|____phenotype
|
|____phenotype
|____panfeed
|____snps
|
|____unet
|____panaroo
|____unitigs
|____abritamr
|____logs
  • associations: contains the inputs and outputs for the locus and lineage associations, with one subfolder for each target phenotype

    • inputs: contains the inputs for the associations

    • phenotype: contains the association outputs, annotated summaries and functional enrichments

  • wg: contains the inputs and outputs for the whole genome associations, with one subfolder for each target phenotype

    • inputs: contains the inputs for the whole genome associations

    • phenotype: contains the associations output, annotated summaries and functional enrichmennts (for both lasso and ridge models)

  • panfeed: contains the input for the gene cluster specific k-mers associations

  • snps: contains the input for the rare variants associations

    • unet: contains the estimated impact of all possible non-synonymous variants across the reference genome’s proteome

  • panaroo : contains the pangenome of all samples and references, as well as the core genome phylogenetic tree

  • unitigs : contains the unitigs variant set based on a “global” de Brujin graph.

  • abritamr: contains the predicted virulence associated genes (VAGs) and antimicrobial resistance gene (ARGs) for each sample

  • logs: contains the log files generated during the execution of each rule by snakemake, which can be used to inspect errors

Note

If multiple phenotypes are defined in the config file, there will be multiple folders in associations and wg.

Output files

The above directories will contain the following files:

out
|____abritamr
|
|____summary_virulence.txt
| |____summary_matches.txt

The virulence associated genes (VAGs) will be listed in the summary_virulence.txt file (under the column Virulence), while the antimicrobial resistance genes (ARGs) will be listed in the summary_matches.txt file, with one column per antimicrobial “class”.

out
|____associations
|
|____inputs
| | |____phenotype
| | |
|____distances.tsv
| | |
| | | |____lineages.tsv
| | |
|____lineages_covariance.tsv
| | | |____phenotypes.tsv
| |
| |
|____similarity.tsv
| |____phenotype
| | |____annotated_summary.tsv
| |
| | |____annotated_gpa_summary.tsv
| | |____annotated_panfeed_summary.tsv
| |
|____annotated_rare_summary.tsv
| |
| | |____annotated_vcf.tsv
| |
|____heritability_all.tsv
| | |____unitigs_lineage.txt
| |
| |
|____mapped.tsv
| | |____mapped_all.tsv
| | |____panfeed.tsv
|
|
| | |____panfeed_filtered.tsv
| | |____rare.tsv
| |
|____rare_filtered.tsv
| |
| | |____struct.tsv
| |
|____struct_filtered.tsv
| | |____unitigs.tsv
| |
| |
|____unitigs_filtered.tsv
| | |____unitigs_patterns.txt
| | |____vcf.tsv
|
|
| | |____vcf_filtered.tsv
| | |____vcf_patterns.txt
| |
|____gpa.tsv
| |
| | |____gpa_filtered.tsv
| |
|____manhattan.png
| | |____qq_gpa.png
| |
| |
|____qq_rare.png
| | |____qq_unitigs.png
| | |____COG.png
|
|
| | |____COG.tsv
| | |____COG_gpa.png
| |
|____COG_gpa.tsv
| |
| | |____COG_panfeed.png
| |
|____COG_panfeed.tsv
| | |____COG_rare.png
| |
| |
|____COG_rare.tsv
| | |____GO.png
| | |____GO.tsv
|
|
| | |____GO_gpa.png
| | |____GO_gpa.tsv
| |
|____GO_panfeed.png
| |
| | |____GO_panfeed.tsv
| |
|____GO_rare.png
| | |____GO_rare.tsv
| |
| |
|____KEGG.png
| | |____KEGG.tsv
| | |____KEGG_gpa.png
|
|
| | |____KEGG_gpa.tsv
| | |____KEGG_panfeed.png
| |
|____KEGG_panfeed.tsv
| |
| | |____KEGG_rare.png
| |
|____KEGG_rare.tsv
| | |____panfeed_annotated_kmers.tsv.gz
| |
|
|____panfeed_plots
| | | |____hybrid_GENE.png
| | |
|____sequence_GENE.png
| |
| | | |____significance_GENE.png
| | |
|____sequence_legend.png
  • inputs folder: the distances.tsv, lineages.tsv, lineages_covariance.tsv, phenotypes.tsv, and similarity.tsv files contain the association inputs for each target phenotype, so that they only contain the samples for which the phenotypic data is available

  • annotated_*.tsv: contains the annotations of genes to which variants passing the association threshold map to; each row contains a gene, followed by the average associations’ summary statistics, the frequency of the gene in the pangenome, the locus tag and gene name of the gene if it’s encoded in the chosen reference(s), and finally the annotations given by eggnog-mapper, including COGs, GO terms and KEGG annotations. annotated_vcf.tsv has a different format, since it reports individual short variants against the chosen reference along with their predicted effect

  • heritability_all.tsv: contains information about what proportion of the phenotypic variation can be explained by either the lineage membership or the genetic variants. The genetics column indicates the likelihood model used for the heritability estimation, lik the likelihood model used for the heritability estimation, h2, the proportion of phenotypic variance explained by the genetic effects.

  • unitigs_lineage.txt: lineage associations output; for each lineage the association p-value is reported; the name is misleading, as the unitigs presence/absence patterns have not been used for this association tests

  • mapped.tsv: mapping information on the unitigs passing the association threshold, across all samples and reference(s)

  • mapped_all.tsv: mapping information for all tested unitigs to the reference genome(s)

  • panfeed.tsv, rare.tsv, vcf.tsv, struct.tsv, unitigs.tsv, and gpa.tsv: contain the raw association results as given by pyseer, with one file per variant set

  • panfeed_filtered.tsv, rare_filtered.tsv, vcf_filtered.tsv, struct_filtered.tsv, unitigs_filtered.tsv, and gpa_filtered.tsv: contain the variants passing the association threshold

  • manhattan.png: manhattan plot for all unitigs mapping to the main reference genome

  • qq_*.png: QQ plot to assess the distribution of observed p-values with the expected distribution under the null hypothesis of the test statistics

  • COG_*.tsv, GO_*.tsv, and KEGG_*.tsv: functional enrichment tests results for each variant set

  • COG_*.png, GO_*.png, and KEGG_*.png: plots to visualise the results of the functional enrichment tests

  • panfeed_annotated_kmers.tsv.gz: detailed annotation of all k-mers mapping to associated gene clusters, as given by panfeed

  • panfeed_plots: visualizaion of the gene-cluster specific k-mers, with 3 files for each associated gene cluster, as given by panfeed

out
|____panfeed

|
|____kmers_to_hashes.tsv
| |____kmers.tsv
|
|____hashes_to_patterns.tsv
  • kmers_to_hashes.tsv: file used to match gene clusters, k-mer sequences and the hash for the respective presence/absence pattern.

  • kmers.tsv: k-mers metadata file

  • hashes_to_patterns.tsv: file contains binary presence/absence matrix for all unique k-mer patterns (rows) across samples (columns)

out
|____similarity.tsv
|____distances.tsv
|____annotated_reference.tsv
  • similarity.tsv and distances.tsv provides information about the genetic reletedness of the test strains. They are both used to account for population structure during the association analysis.

  • annotated_reference.tsv is the functional annotation of the reference using eggnog-mapper. It provides mappings to COG categories, KEGG terms, pathways and more.

out
|____snps
|
|____common.vcf.gz
| |____rare.vcf.gz
| |____unet
| |
|
|____PROTEIN_ID_1.tsv.gz
| | |____PROTEIN_ID_2.tsv.gz
| |
|
|____[...]
  • common.vcf.gz: all common short variants with respect to the chosen reference genome identified across all samples merged into a single VCF file.

  • rare.vcf.gz: all rare deleterious variants identified across all samples merged into a single VCF file.

  • unet: this directory contains, for each protein sequence encoded in the reference genome, the estimated impact of every possible non-synonymous variants. The pred column indicates the probability that a variant is deleterious; the pipeline uses a threshold of 0.5.

out
|____inputs
|
|____phenotype
| | |____distances.tsv
| | |____lineages.tsv
|
|
|____phenotypes.tsv
| | |____similarity.tsv
| |
|____variants.npz
| |
|____variants.pkl
|____wg
|
|____phenotype
| | |____annotated_summary_lasso.tsv
|
| |____annotated_summary_ridge.tsv
|
| |____COG_lasso.png
| | |____COG_lasso.tsv
| |
|____COG_ridge.png
| | |____COG_ridge.tsv
| | |____GO_lasso.png
|
|
|____GO_lasso.tsv
| | |____GO_ridge.png
| |
|____GO_ridge.tsv
| |
|____KEGG_lasso.png
| | |____KEGG_lasso.tsv
|
| |____KEGG_ridge.png
| |
|____KEGG_ridge.tsv
| |
|____lasso.pkl
| | |____lasso.tsv
| |
|____lasso_predictions.tsv
|
| |____metrics_lasso.tsv
| | |____mapped_lasso.tsv
| |
|____mapped_ridge.tsv
| | |____ridge.pkl
| | |____ridge.tsv
|
|
|____ridge_predictions.tsv
| |
|____metrics_ridge.tsv

The contents of the wg are very similar to the equivalent files in the associations folder. The differences are:

  • in the inputs subfolder: variants.* are the pyseer checkpoint files to avoid loading the full set of unitigsmultiple times

  • lasso.tsv and ridge.tsv: association output between each unitig and the phenotype

  • lasso_predictions.tsv and ridge_predictions.tsv: table showing the true and predicted values for each sample

  • metrics_lasso.tsv and metrics_ridge.tsv: model prediction performance metrics on the training set. The actual metrics depend on whether the phenotype is binary or continuous

  • lasso.pkl and ridge.pkl: pyseer checkpoint file containing the trained machine learning model, which can be used to predict the phenotype in new samples

out
|____ggcaller
|
|____gene_calls.faa
| |____gene_calls.ffn
| |____GFF
|
|____ORF_dir
|
|____Path_dir

Note

If user-provided GFF files are specified via the input table (use_user_gffs: true), the pipeline completely circumvents de novo structural annotations and assembly processing. As a result, the default out/ggcaller/ internal pipeline folder outputs shown above will not be populated, except for manifest validation logs.

  • gene_calls.faa: contains the predicted protein sequences

  • gene_calls.ffn: contains the predicted nucleotide sequences

  • GFF: contains the gene calling results in GFF format

  • ORF_dir: contains the predicted open reading frames

  • Path_dir: contains the predicted pathways

out
|____panaroo
|
|____gene_presence_absence.Rtab
| |____gene_presence_absence.csv
| |____struct_presence_absence.Rtab
|
|____core_gene_alignment.aln
|
|____core_gene_alignment.aln.treefile
| |____core_gene_alignment.vcf.gz
|
|____pangenome_sample.faa
|
|____pangenome.emapper.annotations
  • gene_presence_absence.Rtab: gene clusters binary presence/absence file: for each orthologous gene identified by panaroo, its presence (1) and absence (0) is reported for all samples and the selected references

  • gene_presence_absence.csv: describes which gene clusters are present in which samples, and if so, it provides the gene IDs/locus tags; paralogs are separated by the ; character

  • struct_presence_absence.Rtab: gene ordering variants presence/absence file, with the involved genes enlisted in the first column, separated with the - character

  • core_gene_alignment.aln: contains the core genome alignment generated through the concatenation of the alignment of each gene

  • core_gene_alignment.aln.treefile: contains a phylogenetic tree constructed from the core genome alignment file core_gene_alignment.aln

  • core_gene_alignment.vcf.gz: contains the core genome alignment in VCF format

  • pangenome_sample.faa: contains a sampled FASTA file of protein sequences from the pangenome, including genes from the focus reference strains

  • pangenome.emapper.annotations: contains functional annotations for the pangenome generated by eggnog-mapper, including COG categories, GO terms, KEGG pathways, and other functional information for each gene cluster

out
|____unitigs
|
|____unitigs.unique_rows.Rtab.gz
| |____unitigs.unique_rows_to_all_rows.txt
|
|____unitigs.txt.gz
  • unitigs.unique_rows.Rtab.gz: contains the unique unitig patterns found across the input genomes. The number of lines represents the number of unique tests that need to be corrected for in the association analysis

  • unitigs.unique_rows_to_all_rows.txt: provides information on the mapping from the unique unitig patterns to all instances of those patterns observed across the input genomes

  • unitigs.txt.gz: contains the list of unitigs counted across the input genomes and which samples encode for them