Outputs
General output folder structure
Running the suggested pipeline rules (annotate_summary find_amr_vag map_back manhattan_plots heritability enrichment_plots qq_plots tree)
will yield the following
directory strcture in the out outputs folder:
out
|____associations
|
|____inputs
| | |____phenotype
| |____phenotype
| |
| |____panfeed_plots
|____wg
| |____inputs
| |
|____phenotype
|
|____phenotype
|____panfeed
|____snps
|
|____unet
|____panaroo
|____unitigs
|____abritamr
|____logs
associations: contains the inputs and outputs for the locus and lineage associations, with one subfolder for each target phenotypeinputs: contains the inputs for the associationsphenotype: contains the association outputs, annotated summaries and functional enrichments
wg: contains the inputs and outputs for the whole genome associations, with one subfolder for each target phenotypeinputs: contains the inputs for the whole genome associationsphenotype: contains the associations output, annotated summaries and functional enrichmennts (for both lasso and ridge models)
panfeed: contains the input for the gene cluster specific k-mers associationssnps: contains the input for the rare variants associationsunet: contains the estimated impact of all possible non-synonymous variants across the reference genome’s proteome
panaroo: contains the pangenome of all samples and references, as well as the core genome phylogenetic treeunitigs: contains the unitigs variant set based on a “global” de Brujin graph.abritamr: contains the predicted virulence associated genes (VAGs) and antimicrobial resistance gene (ARGs) for each samplelogs: contains the log files generated during the execution of each rule by snakemake, which can be used to inspect errors
Note
If multiple phenotypes are defined in the config file, there will be multiple folders in associations and wg.
Output files
The above directories will contain the following files:
out
|____abritamr
|
|____summary_virulence.txt
| |____summary_matches.txt
The virulence associated genes (VAGs) will be listed in the summary_virulence.txt file
(under the column Virulence),
while the antimicrobial resistance genes (ARGs) will be listed in the summary_matches.txt file,
with one column per antimicrobial “class”.
out
|____associations
|
|____inputs
| | |____phenotype
| | |
|____distances.tsv
| | |
| | | |____lineages.tsv
| | |
|____lineages_covariance.tsv
| | | |____phenotypes.tsv
| |
| |
|____similarity.tsv
| |____phenotype
| | |____annotated_summary.tsv
| |
| | |____annotated_gpa_summary.tsv
| | |____annotated_panfeed_summary.tsv
| |
|____annotated_rare_summary.tsv
| |
| | |____annotated_vcf.tsv
| |
|____heritability_all.tsv
| | |____unitigs_lineage.txt
| |
| |
|____mapped.tsv
| | |____mapped_all.tsv
| | |____panfeed.tsv
|
|
| | |____panfeed_filtered.tsv
| | |____rare.tsv
| |
|____rare_filtered.tsv
| |
| | |____struct.tsv
| |
|____struct_filtered.tsv
| | |____unitigs.tsv
| |
| |
|____unitigs_filtered.tsv
| | |____unitigs_patterns.txt
| | |____vcf.tsv
|
|
| | |____vcf_filtered.tsv
| | |____vcf_patterns.txt
| |
|____gpa.tsv
| |
| | |____gpa_filtered.tsv
| |
|____manhattan.png
| | |____qq_gpa.png
| |
| |
|____qq_rare.png
| | |____qq_unitigs.png
| | |____COG.png
|
|
| | |____COG.tsv
| | |____COG_gpa.png
| |
|____COG_gpa.tsv
| |
| | |____COG_panfeed.png
| |
|____COG_panfeed.tsv
| | |____COG_rare.png
| |
| |
|____COG_rare.tsv
| | |____GO.png
| | |____GO.tsv
|
|
| | |____GO_gpa.png
| | |____GO_gpa.tsv
| |
|____GO_panfeed.png
| |
| | |____GO_panfeed.tsv
| |
|____GO_rare.png
| | |____GO_rare.tsv
| |
| |
|____KEGG.png
| | |____KEGG.tsv
| | |____KEGG_gpa.png
|
|
| | |____KEGG_gpa.tsv
| | |____KEGG_panfeed.png
| |
|____KEGG_panfeed.tsv
| |
| | |____KEGG_rare.png
| |
|____KEGG_rare.tsv
| | |____panfeed_annotated_kmers.tsv.gz
| |
|
|____panfeed_plots
| | | |____hybrid_GENE.png
| | |
|____sequence_GENE.png
| |
| | | |____significance_GENE.png
| | |
|____sequence_legend.png
inputsfolder: thedistances.tsv,lineages.tsv,lineages_covariance.tsv,phenotypes.tsv, andsimilarity.tsvfiles contain the association inputs for each target phenotype, so that they only contain the samples for which the phenotypic data is availableannotated_*.tsv: contains the annotations of genes to which variants passing the association threshold map to; each row contains a gene, followed by the average associations’ summary statistics, the frequency of the gene in the pangenome, the locus tag and gene name of the gene if it’s encoded in the chosen reference(s), and finally the annotations given byeggnog-mapper, including COGs, GO terms and KEGG annotations.annotated_vcf.tsvhas a different format, since it reports individual short variants against the chosen reference along with their predicted effectheritability_all.tsv: contains information about what proportion of the phenotypic variation can be explained by either the lineage membership or the genetic variants. The genetics column indicates the likelihood model used for the heritability estimation, lik the likelihood model used for the heritability estimation, h2, the proportion of phenotypic variance explained by the genetic effects.unitigs_lineage.txt: lineage associations output; for each lineage the association p-value is reported; the name is misleading, as the unitigs presence/absence patterns have not been used for this association testsmapped.tsv: mapping information on the unitigs passing the association threshold, across all samples and reference(s)mapped_all.tsv: mapping information for all tested unitigs to the reference genome(s)panfeed.tsv,rare.tsv,vcf.tsv,struct.tsv,unitigs.tsv, andgpa.tsv: contain the raw association results as given bypyseer, with one file per variant setpanfeed_filtered.tsv,rare_filtered.tsv,vcf_filtered.tsv,struct_filtered.tsv,unitigs_filtered.tsv, andgpa_filtered.tsv: contain the variants passing the association thresholdmanhattan.png: manhattan plot for all unitigs mapping to the main reference genomeqq_*.png: QQ plot to assess the distribution of observed p-values with the expected distribution under the null hypothesis of the test statisticsCOG_*.tsv,GO_*.tsv, andKEGG_*.tsv: functional enrichment tests results for each variant setCOG_*.png,GO_*.png, andKEGG_*.png: plots to visualise the results of the functional enrichment testspanfeed_annotated_kmers.tsv.gz: detailed annotation of all k-mers mapping to associated gene clusters, as given bypanfeedpanfeed_plots: visualizaion of the gene-cluster specific k-mers, with 3 files for each associated gene cluster, as given bypanfeed
out
|____panfeed
|
|____kmers_to_hashes.tsv
| |____kmers.tsv
|
|____hashes_to_patterns.tsv
kmers_to_hashes.tsv: file used to match gene clusters, k-mer sequences and the hash for the respective presence/absence pattern.kmers.tsv: k-mers metadata filehashes_to_patterns.tsv: file contains binary presence/absence matrix for all unique k-mer patterns (rows) across samples (columns)
out
|____similarity.tsv
|____distances.tsv
|____annotated_reference.tsv
similarity.tsvanddistances.tsvprovides information about the genetic reletedness of the test strains. They are both used to account for population structure during the association analysis.annotated_reference.tsvis the functional annotation of the reference usingeggnog-mapper. It provides mappings to COG categories, KEGG terms, pathways and more.
out
|____snps
|
|____common.vcf.gz
| |____rare.vcf.gz
| |____unet
| |
|
|____PROTEIN_ID_1.tsv.gz
| | |____PROTEIN_ID_2.tsv.gz
| |
|
|____[...]
common.vcf.gz: all common short variants with respect to the chosen reference genome identified across all samples merged into a single VCF file.rare.vcf.gz: all rare deleterious variants identified across all samples merged into a single VCF file.unet: this directory contains, for each protein sequence encoded in the reference genome, the estimated impact of every possible non-synonymous variants. Thepredcolumn indicates the probability that a variant is deleterious; the pipeline uses a threshold of 0.5.
out
|____inputs
|
|____phenotype
| | |____distances.tsv
| | |____lineages.tsv
|
|
|____phenotypes.tsv
| | |____similarity.tsv
| |
|____variants.npz
| |
|____variants.pkl
|____wg
|
|____phenotype
| | |____annotated_summary_lasso.tsv
|
| |____annotated_summary_ridge.tsv
|
| |____COG_lasso.png
| | |____COG_lasso.tsv
| |
|____COG_ridge.png
| | |____COG_ridge.tsv
| | |____GO_lasso.png
|
|
|____GO_lasso.tsv
| | |____GO_ridge.png
| |
|____GO_ridge.tsv
| |
|____KEGG_lasso.png
| | |____KEGG_lasso.tsv
|
| |____KEGG_ridge.png
| |
|____KEGG_ridge.tsv
| |
|____lasso.pkl
| | |____lasso.tsv
| |
|____lasso_predictions.tsv
|
| |____metrics_lasso.tsv
| | |____mapped_lasso.tsv
| |
|____mapped_ridge.tsv
| | |____ridge.pkl
| | |____ridge.tsv
|
|
|____ridge_predictions.tsv
| |
|____metrics_ridge.tsv
The contents of the wg are very similar to the equivalent files in the associations folder.
The differences are:
in the
inputs subfolder:variants.*are thepyseercheckpoint files to avoid loading the full set of unitigsmultiple timeslasso.tsvandridge.tsv: association output between each unitig and the phenotypelasso_predictions.tsvandridge_predictions.tsv: table showing the true and predicted values for each samplemetrics_lasso.tsvandmetrics_ridge.tsv: model prediction performance metrics on the training set. The actual metrics depend on whether the phenotype is binary or continuouslasso.pklandridge.pkl:pyseercheckpoint file containing the trained machine learning model, which can be used to predict the phenotype in new samples
out
|____ggcaller
|
|____gene_calls.faa
| |____gene_calls.ffn
| |____GFF
|
|____ORF_dir
|
|____Path_dir
Note
If user-provided GFF files are specified via the input table (use_user_gffs: true), the pipeline completely circumvents de novo structural annotations and assembly processing.
As a result, the default out/ggcaller/ internal pipeline folder outputs shown above will not be populated, except for manifest validation logs.
gene_calls.faa: contains the predicted protein sequencesgene_calls.ffn: contains the predicted nucleotide sequencesGFF: contains the gene calling results in GFF formatORF_dir: contains the predicted open reading framesPath_dir: contains the predicted pathways
out
|____panaroo
|
|____gene_presence_absence.Rtab
| |____gene_presence_absence.csv
| |____struct_presence_absence.Rtab
|
|____core_gene_alignment.aln
|
|____core_gene_alignment.aln.treefile
| |____core_gene_alignment.vcf.gz
|
|____pangenome_sample.faa
|
|____pangenome.emapper.annotations
gene_presence_absence.Rtab: gene clusters binary presence/absence file: for each orthologous gene identified by panaroo, its presence (1) and absence (0) is reported for all samples and the selected referencesgene_presence_absence.csv: describes which gene clusters are present in which samples, and if so, it provides the gene IDs/locus tags; paralogs are separated by the;characterstruct_presence_absence.Rtab: gene ordering variants presence/absence file, with the involved genes enlisted in the first column, separated with the-charactercore_gene_alignment.aln: contains the core genome alignment generated through the concatenation of the alignment of each genecore_gene_alignment.aln.treefile: contains a phylogenetic tree constructed from the core genome alignment filecore_gene_alignment.alncore_gene_alignment.vcf.gz: contains the core genome alignment in VCF formatpangenome_sample.faa: contains a sampled FASTA file of protein sequences from the pangenome, including genes from the focus reference strainspangenome.emapper.annotations: contains functional annotations for the pangenome generated by eggnog-mapper, including COG categories, GO terms, KEGG pathways, and other functional information for each gene cluster
out
|____unitigs
|
|____unitigs.unique_rows.Rtab.gz
| |____unitigs.unique_rows_to_all_rows.txt
|
|____unitigs.txt.gz
unitigs.unique_rows.Rtab.gz: contains the unique unitig patterns found across the input genomes. The number of lines represents the number of unique tests that need to be corrected for in the association analysisunitigs.unique_rows_to_all_rows.txt: provides information on the mapping from the unique unitig patterns to all instances of those patterns observed across the input genomesunitigs.txt.gz: contains the list of unitigs counted across the input genomes and which samples encode for them