Rules
The pipeline contains the following endpoint rules:
ggcaller
pangenome
annotate_pangenome
lineage_st
find_amr_vag
tree
unitigs
panfeed
combine_heritability
pyseer
pyseer_rare
wg
wg_metrics
map_back
qq_plots
manhattan_plots
enrichment
enrichment_plots
annotate_summary
Which accomplish the following functions:
ggcaller: will generate GFF annotations for all sample genomes using ggCaller, providing gene predictions and functional annotations. If pre-computed GFF files are specified via the input table, this rule will instead validate the user-provided paths and skip the time-consuming de novo gene calling step.unitigs: will generate a variant set from the input samples based on a “global” de Brujin graph.lineage_st: will generate a tab-separated file with the predicted sequence types (STs).find_amr_vag: will generate a summary file with the predicted antimicrobial resistance genes identified into functionally relevant groups, and a summary file with the predicted virulence associated genes.pangenome: will find the orthologous gene clusters across all samples and the chosen references using panaroo.annotate_pangenome: will generate functional annotations for the pangenome using eggnog-mapper, including COG categories, GO terms, and KEGG pathway mappings.tree: will generate a phylogenetic tree from the core genome alignment output from panaroo.combine_heritability: will generate a file with the comined heritabilities: built from the lineages of each strain and by using a kinship matrix built from the unitigs presence and absence matrix.pyseer: will test for associations of each unitig and the phenotype, as well as gene presence/absence patterns and lineage effects.pyseer_rare: will test for rare variants based on the predicted deleterious protein coding variants.panfeed: will test gene-cluster specific k-mers for their association with the phenotype(s), and produce output plots.wg: will train two machine learning models (lasso and a ridge elastic nets) based on the presence/absence patterns of all unitigs.wg_metrics: will calculate prediction performance metrics for the machine learning models trained by thewgrule, providing quantitative assessment of model accuracy.qq_plots: will createa Q-Q plot to check that p-values are not inflated (large ‘shelves’ are symptomatic of poorly controlled confounding population structure)map_back: will map back the associated genetic variants to the provided reference genomes.manhattan_plots: will generate a Manhattan plot of the unitigs that map to the chosen reference genome.annotate_summary: will generate an annotated summary table for all associations, including: the identity of the gene the variants map to, the number of strains, the average association pvalue, the gene ID across the selected reference genomes, and automatic annotations provided byeggnog-mapper.enrichment: will generate a file with the functional enrichment of the associated variants for GO terms, COG categories and KEGG pathways.enrichment_plots: will generate visualizations from the results of theenrichmentrule.
Tip
Please note that some of the above rules will depend on each other.