Inputs
Phenotype file
Save your phenotype data as a tab-separated file as data/data.tsv.
The phenotype file should contain at least 3 columns with headers:
strain: sample names.fasta: relative or absolute path to the assemblies (SAMPLE.fasta).phenotype: target phenotype(s).
An optional column named gff can also be provided, indicating the absolute or relative path to the pre-computed annotations (SAMPLE.gff), to completely skip the ggcaller gene-calling step.
There can be more than one target phenotype and the column name will be used in populating the output directory. Subsequent columns can contain other target phenotypes and/or any covariate. Additional columns are allowed and will be simply ignored. See an example phenotype data from the test data:
strain fasta gff phenotype covariate1 covariate2
ECOR-01 test/small_fastas/ECOR-01.fasta test/gffs/ECOR-01.gff 0 0.20035297602710966 1
ECOR-02 test/small_fastas/ECOR-02.fasta test/gffs/ECOR-02.gff 1 0.8798471273587852 1
ECOR-03 test/small_fastas/ECOR-03.fasta test/gffs/ECOR-03.gff 0 0.008404161045130532 0
ECOR-04 test/small_fastas/ECOR-04.fasta test/gffs/ECOR-04.gff 0 0.04728873355931962 1
Note
Only the target variables/phenotype indicated in the config/config.yaml file will be used for the associations.
See Usage for more information.
Sample’s genome sequences
By default, the microGWAS pipeline takes the assemblies with the .fasta extensions.
Make sure that each sample assembly file follows this naming convention before running the analysis.
Note
The pipeline uses ggCaller to generate GFF annotations automatically only if the gff column is not present in the phenotype file, so you no longer need to provide GFF files for your samples.
However, using ggCaller can take a long time with large datasets containing more than ~2k genomes.
If you are dealing with a large number of samples, providing pre-computed GFF files via the optional gff column is highly recommended to speed up the analysis.