Macroevolutionary diversity of traits and genomes in the model yeast genus Saccharomyces

Assemblies, raw data and scripts from Peris et al 2023.

Authors: David Peris, Emily J. Ubbelohde, Meihua Christina Kuang, Jacek Kominek, Quinn K. Langdon, Marie Adams, Justin A. Koshalek, Amanda Beth Hulfachor, Dana A. Opulente, David J. Hall, Katie Hyma, Justin C. Fay, Jean-Baptiste Leducq, Guillaume Charron, Christian R. Landry, Diego Libkind, Carla Gonçalves, Paula Gonçalves, José Paulo Sampaio, Qi-Ming Wang, Feng-Yan Bai, Russel L. Wrobel, Chris Todd Hittinger

Journal: Nature Communications 16(1): 690.

Year: 2023

Abstract: Species is the fundamental unit to quantify biodiversity. In recent years, the model yeast Saccharomyces cerevisiae has seen an increased number of studies related to its geographical distribution, population structure, and phenotypic diversity. However, seven additional species from the same genus have been less thoroughly studied, which has limited our understanding of the macroevolutionary leading to the diversification of this genus over the last 20 million years. Here, we report the geographies, hosts, substrates, and phylogenetic relationships for approximately 1,800 Saccharomyces strains, covering the complete genus with unprecedented breadth and depth. We generated and analyzed complete genome sequences of 163 strains and phenotyped 128 phylogenetically diverse strains. This dataset provides insights about genetic and phenotypic diversity within and between species and populations, quantifies reticulation and incomplete lineage sorting, and demonstrates how gene flow and selection have affected traits, such as galactose metabolism. These findings elevate the genus Saccharomyces as a model to understand biodiversity and evolution in microbial eukaryotes.

Phylogenetic Tree

Saccharomyces strains

Request information for Saccharomyces strains used in this study can be found in Supplementary Data 1.

Raw Reads

The Illumina sequencing reads for this project has been deposed in the Short Reads Archive (SRA) under bioproject PRJNA475869

Saccharomyces genomes

New nuclear, mitochondrial and genome assemblies for different Saccharomyces lineages assembled by using Paired-End and Mate Pair Illumina libraries. FIles were deposited to the European Nucleotide Archive under bioproject PRJEB48264.

BLAST server: Sac2.0 BLAST server

Strain	Species	Pop	ENA Accession no
yHAB335	S.cerevisiae	CHN IV	GCA_918239655
yHDPN23	S.paradoxus	Am-B	GCA_918221835
yHDPN24	S.paradoxus	Am-C	GCA_918281175
yHDPN32	S.paradoxus	EU/Am-A	GCA_918265645
IFO1815	S.mikatae	Asia-A	GCA_918250775
yHAB336	S.mikatae	Asia-B	GCA_918272125
IFO1802	S.kudriavzevii	Asia-A	GCA_918252685
IFO1803	S.kudriavzevii	Asia-B	GCA_918257225
yHKS300	S.kudriavzevii	EU	GCA_918282095
ZP591	S.kudriavzevii	EU	GCA_918268175
ZP960	S.arboricola	Oceania	GCA_918268255
CDFM21L.1	S.eubayanus	HOL	GCA_918271895
yHAB94	S.eubayanus	PA-A	GCA_918265725
yHCT69	S.eubayanus	PA-B	GCA_918298305
yHCT99	S.eubayanus	PA-A	GCA_918246195
yHRVM108	S.eubayanus	HOL	GCA_918265445
CBS7001	S.uvarum	HOL-EU	GCA_918261815
yHAB60	S.uvarum	HOL/SA-A	GCA_918251765
yHAB447	S.uvarum	HOL/SA-A	GCA_918298155
yHAB482	S.uvarum	HOL/SA-A	GCA_918264245
yHAB521	S.uvarum	SA-B	GCA_918280685
yHCT77	S.uvarum	HOL-NA	GCA_918221285
ZP964	S.uvarum	Australasia	GCA_918269355

Corrections or genome assemblies used in the recent work. Files were compressed using tar, to uncompress them use tar -xvzf FILENAME

Strain	Species	Pop	Nuclear	mtDNA	2 micron plasmid
NCYC3947	S.jurei	EU	Naseebetal2018	Naseebetal2018(Corrected)	Incomplete
CBS10644	S.arboricola	Asia-A	Liti et al 2013	Suloetal2017	In the nuclear genome

Other genome assemblies from recent papers stored in Github:

Yue et al 2017: CBS432, DBVPG6044, DBVPG6765, N44, SK1, UFRJ50816, UWOPS03-461.4, UWOPS91-917.1, Y12, YPS128, YPS138

The rest of genome assemblies for Saccharomyces strains, using just Paired-End Illumina reads, are in the FigShare repository.

COX2 and COX3 sequences were deposited in GenBank under accession nos. MH813536-MH813939.

GAL genes that were Sanger sequenced were deposited in GenBank under accession nos. OL660614-OL660618.

Source data

Raw data, alignments and results were deposited in Figshare. The repository contains:

Alignments of 2um plasmid genes REP1 and REP2 used in Supplementary Figure 17.
ASTRAL run and outputs to generate Figure 4a.
MrBayes and BUCKy raw data and results to generate Figure 4a, Supplementary Figure 13.
Single copy ortholog gene and protein alignments annotated using BUSCO for animals and Saccharomyces and used to generate the Figure 3c.
GAL/MEL pathway genes and their encoded protein alignments, together with the IQTree log and Maximum Likelihood tree files. This dataset was used to generate Figure 6a, Supplementary Figure 27,28.
Alignemtns of annotated genes and their proteins with Yeast Genome Annotation Pipeline (YGAP) and IQtree log and Maximum Likelihood tree files. This dataset was used to generate Figure 4, Supplementary Figure 11, 12, 14.
Individual mitochondrial gene alignments used to generate Figure 2, Supplementary Figure 4,. COX2 and COX3 Sanger sequenced alignments to generate Figure 2, Supplementary Figure 3.
iWGS genome assemblies for those strains with just Paired-end Illumina reads.
Variant Call Format files necessary for population genomic analyses resulting in Figure 3a,b, Supplementary Figure 9, 10, 15.
Optican density 600nm values, 96-well plate pictures and additional information for each media condition run necessary to generate Figure 5, 6, Supplementary Figure 18-27, 29.
The rest of genome assemblies for Saccharomyces strains.

Used command lines

For in-house scripts, please check instructions by calling reading the script with a text editor (my first scripts) or by running python script.py --help for the new scripts. If it is not specified, python scripts were run with python2 version 6+ []: replace with the corresponding option or indication [OPTION]: when help is checked you can decide what to use here.

Illumina reads download and trimming

[INPUTFile]: input file
[INPUTF]: path to the input folder containing illumina reads
[OUTPUTF]: output folder to store results
[READ]: read name
[FQEXTENSION]: extension name of the illumina read

python download_SRA_serially.py [INPUTFile] [OUTPUTF] YES
python SRA_rename_v1.4.py [INPUTF] [FQEXTENSION] [INPUTF] [OUTPUTF]

Paired-ended illumina reads
fastqc -o [OUTPUTF] -f fastq [READ]_1.fq
java -jar trimmomatic-0.33.jar PE -threads 4 -phred33 -trimlog analytics.txt [READ]_1.fq [READ]_2.fq [READ]_1.tm.fq [READ]_1.tm-se.fq [READ]_2.tm.fq [READ]_2.tm-se.fq ILLUMINACLIP:TruSeq2-PE.fa:2:30:10 TRAILING:3 MINLEN:25
fastqc -o [OUTPUTF] -f fastq [READ]_1.tm.fq

Mate pair illumina reads
nextclip --input_one [READ]_1.fq --input_two [READ]_2.fq --output_prefix trimmed --log log.txt --trim_ends 5

Quick assesment of phylogenetic position with AAF (Alignment and Assembly Free)

[INPUTF]: path to the input folder containing a strain folder with the PE illumina reads for that strain
[KMER]: Number of kmers to be used
[RUNNAME]: output file name

python2.7 AAF_pipeline-kmer.py -i [INPUTF] -o [RUNNAME] -k [KMER]
Convert the generated infile file to MEGA format and open it in MEGA to generate a Neighbor-Joining phylogenetic tree.

Assemblies and quality check with iWGS v1.0

[NAME].ctl: describe the information about PE and MP illumina reads, assemblers to use, checkpoints to perform and quality assessment of the genome assemblies (example included in the scripts folder STRAINNAME.ctl)

XXXXreads2iWGS2Assembly.py
./iWGS -s [NAME].ctl --Real
XXXXqualimap

Ultrascaffolding PE+MP genome assemblies

[GENOMEANAME]: genome assembly name
[INPUTF]: input folder
[INPUTFile]: input file

python runMummer_serially.py [INPUTFile]
Generated coords file is useful in the next step

Rscript --vanilla reorderContigsUsingNucmer.R [GENOMEANAME].coords [GENOMEANAME].scaffolds.fa
A file called remap.coords.result.csv is generated with the information about scaffolds mapping to chromosomes and if they were reverse complemented. This is useful information for manual curation in Geneious.

python combining_orderedScaffolds.py [INPUTF]
python generating_FinalAssembly.py -l 10000 -i [INPUTF]
Some strains will need manual curation in Geneious. After Geneious curation, run mummer for confirmation.
python runMummer_serially.py [INPUTFile]

#Correction step using pilon
python mapping-and-pilon_serially.py -i [INPUTFile] -t [OPTION] -q [OPTION] -e [OPTION]

GC content and other information

[INPUTName]: name of the input
[INPUTF]: path to the input folder containing genome assemblies
[INPUTFile]: input file
[OUTPUTF]: output folder to store results
[RUNNAME]: output file name
[REFGENOME]: the reference genome where reads will be mapped

python seq_gc_count.py [INPUTF] [OUTPUTF] --window 25000 --window_print --min_contig 20000
python seq_gc_plot.py [OUTPUTF] --window 25000 --out test.pdf --out_stats [RUNNAME].txt --min_contig 20000 --plot_len
python infoseqout_v1.1.py -o [OUTPUTF] -f [INPUTFile]
python quast.py --fast -t 8 -o [RUNNAME] --eukaryote -R [REFGENOME] --gage -l [INPUTName] [INPUTFile]

mtDNA asembly pipeline

[INPUTF]: path to the input folder containing SPAdes genome assemblies
[INPUTFile]: input file
[OUTPUTF]: output folder to store results
[INPUTf]: list with the sequence IDs (i.e. strain names) to be retained

python infoseqout_v1.1.py -o [OUTPUTF] -f [INPUTFile]
python filter_fasta_v3.py -i [INPUTf] -o [OUTPUTF] -p [OPTION] -f [INPUTF]
#In the table look for a scaffold with a size between 40-90 Kbp 
#Annotate with [MFannot](https://megasun.bch.umontreal.ca/cgi-bin/mfannot/mfannotInterface.pl "MFannot")
#Convert sequin output to genbank with Sequin v16.0. Open genbank file in Geneious
#Sort the fasta sequence to start with trnS(tga), export fasta and reannotate in MFannot. Repeat previous step.
#Extract gene sequences and make multisequence alignments

Large structural variation (nuclear genome and mitochondrial genome)

[INPUTFile]: input file
python runMummer_serially.py [INPUTFile]

Extract mtDNA and 2um plasmid genes with HybPiper

[FASTA]: input fasta file with target gene sequences, follow HybPiper target fasta format
[INPUTFile]: strains and read paths to explore
[PATH2HybPiper]: path to HybPiper folder with HybPiper scripts
[OUTPUTF]: output folder to store results
[iFASTAln] :input fasta alignment
[OUTPUTFile]: output file
[BLASTF]: folder with stored blast databases
[BLASTl]: list with name databases to use
[ASSEMBLY]: assembly or text file with a list of assemblies

#Blast
mkblastdb_v5.1.py -a [ASSEMBLY] -o [BLASTF]
run_blast_db_v3.0.py -i [FASTA] -o [OUTPUTF] -b [BLASTl] -d [BLASTF]

#or HybPiper
python HybPiper_serially_v2.py -t [FASTA] -i [INPUTFile] -y [PATH2HybPiper] -o [OUTPUTF] -s [OPTION] -p [OPTION] -d [OPTION] --threads [OPTION]
trimal -in [iFASTAln] -out [OUTPUTFile] -htmlout summary_stats -mega -gt 0.7
trimal -in [iFASTAln] -out [OUTPUTFile] -htmlout summary_stats -fasta -gt 0.7
#or blastp in Geneious R6

Single Copy Orthologous Gene with BUSCO

[INPUTF]: folder with assemblies
[OUTPUTF]: output folder to store results
[iFASTAln] :input fasta alignment
[OUTPUTFile]: output file

python runBUSCO_serially.py -i [INPUTF] -t [OPTION] -o [OUTPUTF]
python busco_get_single_genes_nt.py [OUTPUTF] --out single_genes_alignment_all
python rename_names_fasta-busco.py [OUTPUTF] -o [OUTPUTF2]
MAFFT_serially.py -i [OUTPUTF2] -o [OUTPUTF3] -t [OPTION]
trimal -in [iFASTAln] -out [OUTPUTFile] -fasta -gt 0.9

#in folder where mafft alignments are stored, run FASconCAT.pl to concatenate
perl FASconCAT.pl -s 

Single Copy Orthologous Gene with BUSCO for Multicell Eukaryotes vs Yeast

[INPUTFile]: list of assemblies to use
[INPUTf]: input file
[INPUTF]: folder with assemblies and gff
[INPUTFfasta]: folder with the rename files or gene/protein sequences
[OUTPUTF]: output folder to store results
[FASTAf]: list of fasta to parse

python prepareBUSCO_v2.py -i [INPUTFile] -t [OPTION] -l BUSCO_Tirant-layout.sh -s eukaryota_odb10 -d [OPTION]]
genome2genes_combineGenes.py -i [INPUTf] -f [INPUTFfasta] -o [OUTPUTF]
MAFFT_pal2nal_serially_v1ScarcityCONDOR.py -s [OUTPUTF] -g [GENENAME] #It was run in a CONDOR script to submit multiple alignments at the same time

#Get some statistics
python getBUSCO_statsTirant.py -f [OUTPUTF] -i [INPUTf] -s eukaryota_odb10
python checkLength_OriginalvsTrimmed.py -i genes_trimmed/ -o genes_aligned/ -a [OPTION] -t [OPTION]

#Continue BUSCO pipeline
python pull_buscoTirant.py -d eukaryota_odb10 -l [FASTAf] -o [OUTPUTF] -s [OPTION]

Genome annotation with YGAP

[INPUTf]: input file
[INPUTF]: folder with assemblies and gff
[OUTPUTF]: output folder to store results
[INPUTFfasta]: folder with the rename files for the strains

YGAP_down_serially.py -i [INPUTf] -o [OUTPUTF]
#Convert Genbank file to gff in Geneious
#Paralogue gene annotation > final gff file

#Extract individual genes
python genome2genes_v1.2.py -i [INPUTF] -a [OPTION] -c [OPTION] #includes the step to use process_gff_cds_proteins.pl
genome2genes_combineGenes.py -i [INPUTf] -f [INPUTFfasta] -o [OUTPUTF]
MAFFT_pal2nal_serially_v1ScarcityCONDOR.py -s [OUTPUTF] -g [GENENAME] #It was run in a CONDOR script to submit multiple alignments at the same time

#Percentage amino acid identity in R as indicated in the manuscript.

Phylogenetics

[INPUTf]: list with the sequence IDs (i.e. strain names) to be retained
[INPUTF]: folder with gene/protein trimmed sequence alignments
[OUTPUTF]: output folder to store results
[INPUTN]: input name
[OUTPUTN]: output name

python filter_fasta_v3.py -i [INPUTf] -o [OUTPUTF] -p [OPTION] -f [INPUTF]

#CONDOR script to parallelize multiple IQTree runs that contains the next code line
iqtree -s $(iGENE)_nt_tr_fil.fas -bb 1000 -wbt -nt AUTO -seed 225494 -st DNA -m TEST

#ASTRAL pipeline
cat results/*.treefile > YGAP_ML_best.trees
java -jar astral.5.7.7.jar -i YGAP_ML_best.trees -o Saccharomyces_species_pp.tre 2> [RUNNAME].log
#Add the CFs
iqtree -t Saccharomyces_species_pp.tre --gcf YGAP_ML_best.trees --prefix gene_concordance

#RAxML with the SNP dataset generated by mapping reads to reference (see below)
raxmlHPC-PTHREADS-SSE3 -T 4 --no-bfgs -f d -m ASC_GTRGAMMA --asc-corr=stamatakis -p 54877 -s [INPUTN]_AllGenomes_SNPs.fasta -#100 -n [OUTPUTN] -w [OUTPUTF]/best_tree/ -q part
raxmlHPC-PTHREADS-SSE3 -T 4 --no-bfgs -m ASC_GTRGAMMA --asc-corr=stamatakis -#1000 -b 12345 -s [INPUTN]_AllGenomes_SNPs.fasta -n [OUTPUTN] -t [OUTPUTF]/best_tree/RAxML_bestTree.SNP -w [OUTPUTF]/boots/ -q part
raxmlHPC-PTHREADS-SSE3 -T 4 --no-bfgs -p 12345 -m ASC_GTRCAT --asc-corr=stamatakis -f b -t [OUTPUTF]/best_tree/RAxML_bestTree.SNP -n [OUTPUTN] -z [OUTPUTF]/boots/RAxML_bootstrap.SNP -w [OUTPUTF]/bipartitions/ -q part

#Phylonetwork, just open your fasta file in SplitsTree
#Supernetworks/Galled Trees/ SuperTrees, open your collection of phylogenetic trees in SplitsTree or Dendoscrope

Phylogenetic network of COX2 and COX3 sequences

[iFASTAln] :input fasta alignment
[oFASTAln] :output fasta alignment
[Hapf]: haplotype file generated by DnaSP
[OUTPUTf]: output file
[INPUTf]: input file
[InfoStrain]: table with information about strains and traits for PopART

perl fastaSortByName.pl [iFASTAln] > [oFASTAln]
#Assign haplotype numbers in DnaSPv5
python NexusHap2Table.py -i [Hapf] -o [OUTPUTf]
Rscript Hap2Freq2Nex.R -i [INPUTf] -s [InfoStrain] -c [OPTION] -o [OUTPUTf]
#Add the new generate information to a nexus file for preparing the file for PopART
#Run PopART TCS method

BUCKy analysis

[INPUTF]: folder with sequence alignments
[INPUTb]: folder with *.in files
[OUTPUTF]: output folder to store results
[OUTGROUP]: strain name of the outgroup to root the tree
[#GENES2PICK]: number of genes to pick

python genes2bucky_v1.0.py -i [INPUTF] -o [OUTPUTF] -l [OPTION] -m MrBayes_param.txt -g [OUTGROUP] -r [OPTION] -t [OPTION]  -n [OPTION] > [LOGFILE].txt
bucky -a 1 -k 3 -n 100000 -c 2 -o [OUTPUTF] --calculate-pairs --create-joint-file --create-single-file [INPUTb]/*.in

#You want to randomize which gene you pick for bucky analyses or computational expensive programs
python ~/software/scripts/bucky_RandomPick_v1.0.py -s [#GENES2PICK] -a 1 -i [INPUTF] -o [OUTPUTF]

Mapping and genotyping

[INPUTf]: input file
[INPUTF]: input folder
[REFGENOME]: the reference genome where reads will be mapped
[OUTPUTCov]: mapping4SNP_v2.0_multipleMaps output file with coverage
chromosome.txt: a list of the chromosome names (example included in scripts folder)
[OUTPUTf]: output file
[OUTPUTF]: output folder
pairwiseDivFile_chromosome.txt: tab tabulated file with information of Chromosome name, size and position of start in a concatenated format (example included in scripts folder)

python mapping4SNP_v2.0_multipleMaps.py -i [INPUTf] -t [OPTION] -q [OPTION] -r [REFGENOME] -m [OPTION] -W [OPTION] -p [OPTION] #It runs multiple scripts and programs (bwa, samtools, picard, GTAK, genomeCoverageBed)
python filter_fasta_v3.py -i [INPUTf] -o [OUTPUTF] -p [OPTION] -f [INPUTF] #In case you want to remove strains run in mapping4SNP_v2.0_multipleMaps.py

#FASTA alignments split by chromosomes were processed as follows:
python splitFASTAintoWindows.py -m chromosome.txt -W [OPTION] -i [INPUTF] -o [OUTPUTF] -t [OPTION] -T [OPTION]

Population genomic analyses

[INPUTf]: input file
[OUTPUTf]: output file
[INPUTF]: input folder
[OUTGROUP]: strain name of the outgroup to root the tree
[OUTPUTN]: output name
[OUTPUTF]: output folder
[REFGENOME]: the reference genome where reads will be mapped
[InfoStrain]: table with information about strains and population designation
[QAF]: folder where qaf.tab is stored
pairwiseDivFile_chromosome.txt: tab tabulated file with information of Chromosome name, size and position of start in a concatenated format (example included in scripts folder)
[VCF]: merged vcf generated by ConvertOutputs_Mapping2GenomPopAnalysisTools.py
[INPUTN]: input name

#Combined fasta (from mapping4SNP_v2.0_multipleMaps.py) was trimmed with trimal -gt 0.9 and opened in MEGA v7 to get diversity stats and generate a fasta just for variant sites.
#Variant sites were opened in DnaSP v5 to get polymorphism statistics

#Diversity stats using PopGenome
python check_alignments4PopGenome.py -i [INPUTF] #Input folder contains the fasta splited by windows
trimal -in [INPUTf] -out [OUTPUTf] -gt 0.3
Rscript pairwiseDiversity_validSitesComp_v2.1.R [INPUTF] pairwiseDivFile_chromosome.txt [InfoStrain] [OUTPUTN] #Input folder contains the fasta splited by windows

#Convert results from mapping4SNP_v2.0_multipleMaps.py to other formats
#STRUCTURE
python ConvertOutputs_Mapping2GenomPopAnalysisTools.py -i [INPUTf] -t [OPTION] -r [REFGENOME] -o [OUTPUTN] -s [OUTGROUP] -f [INPUTF] -W [OPTION] -R [OPTION]
python structure_serially.py -k [OPTION] -r [OPTION] -o [OUTPUTF] -m [OPTION] -e [OPTION]

#FineSTRUCTURE
python FASTAtoFineStr_v2.0.py -i [INPUTf] -o [OUTPUTN] -s [InfoStrain] -c pairwiseDivFile_chromosome.txt -r Scer_recombinationRate.csv -e [OPTION]
fs [INPUTN]_AllGenomes_fs.cp -idfile [INPUTN]_AllGenomes_fineStr.idfile -phasefiles [INPUTN]_AllGenomes_fineStr.phase -recombfiles [INPUTN]_AllGenomes_fineStr.recomb -ploidy 1 -go 
fs -X -Y -e X2 [INPUTN]_AllGenomes_fs_linked_hap.chunkcounts.out [INPUTN]_AllGenomes_fs_linked_hap_tree_run0.xml [INPUTN]_AllGenomes_fs_linked_hap_mapstate.csv
fs -X -Y -e meancoincidence [INPUTN]_AllGenomes_fs_linked_hap.chunkcounts.out [INPUTN]_AllGenomes_fs_linked_hap_mcmc_run0.xml [INPUTN]_AllGenomes_fs_linked_hap_meancoincidence.csv
RScript fineStructure_plots_v2.0.R -p [OUTPUTN] -s [OPTION] -I [InfoStrain] -i [INPUTF] -N [OPTION] 

#Convert to PCAdmix
python FASTAtoPCAdmix_v2.0.py -i [INPUTf] -p [OUTPUTN] -c pairwiseDivFile_chromosome.txt -s [InfoStrain] -r Scer_recombinationRate.csv

#Convert to PED format 
python FASTAtoPED_v2.py -i [INPUTf] -o [OUTPUTN] -p [OPTION] -r Scer_recombinationRate.csv -s [InfoStrain] -c pairwiseDivFile_chromosome.txt

#Heterozygosity plots
Rscript snp_HTZplot_v1.r -i [QAF] -o [OUTPUTN]
python heterozigosity_distribution.py -i [INPUTF] -p pairwiseDivFile_chromosome.txt -f [OPTION]
Rscript snp_distrib_v2.r [OUTPUTN].qaf.tab pairwiseDivFile_chromosome.txt

#Convert to PLINK, PCAdmix, ADMIXTOOLS, Genesis in one script
python PopAnalysisFormatGenerator_v1.py -i [VCF] -o [OUTPUTN] -s [InfoStrain] -m [OPTION] -r [OPTION] -Q [OPTION] -c Scer_recombinationRate.csv