Summary
Resource Type
Genome Assembly
Organism
Name
Whole Genome Assembly (v2.0) and Annotation (v2.2) of Setaria italica - cv. Yugu1 (JGI)
Program, Pipeline, Workflow or Method Name
Assembly & annotation, performed by JGI
Program Version
v 2.2
Algorithm
Date Performed
Wednesday, January 30, 2019
Data Source
Source Name
: JGI-Phytozome Setaria italica assembly/annotation
Source Version
: v2.2 [312]

Assembly : https://www.ncbi.nlm.nih.gov/assembly/GCF_000263155.2


About Setaria italica

Setaria italica (foxtail millet) is a grain crop widely grown in Asia with particular significance in semi-arid regions of Northern China. It is also grown on a moderate scale in other parts of the world as a forage crop. It is one of the oldest domesticated crops with archeological remains from 5,500 to 5,900 years BC in northern China. Motivation for sequencing foxtail millet includes its close relationship, both genetically and physiologically, to the biofuel crop switchgrass (Panicum virgatum). Direct study of switchgrass is complicated by its large genome size and polyploidy. Data from the foxtail millet genome assists in study and improvement of switchgrass and related biofuel crops. The nuclear genome (~490 Mb) is diploid with nine chromosomes (2n=18).

Assembly

Setaria italica cv. Yugu1 was sequenced and assembled by the Joint Genome Institute (JGI) in collaboration with community researchers. Sanger sequencing was performed on whole-genome shotgun clone libraries having different insert sizes. Reads totalling 8.29x coverage were assembled with Arachne giving scaffolds that were arranged predominantly into nine pseudomolecules.

Annotation

Protein-coding genes were predicted using the standard JGI plant gene annotation pipeline. They used ESTs and homologous peptides from Arabidopsis, Brachypodium, rice and sorghum, mapped with BLAT alignments of PASA and EXONERATE, with GenomeScan, FGENESH+ and FGENESH_EST. BACs and fosmids were annotated using AUGUSTUS with maize parameters to predict gene models, which were compared to GenBank, TAIR and IRGSP/RAP proteins and manually inspected. Gene models were validated with RNA-seq data.

References
  1. Reference genome sequence of the model plant Setaria.
    Bennetzen JL, Schmutz J, Wang H, Percifield R, Hawkins J, Pontaroli AC, Estep M, Feng L, Vaughn JN, Grimwood J et al. 2012. Nat. Biotechnol.. 30:555-561.
  2. Genome sequence of foxtail millet (Setaria italica) provides insights into grass evolution and biofuel potential.
    Zhang G, Liu X, Quan Z, Cheng S, Xu X, Pan S, Xie M, Zeng P, Yue Z, Wang W et al. 2012. Nat. Biotechnol.. 30:549-554.

 

Source : Gramene


 

NCBI GenBank Records
Release Date: 10/30/2015 BioProject: PRJNA32913 Accession ID: AGNK01000000
Overview

Foxtail millet ( Setaria italica) is a diploid grass with a relatively small genome (~515 Mb). It is an important grain crop in temperate, subtropical, and tropical Asia and in parts of southern Europe, and is grown for forage in North America, South America, Australia, and North Africa. The genetic map of foxtail millet is highly colinear with that of rice, despite the fact that these lineages last shared a common ancestor more than 50 million years ago. Hence, comparison of the rice and foxtail millet genomes will facilitate reconstruction of the ancestral grass genome. Most important, foxtail millet is a close relative of an important biofuel crop, switchgrass ( Panicum virgatum). It is also closely related to pearl millet ( Pennisetum glaucum), which is under investigation as a biofuel grain feedstock in regions unsuitable for maize cultivation, and napiergrass ( Pennisetum purpureum), a grass with biofuel potential in hot/humid regions such as the southeastern United States. Switchgrass is a polyploid species with a large genome that will not be an easy target for full genome sequence analysis. However, switchgrass and foxtail millet are both temperate, C4 grass species (C3 and C4 represent different metabolic approaches to CO2 metabolism in plants), so foxtail millet should share many genetic and physiological processes with switchgrass. Hence, foxtail millet should serve as an excellent surrogate genome to assist future study and improvement of switchgrass and related biofuel crops.

(from The JGI Genome Portal).

Statistics

This is the chromosome-scale release of the 8.3x whole genome shotgun assembly of Setaria italica . The first 9 scaffolds are pseudomolecules on which over 98.9% of the sequence data was able to be placed. The mapping data for Setaria italica chromosomes was generated by Katrien Devos and Xuewen Wang. The telomeric signature for foxtail millet is "AAACCCT". These chromosomes have telomeric signatures: 1Q,2Q,3P,4P,4Q,5P,6P,7Q,8Q,9P.

Genome

The main genome assembly is approximately 405.7 Mb arranged in 336 scaffolds

Approximately 400.9 Mb are arranged in 6791 contigs (~ 1.2% gap)

Scaffold N50 (L50) = 4 (47.3 Mb)

Contig N50 (L50) = 982 (126.3 Kb)

98.9% of the sequence data is represented in the 9 pseudomolecules

Loci

34,584 loci containing protein-coding transcripts

Transcripts

43,001 protein-coding transcripts

Sequencing, Assembly, and Annotation

The current annotation is version 2.2. Transcript assemblies were constructed using PASA from ~1.28 million Setaria italica EST reads sequenced at JGI against the 8.3X version 2.0 release of the Setaria italica genome. Loci were determined by BLAT alignments of above transcript assemblies and/or BLASTX alignments of proteins from sorghum, rice, Arabidopsis thaliana , and grapevine genomes to the S. italica genome, following genome soft-masking of consensus repeat families provided by Hao Wang and Jeff Bennetzen. Gene models were predicated by homology-based predictors FGENESH+ and GenomeScan . The best prediction at each locus was selected based on protein coverage and homology, as well as intron/exon junctions and EST coverage. These transcripts were UTR-extended and/or improved by PASA to match the EST evidence. The final gene set selection is based on ESTs support or protein homology support subject to filtering of repeats/transposable elements.

 

Gene Prediction and Locus Naming

Various Illumina RNA-seq reads (both paired end and single end) were used to construct transcript assemblies using PERTRAN (Shu et. al., manuscript in preparation): 1B pairs of geneAtlas, 391M other pair end and some 454 reads. 77,451 transcript assemblies were constructed using PASA (Haas, 2003) from above sequences. Loci were determined by transcript assembly alignments and/or EXONERATE alignments of proteins from arabidopsis (Arabidopsis thaliana), rice, sorghum, Brachypodium distachyon, grape, soybean and Swiss-Prot eukaryote proteins to soft-repeatmasked Setaria italica Bd21 genome using RepeatMasker (Smit, 1996-2012) with up to 2K BP extension on both ends unless extending into another locus on the same strand. Gene models were predicted by homology-based predictors, FGENESH+ (Salamov, 2000), FGENESH_EST (similar to FGENESH+, EST as splice site and intron input instead of protein/translated ORF), and GenomeScan (Yeh, 2001).

The highest scoring predictions for each locus are selected using multiple positive factors including EST and protein support, and one negative factor: overlap with repeats. The selected gene predictions were improved by PASA. Improvement includes adding UTRs, splicing correction, and adding alternative transcripts. PASA-improved gene model proteins were subject to protein homology analysis to above mentioned proteomes to obtain Cscore and protein coverage. Cscore is a protein BLASTP score ratio to MBH (mutual best hit) BLASTP score and protein coverage is highest percentage of protein aligned to the best of homologs. PASA-improved transcripts were selected based on Cscore, protein coverage, EST coverage, and its CDS overlapping with repeats. The transcripts were selected if its Cscore is larger than or equal to 0.5 and protein coverage larger than or equal to 0.5, or it has EST coverage, but its CDS overlapping with repeats is less than 20%. For gene models whose CDS overlaps with repeats for more that 20%, its Cscore must be at least 0.9 and homology coverage at least 70% to be selected. The selected gene models were subject to Pfam analysis and gene models whose protein is more than 30% in Pfam TE domains were removed. After Setaria viridis went similar gene annotation effort and before it was finalized, a high confidence set of S. viridis gene model peptides were used as one more homlogy seed in second round of the gene call.

References:

Haas, B.J., Delcher, A.L., Mount, S.M., Wortman, J.R., Smith Jr, R.K.,Jr., Hannick, L.I., Maiti, R., Ronning, C.M., Rusch, D.B., Town, C.D. et al. (2003) Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. http://nar.oupjournals.org/cgi/content/full/31/19/5654 [Nucleic Acids Res, 31, 5654-5666].

Smit, AFA, Hubley, R & Green, P. RepeatMasker Open-3.0. 1996-2011 .

Yeh, R.-F., Lim, L. P., and Burge, C. B. (2001) Computational inference of homologous gene structures in the human genome. Genome Res. 11: 803-816.

Salamov, A. A. and Solovyev, V. V. (2000). Ab initio gene finding in Drosophila genomic DNA. Genome Res 10, 516-22.

Contacts

Principal Collaborators:

  • J.L. Bennetzen (email: Maize AT uga DOT edu)
  • K.M. Devos (email: kdevos AT uga DOT edu)
  • A.N. Doust
  • E.A. Kellogg
  • D. Ware
  • J. Zale

JGI Contact: Daniel Rokhsar (email: dsrokhsar AT gmail DOT com)

Reference Publication(s)

Bennetzen JL, Schmutz J, Wang H, Percifield R, Hawkins J, Pontaroli AC, Estep M, Feng L, Vaughn JN, Grimwood J, Jenkins J, Barry K, Lindquist E, Hellsten U, Deshpande S, Wang X, Wu X, Mitros T, Triplett J, Yang X, Ye CY, Mauro-Herrera M, Wang L, Li P, Sharma M, Sharma R, Ronald PC, Panaud O, Kellogg EA, Brutnell TP, Doust AN, Tuskan GA, Rokhsar D, Devos KM, Reference genome sequence of the model plant Setaria.Nature biotechnology. 2012 May 13; 30 6 555-61

 

Source : Phytozome

Publication
Bennetzen JL, Schmutz J, Wang H, Percifield R, Hawkins J, Pontaroli AC, Estep M, Feng L, Vaughn JN, Grimwood J, Jenkins J, Barry K, Lindquist E, Hellsten U, Deshpande S, Wang X, Wu X, Mitros T, Triplett J, Yang X, Ye CY, Mauro-Herrera M, Wang L, Li P, Sharma M, Sharma R, Ronald PC, Panaud O, Kellogg EA, Brutnell TP, Doust AN, Tuskan GA, Rokhsar D, Devos KM. Reference genome sequence of the model plant Setaria.. Nature biotechnology. 2012 May 13; 30(6):555-61.