Summary
Resource Type
Genome Assembly
Organism
Name
Whole Genome Assembly (v3.0) and Annotation (v3.1) of Sorghum bicolor - cv. BTx623 (JGI)
Program, Pipeline, Workflow or Method Name
Assembly & annotation, performed by JGI
Program Version
v 3.1
Algorithm
Date Performed
Monday, December 10, 2018
Data Source
Source Name
: JGI-Phytozome Sorghum bicolor assembly/annotation
Source Version
: 3.1
Source URI
: https://phytozome.jgi.doe.gov/pz/portal.html#!info?alias=Org_Sbicolor
About Sorghum bicolor

Sorghum bicolor is a widely grown cereal crop, particularly in Africa, ranking 5th in global cereal production. It is also used as biofuel crop and potential cellulosic feedstock. The diploid genome (~730 Mb) has a haploid chromosome number of 10. Although highly repetitive, the genome is more tractable for sequencing than its close relative, Zea mays.

 

Assembly

The first genome assembly of Sorghum bicolor cv. Moench was published in 2009. Sequencing by the US department of Energy Joint Genome Institute (JGI) Community Sequencing Program in collaboration with the Plant Genome Mapping Laboratory followed a whole genome shotgun strategy reaching 8x coverage with scaffolds, where possible, being assigned to the genetic map. Since then JGI made two rounds of improvements. The most recent update of release v3.0 includes ~351 Mb of finished sorghum sequence. A total of 349 clones were manually inspected, then finished and validated using a variety of technologies. They were integrated into chromosomes by aligning to v1.0 assembly. As a result, 4,426 gaps were closed, and a total of 4.96 Mb of sequence was added to the assembly. Overall contiguity (contig N50) increased by a factor of 5.8x from 204.5 Kb to 1.2 Mb. 

 

Annotation

This browser presents data from the v3.0.1 assembly and v3.1.1 gene set (March 2007). Gene prediction is an improved process based upon resources used in original v1.0 release (Sbi1 assembly and Sbi1.4 gene set) with new geneAtlas RNA-seq data. 

 

Source : Gramene


 

NCBI GenBank Records
Release Date: 4/7/2017 BioProject: PRJNA38691;PRJNA13876 Accession ID: ABXC03000000

 

Overview

Sorghum bicolor has been selected as a JGI Plant Flagship genome for its use as a biofuel crop and potential cellulosic feedstock. As a Flagship Plant, we are continuing to improve the genomic resources of sorghum, including this release that represents the first available improved Sorghum bicolor genome over the v1.0 whole genome shotgun published release. .

This represents the second improved Sorghum biocolor release which includes ~351 Mb of finished sorghum sequence. These regions were finished by dividing the gene space into ~1Mb overlapping pieces. Each region was manually inspected and then finished using a variety of technologies including Sanger (primer walks on subclones and fosmid templates, transposon sequencing on subclone templates, shotgun sequencing and finishing of complete fosmid and BAC clones that were then integrated back into the genome pieces) 454 (small insert shatter libraries) and Illumina (small insert shatter libraries). Following completion each assembly was validated by an independent quality assessment. This included a visual examination of subclone paired ends and repeat structures and validation of any remaining lower quality regions and regions with high quality base pair discrepancies. These improved sequences have an estimated error rate of less than one error in 100,000 base pairs.

Integration of the finished regions began by aligning the regions to the existing v1.0 assembly and assigning regions to a chromosome. Orientation of individual was adjusted to get all the regions on the same strand. Overlapping regions were first merged together to produce megaRegions. Ends of the megaRegions were then placed on the genome to identify the start and end of the insertion. Sequence and quality scores were then integrated into the assembly, and statistics on the overlap region were obtained (number of gaps, total Ns, etc.). The integrated assembly was screened for ecoli and adapter sequences, and a total of 10,339 bp was identified as foreign and excised from the assembly.

A total of 349 clones were integrated representing 344.4 MB of sequence. 4,426 gaps were closed, and a total of 4.96 MB of sequence was added to the assembly. Overall contiguity (contig N50) increased by a factor of 5.8x from 204.5 KB to 1.2 MB.

Note: 4 scaffolds (super_10, super_11, super_12, and super_610) are removed from assembly v3.0 and this assembly is called v3.0.1

 

Statistics
Genome Size

732.2 MB arranged in 2n=20 chromosomes

Loci

34,129 loci containing protein-coding transcripts

Transcripts

47,121 protein-coding transcripts

 

Sequencing, Assembly, and Annotation

The present v3.1 release, comprising the v3.0 assembly and v3.1 gene set, is a modern annotation using resources used in the original v1.0 release (Sbi1 assembly and Sbi1.4 gene set) and geneAtlas RNA-seq data. The main genome is in 10 chromosomes with small unmapped pieces, some of which contain annotated genes.

This release is essentially the same as v3.1.1 except for 82 genes/loci were inactivated in v3.1.1 due to 4 scaffolds entirely present in chromosome(s) were removed.

127,415 RNAseq transcript assemblies were constructed from about ~3.3B pairs of sorghum stranded paired-end Illumina RNAseq reads using PERTRAN (Shu et. al., unpublished pipeline using GSNAP as read aligners). 111,994 transcript assemblies were constructed using PASA from Sorghum bicolor RNAseq transcript assemblies above and 209,835 ESTs. Loci were determined by transcript assembly alignments and/or EXONERATE alignments of proteins from arabi (Arabidopsis thaliana), rice, maize or grape genomes to repeat-soft-masked S. bicolor genome using RepeatMasker. Gene models were predicated by homology-based predictors, mainly FGENESH+, FGENESH_EST (similar to FGENESH+, EST as splice site and intron input instead of protein/translated ORF), and GenomeScan. The best scored predictions for each locus are selected using multiple positive factors including EST and protein support, and one negative factor: overlap with repeats. The selected gene predictions were improved by PASA. Improvement includes adding UTRs, splicing correction, and adding alternative transcripts. PASA-improved gene model proteins were subject to protein homology analysis to above mentioned proteomes to obtain Cscore and protein coverage. Cscore is a protein BLASTP score ratio to MBH (mutual best hit) BLASTP score and protein coverage is highest percentage of protein aligned to the best of homologs. PASA-improved transcripts were selected based on Cscore, protein coverage, EST coverage, and its CDS overlapping with repeats. The transcripts were selected if its Cscore is larger than or equal to 0.5 and protein coverage larger than or equal to 0.5, or it has EST coverage, but its CDS overlapping with repeats is less than 20%. For gene models whose CDS overlaps with repeats for more that 20%, its Cscore must be at least 0.9 and homology coverage at least 70% to be selected. The selected gene models were subject to Pfam analysis and gene models whose protein is more than 30% in Pfam TE domains were removed.

v2.1 loci were tentatively mapped to v3.1 loci by BLAT both v2.1 loci sequence including intron bounded by its CDS range and v2.1 loci sequence including intron bounded by its range extending up to 1K bp to v2.0 assembly. For each loci pairing, their proteins were aligned to each other. When MBH protein is >= 70% identical, v2.1 locus name becomes v3.1 locus name (88% of v2.1 loci mapped this way). When MBH protein is >= 90% identical, v2.1 synonym and defLine if any become v3.1 synonym and defLine respectively.

Contacts

JGI Contact: Jeremy Schmutz (email: jschmutz AT hudsonalpha DOT org)

Reference Publication(s)

McCormick RF, Truong SK, Sreedasyam A, Jenkins J, Shu S, Sims D, Kennedy M, Amirebrahimi M, Weers BD, McKinley B, Mattison A, Morishige DT, Grimwood J, Schmutz J, Mullet JE, The Sorghum bicolor reference genome: improved assembly, gene annotations, a transcriptome atlas, and signatures of genome organization.The Plant journal : for cell and molecular biology. 2017 Nov 21;

 

Source : Phytozome

Publication
  1. Paterson AH, Bowers JE, Bruggmann R, Dubchak I, Grimwood J, Gundlach H, Haberer G, Hellsten U, Mitros T, Poliakov A, Schmutz J, Spannagl M, Tang H, Wang X, Wicker T, Bharti AK, Chapman J, Feltus FA, Gowik U, Grigoriev IV, Lyons E, Maher CA, Martis M, Narechania A, Otillar RP, Penning BW, Salamov AA, Wang Y, Zhang L, Carpita NC, Freeling M, Gingle AR, Hash CT, Keller B, Klein P, Kresovich S, McCann MC, Ming R, Peterson DG, Mehboob-ur-Rahman , Ware D, Westhoff P, Mayer KF, Messing J, Rokhsar DS. The Sorghum bicolor genome and the diversification of grasses.. Nature. 2009 Jan 29; 457(7229):551-6.
  2. McCormick RF, Truong SK, Sreedasyam A, Jenkins J, Shu S, Sims D, Kennedy M, Amirebrahimi M, Weers BD, McKinley B, Mattison A, Morishige DT, Grimwood J, Schmutz J, Mullet JE. The Sorghum bicolor reference genome: improved assembly, gene annotations, a transcriptome atlas, and signatures of genome organization.. The Plant journal : for cell and molecular biology. 2018 01; 93(2):338-354.