Name | Whole Genome Assembly and Annotation (v2.1) of Sorghum bicolor Rio |
---|---|
Description | Assembly The Sorghum Rio genome assembly was constructed by Cooper et al (2019) using FALCON (Chin et al, 2016) and polished with Quiver (Chin et al, 2013). The Sorghum Rio v2.1 assembly in SorghumBase corresponds to release v2.0 of Phytozome. A total of 35,627 unique, non-repetitive, non-overlapping 1 KB sequences were generated using the existing Sorghum bicolor v3.0 assembly and aligned to the polished Sorghum Rio assembly. Scaffolds were oriented, ordered, and assembled into 10 chromosomes. NCBI accession: GCA_015952705.1. Annotation Genome-guided transcript assemblies were made from close to 1 billion bp of 2x151bp paired-end Illumina RNAseq reads using PERTRAN (Shu, unpublished; see Cooper et al, 2019). PASA (Haas et al, 2003) alignment assemblies were constructed using the PERTRAN output from the Rio RNAseq data along with sequences from known S. bicolor expressed sequence tags (ESTs) associated with the current reference genome. As further described in Phytozome, loci were determined by transcript assembly alignments and/or EXONERATE alignments of proteins from Arabidopsis thaliana, soybean, maize, rice, foxtail, Sorghum bicolor BTx623, brachy, grape, and Swiss-Prot proteomes to the repeat-soft-masked Sorghum bicolor Rio genome using RepeatMasker (RepeatMasker Open-3.0 by AFA Smit, R Hubley & P Green, 1996-2011) with up to 2K BP extension on both ends unless extending into another locus on the same strand. Gene models were predicted by homology-based predictors, FGENESH+ (Salamov and Solovyev, 2000), FGENESH_EST (similar to FGENESH+, EST as splice site and intron input instead of protein/translated ORF), and GenomeScan (Yeh et al, 2001), PASA assembly ORFs (in-house homology constrained ORF finder) and from AUGUSTUS via BRAKER1 (Hoff et al, 2016). The best scored predictions for each locus were selected using multiple positive factors including EST and protein support, and one negative factor: overlap with repeats. The selected gene predictions were improved by PASA (Haas et al, 2003). PASA-improved gene model proteins were subject to protein homology analysis to above mentioned proteomes to obtain Cscore and protein coverage; PASA-improved transcripts were selected based on Cscore, protein coverage, EST coverage, and its CDS overlapping with repeats. Selected gene models were subject to Pfam analysis and gene models whose protein was more than 30% in Pfam TE domains were removed. For additional details, see Sorghum bicolor Rio v2.1 (Sorghum Rio) in Phytozome v12.1. |
Program, Pipeline, Workflow or Method Name | PASA-improved |
Program Version | n/a |
Algorithm | |
Date Performed | Monday, February 13, 2023 |
Data Source |
|