Whole Genome Assembly and Annotation (v1.1) of Sorghum bicolor BTx642

Transcript assemblies were made from ~349M pairs of 2X150 stranded paired-end Illumina RNA-seq reads and ~2B pairs of 2X150 stranded paired-end Illumina RNA-seq reads (expression profile experiments) using PERTRAN (Shu, unpublished). About 1.5M PacBio Iso-Seq CCSs were corrected and collapsed by genome guided correction pipleine (Shu, unpublished) to obtain ~1M putative full length transcripts. 215,476 transcript assemblies were constructed using PASA (Haas, 2003) from RNA-seq transcript assemblies above. Loci were determined by transcript assembly alignments and/or EXONERATE alignments of proteins from arabi (Arabidopsis thaliana), soybean, rice, Setaria viridis, aquillegia, grape and Swiss-Prot proteomes to repeat-soft-masked Sorghum bicolor BTx642 genome using RepeatMasker (Smit, 2013-2015) with up to 2K BP extension on both ends unless extending into another locus on the same strand. Gene models were predicted by homology-based predictors, FGENESH+ (Salamov, 2000), FGENESH_EST (similar to FGENESH+, but using EST to compute splice site and intron input instead of protein/translated ORF), and EXONERATE (Slater and Birney, 2005), PASA assembly ORFs (in-house homology constrained ORF finder) and from AUGUSTUS via BRAKER1 (Hoff, 2015). The best scored predictions for each locus are selected using multiple positive factors including EST and protein support, and one negative factor: overlap with repeats. The selected gene predictions were improved by PASA. Improvement includes adding UTRs, splicing correction, and adding alternative transcripts. PASA-improved gene model proteins were subject to protein homology analysis to above mentioned proteomes to obtain Cscore and protein coverage. Cscore is a protein BLASTP score ratio to MBH (mutual best hit) BLASTP score and protein coverage is highest percentage of protein aligned to the best of homologs. PASA-improved transcripts were selected based on Cscore, protein coverage, EST coverage, and its CDS overlapping with repeats. The transcripts were selected if its Cscore is larger than or equal to 0.5 and protein coverage larger than or equal to 0.5, or it has EST coverage, but its CDS overlapping with repeats is less than 20%. For gene models whose CDS overlaps with repeats for more than 20%, their Cscore must be at least 0.9 and homology coverage at least 70% to be selected. The selected gene models were subject to Pfam analysis and gene models whose protein is more than 30% in Pfam TE domains were removed and weak gene models. Incomplete gene models, low homology supported without fully transcriptome supported gene models and short single exon (< 300 BP CDS) without protein domain nor good expression gene models were manually filtered out.

Source : Phytozome

Program, Pipeline, Workflow or Method Name
Program Version
Date Performed
Tuesday, February 14, 2023
Data Source
Source Name
: Phyzotome
Source Version
: 1.1
Source URI
There are no publications associated with this record.
This record has the following annotations.
There are no annotations of this type
There are 0 relationships.
There are no relationships
Loading content