Summary
Resource Type
Genome Assembly
Organism
Name
Whole Genome Assembly (v7.0) and Annotation (v7.1) of Miscanthus sinensis - cv. DH1 (JGI)
Program, Pipeline, Workflow or Method Name
Assembly & annotation, performed by JGI
Program Version
v 7.1
Algorithm
Date Performed
Wednesday, December 20, 2017
Data Source
Source Name
: JGI-Phytozome Miscanthus sinensis assembly/annotation
Source Version
: v7.1 [497]
Source URI
: https://phytozome.jgi.doe.gov/pz/portal.html#!info?alias=Org_Msinensis_er

 

JGI Assembly and Gene Annotation workflow (Miscanthus sinensis)

 

Release Date (with Restrictions) : December 20, 2017 (Ft. Lauderdale)

Data Source : Energy Biosciences Institute and The Joint Genome Institute

 

Overview

This release (v7.0) is the first chromosome-scale assembly of Miscanthus sinensis doubled haploid DH1 (IGR-2011-001). The wild grass Miscanthus spp. is one of the world’s most widely adapted and productive plants. This chromosome-scale assembly will provide a foundational sequence resource for the Miscanthus genus, and an important link between the diploid Sorghum bicolor and the complex polyploid Saccharum species and provide a reference for the highly productive triploid M. x giganteus. Analysis of structural and regulatory changes among these genomes offer insights into the evolution of rhizome development, nutrient recycling, and self-incompatibility traits that contribute to highly efficient and sustainable biomass accumulation in a perennial temperate grass. Completion of the Miscanthus genome sequence represents a significant advance in both basic and applied plant genomics, creating a powerful system for comparative genomics among the Andropogoneae grasses that are of global importance as leading bioenergy feedstocks and food crops.

 

Statistics

This release of Phytozome includes the JGI v7.0 assembly of Miscanthus sinensis and the JGI v7.1 annotation.

Genome

Approximately 2Gb arranged in 19 chromosomes and some unmapped scaffolds.

Loci

67,789 loci containing protein-coding transcripts

Transcripts

89,486 protein-coding transcripts

 

Sequencing, Assembly, and Annotation

 

Assembly

The genome was assembled using meraculous (Goltsman, 2017) with Illumina fragment libraries and 2.5 kb and 6 kb mate-pair libraries. An additional round of scaffolding using Lucigen's 40 kb NGS mate pair fosmid library was performed and scaffolds smaller than 1 kb were thrown out. Chromosome-scale scaffolding was performed by Dovetail ( using their HiRISE assembler with Chicago and Hi-C libraries.

 

Gene Prediction and Locus Naming

RNA-Seq from leaves, stems, and roots of Miscanthus x giganteus over all seasons were used to make 306,698 transcript assemblies (TA) from ~1.4 billion pairs of 150 bp paired-end Illumina reads. Additional RNA-Seq from Miscanthus sinensis DH1 leaves and rhizome were used to create and 116,719 TAs from ~260 million pairs of 150 bp paired-end Illumina RNAseq reads using PERTRAN (Shu et. al., unpublished). The assemblies were further assembled into PASA transcript assemblies (330,146) using PASA (Haas, 2003). Loci were determined by PASA transcript assembly alignments and/or EXONERATE alignments of proteins from Zea mays, Setaria italica, Brachypodium distachyon, Arabidopsis thaliana columbia, Sorghum bicolor, Oryza sativa Kitaake, Vitis vinifera, and Swiss-Prot proteomes to repeat-soft-masked M. sinensis genome using RepeatMasker (Smit, 1996-2012) and RepeatModeler (Smit, 2008-2015) repeat library with up to 2K BP extension on both ends unless extending into another locus on the same strand. Gene models were predicted by homology-based predictors, FGENESH+ (Salamov, 2000), FGENESH_EST (similar to FGENESH+, EST as splice site and intron input instead of protein/translated ORF), and GenomeScan (Yeh, 2001). The best scored predictions for each locus are selected using multiple positive factors including EST and protein support, and one negative factor: overlap with repeats. The selected gene predictions were improved by PASA. Improvement includes adding UTRs, splicing correction, and adding alternative transcripts. PASA-improved gene model proteins were subject to protein homology analysis to above mentioned proteomes to obtain Cscore and protein coverage. Cscore is a protein BLASTP score ratio to MBH (mutual best hit) BLASTP score and protein coverage is highest percentage of protein aligned to the best of homologs. PASA-improved transcripts were selected based on Cscore, protein coverage, EST coverage, and its CDS overlapping with repeats. The transcripts were selected if its Cscore is larger than or equal to 0.5 and protein coverage larger than or equal to 0.5, or it has EST coverage, but its CDS overlapping with repeats is less than 20%. For gene models whose CDS overlaps with repeats for more that 20%, its Cscore must be at least 0.9 and homology coverage at least 70% to be selected. The selected gene models were subject to Pfam and Panther analysis and gene models whose protein is more than 30% in Pfam/Panther TE domains were removed.

References:

Goltsman, E., Ho, I. and Rokhsar, D. (2017) Meraculous-2D: Haplotype-sensitive Assembly of Highly Heterozygous genomes. bioRxiv doi: 10.1101/070425

Haas, B.J., Delcher, A.L., Mount, S.M., Wortman, J.R., Smith Jr, R.K.,Jr., Hannick, L.I., Maiti, R., Ronning, C.M., Rusch, D.B., Town, C.D. et al. (2003) Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. http://nar.oupjournals.org/cgi/content/full/31/19/5654 [Nucleic Acids Res, 31, 5654-5666].

Hoff, K.J., Lange, S., Lomsadze, A., Borodovsky, M., and Stanke, M. (2015) BRAKER1: Unsupervised RNA-Seq-Based Genome Annotation with GeneMark-ET and AUGUSTUS. Bioinformatics. 2015 Nov 11. pii: btv661.

Putnam NH, O'Connell BL, Stites JC, Rice BJ, Blanchette M, Calef R, Troll CJ, Fields A, Hartley PD, Sugnet CW, Haussler D, Rokhsar DS, Green RE. Chromosome-scale shotgun assembly using an in vitro method for long-range linkage. Genome Res. 2016 Mar;26(3):342-50. doi: 10.1101/gr.193474.115. Epub 2016 Feb 4. PubMed PMID: 26848124; PubMed Central PMCID: PMC4772016.

Salamov, A. A. and Solovyev, V. V. (2000). Ab initio gene finding in Drosophila genomic DNA. Genome Res 10, 516-22.

Smit, AFA, Hubley, R & Green, P. RepeatMasker Open-3.0. 1996-2011 .

Yeh, R.-F., Lim, L. P., and Burge, C. B. (2001) Computational inference of homologous gene structures in the human genome. Genome Res. 11: 803-816.

 

Restrictions on dataset usage

I would like to use this data to help clone a gene, analyse a gene family, etc.
Please use this data to advance your studies. Please cite "Miscanthus sinensis  v7.1 DOE-JGI, http://phytozome.jgi.doe.gov/".

I would like to do a large-scale comparison of Miscanthus sinensisto other genomes, and/or a global analysis of its gene content.
As a public service, the Department of Energy's Joint Genome Institute (JGI) is making the completed Miscanthus sinensis genome sequence available before scientific publication according to the Ft. Lauderdale Accord. This balances the imperative of the DOE and the JGI that the data from its sequencing projects be made available as soon and as completely as possible with the desire of contributing scientists and the JGI to reserve a reasonable period of time to publish on the genome sequencing and analysis without concerns about preemption by other groups. JGI policy is that early release should aid the progress of science. By accessing these data, you agree not to publish any articles containing analyses of genes or genomic data on a whole genome or chromosome scale prior to publication by JGI and/or its collaborators of a comprehensive genome analysis ("Reserved Analyses"). "Reserved analyses" include the identification of complete (whole genome) sets of genomic features such as genes, gene families, regulatory elements, repeat structures, GC content, or any other genome feature, and whole-genome- or chromosome- scale comparisons with other species. The embargo on publication of Reserved Analyses by researchers outside of the Miscanthus sinensis Genome Sequencing Project is expected to extend until the publication of the results of the sequencing project is accepted. Scientific users are free to publish papers dealing with specific genes or small sets of genes using the sequence data. If these data are used for publication, the following acknowledgment should be included: 'These sequence data were produced by the US Department of Energy Joint Genome Institute'. This letter has been circulated to Journal Editors so that they are aware of the conditions of access and publication detailed above. These data may be freely downloaded and used by all who respect the restrictions in the previous paragraphs. The assembly and sequence data should not be redistributed or repackaged without permission from the JGI. Any redistribution of the data during the embargo period should carry this notice: "The Joint Genome Institute provides these data in good faith, but makes no warranty, expressed or implied, nor assumes any legal liability or responsibility for any purpose for which the data are used. Once the sequence is moved to unreserved status, the data will be freely available for any subsequent use."

We prefer that potential users of this sequence assembly contact the individuals listed under Contacts with their plans to ensure that proposed usage of sequence data are not considered Reserved Analyses.

 

Contacts

Principal Collaborators:

  • Kankshita Swaminathan (HudsonAlpha Institute for Biotechnology) (email: kswaminathan AT hudsonalpha DOT org)

JGI Contact:

  • JGI Contact and the Miscanthus Genome Sequencing Project coordinator: Dan Rokhsar (email: drokhsar AT lbl DOT gov)

 

Source : Phytozome

Publication
There are no publications associated with this record.
Cross Reference
There are no cross references.