Genome Project Solutions    
GPS home
 
   
    Nannochloropsis Genome Project, in collaboration with  
    Colorado School of Mines  
 Partnering for Discovery
 

 

 
Colorado School of Mines

This project is led by Professor Matthew Posewitz, Dr. Randor Radakovits, and Dr. Robert Jinkerson of the Colorado School of Mines in Golden, Colorado.

For information or to report a problem, write to: Nannochloropsis@GenomeProjectSolutions.com

The Genome of the stramenopile alga, Nannochloropsis gaditana

Randor Radakovits and this research project were supported with funding provided by Conoco-Phillips through a grant to the Colorado Center for Biofuels and Biorefining (C2B2). Robert Jinkerson was supported by a Graduate Research Fellowship from the National Science Foundation.

Research Interests (from the Posewitz website):

Energy is inextricably linked to a society's standard of living and the 21st century will see dramatic changes in how energy is generated, distributed and utilized. It is clear that diminishing fossil energy resources, climate change concerns, and growing energy demands will require cutting edge solutions in renewable energy technologies. Our group studies the diverse portfolio of bioenergy carriers that can be obtained from algae including hydrogen, lipids for transformation into diesel fuel surrogates, and starch and osmolytes for conversion into alcohols, lipids or hydrogen. Micro-algae have among the highest photosynthetic conversion efficiencies documented, are able to thrive in salt water, and are among the most metabolically versatile organisms known. Currently, laboratory projects include the study of (a) hydrogenase enzymes and the production of hydrogen from phototrophic micro-organisms, (b) starch and lipid metabolisms in algae, (c) Genome-based approaches applied to defining whole cell metabolic and regulatory pathways, (d) the diversity of water-oxidizing phototrophs that are adapted to saline ecosystems, and (e) the enzymatic control of metabolic flux in algae. Our research is firmly entrenched in developing a more informed understanding of central metabolism in these fascinating organisms, which can hopefully be applied in viable bioenergy technologies.

Project Status

The Nannochloropsis genus, with six recognized species, is part of the Stramenopiles (sometimes called the Heterokonts), a group that includes oomycetes, diatoms, and brown algae. General information on Nannochloropsis and its potential for biofuel production can be found on Wikipedia at Nannochloropsis and Biofuels.

The genome sequencing has been completed, an assembly performed, a gene set created, and gene expression measured and compared with and without added nitrogen. A manuscript describing these results has been published: Radakovits R, Jinkerson RE, Fuerstenberg SI, Tae H, Settlage RE, Boore JL, Posewitz MC, 2012  Draft genome sequence and genetic transformation of the oleaginous alga: Nannochloropsis gaditana. Nature Communications 3: 686.

DNA sequencing was done by Eureka Genomics and the assembly was done in collaboration with Robert Settlage and Hongseok Tae at Virginia Tech.

Genome Project Solutions advised the sequencing stragegy, led the genome assembly, determined the gene content and relative gene expression levels, assisted with informatics to interpret and present this genome to the scientific community, and participated in drafting the manuscript describing this project.

 
 

Genome sequencing and assembly

In order to capitalize on their differing strengths, we employed both Roche (“454”) and Illumina sequencing. The relatively long sequencing reads produced by the Roche technology are especially useful for resolving short repeated sequences during assembly. The reads produced by the Illumina technology are much less expensive, enabling deep coverage at moderate cost, and do not suffer from the high error rates within homopolymer runs that characterize Roche sequencing. Further, we performed sequencing using an Illumina protocol called LIPES (“Long Interval Paired-End Sequencing”), which pairs the sequencing reads at approximately 4 kb separation, useful for ordering and orienting contiguous portions of the assembly that are not extended because of repeated elements in the genome.

Download the raw Roche ("454") and Illumina genome sequencing data HERE (9.17 GB compressed file).
(See below for downloading the RNAseq reads.)

The Roche sequences were processed to trim off the primer sequences and all sequencing reads were trimmed to an error probability of approximately 1:100 and to contain no ambiguous nucleotide identities (e.g. ‘N’), and then any shorter than 30 nucleotides were removed.

The performance for each type of genome sequencing was a follows:

Type
Raw reads
Reads after trimming
Nts after trimming
Mean trimmed read length
Genome coverage after trimming *
Download detailed trim report
Roche unpaired
1,123,414
1,103,775
338,614,903
307
12x
Illumina unpaired
24,709,613
24,484,194
2,313,756,333
95
81x
Illumina paired-end
57,978,246
56,644,602
5,845,722,926
103
206x

Total

83,811,273
82,232,571
8,498,094,162
N/A
299x
 

* Coverage assumes a genome size of 28.4 MB

Each set of reads was assembled separately using several different software packages with varying parameters, then the best of each of these assemblies were merged using custom, in-house software to create the version 1.0 assembly. Careful examination identified some scaffolds as being from contaminating bacterial DNA and a few from the organelle genomes (which were analyzed separately). Removing these produced the version 1.1 genome assembly of 2,087 scaffolds. There are 35 scaffolds longer than 100 kb, a total of 561 longer than 20 kb, and a total of 1,447 that are longer than 2 kb. Download a fasta-formatted file of the sequences of these scaffolds HERE and an Excel file with a listing of these scaffolds in order of size, their lengths, the cumulative lengths, and the cumulative genome coverage HERE.

The above version 1.1 of the assembly was the basis of the published genome description. However, some scaffolds were removed during submission to GenBank, a few as bacterial contamination and most as any shorter than 200 nts. This comprises assembly version 1.2, which corresponds to what is available at GenBank.. Download a fasta-formatted file of the sequences of these v 1.2 scaffolds HERE and an Excel file with a listing of these scaffolds in order of size, their lengths, the cumulative lengths, and the cumulative genome coverage HERE.

 
 

Identifying and annotating genes

RNAseq data was obtained using Illumina sequencing on cDNA that had been prepared from a variety of conditions. All cDNA sequencing was performed in paired-end format at about 150 nucleotide separation ("SIPES"). The performance is shown in the table below. The reads can be downloaded as pair-ends with "LEFT" being the set of reads from one end of each fragment and "RIGHT" being their mates in exactly the same order (as is standard input for many viewing and analysis programs).

Type
Raw reads
Download raw  paired-end reads
Reads after trimming
Nts after trimming
Mean trimmed read length
Download detailed trim report
Minus nitrogen
18,244,626
18,148,466
923,756,919
50.9
Plus nitrogen
19,736,016
19,646,689
1,000,016,470
50.9
Pooled from many conditions
17,823,072
17,723,662
895,044,931
50.5

Total

55,803,714
N/A
55,518,817
2,818,818,320
N/A
N/A

The “pooled” sample was assembled using a deBruijn graph method to create 37,056 transcriptome contigs. A summary report on the matching of these reads to the transcriptome assembly can be downloaded HERE and a more detailed report on this HERE.

After masking repeated genomic elements, genes were modeled in the 2,087 genome scaffolds using serveral methods: (1) Exonerate using the Nannochloropsis transcriptome assembly. (2) Aligning by BLASTn as EST evidence this same transcriptome assembly. (3) Aligning the 66,106 Ectocarpus siliculosus EST sequences using tBLASTx. (4) Aligning the RNAseq data using Bowtie followed by gene modeling with Tophat and Cufflinks. (5) Aligning all proteins of 10 genomes to the Nannochloropsis genome scaffolds using BLASTx. (6) Exonerate using all proteins from these same 10 organisms. (7) Augustus for ab initio models trained on the gene structures of Neurospora crassa. (8) SNAP for ab initio models trained on Hidden Markov Models (HMMs) of the genes of Pythium ultimum. (9) Genemark for ab initio models trained on HMMs of the genes of Pythium ultimum. All of these lines of evidence were reconciled into a single gene set using Maker.

We also examined the transcriptome assembly for any additional genes that may be present. We searched all transcript contigs for homology to the gene sets of P. tricornutum, T. pseudonana, C. reinhardtii, E. siliculosus, and A. anophagefferens using BLASTn with a cutoff of e-10. This subset was then searched against the complete Maker gene set using BLASTn with a cuttoff of e-20, narrowing this to 2,531 transcriptome contigs that have homology support for being real genes but were not present in our Maker gene set. We then used EST2genome to align these to the genome assembly v1.1. Of these, 1,310 could be assigned to the genome and 1,221 could not. Presumably these unmatched genes are in regions missing from the assembly altogether, are in regions that are misassembled, correspond to a Maker-predicted gene model better than had been found by BLASTn, or span two or more scaffolds.

An all-by-all BLASTn analysis of all of these gene models identified 619 genes that appear as duplications with identical sequence and 670 with large regions of identical sequence, but in each case being at different loci. These may be actual biological duplications or might be errors in the assembly. Those that are completely identical are distinguished by suffices of 0.01 and 0.02 and those with only large regions of identity by suffices of 0.1 and 0.2 (with the longer sequence called .1). This 9,052-member gene set is referred to as version 1.1.

In addition to this gene set, nearly complete mitochondrial and plastid genome sequences were found in five of the scaffolds that were separately interpreted. 124 chloroplast genes and 36 mitochondrial genes were identified and annotated using DOGMA.

These files are available here for download:

A fasta-formatted file of the sequences of all 37,056 transcript contigs.
An Excel file with statistics on the alignment of the pooled RNA reads to the transcriptome assembly.
A fasta-formatted file of all 9,052 unique gene sequences in gene set version 1.1 (See NCBI for the slightly reduced set of v.1.2)
A fasta-formatted file of the peptide sequence of the longest plus-strand ORF in each gene of this set.

The genome assembly scaffolds have been annotated with all gene models. This annotated file for genome assembly v.1.1 (on which the manuscript is based) can be downloaded HERE. This file can be viewed using free software from CLC Bio called the Sequence Viewer (or several other alternatives) that works on any platform. The GFF for the version 1.1 annotation can be downloaded HERE. The assembly version 1.2 files that are in GenBank differ from this only by the elimination (by their policy) of any scaffold shorter than 200 nts and by the elimination of a few that were identified as bacterial contamination during the submission and by the renaming of the scaffolds sequentially.

 
 
Gene expression changes with added nitrogen

The trimmed reads in the table above for conditions with and without added nitrogen were all separately aligned to the 9,052 sequences of gene set 1.1. A count was made of each of these RNAseq reads aligned to each of the gene sequences of gene set version 1.1. This was then normalized by the length of each transcript to get the mean depth of coverage, then normalized for the differing numbers of input reads, then the ratios calculated for each gene to assay the differences in gene expression under these two conditions. There were matches of this RNAseq data to 8,948 of the total of 9,052 gene models; presumably, the 104 genes that were not matched were not expressed at all in these conditions

An Excel file can be downloaded HERE that shows these results.
A report on the mapping of all reads from transcripts isolated from a culture deprived of nitrogen ("minus nitrogen") can be downloaded HERE.
A report on the mapping of all reads from transcripts isolated from a culture with added nitrogen ("plus nitrogen") can be downloaded HERE.

 
  Last update: January 16, 2012