Identifying and annotating genes
RNAseq data was obtained
using Illumina sequencing on cDNA that had been prepared from a
variety of conditions. All cDNA sequencing was performed in paired-end
format at about 150 nucleotide separation ("SIPES"). The
performance is shown in the table below. The reads can be downloaded
as pair-ends with "LEFT" being the set of reads from one
end of each fragment and "RIGHT" being their mates in
exactly the same order (as is standard input for many viewing and
analysis programs).
Type |
Raw
reads |
Download
raw paired-end reads |
Reads
after trimming |
Nts
after trimming |
Mean
trimmed read length |
Download
detailed trim report |
Minus nitrogen |
18,244,626 |
|
18,148,466 |
923,756,919 |
50.9 |
|
Plus nitrogen |
19,736,016
|
|
19,646,689 |
1,000,016,470 |
50.9 |
Pooled from many conditions |
17,823,072 |
|
17,723,662 |
895,044,931 |
50.5 |
|
Total |
55,803,714 |
N/A |
55,518,817 |
2,818,818,320 |
N/A |
N/A |
The “pooled”
sample was assembled using a deBruijn graph method to create 37,056
transcriptome contigs. A summary report on the matching of these
reads to the transcriptome assembly can be downloaded HERE
and a more detailed report on this HERE.
After masking repeated
genomic elements, genes were modeled in the 2,087 genome scaffolds
using serveral methods: (1) Exonerate
using the Nannochloropsis transcriptome assembly. (2) Aligning
by BLASTn as EST evidence this same transcriptome assembly. (3) Aligning
the 66,106 Ectocarpus siliculosus EST sequences using tBLASTx.
(4) Aligning the RNAseq data using Bowtie
followed by gene modeling with Tophat
and Cufflinks.
(5) Aligning all proteins of 10 genomes to the Nannochloropsis
genome scaffolds using BLASTx. (6) Exonerate
using all proteins from these same 10 organisms. (7) Augustus
for ab initio models trained on the gene structures of Neurospora
crassa. (8) SNAP for ab initio models trained on
Hidden Markov Models (HMMs) of the genes of Pythium ultimum.
(9) Genemark
for ab initio models trained on HMMs of the genes of Pythium
ultimum. All of these lines of evidence were reconciled into
a single gene set using Maker.
We also examined the
transcriptome assembly for any additional genes that may be present.
We searched all transcript contigs for homology to the gene sets
of P. tricornutum, T. pseudonana, C. reinhardtii,
E. siliculosus, and A. anophagefferens using BLASTn
with a cutoff of e-10. This subset was then searched against the
complete Maker gene set using BLASTn with a cuttoff of e-20, narrowing
this to 2,531 transcriptome contigs that have homology support for
being real genes but were not present in our Maker gene set. We
then used EST2genome to align these to the genome assembly v1.1.
Of these, 1,310 could be assigned to the genome and 1,221 could
not. Presumably these unmatched genes are in regions missing from
the assembly altogether, are in regions that are misassembled, correspond
to a Maker-predicted gene model better than had been found by BLASTn,
or span two or more scaffolds.
An all-by-all BLASTn
analysis of all of these gene models identified 619 genes that appear
as duplications with identical sequence and 670 with large regions
of identical sequence, but in each case being at different loci.
These may be actual biological duplications or might be errors in
the assembly. Those that are completely identical are distinguished
by suffices of 0.01 and 0.02 and those with only large regions of
identity by suffices of 0.1 and 0.2 (with the longer sequence called
.1). This 9,052-member gene set is referred to as version 1.1.
In addition to this
gene set, nearly complete mitochondrial and plastid genome sequences
were found in five of the scaffolds that were separately interpreted.
124 chloroplast genes and 36 mitochondrial genes were identified
and annotated using DOGMA.
These files are available
here for download:
A fasta-formatted
file of the sequences of all 37,056 transcript contigs.
An Excel file
with statistics on the alignment of the pooled RNA reads to the
transcriptome assembly.
A fasta-formatted
file of all 9,052 unique gene sequences in gene set version
1.1 (See NCBI for the slightly reduced set of v.1.2)
A fasta-formatted
file of the peptide sequence of the longest plus-strand ORF
in each gene of this set.
The genome assembly
scaffolds have been annotated with all gene models. This annotated
file for genome assembly v.1.1 (on which the manuscript is based)
can be downloaded HERE.
This file can be viewed using free software from CLC Bio called
the Sequence
Viewer (or several other alternatives) that works on any platform.
The GFF for the version 1.1 annotation can be downloaded HERE.
The assembly version 1.2 files that are in GenBank differ from this
only by the elimination (by their policy) of any scaffold shorter
than 200 nts and by the elimination of a few that were identified
as bacterial contamination during the submission and by the renaming
of the scaffolds sequentially. |