- Open Access
A quantitative reference transcriptome for Nematostella vectensis earlyembryonic development: a pipeline for de novo assembly in emergingmodel systems
EvoDevo volume 4, Article number: 16 (2013)
The de novo assembly of transcriptomes from short shotgun sequencesraises challenges due to random and non-random sequencing biases andinherent transcript complexity. We sought to define a pipeline for denovo transcriptome assembly to aid researchers working withemerging model systems where well annotated genome assemblies are notavailable as a reference. To detail this experimental and computationalmethod, we used early embryos of the sea anemone, Nematostellavectensis, an emerging model system for studies of animal body planevolution. We performed RNA-seq on embryos up to 24 h of developmentusing Illumina HiSeq technology and evaluated independent de novoassembly methods. The resulting reads were assembled using either theTrinity assembler on all quality controlled reads or both the Velvet andOases assemblers on reads passing a stringent digital normalization filter.A control set of mRNA standards from the National Institute of Standards andTechnology (NIST) was included in our experimental pipeline to invest ourtranscriptome with quantitative information on absolute transcript levelsand to provide additional quality control.
We generated >200 million paired-end reads from directional cDNA librariesrepresenting well over 20 Gb of sequence. The Trinity assembler pipeline,including preliminary quality control steps, resulted in more than 86% ofreads aligning with the reference transcriptome thus generated.Nevertheless, digital normalization combined with assembly by Velvet andOases required far less computing power and decreased processing time whilestill mapping 82% of reads. We have made the raw sequencing reads andassembled transcriptome publically available.
Nematostella vectensis was chosen for its strategic position in thetree of life for studies into the origins of the animal body plan, however,the challenge of reference-free transcriptome assembly is relevant to allsystems for which well annotated gene models and independently verifiedgenome assembly may not be available. To navigate this new territory, wehave constructed a pipeline for library preparation and computationalanalysis for de novo transcriptome assembly. The gene modelsdefined by this reference transcriptome define the set of genes transcribedin early Nematostella development and will provide a valuabledataset for further gene regulatory network investigations.
Nematostella vectensis, the starlet sea anemone, offers many advantages asa model system for the evolution of animal developmental programs. As an anthozoancnidarian, it is strategically positioned as an outgroup to Bilateria [1–3] and is well situated to reveal the early steps in the evolution of thebilaterian body plan. Two of these evolutionary steps are likely to include theformation of a secondary body axis and a mesodermal germ layer which are bothessential, defining characteristics of a bilaterian animal. Embryonic dorsal-ventralpatterning and mesodermal development have been studied in many bilaterian modelsyet the origins of these significant body plan innovations are not well understood.Initial studies of gene expression in Nematostella and non-anthozoancnidarians have revealed that genes important to bilaterian mesoderm specificationare expressed in the endoderm of the sea anemone, and suggests that the bilaterianmesoderm may have originated from the endoderm of diploblastic ancestors [4–6]. Genes encoding factors involved in dorsal-ventral axis specification inBilaterians are likewise asymmetrically expressed in Nematostella,indicating the possibility that a secondary axis was present in theCnidarian-Bilaterian ancestor [7, 8]. Defining the mechanisms controlling Nematostella developmentwill help address these questions about the early evolutionary steps that led tobilaterian body plans with three germ layers and bilateral symmetry.
Gene regulatory networks (GRN) provide predictive models of gene regulation, as inthe several examples that now exist for normal animal development (for example,Drosophila, sea urchin [10, 11], ascidians , chick , and zebrafish ). To gain a comprehensive view of the control system, it is necessary toidentify all genes whose products make up the regulatory network. This applies toour current research efforts but is also generally applicable to studies ofvirtually any regulatory system. Advanced sequencing platforms now allow us to dothis through RNA-seq techniques. Yet, deep RNA-seq brings challenges in analysisreflecting the scale and complexity of transcriptomes, the primary problem beingadequate assembly of RNA-seq reads in order to define a reference set of gene models [15–17]. Transcriptome assembly can be achieved using a reference-based strategy,a de novo strategy or a combination of the two. The main drawback to usinga genome reference for assembly is that it relies on the quality of the referencegenome being used . This is a particular problem for emerging model systems with recentlycompleted genomes because misassemblies, poor annotation and large gaps in coverageplague the genome assemblies of all but a few of the major model systems . There is also a challenge in assigning reads that align equally well tomultiple places in the genome. The aligner must decide to either exclude these readswhich can result in gaps or to choose which alignments to retain which could lead towrong assignments or predictions of a transcript in a region that has notranscription.
A comprehensive GRN for early embryonic development in Nematostella willenable researchers to investigate the extent to which the bilaterian regulatorytoolkit is present in this representative cnidarian, down to the level of precisesignaling systems and transcription factor cis-regulatory interactions. Byharnessing the power of high-throughput sequencing and perturbation techniques, weaim to build the sea anemone GRN in an unbiased and efficient manner that will serveas a GRN construction pipeline for other model systems to follow.
The current Nematostella genome assemblies [20, 21] fall into the category of young genome models that are still incompleteand contain gaps thus making the reference-based method alone insufficient for ourneeds. Taking these and all of the above complications into account and consideringour goal to define an experimental and computational pipeline for emerging modelsystems, we elected to use the de novo assembly approach. This approachwill be especially useful for evo-devo researchers aiming to harness the power ofnext-generation sequencing to bring their research into the genomics era; a trendalready underway, for example Parhyale, Oncopeltus, sponge , and sea urchin .
The scale of reads, random and non-random sequencing errors, and inherent transcriptcomplexity due to alternate transcription start sites or splice junctions all posechallenges for de novo transcriptome assembly. Indeed, the scale of theproblem is only set to increase with the expanding capacity for transcriptomesequencing from advances in next-generation sequencing (NGS) platforms. In the lastfew years several assembly algorithms have been released to meet these challenges:Trans-ABySS , SOAPdenovo , Velvet/Oases [27, 28], and Trinity . The millions of short reads produced from NGS platforms result inmillions of overlapping sequences. Short-read de novo assemblers exploitthese overlaps to reconstruct the original transcripts by using the de Bruijn graphdata structure, which encodes overlapping k-mers as adjacent vertices.Assembly algorithms then compute paths through the de Bruijn graph that correspondto valid assemblies of the sequence reads.
The Trinity assembler breaks the sequence data into many de Bruijn graphs in order tocapture transcript complexity resulting from alternative splicing, gene duplicationsor allelic variation . Trinity consists of three modules called Inchworm, Chrysalis, andButterfly. Inchworm assembles the RNA-seq reads into transcripts and reports onlythe unique portions of alternate transcripts. Chrysalis clusters the Inchwormcontigs so that each cluster represents all known transcripts variants for each geneor related genes and then constructs De Bruijn graphs for each cluster. All readsare segregated to one of these separate graphs. Butterfly then processes theseseparate graphs in parallel by tracing a path through each one and reports fulllength transcripts separately for alternate splice forms and paralogs. The Oasesassembler uploads a preliminary assembly created by Velvet, which was originallydesigned for genome assembly. Oases corrects this assembly using a range ofk- mers to create separate assemblies, which are then combined into one.The longer k-mers perform better on high expression transcripts and theshorter k- mers have an advantage on low expression transcripts . While the multiple k-mer approach has been found to result inan increase of longer transcripts, it can also lead to an accumulation of incorrectassemblies or artificially fused transcripts .
In this study we designed a next-generation sequencing and analysis pipeline toproduce a minimally biased and quantitative reference transcriptome. The resultingtranscriptome represents the first 24 h of Nematostella developmentand will be the basis for further gene regulatory network studies. The experimentaland computational pipeline will be used by us and others to produce transcriptomesfor other model systems, particularly those evo-devo models that do not yet have anannotated genome but would benefit from an in depth molecular analysis.
Nematostella vectensis adults following normal culture at 18°Cwere spawned with a 9-h cycle of light at 25°C in an incubator. Maleand female spawning adults were in separate bowls and egg sacs were removed to afresh bowl and fertilized with sperm from male bowls for 10 minutes. Theegg sacs were then dejellied with a 4% cysteine solution (pH 7.4) in 50%filtered sea water (FSW) for 8 minutes and rinsed five times with 50% FSW.All embryo processing was performed in an 18°C room and the embryos werecultured from the time of fertilization for 0, 6, 12, 18 or 24 h (fivetimepoints). An additional 24-h sample was prepared in the same way from aseparate spawning event. Cultured embryos were transferred to an eppendorf tube,allowed to settle, gently spun to a pellet and the supernatant removed,approximately 600 embryos per sample. The embryo pellet was immediately immersedin 100 μl of lysis buffer from the Invitrogen mRNA DIRECT kit(Invitrogen, Life Technologies, Grand Island, NY, USA) and homogenized with aKontes Pellet Pestle (distributed by Thermo Fisher Scientific, Pittsburgh, PA,USA) attached to a 12 V/700 rpm drill. Another 100 μl oflysis buffer was used to rinse the Kontes Pellet Pestle tip and collected in thesame tube. Samples were then stored at −80°C until all timepoints hadbeen collected.
To thawed lysates, a third aliquot of 100 μl of lysis buffer was addedand then the normal protocol for the Invitrogen mRNA DIRECT kit was followedusing 50 μl Dynabeads per sample and low adhesion microcentrifugetubes, following the manufacturer’s recommendations. The mRNA yields werebetween 108 ng and 344 ng total per sample. The mRNA was used asstarting material for the ScriptSeq V.1 kit from Epicentre (EpicentreBiotechnologies, Madison, WI, USA). A total of 9.0 μl of mRNAcorresponding to between 74 ng to 233 ng per sample was combined with1.0 μl of a 1:10 dilution of External RNA Controls Consortium (ERCC)spike-in control RNA for the first reaction (available from Invitrogen/LifeTechnologies). The protocol was followed exactly, using 12 cycles total ofPCR in the amplification step with Phusion High Fidelity polymerase (availablefrom Therm Scientific) and barcoded Illumina-compatible primers 1 to 6 fromEpicentre.
The libraries were size selected with a 2% Pippin prep gel (from Sage Science,Beverly, MA, USA) for 450 bp and checked on a Agilent 2100 Bioanalyzer witha high sensitivity DNA chip (from Agilent Technologies, Santa Clara, CA, USA)and then by qPCR. The samples were combined and run on a single lane of theIllumina High Seq 1000 with version III chemistry with 200 cycles of pairedend sequencing plus indexing reads. All raw read files are available on theWoods Hole Data Archive at http://hdl.handle.net/1912/5613, DOI[DOI:http://10.1575/1912/5613].
Quality control was implemented using a combination of Bowtie2 (version2.0.0-beta6), Basic Local Alignment Search Tool (BLAST), btrim (build date 9September 2011), and the FASTX-Toolkit. First, we computed overrepresentedk-mers in the raw sequence data and ran BLAST against the setof non-redundant (nr) sequences from the National Center for BiotechnologyInformation (NCBI). The top several BLAST hits were analyzed for homologywith ribosomal or mitochondrial RNA. Using the BLAST data, we constructed acatalog of rRNA and mtRNA sequences which could serve as a reference set tofilter the overly abundant non-protein coding RNA prior to de novotranscriptome assembly. A Bowtie2 index was built from the rRNA and mtRNAsequences of Montastraea franksi, Savalia savaglia,Actiniaria, Nematostella vectensis, andClathrina. Sequence reads successfully aligning by Bowtie2 tothis set were removed. The overrepresented k-mers also containedadapter sequences that remained in the sequence. We retrieved the exactIllumina adapter sequences and used the software tool btrim to clipadapters.
Sequence reads demonstrating low complexity (containing only one or twounique bases) are likely due to technical artifacts and were removed. Next,the GC content distribution was computed for the set of all reads. GCcontent biases in the first 13 bases of Illumina RNA-seq data are known toexist due to random hexamer priming . This bias may cause an imbalance in read coverage and persistthrough the assembly process, which can affect the quality of assembly andquantification levels. Because we had extremely high sequence coverage, weremoved the bias by simply trimming the start of the reads. Using theFASTX-Toolkit, we removed the initial 13 bases from the reads at eachtimepoint. Finally, btrim was also used to adaptively trim low quality basesfrom the end of the read. Adaptive trimming is performed by sliding a windowof 5 bp from the end of the read to the start, removing bases andshifting the sliding window by 1 base if the average quality score is lessthan 30 until the average quality score is at least 30.
Digital normalization, Velvet and Oases
Digital normalization is a method to reduce the total number of reads to beassembled, thereby also reducing the computing power and time required forassembly. It preferentially removes high abundance reads but retains readcomplexity in order to remove errors but preserve low abundance transcriptsprior to assembly. All links to digital normalization software are availableelectronically through http://ged.msu.edu/papers/2012-diginorm/.Raw paired-end read files were first interleaved into pairs using a pythonscript, available athttp://github.com/ged-lab/khmer/tree/2012-paper-diginorm/sandbox.Then, three rounds of digital normalization were applied to removeoverabundant and erroneous reads. These depend on the khmer softwarepackage, available at http://github.com/ged-lab/khmer/. The khmersoftware also relies on the screed package for loading sequences, availableat http://github.com/ged-lab/screed/ (khmer and screed are©2010 Michigan State University, and are free software available fordistribution, modification, redistribution under the BSD license). Thedigital normalized files were assembled with Velvet (version 1.2.03) andOases (version 0.2.06). The details of the execution commands are availablein Additional file 1. The most currentrecommendations for use of digital normalization for de novotranscriptome assembly recommend using only one round of normalizationinstead of three. Fewer low abundance transcripts may be lost by foregoingfurther rounds of digital normalization at the expense of increasedcomputing time and power to assemble the greater number of remainingreads.
Trinity assembly and quantification
The 20 August 2011 release of the Trinity pipeline was run on the readsremaining after quality control(http://trinityrnaseq.sourceforge.net/). We ran Trinity withthe options to use eight CPU cores and the RF library type to reflect thedirectionality of the sequence reads (full execution commands are given inAdditional file 2). The assembled transcriptome isavailable on the Woods Hole Data Archive at:http://hdl.handle.net/1912/5613,[DOI:10.1575/1912/5613]. Bowtie was then used to alignthe post quality control sequence reads to the transcriptome assembly andthe ERCC spike-in control sequences. We then computed (1) the set ofconcordant paired-end mapped sequence pairs and (2) the set of all mappedsequences for both the transcriptome and the ERCC controls. Fragments perkilobase of exon per million fragments mapped (FPKM) values were computedfor the transcriptome assembly transcripts and the ERCC controls using RSEM(version 1.2.0). To quantify the expression of transcripts in terms ofmolecules we computed the dose–response curve by plotting FPKM valuesversus the known concentration of each ERCC spike-in for each timepoint. Aset of ordinary least squares (Additional file 3)and robust linear regressions (Figure 1)  were computed, and we observed that the set of concordant mappedreads yielded higher R2 (Table 1) thanthe set of all mapped reads, and thus, we used the concordant mapped readsfor downstream analyses. Using the fitted line, we inferred the number ofmolecules present for each Trinity assembled transcript, in eachtimepoint.
To compute overexpressed GO terms in our transcriptome, we used BLASTx2.2.26+, BLOSUM62 similarity matrix, Blast2GO database version August 2011,and pipeline B2G4Pipe version 2.3.5. The definition of each GO term isdetermined by the GO Consortium: http://www.geneontology.org/,and can be found using the EMBL-European Bioinformatics Institute QuickGO:http://www.ebi.ac.uk/QuickGO/, or the Gene Ontology NormalUsage Tracking System, GONUTS:http://gowiki.tamu.edu/wiki/index.php/Main_Page. Definitionsfor all GO terms presented in this paper can be found in Additional file4.
Library preparation and quality control of reads
In theory, RNA-seq can catalog all expressed transcripts as complete mRNAsequences. To determine the set of transcripts expressed from fertilization togastrulation, we chose to sample five timepoints during the first 24 h ofNematostella development: 0, 6, 12, 18, and 24 h afterfertilization (Figure 2A). First, 600 embryos pertimepoint were immediately homogenized and lysed with lysis buffer from theInvitrogen mRNA DIRECT kit and stored at −80°C until all samples hadbeen collected. Lysing immediately was important, as freezing the embryos as apellet results in significant RNA degradation (Antje Fischer, Marine BiologicalLaboratory, Woods Hole, MA, USA, personal communication). Next, mRNA waspositively selected with the Invitrogen mRNA DIRECT kit, which is a magneticbead-based method. We tried an alternate mRNA enrichment method with total RNAextraction combined with negative selection for ribosomal RNAs, but the yieldswere too low from total RNA extraction with a Qiagen total RNA kit even for usewith the low input version of the RiboZero kit from Epicentre. This is probablydue to low RNA levels or difficult to extract RNA (not an uncommon problem inembryo systems) in Nematostella embryos, so for this step in thepipeline the alternate negative selection method may work better for speciessuch as Xenopus, which have large amounts of RNA in their eggs. Ourresearch group has more recently used this alternate method successfully fordeveloping embryos of the slipper snail, Crepidula fornicata.
The polyA-RNA enriched sample was then processed with the Script Seq kit, version1, from Epicentre. The main advantages of this kit are the resulting directionalreads, short preparation time (4 h), and low input requirements (as low as50 ng mRNA). The adapter-ligated libraries were then size selected foruniformity at 450 bp using a Pippin Prep gel electrophoresis apparatus fromSage Science and combined in one lane on an Illumina HiSeq 1000 to produce 2× 100 bp paired-end reads.
The sequencing run produced 238.5 million total raw reads (1.19E + 08 pairs;Table 2), yielding far more than 20 GB ofdata. Quality control was then implemented in two phases (Figure 2B). The first phase removes adapter sequence contaminationand ribosomal and mitochondrial RNA sequence. The second phase filters lowcomplexity artifacts that may have resulted from technical failures duringsequencing, removes low quality bases from the ends of reads, and trims theGC-content bias sequence bases from the start of the each read (see Methodssection for more details). Both quality control phases may remove reads entirelyor a subset of bases. If the length of a read is less than the k-merlength of the de novo assembler it cannot be used for assembly and isremoved from the read set. These filters may therefore result in read fragmentsthat have one of the paired reads removed while the other passes qualitycontrol. The raw sequence dataset contains 24 billion bases in 119 millionpaired directional sequence reads. After quality control phase 1 (QC phase 1),71.7% of the bases and 70.3% of the paired sequence reads remained(Table 2). Phase 1 removed one of the pairs from1.3 million fragments effectively introducing unpaired sequences into the readset. After QC phase 2, 59.1% of the original sequence bases remained and 67.5%of the original paired sequence reads remained. A total of 5.4 million unpairedsequences remained after QC phase 2.
De novo assembly with Trinity assembler
Assembling a transcriptome from short reads is computationally challenging andseveral methods exist for assembly using an annotated genome as a reference,assembling the reads de novo without a genome reference, or acombination of the two. Due to the aforementioned difficulties with using thecurrent Nematostella genome for assembly, we chose to compare twoalternate pipelines which both use de novo assembly exclusively. Thefirst uses the Trinity platform, which has been shown to recover morefull-length transcripts across a range of levels at a sensitivity levelcomparable to assemblers that use a genome reference [33, 34]. Additionally, Trinity can recognize alternate splice forms asbelonging to the same gene and keep them together with the same prefix. Sequencereads that passed quality control were assembled into 119,911 contigs by Trinity(Table 3); 14.85% of assembled contigs were morethan 1,000 bp long (Figure 3). The totalalignment rate for the Trinity assembly was 85.90% (Table 4).
Determining adequate coverage of transcriptomes is more challenging thandetermining coverage of genomes because every transcript species (includingsplice variants or those using alternate transcription start sites) is presentat a different level across a large range. Several studies have used anindependent assembly of randomly selected subsets of their reads to compare therate of new transcript discovery, determine the lower abundance limit ofdetection and compare the average length of isotigs. While analyzing the seaurchin embryonic transcriptome, Tu et al. compared a 20 M readsubset with a 2 M read subset and a 0.2 M read subset, and found that20 M reads were sufficient to reliably detect levels of transcripts at 400molecules/embryo, which they estimate as the lower limit for proteins ofdevelopmental significance (such as transcription factors, which may befunctionally relevant at levels as low as 10 copies of transcript per cell) . In their analysis of the milkweed transcriptome, Ewen-Campen etal. created eight subsets of reads, assembled them separately and usedBLASTx to compare gene discovery rates . They found that the rate of new transcript discovery plateaued at1.5 M reads, although the N50 isotig length continued to increase whenusing 2 M to 17 M read subsets. After 2 rounds of quality controlfiltering of our reads, we were left with 80,537,812 paired and 5,362,854unpaired reads, a depth which has been shown to produce good sensitivity inthese other systems for identifying all protein-coding transcripts expressed inthe early embryo.
To restate, these previous studies indicate that with the volume of reads comingoff the latest Illumina platforms (250 million to 400 million reads/lane), andonly multiplexing 6 samples in a lane, we should be beyond the necessarycoverage to represent all relevant regulatory transcripts. The best indicationof the quality of our assembly is that we have been able to use it as areference to map reads from more recent experiments in our lab at a median 93%rate (with 90% of samples mapping 90% of their reads to the Trinity-assembledreference).
Digital normalization followed by Oases assembly
To evaluate more closely the quality of our de novo transcriptomeassembly, we compared Trinity with an alternate strategy that combines a digitalnormalization step  with the assemblers Velvet  and Oases . Digital normalization is a computational normalization method thatpreferentially removes high abundance reads but retains read complexity in orderto remove errors and preserve low abundance transcripts prior to assembly. Thequality controlled reads were assembled using Velvet and Oases and then mappedback to the resulting assembly (commands in Additional file 1). The main advantage of this method is it greatly decreases thecomputing power and time required to process millions of reads. We also testedthe Amazon Elastic Cloud Computing Service (Amazon EC2),http://aws.amazon.com/ec2/, to perform the digital normalizationand Velvet-Oases assembly. This method proves to be a great alternative to usinga home institution’s core computers if the institution does not havesufficient computing power for running an assembler or if the computers areexpensive to rent, slow or unreliable. Whereas the computation of Trinityrequired over 50 h and 100 GB of RAM in addition to the qualitycontrol steps, the pipeline using digital normalization, Velvet and Oases canall be run in a single day using an XL computer rented from Amazon. We foundthat this alternative assembly approach gave competitive results when mappingour Nematostella vectensis RNA-seq reads. The overall mapping successrate was 82.17% for digital normalization-Oases as compared to 85.90% forTrinity (Table 4).
Quantification of Trinity-assembled transcriptome using known RNA‘spike-ins’
A key component of our de novo transcriptome pipeline is a method toobtain absolute quantitative information for each transcript using externalstandards or ‘spike-ins’, that is, control RNAs of knownconcentration. Before absolute quantification of inferred transcripts can beperformed, the dynamic range and transcript detection limits must be evaluatedusing spike-ins. For this we employed the ERCC RNA spike-in set as recommendedby the National Institute of Standards and Technology (NIST) [35–37]. First, we computed the properly mapped read alignments from qualitycontrolled sequence read pairs to the ERCC spike-in reference sequences for eachembryological timepoint. Quantification of spike-in transcripts was thenperformed using RSEM . Read alignments that were not concordant with directionalityconstraints were not considered for quantification. We then compared the knownconcentration of each ERCC spike-in transcript to the RSEM calculated FPKMvalues. The dose–response curve for each timepoint containing the knownconcentrations and FPKM values for the spike-ins were plotted. We computed anordinary least squares (Additional file 3) and robustlinear regression to determine the best fit (Figure 1) . The robust linear regression provided a larger detection range andwas used to compute absolute quantification for the assembled transcripts. Intotal, we computed 9,516 transcripts expressed above 0 molecules per embryo and8,154 transcripts expressed above 100 molecules per embryo (Figure 4). When restricting the set of transcripts to those with atleast 1 BLAST hit against nr, we observed 6,169 transcripts expressed above 100molecules per embryo (Table 3).
Blast2GO analysis reveals genes involved in gene regulation
We used Blast2GO to quantify how many transcripts from the Trinity transcriptomefell into well defined gene ontology (GO) categories over all time periodssampled . A sample of the top overexpressed GO terms computed usingFisher’s exact test from the topGO package version 2.10.0 for R  are visualized in Figure 5A. GO termsbelong to a top level designation of biological process (BP), cellular component(CC), or molecular function (MF) where the titles of the GO terms are inreference to the top level designation; for example, ‘nucleus’refers to the location of the gene product in the nucleus while ‘geneexpression’ refers to a gene product involved in the process of convertinggene sequence into RNA or proteins. Definitions for all of the GO terms inFigure 5 can be found in Additional file 3. In order to understand how many transcripts arepotentially a part of the embryonic gene regulatory control system, we focusedon terms enriched for transcription factors and signaling pathway components(Figure 5B). Nearly 1,000 transcripts combinedfell into the 2 transcription factor categories while nearly 1,500 transcriptscombined to make up signaling molecules, their receptors, modulators andtransducers. Taken together, these 2,500 transcripts provide an estimate of thenumber of regulatory factors (transcription factors, ligands, receptors,modulators and transducers) present in the Nematostella developmentalgene regulatory network.
Transcript family analysis
The transcripts inferred by the Trinity assembler were tested for sequencesimilarity with known genes. Specifically, the sequence of each transcript wastranslated and BLASTx was run using the NCBI non-redundant RefSeq database ofprotein sequences (nr). We combined sequences sharing at least 1 BLAST resultwith an e-value <5e-5 and a percent identity >80% into a transcriptfamily. The e-value threshold filtered out low confidence BLAST results, whilethe percent identity filter requires a large percentage of the transcript tomatch. We systematically tested multiple identity thresholds and found that 80%sequence match yielded a realistic transcript family distribution. Given thenatural tradeoff between identification of transcript variants versus paralogousgenes, overly stringent requirements of similarity result in anunderrepresentation of true homologous relationships; conversely, whensimilarity thresholds are set too low, distinct transcripts are erroneouslygrouped together. Because the assembled Nematostella transcripts arealso included in nr, the sequence similarity threshold is required to be sethigh. The transcript families therefore represent a mixture of fully assembledgene transcripts, pieces of transcripts, paralogs, or multiple splice forms,entirely compatible with our overarching goals. For BLAST analysis with sequencedatabases of very different sizes (as in our case, nr vs Nematostella)e-values are not an appropriate measure of similarity; in this case a similaritythreshold is more informative. Thus, at a similarity threshold of 80% we wereable to annotate 48,235 transcripts from the nr database, compared to only47,193 transcripts when using the Nematostella genome alone(Table 3). To do transcript family analysis weused the BLAST hits from the nr database. We inferred 13,293 total transcriptfamilies from the 61,835 assembled transcripts with at least 1 BLAST hit. Whenrestricting the transcript family analysis to transcripts expressed over 100molecules per embryo (MPE) and at least 1 BLAST hit (MPE >100, ≥1 BLASThit) across the 5 timepoints, we observed a total of 4,055 transcript familiesfrom the 6,169 transcripts passing the filter (Table 3). These computations likely represents an underestimate of thetrue number of genes expressed due to an inability to assemble very lowlyexpressed transcripts and, in a few cases, grouping paralogous genestogether.
As an example of using the quantitative transcriptome for transcript familyanalysis, we located the transcripts corresponding to Notch as identified byBLAST. There were three transcripts in this transcript family, one of which istoo short and too lowly expressed to be relevant, while the other two are nearlyidentical and apparently full length. As shown in Figure 6, total Notch molecules per embryo increased from virtually zerocopies at 0 h and 6 h, reflecting little or no maternal and earlyzygotic expression, to significant levels by 12 h, and thence to 1,000copies at 24 h. A previous study has shown that Notch protein can be seenby in situ as early as 20 h, however, that same study onlydetected Delta-Notch signaling pathway function at later developmental stages . We also see low levels of a putative Delta-like ligand, though wecannot conclude whether it is expressed at functional levels, nor can we makethe simplistic conclusion that because both ligand and receptor are present thesignaling pathway is functional. Rather our data suggest further investigationis merited, as also stated in Röttinger et al. , the most systematic study of early Nematostellaendomesoderm specification to date.
An estimation of the percentage of the genome transcribed during these timeperiods was computed by taking the length of the longest transcript in eachtranscript family and dividing that by the length of the genome taken from theestimate in the Nematostella genome paper (450 Mbp) . An important caveat is that this is likely an underestimate becausesome transcripts are not full length. The percent of the genome transcribedabove 100 molecules per embryo according to this calculation is 0.368%. Theaverage transcript length for all assembled transcripts is 622.53 bp. Whenonly taking transcripts expressed after spike-in control correction above 0molecules per embryo, the average transcript length is 456.18 bp. Bytimepoint, the average of assembled and expressed transcripts is: (0 h)479.95 bp, (6 h) 487.82 bp, (12 h) 490.54 bp,(18 h) 494.03 bp, and (24 h) 441.29 bp.
The goals of the project discussed in this paper were (1) to identify all theprotein-coding genes expressed during the first 24 h of Nematostellavectensis development and in so doing (2) to detail a modern,cost-effective, efficient and quantitative series of experimental and computationalmethods that together make up a transcriptome pipeline for non-model organisms. Akey component of our pipeline is the inclusion of NIST RNA spike-in standards forquantification. This allows us to get around the problem of normalizing data toestimate gene expression levels , and provides an absolute measure of transcript abundance per embryo.
Many evolutionary developmental biology ‘evo-devo’ research projects haverevealed candidate genes in non-model organisms leading to intriguing hypothesesregarding the conservation, or conversely, invention of pathways controllingdevelopment [5, 44, 45]. However, to answer the questions these hypotheses have generated, it isnot only the gene homology, presence, absence or spatial localization that needs tobe known. To say that a developmental program or subcircuit has been conserved orevolved in a specific way, the cis-regulatory network connections betweenall the regulatory genes involved must be at a minimum known and validated.Candidate genes identified from BLAST analysis will typically only make up a smallfraction of the regulatory genes in any pathway. With the advent of next-generationsequencing platforms, identifying all the protein coding genes expressed at anygiven time during embryonic development is now within the reach of any model systemwhere embryos can be acquired. The lack of a sequenced, annotated genome is nolonger a major setback to GRN analysis.
The use of polyadenylated spike-in RNAs provides quantitative information on theabsolute abundance of transcripts per embryo. It is important to note the differencebetween this method of standardization and normalization approaches. The ERCCspike-ins allow us to build a standard curve, in our case a 92-point standard curve.As the quantities of the spike-ins are known, this allows us to infer from thestandard curve absolute quantities. Note that since spike-ins are added at thebeginning of the library preparation procedures, any variation in preparationefficiencies (that is, technical noise) is in theory accounted for by the spike-ins.Thus, even without absolute quantitation, the use of spike-ins allows directcomparison between samples without the distorting effects of normalization tominimize the effects of technical variation. Further, quantitation by spike-ins alsoallows us to know the limits of our ability to detect and quantify lowly expressedtranscripts. Since low expressed transcripts account for many of the problems inbioinfomatics analysis, our 100 molecules per embryo cut-off allows us to focus ouranalysis on those transcripts expressed at biologically relevant levels which arealso within the linear range of our standard curve. Increasing the sequencing depthand being less conservative with mapping stringency could improve our ability toquantify these lowly expressed transcripts.
This transcriptome pipeline is part of a larger GRN construction pipeline that we arein the process of defining empirically. A visualization of the proposed workflow forconstructing a GRN starting from a sequenced and assembled transcriptome is shown inFigure 7. The transcriptome is the starting point andfoundation of the GRN because it represents all the transcripts present in the scopeof the network. The next datasets to be produced are: a high-density, quantitativeRNA-seq timecourse which will be mapped to the full transcriptome, for the purposeof high resolution covariance analysis; a ‘Perturbation-seq’ datasetwhere RNA-seq is used on embryos treated with drugs against components of majorsignaling pathways; and a genome-wide sequencing-based search forcis- regulatory elements using either FAIRE-seq  or DNase I hypersensitivity. A custom computational comparison of thesedatasets will produce an interactome with clusters representing transcripts thatchange together from timepoint to timepoint or after a perturbation. More sensitiveinvestigation of spatial expression, coexpression and perturbation expression (aftermorpholino treatment) will take the interactome to the level of a preliminary GRN.Finally, cis- regulatory analysis using bacterial artificial chromosome(BAC) recombination to evaluate subcircuit function will produce a verified GRN withpredictive power.
The embryo of the sea anemone Nematostella vectensis provides an importantevo-devo model for understanding early animal development, particularly in relationto the question of how initial patterns of differential gene expression emerge alongorthogonal body axes. Given Nematostella’s position among cnidariansand the molecular evidence thus far, it is possible that a bilaterally-symmetricpattern formation network stretches back to before the Cambrian to a time precedingthe Cnidaria-Bilateria bifurcation. However, to make this argument we need amechanistic understanding of early development in both cnidarian and canonicalbilaterian models systems. Moreover, in light of the compare-and-contrast nature ofthese studies, we need to move away from a candidate gene approach as such methodsclearly bias towards the ‘discovery’ of similarities as opposed todifferences between regulatory networks. With the advent of genomics, we can nowattempt exhaustive de novo approaches to define regulatory networks, thoughchallenges in handling RNA-seq data sets still exist. In this report, we haveundertaken preliminary steps in defining the Nematostella gene regulatorynetwork for early pattern formation by building a comprehensive model of geneexpression through 24 h of development. This quantitative referencetranscriptome will help us identify, in a minimally biased manner, the most relevantgenes to the pattern formation control system. The regulatory network for patternformation in Nematostella will in turn provide a powerful basis forcomparison with early development networks from canonical bilaterians.
In summary, we have presented our quantitative reference transcriptome forNematostella vectensis early embryogenesis, which is available on theWoods Hole Data Archive at http://hdl.handle.net/1912/5613[DOI:10.1575/1912/5613]. Additionally, our de novotranscriptome pipeline, based on the Trinity assembler, has been designed to meetthe needs of the evo-devo community.
External RNA Controls Consortium
fragments per kilobase of exon permillion fragments mapped
filtered sea water
gene regulatory network
next generation sequencing
National Institute ofStandards and Technology
non-redundant BLAST database
(quantitative)polymerase chain reaction.
Dunn CW, Hejnol A, Matus DQ, Pang K, Browne WE, Smith SA, Seaver E, Rouse GW, Obst M, Edgecombe GD, Sørensen MV, Haddock SHD, Schmidt-Rhaesa A, Okusu A, Kristensen RM, Wheeler WC, Martindale MQ, Giribet G: Broad phylogenomic sampling improves resolution of the animal tree oflife. Nature. 2008, 452: 745-749. 10.1038/nature06614.
Hejnol A, Obst M, Stamatakis A, Ott M, Rouse GW, Edgecombe GD, Martinez P, Baguna J, Bailly X, Jondelius U, Wiens M, Muller WEG, Seaver E, Wheeler WC, Martindale MQ, Giribet G, Dunn CW: Assessing the root of bilaterian animals with scalable phylogenomicmethods. Proc R Soc B. 2009, 276: 4261-4270. 10.1098/rspb.2009.0896.
Mallatt J, Craig CW, Yoder MJ: Nearly complete rRNA genes assembled from across the metazoan animals:effects of more taxa, a structure-based alignment, and paired-sitesevolutionary models on phylogeny reconstruction. Mol Phylogenet Evol. 2010, 55: 1-17. 10.1016/j.ympev.2009.09.028.
Martindale MQ, Pang K, Finnerty JR: Investigating the origins of triploblasty: “mesodermal” geneexpression in a diploblastic animal, the sea anemone Nematostellavectensis (phylum, Cnidaria; class, Anthozoa). Development. 2004, 131: 2463-2474. 10.1242/dev.01119.
Wikramanayake AH, Hong M, Lee PN, Pang K, Byrum CA, Bince JM, Xu R, Martindale MQ: An ancient role for nuclear beta-catenin in the evolution of axial polarityand germ layer segregation. Nature. 2003, 426: 446-450. 10.1038/nature02113.
Fritzenwanker JH, Saina M, Technau U: Analysis of forkhead and snail expression reveals epithelial-mesenchymaltransitions during embryonic and larval development of Nematostellavectensis. Dev Biol. 2004, 275: 389-402. 10.1016/j.ydbio.2004.08.014.
Matus DQ, Thomsen GH, Martindale MQ: Dorso/ventral genes are asymmetrically expressed and involved in germ-layerdemarcation during cnidarian gastrulation. Curr Biol. 2006, 16: 499-505. 10.1016/j.cub.2006.01.052.
Rentzsch F, Anton R, Saina M, Hammerschmidt M, Holstein TW, Technau U: Asymmetric expression of the BMP antagonists chordin and gremlin in the seaanemone Nematostella vectensis: implications for the evolution ofaxial patterning. Dev Biol. 2006, 296: 375-387. 10.1016/j.ydbio.2006.06.003.
Stathopoulos A, Levine M: Genomic regulatory networks and animal development. Dev Cell. 2005, 9: 449-462. 10.1016/j.devcel.2005.09.005.
Davidson EH, Levine MS: Properties of developmental gene regulatory networks. Proc Natl Acad Sci USA. 2008, 105: 20063-20066. 10.1073/pnas.0806007105.
Davidson EH: The Regulatory Genome. 2006, London, UK: Academic Press
Imai KS, Levine M, Satoh N, Satou Y: Regulatory blueprint for a chordate embryo. Science. 2006, 312: 1183-1187. 10.1126/science.1123404.
Betancur P, Bronner-Fraser M, Sauka-Spengler T: Assembling neural crest regulatory circuits into a gene regulatorynetwork. Annu Rev Cell Dev Biol. 2010, 26: 581-603. 10.1146/annurev.cellbio.042308.113245.
Swiers G, Patient R, Loose M: Genetic regulatory networks programming hematopoietic stem cells anderythroid lineage specification. Dev Biol. 2006, 294: 525-540. 10.1016/j.ydbio.2006.02.051.
Tu Q, Cameron RA, Worley KC, Gibbs RA, Davidson EH: Gene structure in the sea urchin Strongylocentrotus purpuratus based ontranscriptome analysis. Genome Res. 2012, 22: 2079-2087. 10.1101/gr.139170.112.
Martin J, Wang Z: Next-generation transcriptome assembly. Nat Rev Genet. 2011, 12: 671-682. 10.1038/nrg3068.
Brown CT, Howe A, Zhang Q, Pyrkosz A, Brom T: A reference-free algorithm for computational normalization of shotgunsequencing data. [http://arxiv.org/abs/1203.4802],
Miller JR, Koren S, Sutton G: Assembly algorithms for next-generation sequencing data. Genomics. 2010, 95: 315-327. 10.1016/j.ygeno.2010.03.001.
Salzberg SL, Yorke JA: Beware of mis-assembled genomes. Bioinformatics. 2005, 21: 4320-4321. 10.1093/bioinformatics/bti769.
Ryan JF, Burton PM, Mazza ME, Kwong GK, Mullikin JC, Finnerty JR: The cnidarian-bilaterian ancestor possessed at least 56 homeoboxes: evidencefrom the starlet sea anemone, Nematostella vectensis. Genome Biol. 2006, 7: R64-10.1186/gb-2006-7-7-r64.
Putnam NH, Srivastava M, Hellsten U, Dirks B, Chapman J, Salamov A, Terry A, Shapiro H, Lindquist E, Kapitonov VV, Jurka J, Genikhovich G, Grigoriev IV, Lucas SM, Steele RE, Finnerty JR, Technau U, Martindale MQ, Rokhsar DS: Sea anemone genome reveals ancestral eumetazoan gene repertoire and genomicorganization. Science. 2007, 317: 86-94. 10.1126/science.1139158.
Zeng V, Villanueva KE, Ewen-Campen BS, Alwes F, Browne WE, Extavour CG: De novo assembly and characterization of a maternal anddevelopmental transcriptome for the emerging model crustacean Parhyalehawaiensis. BMC Genomics. 2011, 12: 581-10.1186/1471-2164-12-581.
Ewen-Campen B, Shaner N, Panfilio KA, Suzuki Y, Roth S, Extavour CG: The maternal and early embryonic transcriptome of the milkweed bugOncopeltus fasciatus. BMC Genomics. 2011, 12: 61-10.1186/1471-2164-12-61.
Conaco C, Neveu P, Zhou H, Arcila ML, Degnan SM, Degnan BM, Kosik KS: Transcriptome profiling of the demosponge Amphimedon queenslandicareveals genome-wide events that accompany major life cycle transitions. BMC Genomics. 2012, 13: 209-10.1186/1471-2164-13-209.
Robertson G, Schein J, Chiu R, Corbett R, Field M, Jackman SD, Mungall K, Lee S, Okada HM, Qian JQ, Griffith M, Raymond A, Thiessen N, Cezard T, Butterfield YS, Newsome R, Chan SK, She R, Varhol R, Kamoh B, Prabhu A-L, Tam A, Zhao Y, Moore RA, Hirst M, Marra MA, Jones SJM, Hoodless PA, Birol I: De novo assembly and analysis of RNA-seq data. Nat Methods. 2010, 7: 909-912. 10.1038/nmeth.1517.
Li R, Zhu H, Ruan J, Qian W, Fang X, Shi Z, Li Y, Li S, Shan G, Kristiansen K, Li S, Yang H, Wang J, Wang J: De novo assembly of human genomes with massively parallel short readsequencing. Genome Res. 2010, 20: 265-272. 10.1101/gr.097261.109.
Zerbino DR, Birney E: Velvet: algorithms for de novo short read assembly using de Bruijngraphs. Genome Res. 2008, 18: 821-829. 10.1101/gr.074492.107.
Schulz MH, Zerbino DR, Vingron M, Birney E: Oases: robust de novo RNA-seq assembly across the dynamic range ofexpression levels. Bioinformatics. 2012, 28: 1086-1092. 10.1093/bioinformatics/bts094.
Grabherr MG, Haas BJ, Yassour M, Levin JZ, Thompson DA, Amit I, Adiconis X, Fan L, Raychowdhury R, Zeng Q, Chen Z, Mauceli E, Hacohen N, Gnirke A, Rhind N, di Palma F, Birren BW, Nusbaum C, Lindblad-Toh K, Friedman N, Regev A: Full-length transcriptome assembly from RNA-seq data without a referencegenome. Nat Biotechnol. 2011, 29: 644-652. 10.1038/nbt.1883.
Zhao Q-Y, Wang Y, Kong Y-M, Luo D, Li X, Hao P: Optimizing de novo transcriptome assembly from short-read RNA-seqdata: a comparative study. BMC Bioinformatics. 2011, 12 (Suppl 14): S2-10.1186/1471-2105-12-S14-S2.
Hansen KD, Brenner SE, Dudoit S: Biases in Illumina transcriptome sequencing caused by random hexamerpriming. Nucleic Acids Res. 2010, 38: e131-e131. 10.1093/nar/gkq224.
Venables WN, Ripley BD: Modern Applied Statistics with S. 2002, Berlin, Germany: Springer Verlag
Ren X, Liu T, Dong J, Sun L, Yang J, Zhu Y, Jin Q: Evaluating de bruijn graph assemblers on 454 transcriptomic data. PLoS ONE. 2012, 7: e51188-10.1371/journal.pone.0051188.
Vijay N, Poelstra JW, Künstner A, Wolf JBW: Challenges and strategies in transcriptome assembly and differential geneexpression quantification. A comprehensive in silico assessment ofRNA-seq experiments. Mol Ecol. 2013, 22: 620-634. 10.1111/mec.12014.
Baker SC, Bauer SR, Beyer RP, Brenton JD, Bromley B, Burrill J, Causton H, Conley MP, Elespuru R, Fero M, Foy C, Fuscoe J, Gao X, Gerhold DL, Gilles P, Goodsaid F, Guo X, Hackett J, Hockett RD, Ikonomi P, Irizarry RA, Kawasaki ES, Kaysser-Kranich T, Kerr K, Kiser G, Koch WH, Lee KY, Liu C, Liu ZL, Lucas A: The External RNA Controls Consortium: a progress report. Nat Methods. 2005, 2: 731-734. 10.1038/nmeth1005-731.
Devonshire AS, Elaswarapu R, Foy CA: Evaluation of external RNA controls for the standardisation of geneexpression biomarker measurements. BMC Genomics. 2010, 11: 662-10.1186/1471-2164-11-662.
External RNA Controls Consortium: Proposed methods for testing and selecting the ERCC external RNA controls. BMC Genomics. 2005, 6: 150
Li B, Dewey CN: RSEM: accurate transcript quantification from RNA-seq data with or without areference genome. BMC Bioinformatics. 2011, 12: 323-10.1186/1471-2105-12-323.
Conesa A, Götz S, García-Gómez JM, Terol J, Talón M, Robles M: Blast2GO: a universal tool for annotation, visualization and analysis infunctional genomics research. Bioinformatics. 2005, 21: 3674-3676. 10.1093/bioinformatics/bti610.
Alexa A, Rahnenfuher J: topGO: enrichment analysis for gene ontology. [http://www.bioconductor.org/packages/2.11/bioc/html/topGO.html]
Marlow H, Roettinger E, Boekhout M, Martindale MQ: Functional roles of Notch signaling in the cnidarian Nematostellavectensis. Dev Biol. 2012, 362: 295-308. 10.1016/j.ydbio.2011.11.012.
Röttinger E, Dahlin P, Martindale MQ: A framework for the establishment of a Cnidarian gene regulatory network for“endomesoderm” specification: the inputs of β-catenin/TCFsignaling. PLoS Genet. 2012, 8: e1003164-10.1371/journal.pgen.1003164.
Tarazona S, García-Alcalde F, Dopazo J, Ferrer A, Conesa A: Differential expression in RNA-seq: a matter of depth. Genome Res. 2011, 21: 2213-2223. 10.1101/gr.124321.111.
Yamada A, Pang K, Martindale MQ, Tochinai S: Surprisingly complex T-box gene complement in diploblastic metazoans. Evol Dev. 2007, 9: 220-230. 10.1111/j.1525-142X.2007.00154.x.
Adamska M, Larroux C, Adamski M, Green K, Lovas E, Koop D, Richards GS, Zwafink C, Degnan BM: Structure and expression of conserved Wnt pathway components in thedemosponge Amphimedon queenslandica. Evol Dev. 2010, 12: 494-518. 10.1111/j.1525-142X.2010.00435.x.
Giresi PG: Chromatin profiles of human cells in health and disease using FAIRE. 2012, Ann Arbor, MI: ProQuest
We give special thanks to Liliana Florea for her valuable contributions to thedesign of the Trinity de novo assembly pipeline used in this paper. Wethank Kasia Hammar for animal care and technical assistance; Likit Preeyan and CTitus Brown for help with digital normalization and for credit on the Amazon EC2system; Antje Fischer, Rebecca Helm, Freya Goetz, and Casey Dunn for discussionson experimental protocols; and C Titus Brown also for comments on themanuscript.
The authors declare they have no competing interests.
ST and JS designed the experiment; DA and SI, in consultation with ST and JS,designed the computational pipeline for the de novo transcriptome assembly.ST performed the experiments. DA performed the Trinity assembly and computationalanalysis. ST performed the digital normalization and Oases assembly. All authorscontributed to the writing of the manuscript. All authors read and approved thefinal manuscript.
Sarah Tulin, Derek Aguiar contributed equally to this work.
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
About this article
Cite this article
Tulin, S., Aguiar, D., Istrail, S. et al. A quantitative reference transcriptome for Nematostella vectensis earlyembryonic development: a pipeline for de novo assembly in emergingmodel systems. EvoDevo 4, 16 (2013) doi:10.1186/2041-9139-4-16
- Gene regulatory network
- Nematostella embryonic development
- Body plan evolution
- Next-generation sequencing
- Illumina HiSeq