A quantitative reference transcriptome for Nematostella vectensis earlyembryonic development: a pipeline for de novo assembly in emergingmodel systems
© Tulin et al.; licensee BioMed Central Ltd. 2013
Received: 4 January 2013
Accepted: 14 March 2013
Published: 3 June 2013
The de novo assembly of transcriptomes from short shotgun sequencesraises challenges due to random and non-random sequencing biases andinherent transcript complexity. We sought to define a pipeline for denovo transcriptome assembly to aid researchers working withemerging model systems where well annotated genome assemblies are notavailable as a reference. To detail this experimental and computationalmethod, we used early embryos of the sea anemone, Nematostellavectensis, an emerging model system for studies of animal body planevolution. We performed RNA-seq on embryos up to 24 h of developmentusing Illumina HiSeq technology and evaluated independent de novoassembly methods. The resulting reads were assembled using either theTrinity assembler on all quality controlled reads or both the Velvet andOases assemblers on reads passing a stringent digital normalization filter.A control set of mRNA standards from the National Institute of Standards andTechnology (NIST) was included in our experimental pipeline to invest ourtranscriptome with quantitative information on absolute transcript levelsand to provide additional quality control.
We generated >200 million paired-end reads from directional cDNA librariesrepresenting well over 20 Gb of sequence. The Trinity assembler pipeline,including preliminary quality control steps, resulted in more than 86% ofreads aligning with the reference transcriptome thus generated.Nevertheless, digital normalization combined with assembly by Velvet andOases required far less computing power and decreased processing time whilestill mapping 82% of reads. We have made the raw sequencing reads andassembled transcriptome publically available.
Nematostella vectensis was chosen for its strategic position in thetree of life for studies into the origins of the animal body plan, however,the challenge of reference-free transcriptome assembly is relevant to allsystems for which well annotated gene models and independently verifiedgenome assembly may not be available. To navigate this new territory, wehave constructed a pipeline for library preparation and computationalanalysis for de novo transcriptome assembly. The gene modelsdefined by this reference transcriptome define the set of genes transcribedin early Nematostella development and will provide a valuabledataset for further gene regulatory network investigations.
KeywordsTranscriptome Gene regulatory network Nematostella embryonic development Body plan evolution Next-generation sequencing Illumina HiSeq Trinity Oases RNA-seq
Nematostella vectensis, the starlet sea anemone, offers many advantages asa model system for the evolution of animal developmental programs. As an anthozoancnidarian, it is strategically positioned as an outgroup to Bilateria [1–3] and is well situated to reveal the early steps in the evolution of thebilaterian body plan. Two of these evolutionary steps are likely to include theformation of a secondary body axis and a mesodermal germ layer which are bothessential, defining characteristics of a bilaterian animal. Embryonic dorsal-ventralpatterning and mesodermal development have been studied in many bilaterian modelsyet the origins of these significant body plan innovations are not well understood.Initial studies of gene expression in Nematostella and non-anthozoancnidarians have revealed that genes important to bilaterian mesoderm specificationare expressed in the endoderm of the sea anemone, and suggests that the bilaterianmesoderm may have originated from the endoderm of diploblastic ancestors [4–6]. Genes encoding factors involved in dorsal-ventral axis specification inBilaterians are likewise asymmetrically expressed in Nematostella,indicating the possibility that a secondary axis was present in theCnidarian-Bilaterian ancestor [7, 8]. Defining the mechanisms controlling Nematostella developmentwill help address these questions about the early evolutionary steps that led tobilaterian body plans with three germ layers and bilateral symmetry.
Gene regulatory networks (GRN) provide predictive models of gene regulation, as inthe several examples that now exist for normal animal development (for example,Drosophila, sea urchin [10, 11], ascidians , chick , and zebrafish ). To gain a comprehensive view of the control system, it is necessary toidentify all genes whose products make up the regulatory network. This applies toour current research efforts but is also generally applicable to studies ofvirtually any regulatory system. Advanced sequencing platforms now allow us to dothis through RNA-seq techniques. Yet, deep RNA-seq brings challenges in analysisreflecting the scale and complexity of transcriptomes, the primary problem beingadequate assembly of RNA-seq reads in order to define a reference set of gene models [15–17]. Transcriptome assembly can be achieved using a reference-based strategy,a de novo strategy or a combination of the two. The main drawback to usinga genome reference for assembly is that it relies on the quality of the referencegenome being used . This is a particular problem for emerging model systems with recentlycompleted genomes because misassemblies, poor annotation and large gaps in coverageplague the genome assemblies of all but a few of the major model systems . There is also a challenge in assigning reads that align equally well tomultiple places in the genome. The aligner must decide to either exclude these readswhich can result in gaps or to choose which alignments to retain which could lead towrong assignments or predictions of a transcript in a region that has notranscription.
A comprehensive GRN for early embryonic development in Nematostella willenable researchers to investigate the extent to which the bilaterian regulatorytoolkit is present in this representative cnidarian, down to the level of precisesignaling systems and transcription factor cis-regulatory interactions. Byharnessing the power of high-throughput sequencing and perturbation techniques, weaim to build the sea anemone GRN in an unbiased and efficient manner that will serveas a GRN construction pipeline for other model systems to follow.
The current Nematostella genome assemblies [20, 21] fall into the category of young genome models that are still incompleteand contain gaps thus making the reference-based method alone insufficient for ourneeds. Taking these and all of the above complications into account and consideringour goal to define an experimental and computational pipeline for emerging modelsystems, we elected to use the de novo assembly approach. This approachwill be especially useful for evo-devo researchers aiming to harness the power ofnext-generation sequencing to bring their research into the genomics era; a trendalready underway, for example Parhyale, Oncopeltus, sponge , and sea urchin .
The scale of reads, random and non-random sequencing errors, and inherent transcriptcomplexity due to alternate transcription start sites or splice junctions all posechallenges for de novo transcriptome assembly. Indeed, the scale of theproblem is only set to increase with the expanding capacity for transcriptomesequencing from advances in next-generation sequencing (NGS) platforms. In the lastfew years several assembly algorithms have been released to meet these challenges:Trans-ABySS , SOAPdenovo , Velvet/Oases [27, 28], and Trinity . The millions of short reads produced from NGS platforms result inmillions of overlapping sequences. Short-read de novo assemblers exploitthese overlaps to reconstruct the original transcripts by using the de Bruijn graphdata structure, which encodes overlapping k-mers as adjacent vertices.Assembly algorithms then compute paths through the de Bruijn graph that correspondto valid assemblies of the sequence reads.
The Trinity assembler breaks the sequence data into many de Bruijn graphs in order tocapture transcript complexity resulting from alternative splicing, gene duplicationsor allelic variation . Trinity consists of three modules called Inchworm, Chrysalis, andButterfly. Inchworm assembles the RNA-seq reads into transcripts and reports onlythe unique portions of alternate transcripts. Chrysalis clusters the Inchwormcontigs so that each cluster represents all known transcripts variants for each geneor related genes and then constructs De Bruijn graphs for each cluster. All readsare segregated to one of these separate graphs. Butterfly then processes theseseparate graphs in parallel by tracing a path through each one and reports fulllength transcripts separately for alternate splice forms and paralogs. The Oasesassembler uploads a preliminary assembly created by Velvet, which was originallydesigned for genome assembly. Oases corrects this assembly using a range ofk- mers to create separate assemblies, which are then combined into one.The longer k-mers perform better on high expression transcripts and theshorter k- mers have an advantage on low expression transcripts . While the multiple k-mer approach has been found to result inan increase of longer transcripts, it can also lead to an accumulation of incorrectassemblies or artificially fused transcripts .
In this study we designed a next-generation sequencing and analysis pipeline toproduce a minimally biased and quantitative reference transcriptome. The resultingtranscriptome represents the first 24 h of Nematostella developmentand will be the basis for further gene regulatory network studies. The experimentaland computational pipeline will be used by us and others to produce transcriptomesfor other model systems, particularly those evo-devo models that do not yet have anannotated genome but would benefit from an in depth molecular analysis.
Nematostella vectensis adults following normal culture at 18°Cwere spawned with a 9-h cycle of light at 25°C in an incubator. Maleand female spawning adults were in separate bowls and egg sacs were removed to afresh bowl and fertilized with sperm from male bowls for 10 minutes. Theegg sacs were then dejellied with a 4% cysteine solution (pH 7.4) in 50%filtered sea water (FSW) for 8 minutes and rinsed five times with 50% FSW.All embryo processing was performed in an 18°C room and the embryos werecultured from the time of fertilization for 0, 6, 12, 18 or 24 h (fivetimepoints). An additional 24-h sample was prepared in the same way from aseparate spawning event. Cultured embryos were transferred to an eppendorf tube,allowed to settle, gently spun to a pellet and the supernatant removed,approximately 600 embryos per sample. The embryo pellet was immediately immersedin 100 μl of lysis buffer from the Invitrogen mRNA DIRECT kit(Invitrogen, Life Technologies, Grand Island, NY, USA) and homogenized with aKontes Pellet Pestle (distributed by Thermo Fisher Scientific, Pittsburgh, PA,USA) attached to a 12 V/700 rpm drill. Another 100 μl oflysis buffer was used to rinse the Kontes Pellet Pestle tip and collected in thesame tube. Samples were then stored at −80°C until all timepoints hadbeen collected.
To thawed lysates, a third aliquot of 100 μl of lysis buffer was addedand then the normal protocol for the Invitrogen mRNA DIRECT kit was followedusing 50 μl Dynabeads per sample and low adhesion microcentrifugetubes, following the manufacturer’s recommendations. The mRNA yields werebetween 108 ng and 344 ng total per sample. The mRNA was used asstarting material for the ScriptSeq V.1 kit from Epicentre (EpicentreBiotechnologies, Madison, WI, USA). A total of 9.0 μl of mRNAcorresponding to between 74 ng to 233 ng per sample was combined with1.0 μl of a 1:10 dilution of External RNA Controls Consortium (ERCC)spike-in control RNA for the first reaction (available from Invitrogen/LifeTechnologies). The protocol was followed exactly, using 12 cycles total ofPCR in the amplification step with Phusion High Fidelity polymerase (availablefrom Therm Scientific) and barcoded Illumina-compatible primers 1 to 6 fromEpicentre.
The libraries were size selected with a 2% Pippin prep gel (from Sage Science,Beverly, MA, USA) for 450 bp and checked on a Agilent 2100 Bioanalyzer witha high sensitivity DNA chip (from Agilent Technologies, Santa Clara, CA, USA)and then by qPCR. The samples were combined and run on a single lane of theIllumina High Seq 1000 with version III chemistry with 200 cycles of pairedend sequencing plus indexing reads. All raw read files are available on theWoods Hole Data Archive at http://hdl.handle.net/1912/5613, DOI[DOI:http://10.1575/1912/5613].
Quality control was implemented using a combination of Bowtie2 (version2.0.0-beta6), Basic Local Alignment Search Tool (BLAST), btrim (build date 9September 2011), and the FASTX-Toolkit. First, we computed overrepresentedk-mers in the raw sequence data and ran BLAST against the setof non-redundant (nr) sequences from the National Center for BiotechnologyInformation (NCBI). The top several BLAST hits were analyzed for homologywith ribosomal or mitochondrial RNA. Using the BLAST data, we constructed acatalog of rRNA and mtRNA sequences which could serve as a reference set tofilter the overly abundant non-protein coding RNA prior to de novotranscriptome assembly. A Bowtie2 index was built from the rRNA and mtRNAsequences of Montastraea franksi, Savalia savaglia,Actiniaria, Nematostella vectensis, andClathrina. Sequence reads successfully aligning by Bowtie2 tothis set were removed. The overrepresented k-mers also containedadapter sequences that remained in the sequence. We retrieved the exactIllumina adapter sequences and used the software tool btrim to clipadapters.
Sequence reads demonstrating low complexity (containing only one or twounique bases) are likely due to technical artifacts and were removed. Next,the GC content distribution was computed for the set of all reads. GCcontent biases in the first 13 bases of Illumina RNA-seq data are known toexist due to random hexamer priming . This bias may cause an imbalance in read coverage and persistthrough the assembly process, which can affect the quality of assembly andquantification levels. Because we had extremely high sequence coverage, weremoved the bias by simply trimming the start of the reads. Using theFASTX-Toolkit, we removed the initial 13 bases from the reads at eachtimepoint. Finally, btrim was also used to adaptively trim low quality basesfrom the end of the read. Adaptive trimming is performed by sliding a windowof 5 bp from the end of the read to the start, removing bases andshifting the sliding window by 1 base if the average quality score is lessthan 30 until the average quality score is at least 30.
Digital normalization, Velvet and Oases
Digital normalization is a method to reduce the total number of reads to beassembled, thereby also reducing the computing power and time required forassembly. It preferentially removes high abundance reads but retains readcomplexity in order to remove errors but preserve low abundance transcriptsprior to assembly. All links to digital normalization software are availableelectronically through http://ged.msu.edu/papers/2012-diginorm/.Raw paired-end read files were first interleaved into pairs using a pythonscript, available athttp://github.com/ged-lab/khmer/tree/2012-paper-diginorm/sandbox.Then, three rounds of digital normalization were applied to removeoverabundant and erroneous reads. These depend on the khmer softwarepackage, available at http://github.com/ged-lab/khmer/. The khmersoftware also relies on the screed package for loading sequences, availableat http://github.com/ged-lab/screed/ (khmer and screed are©2010 Michigan State University, and are free software available fordistribution, modification, redistribution under the BSD license). Thedigital normalized files were assembled with Velvet (version 1.2.03) andOases (version 0.2.06). The details of the execution commands are availablein Additional file 1. The most currentrecommendations for use of digital normalization for de novotranscriptome assembly recommend using only one round of normalizationinstead of three. Fewer low abundance transcripts may be lost by foregoingfurther rounds of digital normalization at the expense of increasedcomputing time and power to assemble the greater number of remainingreads.
Trinity assembly and quantification
Spike-in standard curve R values for the ordinary least squaresregression
To compute overexpressed GO terms in our transcriptome, we used BLASTx2.2.26+, BLOSUM62 similarity matrix, Blast2GO database version August 2011,and pipeline B2G4Pipe version 2.3.5. The definition of each GO term isdetermined by the GO Consortium: http://www.geneontology.org/,and can be found using the EMBL-European Bioinformatics Institute QuickGO:http://www.ebi.ac.uk/QuickGO/, or the Gene Ontology NormalUsage Tracking System, GONUTS:http://gowiki.tamu.edu/wiki/index.php/Main_Page. Definitionsfor all GO terms presented in this paper can be found in Additional file4.
Library preparation and quality control of reads
The polyA-RNA enriched sample was then processed with the Script Seq kit, version1, from Epicentre. The main advantages of this kit are the resulting directionalreads, short preparation time (4 h), and low input requirements (as low as50 ng mRNA). The adapter-ligated libraries were then size selected foruniformity at 450 bp using a Pippin Prep gel electrophoresis apparatus fromSage Science and combined in one lane on an Illumina HiSeq 1000 to produce 2× 100 bp paired-end reads.
Quality control read attrition
Raw paired end reads
QC step 1
QC step 2
1.74E + 07
1.29E + 07
1.24E + 07
1.98E + 07
1.46E + 07
1.40E + 07
1.44E + 07
9.17E + 06
8.76E + 06
2.88E + 07
1.96E + 07
1.89E + 07
3.61E + 07
2.57E + 07
2.46E + 07
2.75E + 06
1.87E + 06
1.77E + 06
1.19E + 08
8.39E + 07
8.05E + 07
De novo assembly with Trinity assembler
No. of distinct transcripts passing filters
Total number of assembled transcripts
≥1 BLASTx hit (Nematostella vectensis), e<5e-5
≥1 BLASTx hit (non-redundant), e <5e-5
≥1 BLASTx hit (non-redundant) and 80% similarity
≥1 BLASTx hit (Nematostella vectensis) and 80%similarity
Molecules per embryo (MPE) >100
MPE >100, ≥1 BLASTx hit (non-redundant)
Transcript families from transcripts with MPE >100,≥1 BLASTx hit (non-redundant)
Trinity assembly compared to digital normalization/Velvet/Oases
Digital normalization, Oases-Velvet
Aligned concordantly 0 times
Aligned concordantly 1 time
Of reads aligning concordantly or discordantly 0 times:
Aligned discordantly 1 time
Aligned 0 times
Aligned exactly 1 time
Aligned >1 times
Overall alignment rate
Determining adequate coverage of transcriptomes is more challenging thandetermining coverage of genomes because every transcript species (includingsplice variants or those using alternate transcription start sites) is presentat a different level across a large range. Several studies have used anindependent assembly of randomly selected subsets of their reads to compare therate of new transcript discovery, determine the lower abundance limit ofdetection and compare the average length of isotigs. While analyzing the seaurchin embryonic transcriptome, Tu et al. compared a 20 M readsubset with a 2 M read subset and a 0.2 M read subset, and found that20 M reads were sufficient to reliably detect levels of transcripts at 400molecules/embryo, which they estimate as the lower limit for proteins ofdevelopmental significance (such as transcription factors, which may befunctionally relevant at levels as low as 10 copies of transcript per cell) . In their analysis of the milkweed transcriptome, Ewen-Campen etal. created eight subsets of reads, assembled them separately and usedBLASTx to compare gene discovery rates . They found that the rate of new transcript discovery plateaued at1.5 M reads, although the N50 isotig length continued to increase whenusing 2 M to 17 M read subsets. After 2 rounds of quality controlfiltering of our reads, we were left with 80,537,812 paired and 5,362,854unpaired reads, a depth which has been shown to produce good sensitivity inthese other systems for identifying all protein-coding transcripts expressed inthe early embryo.
To restate, these previous studies indicate that with the volume of reads comingoff the latest Illumina platforms (250 million to 400 million reads/lane), andonly multiplexing 6 samples in a lane, we should be beyond the necessarycoverage to represent all relevant regulatory transcripts. The best indicationof the quality of our assembly is that we have been able to use it as areference to map reads from more recent experiments in our lab at a median 93%rate (with 90% of samples mapping 90% of their reads to the Trinity-assembledreference).
Digital normalization followed by Oases assembly
To evaluate more closely the quality of our de novo transcriptomeassembly, we compared Trinity with an alternate strategy that combines a digitalnormalization step  with the assemblers Velvet  and Oases . Digital normalization is a computational normalization method thatpreferentially removes high abundance reads but retains read complexity in orderto remove errors and preserve low abundance transcripts prior to assembly. Thequality controlled reads were assembled using Velvet and Oases and then mappedback to the resulting assembly (commands in Additional file 1). The main advantage of this method is it greatly decreases thecomputing power and time required to process millions of reads. We also testedthe Amazon Elastic Cloud Computing Service (Amazon EC2),http://aws.amazon.com/ec2/, to perform the digital normalizationand Velvet-Oases assembly. This method proves to be a great alternative to usinga home institution’s core computers if the institution does not havesufficient computing power for running an assembler or if the computers areexpensive to rent, slow or unreliable. Whereas the computation of Trinityrequired over 50 h and 100 GB of RAM in addition to the qualitycontrol steps, the pipeline using digital normalization, Velvet and Oases canall be run in a single day using an XL computer rented from Amazon. We foundthat this alternative assembly approach gave competitive results when mappingour Nematostella vectensis RNA-seq reads. The overall mapping successrate was 82.17% for digital normalization-Oases as compared to 85.90% forTrinity (Table 4).
Quantification of Trinity-assembled transcriptome using known RNA‘spike-ins’
Blast2GO analysis reveals genes involved in gene regulation
Transcript family analysis
The transcripts inferred by the Trinity assembler were tested for sequencesimilarity with known genes. Specifically, the sequence of each transcript wastranslated and BLASTx was run using the NCBI non-redundant RefSeq database ofprotein sequences (nr). We combined sequences sharing at least 1 BLAST resultwith an e-value <5e-5 and a percent identity >80% into a transcriptfamily. The e-value threshold filtered out low confidence BLAST results, whilethe percent identity filter requires a large percentage of the transcript tomatch. We systematically tested multiple identity thresholds and found that 80%sequence match yielded a realistic transcript family distribution. Given thenatural tradeoff between identification of transcript variants versus paralogousgenes, overly stringent requirements of similarity result in anunderrepresentation of true homologous relationships; conversely, whensimilarity thresholds are set too low, distinct transcripts are erroneouslygrouped together. Because the assembled Nematostella transcripts arealso included in nr, the sequence similarity threshold is required to be sethigh. The transcript families therefore represent a mixture of fully assembledgene transcripts, pieces of transcripts, paralogs, or multiple splice forms,entirely compatible with our overarching goals. For BLAST analysis with sequencedatabases of very different sizes (as in our case, nr vs Nematostella)e-values are not an appropriate measure of similarity; in this case a similaritythreshold is more informative. Thus, at a similarity threshold of 80% we wereable to annotate 48,235 transcripts from the nr database, compared to only47,193 transcripts when using the Nematostella genome alone(Table 3). To do transcript family analysis weused the BLAST hits from the nr database. We inferred 13,293 total transcriptfamilies from the 61,835 assembled transcripts with at least 1 BLAST hit. Whenrestricting the transcript family analysis to transcripts expressed over 100molecules per embryo (MPE) and at least 1 BLAST hit (MPE >100, ≥1 BLASThit) across the 5 timepoints, we observed a total of 4,055 transcript familiesfrom the 6,169 transcripts passing the filter (Table 3). These computations likely represents an underestimate of thetrue number of genes expressed due to an inability to assemble very lowlyexpressed transcripts and, in a few cases, grouping paralogous genestogether.
An estimation of the percentage of the genome transcribed during these timeperiods was computed by taking the length of the longest transcript in eachtranscript family and dividing that by the length of the genome taken from theestimate in the Nematostella genome paper (450 Mbp) . An important caveat is that this is likely an underestimate becausesome transcripts are not full length. The percent of the genome transcribedabove 100 molecules per embryo according to this calculation is 0.368%. Theaverage transcript length for all assembled transcripts is 622.53 bp. Whenonly taking transcripts expressed after spike-in control correction above 0molecules per embryo, the average transcript length is 456.18 bp. Bytimepoint, the average of assembled and expressed transcripts is: (0 h)479.95 bp, (6 h) 487.82 bp, (12 h) 490.54 bp,(18 h) 494.03 bp, and (24 h) 441.29 bp.
The goals of the project discussed in this paper were (1) to identify all theprotein-coding genes expressed during the first 24 h of Nematostellavectensis development and in so doing (2) to detail a modern,cost-effective, efficient and quantitative series of experimental and computationalmethods that together make up a transcriptome pipeline for non-model organisms. Akey component of our pipeline is the inclusion of NIST RNA spike-in standards forquantification. This allows us to get around the problem of normalizing data toestimate gene expression levels , and provides an absolute measure of transcript abundance per embryo.
Many evolutionary developmental biology ‘evo-devo’ research projects haverevealed candidate genes in non-model organisms leading to intriguing hypothesesregarding the conservation, or conversely, invention of pathways controllingdevelopment [5, 44, 45]. However, to answer the questions these hypotheses have generated, it isnot only the gene homology, presence, absence or spatial localization that needs tobe known. To say that a developmental program or subcircuit has been conserved orevolved in a specific way, the cis-regulatory network connections betweenall the regulatory genes involved must be at a minimum known and validated.Candidate genes identified from BLAST analysis will typically only make up a smallfraction of the regulatory genes in any pathway. With the advent of next-generationsequencing platforms, identifying all the protein coding genes expressed at anygiven time during embryonic development is now within the reach of any model systemwhere embryos can be acquired. The lack of a sequenced, annotated genome is nolonger a major setback to GRN analysis.
The use of polyadenylated spike-in RNAs provides quantitative information on theabsolute abundance of transcripts per embryo. It is important to note the differencebetween this method of standardization and normalization approaches. The ERCCspike-ins allow us to build a standard curve, in our case a 92-point standard curve.As the quantities of the spike-ins are known, this allows us to infer from thestandard curve absolute quantities. Note that since spike-ins are added at thebeginning of the library preparation procedures, any variation in preparationefficiencies (that is, technical noise) is in theory accounted for by the spike-ins.Thus, even without absolute quantitation, the use of spike-ins allows directcomparison between samples without the distorting effects of normalization tominimize the effects of technical variation. Further, quantitation by spike-ins alsoallows us to know the limits of our ability to detect and quantify lowly expressedtranscripts. Since low expressed transcripts account for many of the problems inbioinfomatics analysis, our 100 molecules per embryo cut-off allows us to focus ouranalysis on those transcripts expressed at biologically relevant levels which arealso within the linear range of our standard curve. Increasing the sequencing depthand being less conservative with mapping stringency could improve our ability toquantify these lowly expressed transcripts.
The embryo of the sea anemone Nematostella vectensis provides an importantevo-devo model for understanding early animal development, particularly in relationto the question of how initial patterns of differential gene expression emerge alongorthogonal body axes. Given Nematostella’s position among cnidariansand the molecular evidence thus far, it is possible that a bilaterally-symmetricpattern formation network stretches back to before the Cambrian to a time precedingthe Cnidaria-Bilateria bifurcation. However, to make this argument we need amechanistic understanding of early development in both cnidarian and canonicalbilaterian models systems. Moreover, in light of the compare-and-contrast nature ofthese studies, we need to move away from a candidate gene approach as such methodsclearly bias towards the ‘discovery’ of similarities as opposed todifferences between regulatory networks. With the advent of genomics, we can nowattempt exhaustive de novo approaches to define regulatory networks, thoughchallenges in handling RNA-seq data sets still exist. In this report, we haveundertaken preliminary steps in defining the Nematostella gene regulatorynetwork for early pattern formation by building a comprehensive model of geneexpression through 24 h of development. This quantitative referencetranscriptome will help us identify, in a minimally biased manner, the most relevantgenes to the pattern formation control system. The regulatory network for patternformation in Nematostella will in turn provide a powerful basis forcomparison with early development networks from canonical bilaterians.
In summary, we have presented our quantitative reference transcriptome forNematostella vectensis early embryogenesis, which is available on theWoods Hole Data Archive at http://hdl.handle.net/1912/5613[DOI:10.1575/1912/5613]. Additionally, our de novotranscriptome pipeline, based on the Trinity assembler, has been designed to meetthe needs of the evo-devo community.
External RNA Controls Consortium
fragments per kilobase of exon permillion fragments mapped
filtered sea water
gene regulatory network
next generation sequencing
National Institute ofStandards and Technology
non-redundant BLAST database
(quantitative)polymerase chain reaction.
We give special thanks to Liliana Florea for her valuable contributions to thedesign of the Trinity de novo assembly pipeline used in this paper. Wethank Kasia Hammar for animal care and technical assistance; Likit Preeyan and CTitus Brown for help with digital normalization and for credit on the Amazon EC2system; Antje Fischer, Rebecca Helm, Freya Goetz, and Casey Dunn for discussionson experimental protocols; and C Titus Brown also for comments on themanuscript.
- Dunn CW, Hejnol A, Matus DQ, Pang K, Browne WE, Smith SA, Seaver E, Rouse GW, Obst M, Edgecombe GD, Sørensen MV, Haddock SHD, Schmidt-Rhaesa A, Okusu A, Kristensen RM, Wheeler WC, Martindale MQ, Giribet G: Broad phylogenomic sampling improves resolution of the animal tree oflife. Nature. 2008, 452: 745-749. 10.1038/nature06614.View ArticlePubMed
- Hejnol A, Obst M, Stamatakis A, Ott M, Rouse GW, Edgecombe GD, Martinez P, Baguna J, Bailly X, Jondelius U, Wiens M, Muller WEG, Seaver E, Wheeler WC, Martindale MQ, Giribet G, Dunn CW: Assessing the root of bilaterian animals with scalable phylogenomicmethods. Proc R Soc B. 2009, 276: 4261-4270. 10.1098/rspb.2009.0896.PubMed CentralView ArticlePubMed
- Mallatt J, Craig CW, Yoder MJ: Nearly complete rRNA genes assembled from across the metazoan animals:effects of more taxa, a structure-based alignment, and paired-sitesevolutionary models on phylogeny reconstruction. Mol Phylogenet Evol. 2010, 55: 1-17. 10.1016/j.ympev.2009.09.028.View ArticlePubMed
- Martindale MQ, Pang K, Finnerty JR: Investigating the origins of triploblasty: “mesodermal” geneexpression in a diploblastic animal, the sea anemone Nematostellavectensis (phylum, Cnidaria; class, Anthozoa). Development. 2004, 131: 2463-2474. 10.1242/dev.01119.View ArticlePubMed
- Wikramanayake AH, Hong M, Lee PN, Pang K, Byrum CA, Bince JM, Xu R, Martindale MQ: An ancient role for nuclear beta-catenin in the evolution of axial polarityand germ layer segregation. Nature. 2003, 426: 446-450. 10.1038/nature02113.View ArticlePubMed
- Fritzenwanker JH, Saina M, Technau U: Analysis of forkhead and snail expression reveals epithelial-mesenchymaltransitions during embryonic and larval development of Nematostellavectensis. Dev Biol. 2004, 275: 389-402. 10.1016/j.ydbio.2004.08.014.View ArticlePubMed
- Matus DQ, Thomsen GH, Martindale MQ: Dorso/ventral genes are asymmetrically expressed and involved in germ-layerdemarcation during cnidarian gastrulation. Curr Biol. 2006, 16: 499-505. 10.1016/j.cub.2006.01.052.View ArticlePubMed
- Rentzsch F, Anton R, Saina M, Hammerschmidt M, Holstein TW, Technau U: Asymmetric expression of the BMP antagonists chordin and gremlin in the seaanemone Nematostella vectensis: implications for the evolution ofaxial patterning. Dev Biol. 2006, 296: 375-387. 10.1016/j.ydbio.2006.06.003.View ArticlePubMed
- Stathopoulos A, Levine M: Genomic regulatory networks and animal development. Dev Cell. 2005, 9: 449-462. 10.1016/j.devcel.2005.09.005.View ArticlePubMed
- Davidson EH, Levine MS: Properties of developmental gene regulatory networks. Proc Natl Acad Sci USA. 2008, 105: 20063-20066. 10.1073/pnas.0806007105.PubMed CentralView ArticlePubMed
- Davidson EH: The Regulatory Genome. 2006, London, UK: Academic Press
- Imai KS, Levine M, Satoh N, Satou Y: Regulatory blueprint for a chordate embryo. Science. 2006, 312: 1183-1187. 10.1126/science.1123404.View ArticlePubMed
- Betancur P, Bronner-Fraser M, Sauka-Spengler T: Assembling neural crest regulatory circuits into a gene regulatorynetwork. Annu Rev Cell Dev Biol. 2010, 26: 581-603. 10.1146/annurev.cellbio.042308.113245.PubMed CentralView ArticlePubMed
- Swiers G, Patient R, Loose M: Genetic regulatory networks programming hematopoietic stem cells anderythroid lineage specification. Dev Biol. 2006, 294: 525-540. 10.1016/j.ydbio.2006.02.051.View ArticlePubMed
- Tu Q, Cameron RA, Worley KC, Gibbs RA, Davidson EH: Gene structure in the sea urchin Strongylocentrotus purpuratus based ontranscriptome analysis. Genome Res. 2012, 22: 2079-2087. 10.1101/gr.139170.112.PubMed CentralView ArticlePubMed
- Martin J, Wang Z: Next-generation transcriptome assembly. Nat Rev Genet. 2011, 12: 671-682. 10.1038/nrg3068.View ArticlePubMed
- Brown CT, Howe A, Zhang Q, Pyrkosz A, Brom T: A reference-free algorithm for computational normalization of shotgunsequencing data. [http://arxiv.org/abs/1203.4802],
- Miller JR, Koren S, Sutton G: Assembly algorithms for next-generation sequencing data. Genomics. 2010, 95: 315-327. 10.1016/j.ygeno.2010.03.001.PubMed CentralView ArticlePubMed
- Salzberg SL, Yorke JA: Beware of mis-assembled genomes. Bioinformatics. 2005, 21: 4320-4321. 10.1093/bioinformatics/bti769.View ArticlePubMed
- Ryan JF, Burton PM, Mazza ME, Kwong GK, Mullikin JC, Finnerty JR: The cnidarian-bilaterian ancestor possessed at least 56 homeoboxes: evidencefrom the starlet sea anemone, Nematostella vectensis. Genome Biol. 2006, 7: R64-10.1186/gb-2006-7-7-r64.PubMed CentralView ArticlePubMed
- Putnam NH, Srivastava M, Hellsten U, Dirks B, Chapman J, Salamov A, Terry A, Shapiro H, Lindquist E, Kapitonov VV, Jurka J, Genikhovich G, Grigoriev IV, Lucas SM, Steele RE, Finnerty JR, Technau U, Martindale MQ, Rokhsar DS: Sea anemone genome reveals ancestral eumetazoan gene repertoire and genomicorganization. Science. 2007, 317: 86-94. 10.1126/science.1139158.View ArticlePubMed
- Zeng V, Villanueva KE, Ewen-Campen BS, Alwes F, Browne WE, Extavour CG: De novo assembly and characterization of a maternal anddevelopmental transcriptome for the emerging model crustacean Parhyalehawaiensis. BMC Genomics. 2011, 12: 581-10.1186/1471-2164-12-581.PubMed CentralView ArticlePubMed
- Ewen-Campen B, Shaner N, Panfilio KA, Suzuki Y, Roth S, Extavour CG: The maternal and early embryonic transcriptome of the milkweed bugOncopeltus fasciatus. BMC Genomics. 2011, 12: 61-10.1186/1471-2164-12-61.PubMed CentralView ArticlePubMed
- Conaco C, Neveu P, Zhou H, Arcila ML, Degnan SM, Degnan BM, Kosik KS: Transcriptome profiling of the demosponge Amphimedon queenslandicareveals genome-wide events that accompany major life cycle transitions. BMC Genomics. 2012, 13: 209-10.1186/1471-2164-13-209.PubMed CentralView ArticlePubMed
- Robertson G, Schein J, Chiu R, Corbett R, Field M, Jackman SD, Mungall K, Lee S, Okada HM, Qian JQ, Griffith M, Raymond A, Thiessen N, Cezard T, Butterfield YS, Newsome R, Chan SK, She R, Varhol R, Kamoh B, Prabhu A-L, Tam A, Zhao Y, Moore RA, Hirst M, Marra MA, Jones SJM, Hoodless PA, Birol I: De novo assembly and analysis of RNA-seq data. Nat Methods. 2010, 7: 909-912. 10.1038/nmeth.1517.View ArticlePubMed
- Li R, Zhu H, Ruan J, Qian W, Fang X, Shi Z, Li Y, Li S, Shan G, Kristiansen K, Li S, Yang H, Wang J, Wang J: De novo assembly of human genomes with massively parallel short readsequencing. Genome Res. 2010, 20: 265-272. 10.1101/gr.097261.109.PubMed CentralView ArticlePubMed
- Zerbino DR, Birney E: Velvet: algorithms for de novo short read assembly using de Bruijngraphs. Genome Res. 2008, 18: 821-829. 10.1101/gr.074492.107.PubMed CentralView ArticlePubMed
- Schulz MH, Zerbino DR, Vingron M, Birney E: Oases: robust de novo RNA-seq assembly across the dynamic range ofexpression levels. Bioinformatics. 2012, 28: 1086-1092. 10.1093/bioinformatics/bts094.PubMed CentralView ArticlePubMed
- Grabherr MG, Haas BJ, Yassour M, Levin JZ, Thompson DA, Amit I, Adiconis X, Fan L, Raychowdhury R, Zeng Q, Chen Z, Mauceli E, Hacohen N, Gnirke A, Rhind N, di Palma F, Birren BW, Nusbaum C, Lindblad-Toh K, Friedman N, Regev A: Full-length transcriptome assembly from RNA-seq data without a referencegenome. Nat Biotechnol. 2011, 29: 644-652. 10.1038/nbt.1883.PubMed CentralView ArticlePubMed
- Zhao Q-Y, Wang Y, Kong Y-M, Luo D, Li X, Hao P: Optimizing de novo transcriptome assembly from short-read RNA-seqdata: a comparative study. BMC Bioinformatics. 2011, 12 (Suppl 14): S2-10.1186/1471-2105-12-S14-S2.View Article
- Hansen KD, Brenner SE, Dudoit S: Biases in Illumina transcriptome sequencing caused by random hexamerpriming. Nucleic Acids Res. 2010, 38: e131-e131. 10.1093/nar/gkq224.PubMed CentralView ArticlePubMed
- Venables WN, Ripley BD: Modern Applied Statistics with S. 2002, Berlin, Germany: Springer VerlagView Article
- Ren X, Liu T, Dong J, Sun L, Yang J, Zhu Y, Jin Q: Evaluating de bruijn graph assemblers on 454 transcriptomic data. PLoS ONE. 2012, 7: e51188-10.1371/journal.pone.0051188.PubMed CentralView ArticlePubMed
- Vijay N, Poelstra JW, Künstner A, Wolf JBW: Challenges and strategies in transcriptome assembly and differential geneexpression quantification. A comprehensive in silico assessment ofRNA-seq experiments. Mol Ecol. 2013, 22: 620-634. 10.1111/mec.12014.View ArticlePubMed
- Baker SC, Bauer SR, Beyer RP, Brenton JD, Bromley B, Burrill J, Causton H, Conley MP, Elespuru R, Fero M, Foy C, Fuscoe J, Gao X, Gerhold DL, Gilles P, Goodsaid F, Guo X, Hackett J, Hockett RD, Ikonomi P, Irizarry RA, Kawasaki ES, Kaysser-Kranich T, Kerr K, Kiser G, Koch WH, Lee KY, Liu C, Liu ZL, Lucas A: The External RNA Controls Consortium: a progress report. Nat Methods. 2005, 2: 731-734. 10.1038/nmeth1005-731.View ArticlePubMed
- Devonshire AS, Elaswarapu R, Foy CA: Evaluation of external RNA controls for the standardisation of geneexpression biomarker measurements. BMC Genomics. 2010, 11: 662-10.1186/1471-2164-11-662.PubMed CentralView ArticlePubMed
- External RNA Controls Consortium: Proposed methods for testing and selecting the ERCC external RNA controls. BMC Genomics. 2005, 6: 150PubMed CentralView Article
- Li B, Dewey CN: RSEM: accurate transcript quantification from RNA-seq data with or without areference genome. BMC Bioinformatics. 2011, 12: 323-10.1186/1471-2105-12-323.PubMed CentralView ArticlePubMed
- Conesa A, Götz S, García-Gómez JM, Terol J, Talón M, Robles M: Blast2GO: a universal tool for annotation, visualization and analysis infunctional genomics research. Bioinformatics. 2005, 21: 3674-3676. 10.1093/bioinformatics/bti610.View ArticlePubMed
- Alexa A, Rahnenfuher J: topGO: enrichment analysis for gene ontology. [http://www.bioconductor.org/packages/2.11/bioc/html/topGO.html]
- Marlow H, Roettinger E, Boekhout M, Martindale MQ: Functional roles of Notch signaling in the cnidarian Nematostellavectensis. Dev Biol. 2012, 362: 295-308. 10.1016/j.ydbio.2011.11.012.PubMed CentralView ArticlePubMed
- Röttinger E, Dahlin P, Martindale MQ: A framework for the establishment of a Cnidarian gene regulatory network for“endomesoderm” specification: the inputs of β-catenin/TCFsignaling. PLoS Genet. 2012, 8: e1003164-10.1371/journal.pgen.1003164.PubMed CentralView ArticlePubMed
- Tarazona S, García-Alcalde F, Dopazo J, Ferrer A, Conesa A: Differential expression in RNA-seq: a matter of depth. Genome Res. 2011, 21: 2213-2223. 10.1101/gr.124321.111.PubMed CentralView ArticlePubMed
- Yamada A, Pang K, Martindale MQ, Tochinai S: Surprisingly complex T-box gene complement in diploblastic metazoans. Evol Dev. 2007, 9: 220-230. 10.1111/j.1525-142X.2007.00154.x.View ArticlePubMed
- Adamska M, Larroux C, Adamski M, Green K, Lovas E, Koop D, Richards GS, Zwafink C, Degnan BM: Structure and expression of conserved Wnt pathway components in thedemosponge Amphimedon queenslandica. Evol Dev. 2010, 12: 494-518. 10.1111/j.1525-142X.2010.00435.x.View ArticlePubMed
- Giresi PG: Chromatin profiles of human cells in health and disease using FAIRE. 2012, Ann Arbor, MI: ProQuest
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative CommonsAttribution License (http://creativecommons.org/licenses/by/2.0), whichpermits unrestricted use, distribution, and reproduction in any medium, provided theoriginal work is properly cited.