Nematostella vectensis, the starlet sea anemone, offers many advantages asa model system for the evolution of animal developmental programs. As an anthozoancnidarian, it is strategically positioned as an outgroup to Bilateria [1–3] and is well situated to reveal the early steps in the evolution of thebilaterian body plan. Two of these evolutionary steps are likely to include theformation of a secondary body axis and a mesodermal germ layer which are bothessential, defining characteristics of a bilaterian animal. Embryonic dorsal-ventralpatterning and mesodermal development have been studied in many bilaterian modelsyet the origins of these significant body plan innovations are not well understood.Initial studies of gene expression in Nematostella and non-anthozoancnidarians have revealed that genes important to bilaterian mesoderm specificationare expressed in the endoderm of the sea anemone, and suggests that the bilaterianmesoderm may have originated from the endoderm of diploblastic ancestors [4–6]. Genes encoding factors involved in dorsal-ventral axis specification inBilaterians are likewise asymmetrically expressed in Nematostella,indicating the possibility that a secondary axis was present in theCnidarian-Bilaterian ancestor [7, 8]. Defining the mechanisms controlling Nematostella developmentwill help address these questions about the early evolutionary steps that led tobilaterian body plans with three germ layers and bilateral symmetry.
Gene regulatory networks (GRN) provide predictive models of gene regulation, as inthe several examples that now exist for normal animal development (for example,Drosophila, sea urchin [10, 11], ascidians , chick , and zebrafish ). To gain a comprehensive view of the control system, it is necessary toidentify all genes whose products make up the regulatory network. This applies toour current research efforts but is also generally applicable to studies ofvirtually any regulatory system. Advanced sequencing platforms now allow us to dothis through RNA-seq techniques. Yet, deep RNA-seq brings challenges in analysisreflecting the scale and complexity of transcriptomes, the primary problem beingadequate assembly of RNA-seq reads in order to define a reference set of gene models [15–17]. Transcriptome assembly can be achieved using a reference-based strategy,a de novo strategy or a combination of the two. The main drawback to usinga genome reference for assembly is that it relies on the quality of the referencegenome being used . This is a particular problem for emerging model systems with recentlycompleted genomes because misassemblies, poor annotation and large gaps in coverageplague the genome assemblies of all but a few of the major model systems . There is also a challenge in assigning reads that align equally well tomultiple places in the genome. The aligner must decide to either exclude these readswhich can result in gaps or to choose which alignments to retain which could lead towrong assignments or predictions of a transcript in a region that has notranscription.
A comprehensive GRN for early embryonic development in Nematostella willenable researchers to investigate the extent to which the bilaterian regulatorytoolkit is present in this representative cnidarian, down to the level of precisesignaling systems and transcription factor cis-regulatory interactions. Byharnessing the power of high-throughput sequencing and perturbation techniques, weaim to build the sea anemone GRN in an unbiased and efficient manner that will serveas a GRN construction pipeline for other model systems to follow.
The current Nematostella genome assemblies [20, 21] fall into the category of young genome models that are still incompleteand contain gaps thus making the reference-based method alone insufficient for ourneeds. Taking these and all of the above complications into account and consideringour goal to define an experimental and computational pipeline for emerging modelsystems, we elected to use the de novo assembly approach. This approachwill be especially useful for evo-devo researchers aiming to harness the power ofnext-generation sequencing to bring their research into the genomics era; a trendalready underway, for example Parhyale, Oncopeltus, sponge , and sea urchin .
The scale of reads, random and non-random sequencing errors, and inherent transcriptcomplexity due to alternate transcription start sites or splice junctions all posechallenges for de novo transcriptome assembly. Indeed, the scale of theproblem is only set to increase with the expanding capacity for transcriptomesequencing from advances in next-generation sequencing (NGS) platforms. In the lastfew years several assembly algorithms have been released to meet these challenges:Trans-ABySS , SOAPdenovo , Velvet/Oases [27, 28], and Trinity . The millions of short reads produced from NGS platforms result inmillions of overlapping sequences. Short-read de novo assemblers exploitthese overlaps to reconstruct the original transcripts by using the de Bruijn graphdata structure, which encodes overlapping k-mers as adjacent vertices.Assembly algorithms then compute paths through the de Bruijn graph that correspondto valid assemblies of the sequence reads.
The Trinity assembler breaks the sequence data into many de Bruijn graphs in order tocapture transcript complexity resulting from alternative splicing, gene duplicationsor allelic variation . Trinity consists of three modules called Inchworm, Chrysalis, andButterfly. Inchworm assembles the RNA-seq reads into transcripts and reports onlythe unique portions of alternate transcripts. Chrysalis clusters the Inchwormcontigs so that each cluster represents all known transcripts variants for each geneor related genes and then constructs De Bruijn graphs for each cluster. All readsare segregated to one of these separate graphs. Butterfly then processes theseseparate graphs in parallel by tracing a path through each one and reports fulllength transcripts separately for alternate splice forms and paralogs. The Oasesassembler uploads a preliminary assembly created by Velvet, which was originallydesigned for genome assembly. Oases corrects this assembly using a range ofk- mers to create separate assemblies, which are then combined into one.The longer k-mers perform better on high expression transcripts and theshorter k- mers have an advantage on low expression transcripts . While the multiple k-mer approach has been found to result inan increase of longer transcripts, it can also lead to an accumulation of incorrectassemblies or artificially fused transcripts .
In this study we designed a next-generation sequencing and analysis pipeline toproduce a minimally biased and quantitative reference transcriptome. The resultingtranscriptome represents the first 24 h of Nematostella developmentand will be the basis for further gene regulatory network studies. The experimentaland computational pipeline will be used by us and others to produce transcriptomesfor other model systems, particularly those evo-devo models that do not yet have anannotated genome but would benefit from an in depth molecular analysis.