Rapid isolation of gene homologs across taxa: Efficient identification and isolation of gene orthologs from non-model organism genomes, a technical report

Background Tremendous progress has been made in the field of evo-devo through comparisons of related genes from diverse taxa. While the vast number of species in nature precludes a complete analysis of the molecular evolution of even one single gene family, this would not be necessary to understand fundamental mechanisms underlying gene evolution if experiments could be designed to systematically sample representative points along the path of established phylogenies to trace changes in regulatory and coding gene sequence. This isolation of homologous genes from phylogenetically diverse, representative species can be challenging, especially if the gene is under weak selective pressure and evolving rapidly. Results Here we present an approach - Rapid Isolation of Gene Homologs across Taxa (RIGHT) - to efficiently isolate specific members of gene families. RIGHT is based upon modification and a combination of degenerate polymerase chain reaction (PCR) and gene-specific amplified fragment length polymorphism (AFLP). It allows targeted isolation of specific gene family members from any organism, only requiring genomic DNA. We describe this approach and how we used it to isolate members of several different gene families from diverse arthropods spanning millions of years of evolution. Conclusions RIGHT facilitates systematic isolation of one gene from large gene families. It allows for efficient gene isolation without whole genome sequencing, RNA extraction, or culturing of non-model organisms. RIGHT will be a generally useful method for isolation of orthologs from both distant and closely related species, increasing sample size and facilitating the tracking of molecular evolution of gene families and regulatory networks across the tree of life.


Findings
One focus of evolutionary biologists is to understand how changes in regulatory and coding regions of genes contribute to species evolution and adaptation [1,2]. This requires sequence comparisons across distantly related taxa as well as among closely related species. A major limitation in studying molecular evolution is the amount of comprehensive sequence data available to track these changes in genes and their networks. Standard approaches include comparisons across widely divergent model organisms, comparison of gene sequences that have been deposited in databases, and comparisons of whole genome sequences. This can result in an incomplete matrix of information about the lineages of particular gene families, making it difficult to trace steps leading to functional changes in regulatory and coding sequences. Additionally, the sequence conservation of duplicated and diverged genes within gene families [3,4] poses a challenge: How can we identify a particular member of a gene family without isolating and screening through closely-related homologs? Here we report a strategy to efficiently isolate genes from genomic DNA that can be used to obtain sequence information from un-sequenced genomes and nonmodel organisms not easily reared in the laboratory. Rapid Isolation of Gene Homologs across Taxa (RIGHT) is based on the fact that homologous genes (both paralogs and orthologs) generally show conservation of at least one domain, even if other parts of the sequence are under weaker selective pressure. For example, the Hox proteins have retained the conserved DNA binding domain after duplication and divergence [5,6]. While not forging fundamentally new technology, this approach combines and modifies existing procedures to facilitate the rapid isolation of genes, allowing sampling of a large number of taxa.

RIGHT Methodology
RIGHT methodology utilizes degenerate polymerase chain reaction (PCR) and gene-specific amplified-length fragment polymorphism (AFLP) to allow for rapid gene isolation. First, degenerate primers are designed to amplify a small region of less than 200 bases of the conserved domain characteristic of the gene family ( Figure  1, Step 1). One primer is derived from the unique signature motif in the homolog of interest, while the other can be shared with other family members. A variation of touchdown PCR is used that is optimized for these degenerate primers (see attached protocol, Additional file 1). Nested PCR is done to ensure that the correct gene-family member is amplified. The PCR product is run on an agarose gel, purified and sequenced. The product can be positively identified by characteristic residues in the homolog of interest, along with BLAST searches against other species. Together, this allows for increased confidence that the isolated gene product corresponds to the gene of interest. From this short DNA sequence obtained from degenerate PCR, gene-specific primers are designed for subsequent reactions. Sequence up-and downstream of the conserved region (obtained in Step 1, Figure 1) is next isolated by modifications of AFLP and TE-display techniques [7][8][9][10][11][12] that allow selective amplification of only the gene sequence of interest. Traditional AFLP uses restriction enzymes to digest genomic DNA followed by ligation of adapters of known sequence to DNA ends. Adapter-specific primers are used in subsequent PCRs to amplify DNA fragments, which are then separated on a gel and analyzed. RIGHT uses the basic idea of AFLP up to the amplification step; however, instead of amplifying DNA

S tep 4
Isolation of entire coding sequence by 'w alking ' along the gene using different D N A digests as tem plates S tep 5 PC R-am plify entire coding region from genom e and sequence to verify reading fram e Figure 1 Overview of RIGHT technique used to isolate homologous genes from large gene families. All steps are described in the text. Oligonucleotides that were annealed to make the adapter destroyed the restriction site. All reverse primers were ordered with the 5' end phosphorylated, or were phosphorylated before annealing with an appropriate enzyme. For example, MseI digest/ligation: F-5' GACGATGAGTCTTGAGTTCAGTCTGTA, R-5'PhosTATACAGACTGAACTCAAGACTCATC; XhoI: F-5' GACGATGAGTCTTGAGTTCAGTCTGTA, R-5'PhosTCGATACAGACTGAACTCAAGACTCATC. fragments using adapter sequences as both primers (which generates many fragments), an adapter-specific primer is used as one primer and a gene-specific primer (derived from degenerate PCR used in Figure 1, Step 1) as the other primer. Thus, only a sequence from the gene of interest is isolated. The digestion of genomic DNA and ligation of adapters is done in a single step ( Figure 1, Step 2). Adapter sequences are designed to anneal to, but destroy, restriction sites in order to avoid re-digestion in this combined restriction/ligation reaction. Several different restriction digests are set up in parallel to provide different-length PCR templates covering the gene of interest. This is also beneficial because restriction site locations are not known for genomes that have not been sequenced. The digestion/ligation is followed by two rounds of nested PCR (Figure 1, Step 3), which functions to increase specificity of primer binding and the amount of product. After the PCR product is amplified and sequenced, new gene-specific primers are designed at the sequence ends to repeat PCRs with a different restriction digest/ligation as template in order to extend the sequence. By repeating this process, one can "walk" along the genomic sequence to isolate the entire coding sequence (Figure 1, Step 4).
In most cases only one clear product was observed after nested PCR; however, occasionally there were several. In these situations, either all products were sequenced or products were re-amplified using the same primers or another nested set to reduce the number of products. In cases where multiple bands persisted, it was usually due to restriction sites that were very close together in the genome and almost all of the sequenced regions overlapped. After a new sequence has been isolated, its continuity is always checked by PCR with primers at extreme opposite ends of the sequence that has been obtained to make sure the sequence being isolated is contiguous with that upstream and/or downstream (Figure 1, Step 5). This is very important because, although infrequent, ligation may occur between genomic DNA fragments in Step 2. As demonstrated, RIGHT provides efficiency and saves time when compared to other protocols. This combination is a powerful method for obtaining full gene sequence information, including coding and regulatory regions.

RIGHT isolation of homeobox and nuclear receptor genes
RIGHT has been used successfully in our laboratory to efficiently isolate specific members of several large gene families. The technique was first developed to isolate a rapidly-evolving member of the Hox gene family, fushi tarazu (ftz) [13][14][15]. First, degenerate PCR primers were designed based upon signature residues encoding the amino-terminal end of the DNA-binding homeodomain and another highly conserved region with low degeneracy approximately 150 bases downstream (Figure 2A, arrows). Ftz homologs were positively identified by characteristic residues in the homeodomain. Next, gene-specific AFLP was carried out using a combination of unique restriction/ligation templates for PCR with one ftz-specific primer and one adapter-specific primer ( Figure 2B). All products were sequenced and genesequence continuity verified by PCR with genomic DNA and primers designed to the extreme 5' and 3' ends of sequence that had been isolated. Full-length ftz sequences, including putative introns, were isolated by genomic walking until translation initiation and stop codons were identified. Using RIGHT allowed us to isolate ftz genes from diverse arthropods representing approximately 550 million years of evolutionary divergence [16], including the dermapteran Forficula auricularia and archaeognathan Pedetontus saltator ( Figure  2C). Additionally, genomic DNA of two non-model beetles was used for degenerate PCR to obtain the ftz homeobox, and in combination with RACE on embryonic cDNA, full-length ftz sequences were obtained [16]. To date, we have isolated 2 full-length and 10 partial ftz genes from a range of non-model organisms using RIGHT.
In addition to ftz, we isolated other homeobox-containing genes such as extradenticle (exd) and the orphan nuclear receptor ftz-f1 from multiple species with great success (unpublished). RIGHT was used to isolate partial exd sequences from Thermobia domestica (firebrat), Callosobruchus maculatus (beetle), and Folsomia candida (collembolan). In combination with RACE, full-length exd coding regions were isolated from these species. Several partial ftz-f1 sequences were isolated, including Artemia salina (brine shrimp), Folsomia candida (collembolan), Thermobia domestica (firebrat), Callosobruchus maculatus (beetle), Dermestes maculatus (beetle) Oncopeltus fasciatus (milkweed bug), and Acyrthosiphon pisum (aphid). As for exd, full-length ftz-f1 sequences have been obtained from many of these organisms in combination with RACE. For this work, as per experimental design, sequences were obtained from species representing key points in arthropod phylogeny to allow for systematic analysis of a small network of functionally related genes from different families (ftz, ftz-f1, exd). Thus far, every gene that we have attempted to isolate from any chosen species using RIGHT has been obtained.

Conclusions
The ability to isolate homologous genes from diverse taxa will empower studies of molecular evolution of genes, families and gene networks. In the past, these approaches were limited by absence of genomic information. Even though genome sequencing is now practical for a larger number of species, it is unlikely to make P. saltator genomic DNA with Hox gene family

Digestion of genomic DNA and in the presence of annealed oligonucleotides that destroy the restriction site
Overnight ligation to add adapter to DNA ends PCR using different digestion/ligation reactions as templates to isolate the 3' and 5' ends of ftz sequence. One ftz-specific primer and one adapter-specific primer  a dent in the millions of species on Earth. Similarly, investments are being made in developing new model systems, to expand on the standard fly, mouse and worm systems. However, the investment to bring a new model system up to speed is substantial and it is neither necessary nor practical to fully develop hundreds of genetic model systems. We suggest that these approaches, while enormously important for the field of evo-devo, are not always necessary to answer specific evolutionary questions. RIGHT provides a fast and efficient way to isolate genes, including coding regions and candidate cis-regulatory regions, and overcomes many practical constraints, realistically allowing for the isolation of 10s if not 100s of genes from families or gene networks to study molecular evolution across divergent taxa or within specific clades. This approach obviates common limitations, such as genome sequence availability or rearing species in the lab. It has been used successfully to isolate specific members of several large gene families, allowing for a comparative analysis over millions of years of evolutionary time.