Detecting possibly saturated positions in 18S and 28S sequences and their influence on phylogenetic reconstruction of Annelida (Lophotrochozoa)

Torsten Struck

An accurately inferred phylogeny is important to the study of evolution. Factors affecting the accuracy of an inferred tree can be traced to several sequential steps leading to the inference of the phylogeny. We have examined here the impact of some features of nucleotide sequences in alignments on phylogenetic (topological) accuracy rather than any source of error during the process of sequence alignment or choice of the method of inference (as is usually done). Specifically, we have studied (using computer simulation) the implications of changing the values of the following five parameters, individually and in combination: sequence length (l), nucleotide substitution rate (r), nucleotide base composition (θ), the transition-transversion rate ratio (κ), and the substitution rate heterogeneity among the sites (α). An interesting, and unexpected, result was that κ has a strong positive relationship with phylogenetic accuracy, especially at high substitution rates. This simulation-based work has implications for empirical researchers in the field and should enable them to choose from among the multiple genes typically available today for a more accurate inference of the phylogeny being studied.

We present phylogenomic analyses of the most comprehensive molecular character set compiled for Annelida and its constituent taxa, including over 347 000 aligned nucleotide sites for 39 taxa. The nucleotide data set was recovered using a pre-existing amino acid data set of almost 48 000 aligned sites as a backbone for tBLASTn searches against NCBI. In addition, orthology determinations of the loci in the original amino acid data set were scrutinized using an All vs All Reciprocal Best Hit approach, employing BLASTp, and examining for statistical interdependency among the loci. This approach revealed considerable sequence redundancy among the loci in the original data set and a new data set was compiled, with the redundancy removed. The newly compiled nucleotide data set, the original amino acid data set, and the new reduced amino acid data set were subjected to parsimony analyses and two forms of bootstrap resampling. The last-named data set also was analysed using a maximum- likelihood approach. There were two main objectives to these analyses: (i) to examine the general topology, including support, resulting from the analyses of the new data sets and (ii) to assess the consistency of the branching patterns across optimality criteria by comparison with previous probabilistic approaches. The phylogenetic hypotheses resulting from analyses of the three data sets are largely unsupported, reflecting the continued difficulty of finding numerous, reliable, and suitable loci for a group as ancient as Annelida. Resulting parsimonious hypotheses disagree, in some respects, with the previous probabilistic approaches; Sedentaria and, in most cases, Errantia are not supported as monophyletic groups but Pleistoannelida is recovered as a (unsupported) monophyletic group in one of the three parsimony analyses as well as the likelihood analysis. In addition, we performed missing data titration studies to estimate the impact of missing data on overall support and support for specific clades.

Here, we describe the complete mitogenome of the polychaete Laeonereis culveri, and we use this genome along with eight other annelid species to test the performance of mitochondrial markers and tree-building methods to recover a reliable phylogeny of Annelida. We also compared the efficiency of tree-building methods (Neighbour-joining, maximum likelihood and Bayesian inference) and models. A total of 420 phylogenies were reconstructed and revealed that amino acids greatly outperformed the nucleotide data set. Bayesian inference was more susceptible to type I error, due to high rates of incorrect clades with high statistical support. Conversely, maximum likelihood and Neighbour-joining were more prone to type II error, as many accepted clades were poorly supported. Our results clearly support the idea that the choice of the method and model was not as important as the choice of the gene. Six genes recovered the reference topology at least once. COB had the best performance followed by CO1, CO3, ND1, ND6 and ND4L. Even for these genes, the support of the deepest clades was low in many cases. The concatenation of genes only improved the performance for maximum likelihood. These results showed that mitochondrial genes must be selected very carefully, mainly when used for deeper relationships.

Molecular Phylogenetics and Evolution 48 (2008) 628–645 Contents lists available at ScienceDirect Molecular Phylogenetics and Evolution journal homepage: www.elsevier.com/locate/ympev Detecting possibly saturated positions in 18S and 28S sequences and their influence on phylogenetic reconstruction of Annelida (Lophotrochozoa) Torsten H. Struck a,b,*, Maximilian P. Nesnidal b, Günter Purschke b, Kenneth M. Halanych a a b Auburn University, 101 Rouse Building, Auburn, AL 36849, USA Universität Osnabrück, FB05 Biologie/Chemie, AG Zoologie, Barbarastrabe 11, D-49076 Osnabrück, Germany a r t i c l e i n f o Article history: Received 11 November 2007 Revised 25 April 2008 Accepted 13 May 2008 Available online 20 May 2008 Keywords: C factor O/E ratio Annelida Saturation rRNA Phylogeny Iss a b s t r a c t Phylogenetic reconstructions may be hampered by multiple substitutions in nucleotide positions obliterating signal, a phenomenon called saturation. Traditionally, plotting ti/tv ratios against genetic distances has been used to reveal saturation by assessing when ti/tv stabilizes at 1. However, interpretation of results and assessment of comparability between different data sets or partitions are rather subjective. Herein, we present the new C factor, which quantifies convergence of ti/tv ratios, thus allowing comparability. Furthermore, we introduce a comparative value for homoplasy, the O/E ratio, based on alterations of tree length. Simulation studies and an empirical example, based on annelid rRNA-gene sequences, show that the C factor correlates with noise, tree length and genetic distance and therefore is a proxy for saturation. The O/E ratio correlates with the C factor, which does not provide an intrinsic threshold of exclusion, and thus both together can objectively guide decisions to exclude saturated nucleotide positions. However, analyses also showed that, for reconstructing annelid phylogeny using Maximum Likelihood, an increase in numbers of positions improves tree reconstruction more than does the exclusion of saturated positions. Ó 2008 Elsevier Inc. All rights reserved. 1. Introduction Reliable reconstruction of phylogenies using molecular data is affected by several factors such as long branches, heterogeneity of base frequencies, and rate heterogeneity among positions (e.g., Kuhner and Felsenstein, 1994; Lake, 1994; Lockhart et al., 1994; Xia et al., 2003). All of these factors can result in homoplasy, which can decrease phylogenetic signal and hamper reconstruction efforts. An indicator of homoplasy in molecular data sets can be saturation of transition events relative to other mutations (e.g., Halanych and Robinson, 1999; Jördens et al., 2004; Lopez et al., 1999; Nickrent et al., 2000; Philippe and Forterre, 1999; Simon et al., 1994; Struck et al., 2002a; Swofford et al., 1996). Saturation is invoked when no further increase in transitions can be observed despite increasing genetic distance, indicating that multiple substitutions at nucleotide positions have occurred. Therefore, when a data set is saturated, phylogenetic reconstruction may be misled by homoplasious signal (e.g., Milinkovitch et al., 1996; Simon et al., 1994). Saturation is especially problematic for taxa that radi- * Corresponding author. Address: Universität Osnabrück, FB05 Biologie/Chemie, AG Zoologie, Barbarastrabe 11, D-49076 Osnabrück, Germany. Fax: +49 541 969 2587. E-mail addresses: struck@biologie.uni-osnabrueck.de (T.H. Struck), nesnidal@ gmail.com (M.P. Nesnidal), Purschke@biologie.uni-osnabrueck.de (G. Purschke), ken@auburn.edu (K.M. Halanych). 1055-7903/$ - see front matter Ó 2008 Elsevier Inc. All rights reserved. doi:10.1016/j.ympev.2008.05.015 ated rapidly long ago. For example, the radiation of annelid worms within Lophotrochozoa apparently was fast and resulted in branching patterns with short internodes in the basal part of the annelid tree and with branch lengths rapidly increasing towards the tips. Such branching patterns accumulate only a few informative substitutions along the internal and more basal branches, but numerous homoplasious substitutions on terminal branches (Struck et al., 2002a). For example, in Struck et al.’s (2007) analyses of three nuclear genes, the ratio of terminal branch length to internal branch length is 4.2/1 even excluding the extremely long branch leading to the highly divergent taxon, Ophryotrocha labronica. Thus, phylogenetic information is likely to be eroded due to multiple substitutions, or homoplasy, along the terminal branches. To avoid these problems, several authors have used conservative genes such as elongation factor 1a, 18S and/or 28S rDNA (e.g., Regier and Shultz, 1997; Struck et al., 2002a, 2007; Xia et al., 2003). However, even in these genes certain regions may be saturated and therefore mislead reconstruction. In contrast, excluding problematic regions may leave too little phylogenetic information to resolve the deep branches (Xia et al., 2003). Traditionally, saturation is shown by plotting either numbers of substitutions or the transition-to-transversion (ti/tv) ratios of all pairwise comparisons of taxa in an alignment against the genetic distances (p) between these pairs (e.g., Halanych and Robinson, 1999; Jördens et al., 2004; Struck et al., 2002a). In the latter case, convergence upon a ti/tv value of 1 is expected with the approach of T.H. Struck et al. / Molecular Phylogenetics and Evolution 48 (2008) 628–645 saturation at higher genetic distances. Generally, plotting ti/tv ratios is more useful because it integrates properties of two substitution types. However, identifying saturation is subjective when using such plots: Whether or not to include a gene or certain subsets of a gene (i.e., positions that exhibit only slight saturation or strong saturation) may be obvious, but whether to incorporate or eliminate positions with intermediate stages of saturation is not straightforward and lacks objectivity (Struck et al., 2007). Furthermore, with an increasing number of taxa and genes or subsets of genes, the representation of the plots within one graph becomes increasingly problematic (Struck et al., 2007). A comprehensive method of display would promote comparison of plots and aid objective decision-making. Due to convergence upon a value, the distribution of the ti/tv values of the pairwise comparisons covers an increasingly smaller range of values as genetic distance increases, and thus the standard deviation of this distribution [r(ti/tv)] decreases with increasing saturation. On the other hand, the distribution of the genetic distances p of these pairwise comparisons becomes more spread out and thus the standard deviation of this distribution [r(p)] is increasing. By extrapolating from this behavior of convergence, we developed a convergence factor (C factor), which is based on the ratio of the standard deviation of the ti/tv distribution to the standard deviation of the p distribution: C¼ rðtvti Þ rðpÞ ð1Þ The C factor is determined before, and independently of, the tree reconstruction and thus is an a priori approach. Next, using the parsimony criterion we developed another factor to assess the degree of homoplasy. This one being based on alteration of tree lengths is an a posteriori approach. Given a data set without homoplasy, removal of any subset of the data reduces tree length by an amount directly proportional to the reduction in the minimum number of substitutions necessary to explain the data set. Because each substitution is one step of the tree length, the relative change is the same for both, tree length and number of substitutions, and thus their ratio is 1. Given a starting tree length of l, alteration in tree length can be expressed as the observed change O and the reduction in minimum necessary substitutions s as the expected change, E, assuming no homoplasy in the data set: O¼ lr lc ð2Þ sr sc ð3Þ and E¼ For both l and s, the subscript c indicates the value for the complete, initial data set, and subscript r indicates the value for the reduced data set. Their ratio can be given as O/E. An easy way to calculate this ratio is to estimate the ratio of the consistency indices CIc/CIr: O lr =lc lr sc 1 CIc ¼ ¼ ¼ CIc ¼ E sr =sc sr lc CIr CIr ð4Þ In empirical data sets, homoplasious substitutions contribute more to the tree length than do the synapomorphic ones, because they require at least two steps instead of one. Therefore, if the removed subset is less homoplasious than the complete data set, the O/E ratio is >1, because fewer steps were excluded from the data set than expected from the reduction in substitutions. In other words, lr is larger than expected. On the other hand, if the removed subset is more homoplasious than the complete set, then O/E is <1, because lr is smaller than expected. An O/E ratio of 1 indicates that the removed part has the same degree of homoplasy as the complete data set. 629 Herein, we tested both our new factors—C and O/E—using simulation studies as well as a real data set derived from 18S and 28S sequences of Annelida. 1.1. On the Annelida As mentioned, Annelida are part of the rapid radiation of lophotrochozoan animals that occurred in or before the early Cambrian; besides annelids, Lophotrochozoa also includes mollusks, platyhelminths, brachiopods, nemerteans, rotifers and several other taxa (Halanych, 2004; McHugh, 2000, 2005). Annelida comprises an ecologically important animal phylum with over 16,500 described species, and its members include the dominant macrofauna of the deep sea. Traditionally two major groups have been distinguished: Clitellata (earthworms, leeches) and Polychaeta (e.g., Fauchald and Rouse, 1997; Westheide, 1997; Westheide et al., 1999). Monophyly of Clitellata is well-supported by both morphological and molecular data (e.g., Erséus, 2005; Erséus and Kallersjo, 2004; Purschke, 2002; Struck and Purschke, 2005). However, Polychaeta has proven more problematic. Cladistic analyses based on morphological data seem to support a monophyletic Polychaeta (Rouse and Fauchald, 1997), but recent molecular phylogenetic studies reject monophyly of polychaetes, as well as of their traditional subgroups (e.g., Rousset et al., 2007; Struck et al., 2007). Additionally, Siboglinidae (a.k.a. Pogonophora and Vestimentifera; beard worms) and Echiura (spoon worms; as sister to Capitellidae) have also to be included in Annelida (e.g., Bartolomaeus, 1995; Bleidorn et al., 2003b; Hessling, 2002; McHugh, 1997; Struck et al., 2007). Furthermore, Sipuncula (peanut worms) are discussed as either an annelid subtaxon or their sister group (e.g., Bleidorn et al., 2006; Struck et al., 2007; Tzetlin and Purschke, 2006). Unfortunately, well-supported more-inclusive groupings of the approximately 80 polychaete families are largely lacking. Although some recent studies used up to five genes, molecular analyses addressing annelid phylogeny are mainly based on the 18S and 28S genes (Colgan et al., 2006; Rousset et al., 2007; Struck et al., 2007). Different molecular analyses of annelid phylogeny have treated the problem of saturated positions in different ways. Some did not exclude any positions at all (e.g., Rousset et al., 2007), others excluded ambiguously aligned ones (e.g., Colgan et al., 2006), and others also excluded saturated ones (e.g., Jördens et al., 2004; Struck et al., 2007). Furthermore, although complete 18S was always used, different parts of the 28S have been included: the nearly complete sequence of 3.3 kilobases (kb) (Struck et al., 2005, 2007), the so-called D1 (0.3 kb) and/or D9/10 (0.6 kb) fragments (Brown et al., 1999; Colgan et al., 2006; Rousset et al., 2007) or a fragment comprising the region between these two fragments of about 2.2 kb (Jördens et al., 2004). Herein we evaluated our saturation factors, C and O/E, using simulation studies as well as a data set of 18S and 28S sequences from 98 annelid and 9 lophotrochozoan outgroup taxa. Furthermore, we determined how likely three different fragments of 28S are to saturate, compared to nearly complete sequences of 28S and 18S. Additionally, we propose an improvement of the sliding window analysis for rRNA and similar genes to group positions with similar genetic variation within one partition. 2. Material and methods 2.1. Simulated data To model simulation studies on an empirical tree, we selected 18S and 28S sequences of 9 polychaete and 1 outgroup from Struck et al. (2007). This taxonomic subsample was evenly distributed on their tree and represented the variety of branch lengths they recov- 630 T.H. Struck et al. / Molecular Phylogenetics and Evolution 48 (2008) 628–645 ered. Chosen operational taxonomic units (OTUs) were Glycera, Sthenalanella, Drilonereis, Ninoe, Pectinaria, Ophelina, Notomastus, Polydora, Chaetopterus and Crassostrea (mollusk outgroup), and with these sequences we reconstructed the ML tree. The appropriate model of sequence evolution, Tamura & Nei + I + C, was determined based on the AIC criterion in Modeltest V 3.7 (Posada and Crandall, 1998, 2001). The estimated model parameters were: base frequencies: A = 0.2514, C = 0.2275, G = 0.3017, T = 0.2194; substitution rates: A M C, A M T, C M G, G M T = 1.0000, A M G = 1.8754, C M T = 4.7929; a = 0.6216, number of categories = 4; proportion of invariant sites = 0.4575. ML tree reconstruction based on these model parameters employed PAUP*4.0b (Swofford, 2002) with 10 random taxon additions and tree bisection-reconnection (TBR) branch swapping. Simulated data sets based on this ML tree (Fig. 1) and on the corresponding model parameters were generated using Mesquite v1.05 (Maddison and Maddison, 2004). By definition the initial data set had no or only a low degree of homoplasy or saturation, so we introduced these features in two ways. In the first strategy, we altered the evolutionary rate along the whole tree using different scale factors (0.1, 0.5, 1, 2, 10). This means that each branch of the tree was varied from one tenth to ten times the length of the branch shown in Fig. 1. Thus, the evolutionary rates were made to vary by a factor of 100 over the simulations. Note that, the simulations using a scale factor of 0.1 had the slowest evolutionary rates and the lowest degree of homoplasy. Subsequently, data sets generated for the second strategy were based on a scale factor of 0.1 to ensure that both strategies had similar, low, initial degrees of homoplasy. For the second strategy, we directly introduced random data (i.e., noise) using the option ‘‘random fill” in Mesquite v1.05 (Maddison and Maddison, 2004). In the different trials, 0%, 5%, 10%, 25% or 50% of the positions were randomized. For both strategies and for each individual option (i.e., 0%, 5%, 10%, 25% and 50% noise or scale factors of 0.1, 0.5, 1, 2 and 10), we simulated 50 individual data sets with 300 nucleotide positions resulting in total 10 times 50 (=500) data sets. For each individual data set, we were able to determine the C factor directly. However, the O/E ratio is a relative factor comparing a data set reduced by a certain partition against the complete data set. Therefore, to calculate the O/E ratio we had to concatenate individual data sets generated for each of the two strategies and five options, respectively. Because we were interested in the correlation of the O/E ratio with both saturation and the C factor, we concatenated for each strategy five individual data sets, one from each of the five options. For example, for the first strategy using scale factors we concatenated five individual data sets, one from each of the 50 data sets generated by a scale factor of 0.1, 0.5, 1, 2 or 10, respectively. For the second strategy, such a concatenated data set consisted of five individual data sets, one from each of the 50 data sets generated using 0%, 5%, 10%, 25% or 50% noise level, respectively. Thus, we got 2 times 50 concatenated data sets generated from the 10 times 50 individual data sets. This concatenation procedure allowed assessing the correlation of the O/E ratio with both saturation (i.e., 0%, 5%, 10%, 25% and 50% noise, or scale factors of 0.1, 0.5, 1, 2 and 10) and the determined C factor of the individual data. Finally, we also investigated if an increased number of positions results in an increased recovery of the true tree. Therefore, we simulated 50 data sets with 100, 300, 500, 1000, 2000 or 5000 positions, respectively, using a scale factor of 1. Herein, we used a scale factor of 1 in contrast to 0.1 as the initial degree of homoplasy to be able to compare these results with the results we got from the empirical data of Annelida. We determined the C factor as well as the MP and ML tree for each data set using 10 random taxon additions and TBR branch swapping in PAUP*4.0b (Swofford, 2002). 2.2. Empirical data The empirical data set comprised 107 OTUs of Annelida and outgroups including new sequence data of 18S and 28S determined for 11 and 16 polychaete OTUs, respectively. 2.3. Collection of data and sequence alignment Taxon 4 Taxon 10 Taxon 5 Taxon 6 Taxon 7 Taxon 3 Taxon 9 Taxon 8 Taxon 1 Taxon 2 0.1 Fig. 1. Tree used in simulation study. Phylogram of ML analysis of 9 annelid and 1 outgroup OTUs (Glycera, Sthenalanella, Drilonereis, Ninoe, Pectinaria, Ophelina, Notomastus, Polydora, Chaetopterus and Crassostrea). ln L = 12625.63033. OTU names have been concealed, because the tree was generated for simulation purposes only. Table 1 lists the taxa, gene sequences and GenBank accession numbers used in this study. Upon collection, tissue samples or whole individuals were preserved in >70% non-denatured EtOH or frozen at 80 °C. Genomic DNA was extracted using the DNeasy Tissue Kit (Qiagen). Amplification and sequencing of the nuclear 18S and 28S rDNA were carried out using protocols described by Struck et al. (2006). All products were verified on a 1% agarose gel and purified with the QIAquick PCR Purification Kit (Qiagen). As necessary, PCR products were size-selected on agarose gels and/or cloned using the pGEMÒ-T Easy Vector System (Promega). A CEQTM 8000 Genetic Analysis System (Beckman Coulter) using CEQ dye terminator chemistry was used for bidirectional sequencing of all products. Because Annelida are part of the poorly resolved lophotrochozoan clade (Halanych, 2004), their sister group is not known yet, so several other lophotrochozoan taxa were employed as outgroup taxa (see Table 1). Sequences were aligned with CLUSTAL W using default settings (Thompson et al., 1994) and subsequently corrected by hand in GeneDoc (Nicholas and Nicholas, 1997). Ambiguous positions were determined using GBlocks (Castresana, 2000) with default settings except the allowed gap positions, which were set to ‘‘with half”. Positions possessing a gap in less than 50% of the sequences are not excluded a priori, but can be selected in the final alignment if they are within an appropriate block. After the determination of classes of identity, all ambiguous positions were excluded from all subsequent analyses. The alignment (Accession #S2045) is available at TREEBASE (http://www.treebase.org). 631 T.H. Struck et al. / Molecular Phylogenetics and Evolution 48 (2008) 628–645 Table 1 List of the taxa used in the analyses with 18S and 28S sequence accession numbers Taxon Species 18S Arenicola brasiliensis (Nonato, 1958) Abarenicola affinis (Ashworth, 1903) DQ790076 Capitellidae Heteromastus filiformis (Claparede, 1864) Notomastus tenuis Moore, 1909 DQ790081 DQ790084 Maldanidae Clymenella torquata (Leidy, 1855) Clymenura clypeata (Saint-Joseph, 1894) Axiothella rubrocincta (Johnson, 1901) AF448152 DQ790078 DQ790027 Opheliidae Armandia brevis (Moore, 1906)a Ophelia rathkei McIntosh, 1908 Ophelina acuminata Oersted, 1843 EU418854 AF448157 DQ790085 EU418862 AY366513 DQ790045 Orbiniidae Orbinia swani Pettibone, 1957 Scoloplos fragilis (Verrill, 1873)b DQ790087 AY532360 DQ790048 EU418863 Paranoidae Aricidea sp.c EU418855 DQ790052 Scalibregmatidae Scalibregma inflatum Rathke, 1843 Travisia brevis Moore, 1923 Travisia forbesii Johnston, 1840 DQ790093 DQ790060 DQ790069 AF508127 Aphroditidae Aphrodita sp. AY894295 DQ790024 Polynoidae Gattyana ciliata Moore, 1902 Lepidonotus sublevis Verrill, 1873 AY894297 AY894301 DQ790035 DQ790039 Sigalionidae Sigalion spinosus (Hartman, 1939) Sthenelanella uniformis Moore, 1910 AY894304 AY894306 DQ790062 DQ790064 Hesionidae Ophiodromus pugettensis (Johnson, 1901) DQ790086 DQ790046 Nereididae Nereis succinea (Frey and Leuchart, 1847) Nereis vexillosa Grube, 1851 AY210447 DQ790083 AY210464 DQ790043 Pilargidae Ancistrosyllis groenlandica McIntosh, 1879 DQ790075 DQ790023 Syllidae Exogone naidina Oersted, 1845 Exogone verugera (Claparede, 1868) Proceraea cornuta (Agassiz, 1862) Typosyllis anoculata (Hartmann-Schröder, 1962) AF474290 AF474312 DQ790098 DQ790033 AF212165 DQ790071 Glyceridae Glycera dibranchiata Ehlers, 1868 Glycera americana Leidy, 1855d AY995208 EU418856 AY995207 EU418864 Goniadidae Goniada brunnea Treadwell, 1906 Glycinde armigera Moore, 1911 DQ790080 DQ790079 DQ790037 DQ790036 Nephtyidae Nephtys longosetosa (Oersted, 1842) Nephtys incisa Malmgren, 1865e Aglaophamus circinata (Verrill, 1874) DQ790082 EU418857 DQ790072 DQ790042 EU418865 DQ790020 Annelida Arenicolidae 28S DQ790025 DQ790038 DQ790044 DQ790030 Paralacydoniidae Paralacydonia paradoxa Fauvel, 1913 DQ790088 DQ790050 Phyllodocidae Phyllodoce groenlandica (Oersted, 1843) DQ790092 DQ790055 Alciopidae Alciopina sp. Torrea sp. DQ790073 DQ790096 DQ790021 DQ790068 Tomopteridae Tomopteris sp. DQ790095 DQ790067 Amphinomidae Paramphinome jeffreysii (Mcintosh, 1868) Eurythoë complanata (Pallas, 1766) AY838856 AY364851 AY838865 AY364849 Dorvilleidae Protodorvillea kefersteini (McIntosh, 1869) Parougia eliasoni (Oug, 1978) AF412799 AF412798 AY732230 DQ790053 Eunicidae Marphysa sanguinea (Montagu, 1815) Eunice sp. AY525621 AF412791 AY838861 AY732229 Lumbrineridae Lumbrineris latreilli Audouin & Milne-Edwards, 1834 Lumbrineris inflata (Moore, 1911) Ninoe nigripes Pettibone, 1982 AY525623 AY525622 AY838852 AY366512 AY364864 AY838862 Oenonidae Arabella semimaculata (Moore, 1911) Drilonereis longa Webster, 1879 Oenone fulgida Pettibone, 1982 AY838844 AY838847 AY838853 AY838857 AY838860 AY838863 Onuphidae Diopatra aciculata Knox and Cameron, 1971 Hyalinoecia tubicola O.F. Müller, 1776 AY838845 AF412794 AY838858 AY732228 Oweniidae Owenia fusiformis delle Chiaje, 1841 AF448160 DQ790049 Sabellaridae Sabellaria cementarium Moore, 1906 AY732223 AY732226 Sabellidae Serpulidae Schizobranchia insignis Bush, 1905 Serpula vermicularis Linnaeus, 1767 AY732222 AY732224 AY732225 AY732227 Siboglinidae Riftia pachyptila Jones, 1981 Siboglinum fiordicum Webb, 1963 AF168739 X79876 Z21534 DQ790061 Cirratulidae Cirratulus spectabilis (Kinberg, 1866) AY708536 DQ790029 (continued on next page) 632 T.H. Struck et al. / Molecular Phylogenetics and Evolution 48 (2008) 628–645 Table 1 (continued) Taxon Species 18S Ctenodrilidae Ctenodrilus serratus (Schmidt, 1857) AY364850 AY364864 Fauveliopsidae Fauveliopsis scabra Hartman & Fauchald, 1971 AY708537 DQ790034 Flabelligeridae Diplocirrus glaucus (Malmgren, 1867) Pherusa plumosa (Müller, 1776) AY708534 AY708528 DQ790031 DQ790056 Poeobiidae Poeobius meseres Heath, 1930 AY708526 DQ790058 Sternaspidae Sternaspis scutata (Ranzani, 1817) AY532329 DQ790063 Ampharetidae Auchenoplax crinita Ehlers, 1887 DQ790077 DQ790026 Pectinariidae Pectinaria gouldi (Verrill, 1873) DQ790091 DQ790054 Terebellidae Amphitrite ornata (Leidy, 1855) Pista cristata (O. F. Mueller, 1776) Polycirrus sp.f DQ790074 AY611461 EU418858 DQ790022 DQ790057 EU418866 Trichobranchidae Terebellides stroemi Sars, 1835 DQ790094 DQ790066 Chaetopteridae Chaetopterus variopedatus (Renier, 1804) U67324 AY145399 Spionidae Polydora ciliata (Johnston, 1838) Polydora sp. Prionospio dubia Maciolek, 1985g Scolecolepis viridis Verrill, 1873h U50971 EU418859 EU418860 28S DQ790059 EU418867 EU418868 Trochochaetidae Trochochaeta sp. DQ790097 DQ790070 Poecilochaetidae Poecilochaetus serpens Allen, 1904i AY569652 EU418869 Apistobranchidae Apistobranchus typicus (Webster & Benedict, 1887)i AF448150 EU418870 Clitellata Lumbricus terrestris Linnaeus, 1758 Lumbricus sp. Lumbriculus variegatus (Mueller, 1774) Lumbriculus sp. Eisenia foetida Savigny, 1826 Eisenia sp. Stylaria sp. Erpobdella octoculata (Linnaeus, 1758) Hirudo medicinalis Linnaeus, 1758 AJ272183 U95946 AF099949 Z83752 DQ790032 DQ790065 AY364865 AY364866 DQ790041 AY040693 DQ790040 AB076887 Aeolosomatidae Aeolosoma sp. Z83748 DQ790019 Parergodrilidae Parergodrilus heideri Reisinger, 1925 Stygocapitella subterranea Knöllner, 1934 Hrabeiella periglandulata Pizl & Chalupsky, 1984 AJ310504 AF412810 AJ310501 AY366514 AY366516 AY364867 Dinophilidae Trilobodrilus axi Westheide, 1967 Trilobodrilus heideri Remane, 1925 AF412806 AF412807 AY732231 AY894292 Nerillidae Leptonerilla prospera (Sterrer & Iliffe, 1982)j AY834758 EU418871 Polygordiidae Polygordius appendicularis (Fraipont, 1887)k AY525629 EU418872 Protodrilidae Protodrilus ciliatus Jägersten, 1952k Protodrilus purpureus (Schneider, 1868)k AY525631 AY527057 EU418873 EU418874 Protodriloidae Protodriloides chaetifer (Remane, 1926)l Protodriloides symbioticus (Giard, 1904)l AY527058 AF508125 EU418875 EU418876 Saccocirridae Saccocirrus sp.m EU418861 EU418877 Echiura Arhynchite pugettensis Fisher, 1949 Urechis caupo Fisher & MacGinitie, 1928 AY210441 AF119076 AY210455 AF519268 Sipuncula Phascolopsis gouldi (Pourtalès, 1851) AF342796 AF342795 Crassostrea virginica (Gmelin, 1791) Solemya velum Say, 1822 Yoldia limaluta (Say, 1831) AB064942 AF120524 AF120528 AY145400 AY145421 AY145424 Outgroups Bivalvia Gastropoda Ilyanassa obsoleta (Say, 1822) AY145379 AY145411 Polyplacophora Chaetopleura apiculata (Say, 1834) AY145370 AY145398 Brachiopoda Terebratalia transversa (Sowerby, 1846) Glottidia pyramidata (Stimpson, 1860) AF025945 U12647 AF342802 AY210459 Nemertea Cerebratulus lacteus (Leidy, 1851) AY145368 AY145396 633 T.H. Struck et al. / Molecular Phylogenetics and Evolution 48 (2008) 628–645 Table 1 (continued) Sequences that were newly obtained for the present study are indicated in boldface. a Locality: 48°32.971’N/122°55.378’W, Lopez Island, WA, USA; Specimen voucher: USNM 1112639. b Locality: Little Buttermilk Bay, MA, USA; Specimen voucher: USNM 1112638. c Locality: 39°58.397’N/70°40.281’W, Southern New England, MA, USA; Specimen voucher: USNM 1112641. d Locality: 41°43.250’N/70°20.150’W, Land grant area 1, Barnstable Harbor, MA, USA; Specimen voucher: USNM 1112637. e Locality: 40°53.019’N/70°25.003’W, Southern New England, MA, USA; Specimen voucher: USNM 1112640. f Locality: 40°35.596’N/71°31.911’W, Southern New England, MA, USA; Specimen voucher: USNM 1112642. g Locality: 39°56.172’N/69°34.573’W, Southern New England, MA, USA; completely used. h Locality: 41°41.484’N/70°37.612’W, Barlow’s Landing, MA, USA; Specimen voucher: USNM 1112636. i kindly provided by C. Bleidorn (University of Potsdam) (Bleidorn et al., 2003a, 2005). j kindly provided by K. Worsaae (University of Kopenhagen) (Worsaae et al., 2005). k Locality: North Sea island Helgoland, Germany; completely used. l Locality: List, North Sea island Sylt, Germany; completely used. m Locality: Giglio, Italy; completely used. homoplasious (Hennig’s auxiliary principle), the degree of variation is expressed as a percentage of the 27 maximally possible substitutions in a window of nine characters. For example, if such a window has six constant characters across some number of taxa and three characters that possess two nucleotide states across the same taxa, the window would be said to exhibit at least three substitutions assuming Hennig’s auxiliary principle. The degree of variation of this window would be 11.1% (3 characters with minimally one substitution/27 maximally possible substitutions). The window then slides over 6 characters (resulting in an overlap of three characters with the previous window) and the degree of variation (11.1% here) is assigned to the six characters left behind in the previous window. The degree of variation is determined anew each time the window slides down the alignment. The degree ranges from 0% to 100%. Classes of identity are assigned to 10% increments of genetic variation. To some degree, this procedure reflects a discrete C distribution (Yang, 1993, 1996). This sliding-window procedure has a problem, however. Inclusion of an insertion not in frame with the number of characters the window slides down each step (i.e., six characters herein) has a shifting effect on all windows downstream from this insertion, even if the sequence with the indel is otherwise completely identical to sequence(s) present in the alignment. To counterbalance such effects and increase the robustness of the procedure, we modified the sliding window analyses to restrict the effects of indels to their vicinity. Instead of six characters, the window is moved by 2.4. Determination of classes of identity Saturation can be investigated for complete genes or for subsets of characters (sites) within genes that share a certain property. Such groups are called classes of identity (Vogt, 2002). Different sites within genes can show tremendous differences in substitution rates with some characters seemingly static and others experiencing numerous substitutions (Yang, 1993, 1996). Therefore, various regions of a gene can be treated differently. For structural genes, like the small and large subunit ribosomal RNA genes, variable and conserved parts have been characterized (e.g., Hassouna et al., 1984; Mallatt et al., 2001; Zardoya and Meyer, 2001). The former are the so-called divergent domains (D1-12) in 28S and the variable regions (V1-9) in 18S. Plots on tertiary structure predictions reveal that the degree of variability depends on the distance from the ribosome center and is continuously changing (Wuyts et al., 2001). Therefore, we wanted to distinguish between more than just these two basic classes of identity (conserved vs. variable) to get more rate categories (classes of identity). An approach using information contained in the data set seems to be promising for discriminating more classes of identity. Various authors have performed sliding window analyses examining the degree of variation in nucleotide positions (e.g., Jördens et al., 2004; Pesole et al., 1992; Struck et al., 2002a; Sturmbauer and Meyer, 1992). Variation is examined within a window of nine characters (Fig. 2). Assuming a priori for each substitution that it is not A A A T C C C C C G G G T T T T A A A A A A A A G G G G A A G A T T T T T T T T G G G G A A A A A A A A A G C A T C C C A C T G G ... C .. . A .. . T . .. 0 1 0 0 0 0 1 0 0 0 0 0 2 1 3 3 ... Alignment ... ... ... ... Minimally necessary substitutions ... 1 Sliding window ... 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 ... 11.1% move on 6 characters ... 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 ... 14.8% 11.1% Assigned degree of variation ... 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 ... 11.1% 14.8% 74.7% Fig. 2. Hypothetical scheme illustrating the traditional sliding window analysis. The minimal number of substitutions necessary to explain a pattern at a position are given below the alignment. Character positions in the alignment are shown as 61–77, and the gray rectangles indicate the sliding window. Brackets indicate the assigned genetic variation. 634 T.H. Struck et al. / Molecular Phylogenetics and Evolution 48 (2008) 628–645 Alignment ... ... ... ... A A A T C C C C C G G G T T T T A A A A A A A A G G G G A A G A T T T T T T T T G G G G A A A A A A A A A G C A T C C C A C T G G ... C .. . A ... T ... Minimally necessary substitutions ... 1 0 1 0 0 0 0 1 0 0 0 0 0 2 1 3 3 ... Modified sliding window ... 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 ... 11.1% move on 1 character ... 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 ... ... 61 ... 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 ... 7.4% 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 ... ... 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 ... ... 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 ... ... 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 ... 14.8% ... 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 ... ... 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 ... 33.3% 7.4% 3.7% 3.7% 11.1% 25.9% 13.2% Position Assigned degree of variation ... 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 ... 36.2% 24.7% 30.5% 15.2% 19.3% 11.1% 12.3% 13.2% 11.5% 22.2% 16.9% 36.2% 28.4% 53.9% 45.3% 70.8% 63.0% Fig. 3. Hypothetical scheme showing the modified sliding window analysis proposed here, one moves along just one character per step (as opposed to six characters in Fig. 2). The minimal number of substitutions necessary to explain a pattern at a position are given below the alignment. The numbers 61–77 are the character positions in the alignment, and gray rectangles indicate the sliding window. Below the modified sliding window analysis the individual degrees of variation for the positions 61–77 are shown. 0.12 Average contribution (to genetic variation) only one character and the variation is determined anew in the next window. The degree of variation of a character is calculated by averaging the percentage of variation of the nine windows comprising the character (Fig. 3). Thus, an individual degree of variation is assigned to each character contrary to the shared degree by six characters. Furthermore, the average contribution of adjacent characters to the genetic variation of a certain character depends on the distance between them (Fig. 4). That is, direct neighbors of a character contribute most strongly, and the more sites there are between two characters the less is the contribution; in a nutshell—the shorter the distance, the larger the influence. At last, this approach is independent of the direction and the starting point of the sliding window analysis. Thus, classes of identity (at 10% increments) were determined as in Fig. 3. For each class of identity, as well as for the complete 18S and 28S alignments, v2 tests for homogeneity of base frequencies across taxa were performed. Furthermore, single-factor analyses of variance (ANOVA) in Microsoft Excel were conducted to test the independence of each class of identity determined by the sliding-window approach from the other classes of identity with respect to the distribution of the genetic distances of the pairwise- 0.10 0.08 0.06 0.04 0.02 0.00 -8 -7 -6 -5 -4 -3 -2 -1 coi 1 2 3 4 5 6 7 8 Positions of character relative to the character of interest (coi) Fig. 4. Contribution of nearby positions to the genetic variation of a certain position depending on the distance to that position. taxa comparisons. First, we compared the two similar classes of identity (e.g., the two 0–10% classes) of the two rRNA genes, 28S and 18S, against each other to test if each was drawn from its 635 T.H. Struck et al. / Molecular Phylogenetics and Evolution 48 (2008) 628–645 2.5. Determination of C factor and O/E ratio For the C factor, any genetic distance could be used, but herein we used the uncorrected genetic distance p. Values for ti/tv and p were determined for each pairwise comparison of taxa and the standard deviations of both distributions were estimated using Microsoft Excel. The C factor was determined for each simulated data set as well as for each class of identity of the empirical example. The O/E ratio is the ratio of the CI values with and without the subset of interest included in the concatenated data set. For the simulated data sets, equally weighted parsimony analyses using PAUP*4.0b (Swofford, 2002) and the branch and bound algorithm were performed to determine the CI values of the concatenated data sets as well as of the concatenated data sets reduced by each individual data set in turn (e.g., 0%, 5%, 10%, 25% or 50% noise, respectively). For the empirical data, with its many (100+) taxa, we used another, faster strategy to determine the CI values of the complete alignment as well as of the complete alignment reduced by each class of identity in turn; that is, the faster heuristic search with TBR branch swapping was used. Each tree search for the empirical data was done in at least two steps. In the first step, we used the option ‘‘random taxon addition” with 1000 replicates and a chuck score of 1. A maximum of 50 trees were saved per replicate to screen the tree space faster for the best solution(s). In the next step, the chuck score was increased to the value of the most-parsimonious solution plus 1 additional step (and all other settings remained unaltered) to obtain all most-parsimonious trees with the score of the first step. If the second step achieved a better solution than the first pass, the second step was repeated with a newly adjusted chuck score. The search was stopped when the last step resulted in the same most-parsimonious tree length as the previous step. Usually two passes were necessary; only occasionally a third pass was necessary. In all analyses, gaps in the alignment were treated as missing data. scribed by Posada and Buckley (2004) using Modeltest V 3.06 (Posada and Crandall, 1998, 2001). The most likely tree was reconstructed in PAUP*4.0b (Swofford, 2002) using the indicated parameters, 10 random taxon additions and TBR branch swapping. Furthermore, for each data set we also used the best tree of BI as an individual starting tree to test if a better ML tree could be obtained. The reliability of phylogenetic nodes was estimated in three ways: ML, maximum parsimony (MP), and neighbor-joining (NJ). First, using GARLI v0.951 (Zwickl, 2006), we determined ML bootstrap (BS) values of 100 replicates employing the GTR + I + C model and the option ‘‘gentreshfortopoterm” set to 5000. Second, in a MP-BS analysis 100 BS replicates were conducted with 10 random taxon additions per replicate, TBR branch swapping, chuck score set to 1 and no more than 50 trees kept. Third, 10,000 BS replicates using a NJ search algorithm and ML settings were performed. 3. Results 3.1. Simulation studies Fig. 5 shows how the C factor varies with noise and with scale factor in our simulated data sets. The C factor decreases with increasing noise (Fig. 5A) and with increasing scale factor (Fig. 5B) and thus with increasing homoplasy. The decreases follow a power function. Whereas at 0% noise or a scale factor of 0.1 the C factor is larger than 100, at 25% random data (i.e., 0.25 noise) or a doubled tree length (i.e., scale factor of 2) it is around 10, and at higher degrees of random data it drops to values around 5. This shape of the curve has to be expected, because the C factor decreases either as the standard deviation of the ti/tv distribution decreases, or as the standard deviation of the distribution of the genetic distances increases. In the case of increased saturation and homoplasy, both effects occur at the same time and thus the decrease of the C factor is amplified. 250 200 C factor own underlying probability distribution. Second, we compared each class of identity to its directly adjacent classes of identity within each gene. For example, the 10–20% class of the 18S gene was compared to the 0–10% class as well as to the 20–30% class of the 18S gene. Based on the ANOVA test, we wanted to ensure that each determined class of identity was truly distinct from the other ones and not just due to chance. 150 100 2.6. Phylogenetic analyses of the empirical data 50 0 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 Noise 250 200 150 C factor Phylogenetic analyses using Bayesian Inference (BI) and ML were conducted on both the complete data set (ALL) and with saturated positions excluded (EX). For BI, MrModeltest 1.1b (Nylander, 2002) was used to determine appropriate models of sequence evolution for both genes, 18S and 28S. Then, to reconstruct trees, the parallel version of MrBayes 3.1 (Altekar et al., 2004; Huelsenbeck and Ronquist, 2001; Ronquist and Huelsenbeck, 2003) was used. The model parameters were set in accordance with the specified model for each gene separately. Thus, a partitioned likelihood analysis of the combined data set was implemented. Two simultaneous runs with four Markov chains each ran for 1.5 107 generations. Sampling every 250th tree, 2 60,001 trees were sampled in total. Based on average standard deviations above 0.1 for the implemented diagnosis feature, the first 18,000 trees in each run were discarded as burn in. The majority-rule consensus tree containing posterior probabilities (PP) of the topology was determined from the remaining 2*42,001 trees. For the ML analyses, appropriate models of sequence evolution for the data sets were assessed by model averaging procedures using the weighted Akaike Information criterion (=AIC) as de- 100 50 0 0 2 4 6 8 10 12 Scale factor Fig. 5. Plots of the C factor versus noise (A) and scale factor (B). Mean values with standard deviation are provided. 636 T.H. Struck et al. / Molecular Phylogenetics and Evolution 48 (2008) 628–645 1.02 1.01 1 O/E ratio 0.99 0.98 0.97 0.96 0.95 0.94 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 Noise 1.02 1.01 1 O/E ratio 0.99 0.98 0.97 0.96 0.95 0.94 0.93 0 4 2 6 8 10 12 Scale factor Fig. 6. Plots of the O/E ratio versus noise (A) and scale factor (B). Mean values with standard deviation are provided. Fig. 6 shows how the O/E ratio varies with noise and with scale factor in our simulated data sets. The O/E ratio remains relatively constant at 1.00–1.01 in the range from 0% to 10% noise (Fig. 6A) and the standard deviations overlap in this range. The same is true for scale factors in the range of 0.1–2 (Fig. 6B). However, the O/E ratio decreases sharply, to values of about 0.95, with 50% noisy 1.04 Scale factor Noise 1 O/E ratio increasing homoplasy 1.02 0.98 0.96 0.94 0.92 0.9 0.88 0 50 100 150 200 250 300 C factor increasing homoplasy Fig. 7. Plot of the O/E ratio versus the C factor for the simulated data that were generated using either scale factors (gray triangles) or degrees of noise (black squares). 637 T.H. Struck et al. / Molecular Phylogenetics and Evolution 48 (2008) 628–645 data or a 10-fold longer tree length. Given that the O/E ratio measures the degree of homoplasy relative to the complete data set, such a result is not surprising in our simulated concatenated data sets. A partition with a high degree of homoplasy (e.g., 50% noise) can increase the O/E ratio of another partition even though this one has substantial homoplasy itself (e.g., 25% noise). The addition of the first one increases the average homoplasy of the complete data set while the degree of homoplasy in the other partition remains unaltered by the addition. Thus, due to the addition of the first partition the O/E ratio of the other partition increases, because it is relatively less homoplasious in comparison to the new average degree than before (see also Discussion). Fig. 7 shows the relation of the C factor to the O/E ratio in our simulated data sets. The C factor correlates positively and linearly with the O/E ratio at lower values, when homoplasy is high (Fig. 7). Overall, the figure shows that an increasing C factor indicates an increasing O/E ratio. However, the correlation does not stay linear. The O/E ratio increases strongly, from 0.9 to 1, up to a C factor of 10 and then it levels off. At a C factor of 20 and above the O/E ratio has a constant value of 1 or slightly higher. The value of the C factor declines more sharply with increasing homoplasy (scale or noise) than does the O/E ratio, as is obvious from comparing Figs. 5 and 6. More specifically, C starts declining at far lower levels of homoplasy (>0 in Fig. 5) than does O/E (>10% or >2 in Fig. 6). Thus, C is a much more sensitive indicator of homoplasy. However, both C factor and O/E ratio correlate with homoplasy and with each other when the degree of homoplasy increases. 3.2. Classes of identity in the annelid example The aligned, empirical, data set contained 2565 positions for the 18S and 5775 for the 28S partition, from which GBlocks excluded 1345 and 3429 ambiguously aligned positions, respectively. In the final alignment of 3566 positions 1368 positions were parsimony-informative, 1684 constant and 514 sites parsimony uninformative. In the sliding window analyses, 10% increments of variability could be determined ranging from 0–80% in the 18S partition and from 0–100% in the 28S partition. In both, most positions were within the classes of identity with low variability (Table 2: 0– 10%, 10–20%, etc.) and higher classes of variability contain progressively fewer characters. The last two increments of the 28S partition (80–90% and 90–100%) were pooled, because the latter one would have had only 28 positions, which is a rather small sample size for the calculations of the C factor and O/E ratio. That is, too small a sample size would produce too much random error. The complete data partitions (‘‘ALL”), as well as every class of identity within partitions, showed homogeneity of base frequencies across taxa (Table 2). Finally, we checked if either two neighbouring classes of identity within a partition (e.g., 0–10% with 10–20% within the 18S partition) or the same classes of identity of two partitions (e.g., 0–10% of 18S and 28S partitions) had to be merged with each other, because their characters were drawn from the same underlying probability distribution. To the contrary ANOVA indicated that the characters of each class of identity were drawn from independent underlying probability distributions and thus no classes have been merged (see the right side of Table 2). This means that the distribution of genetic distances of pairwise-taxa comparisons was unique for each class of identity and never similar or identical to the distribution of another one. This holds even though the mean-genetic distances of some classes of identity are very similar (see the points that nearly touch in Fig. 8). 3.3. Detection of saturation in the annelid example Fig. 8A shows how the C factor varies with increasing genetic distance across annelid taxa in our empirical example. With increasing genetic distance the C factor decreases following a power function. Classes of identity from 0–30% for the 28S and from 0–20% for the 18S achieve values of more than 100, whereas classes above 70% show values below 10 accompanied by much higher genetic distances. Thus, there is a difference of an order of a magnitude. This resembles the results of our simulation studies, which exhibited a strong decrease of the C factor with increasing homoplasy. Hence, the higher the variability in the class of identity the stronger is the degree of saturation indicated by the C factor. Turning to O/E ratio, all classes of identity in the 18S with degrees of variation below 60%, as well as in the 28S below 50%, have ratios larger than or close to 1, whereas all other classes have values below 1 (Fig. 8B). Generally, O/E ratio decreases with increasing variability. However, while the classes of identity of the 18S in the range from 0–50% degree of variation remain more or less constant the classes of the 28S exhibit an optimum of the O/E ratio in the class of 20–30% variability. The relation of the C factor to the O/E ratio is similar to our simulation studies (Fig. 8C). The O/E ratio increases sharply up to 1 (or slightly higher) at low C factors and then starts to level off at C factors in the range from 20 to 30. At higher C factors the O/E ratio might be viewed as levelling off at a value around 1.01, except for two maverick points for 28S. Furthermore, all six classes of identity possessing an O/E clearly smaller than 1 have a C factor equal to or smaller than 20, whereas nearly all others have a C above 20 (Fig. 8C). The lowest O/E value can be observed in the class of identity 80–100% of the 28S partition, which has also the lowest C factor (Fig. 8A and B). We decided to treat all classes of identity with an O/E ratio clearly below 1 (i.e., 0.99) as saturated, for three objective reasons. Table 2 Homogeneity of base frequencies and ANOVA results for the genetic distance p of the classes of identity of 18S and 28S Classes of identity All 0–10% 10–20% 20–30% 30–40% 40–50% 50–60% 60–70% 70–80% 80–100% Number of alignable characters Homogeneity of base frequencies 18S 28S 18S 28S 1220 250 251 209 138 133 99 87 53 n.a. 2346 432 431 435 269 224 209 139 104 103 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 n.a. 0.99 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.99 Significant values are in bold. n.a., not applicable. ANOVA results for the genetic distance p 18S vs. 28S 0–10% 10–20% 20–30% 30–40% 40–50% 50–60% 60–70% 70–80% 80–100% <0.001 <0.001 <0.001 <0.001 <0.001 <0.001 <0.001 0.005 n.a. Comparison 18S 28S All 0–10% 10–20% 20–30% 30–40% 40–50% 50–60% 60–70% 70–80% <0.001 <0.001 <0.001 <0.001 <0.001 <0.001 <0.001 <0.001 n.a. <0.001 <0.001 <0.001 <0.001 <0.001 <0.001 <0.001 <0.001 <0.001 vs. vs. vs. vs. vs. vs. vs. vs. 10–20% 20–30% 30–40% 40–50% 50–60% 60–70% 70–80% 80–100% 638 T.H. Struck et al. / Molecular Phylogenetics and Evolution 48 (2008) 628–645 300 18S 28S 0-10% 250 Power C factor 200 150 10-20% 20-30% 100 30-40% 50 50-60% 40-50% 60-70% 70-80% 80-100% 0 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 uncorrected genetic distance p 1.06 O/E ratio 1.03 1 20-30% 30-40% 10-20% 0-10% 40-50% 0.97 50-60% 60-70% 70-80% 0.94 80-100% 0.91 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 uncorrected genetic distance p 1.04 1.03 1.02 O/E ratio 1.01 1.00 0.99 0.98 0.97 0.96 0.95 0.94 0 50 100 150 200 250 300 C factor Fig. 8. Plots of both the C factor (A) and the O/E ratio (B) versus the genetic distance p for the actual data set, as well as a plot of the O/E ratio versus the C factor (C), all for the empirical data. The classes of identity of the 18S (black diamonds) and 28S (black squares) are indicated in the plots. In (A), the power function of the C factor is shown. First, an O/E ratio below 1 indicates that the degree of homoplasy in this class of identity is worse than the average degree in the complete data set. Second, given the relation of C factor and O/E ratio (Figs. 7 and 8C) the asymptote of the O/E ratios was in the range from 1.00 to 1.01. Third, as mentioned in the previous paragraph, all O/E ratios with values clearly below 1 were generally accompanied by low C factors of 20 or less (Figs. 8C and 7). This means we eliminated in the saturation-excluded EX data set all classes of identity with a degree of variation above 60% and also the class of identity 50–60% of the 28S. Such a strict and conservative procedure is similar to the one used by Struck et al. (2007). 3.4. Comparison of 28S fragments Of the three 28S fragments that have been employed in the literature, the D9/10 of annelids has a C factor of about 100, the 639 T.H. Struck et al. / Molecular Phylogenetics and Evolution 48 (2008) 628–645 Table 3 Comparison of the three different 28S fragments used in annelid phylogeny Fragment D1 (0.3 kb) 2.2 kb D9/10 (0.6 kb) C factor O/E ratio Genetic distance p With s.c. Without s.c. With s.c. Without s.c. With s.c. Without s.c. 9.86 15.66 103.77 51.46 64.77 159.80 0.97 0.93 1.03 1.01 1.06 1.03 0.1357 0.1010 0.0382 0.0477 0.0508 0.0269 s.c., saturated characters. 2.2 kb fragment has a C factor around 15, and D1 has a C factor around 10 (Table 3 with s.c.). Deleting supposedly saturated positions increases the C factor in all three fragments. In fact, the C factors for the 2.2 kb and D1 fragments increase markedly, by a factor of 4–5. The O/E ratio of D9/10 is above 1 and is unaltered by the exclusion of saturated positions (Table 3). On the other hand, the O/E ratios of the other two fragments are below 1 and deleting saturated characters from these fragments increases their O/E ratios to above 1. This effect is most marked for the 2.2 kb fragment, where an O/E ratio of 0.93 increases to 1.06. In any case, we have found a way to assure O/E ratios are above our threshold of 1, for all three fragments and therefore to minimize homoplasy. 3.5. Phylogenetic analyses of the annelid example For the annelid example of nearly complete 18S and 28S data, two data sets were analyzed, a complete one with only ambiguously aligned positions excluded, but saturated positions retained (ALL) and another one with the possibly saturated positions also excluded (EX). ML analyses of both data sets were conducted in two ways; that is, by using starting trees obtained by random taxon addition, and by using the best tree of the BI as a starting tree. In both cases, the tree obtained using the best BI tree achieved a better (smaller) ln L value than did the standard procedure of random taxon addition. Precisely, for the EX data set, ln L using random taxon addition was 31,647.03318 and using the best BI tree 31,643.88339. For the ALL data, the values were 62,622.70950 and 62,612.01527, respectively. Thus, for taxon-rich data sets its seems more promising to use near optimal starting trees obtained by other procedures such as BI to increase the probability to reconstruct the overall most likely tree. Additionally, such a tree search needed only about one third of the CPU time in our analyses. Analyses excluding possibly saturated characters (EX) had just four nodes showing zero branch length (Fig. 9) and the analyses not excluding these characters (ALL) had only one (Fig. 10). Thus, overall the resolution in the trees is good, but support is low beyond the ‘‘family” level in both. This is indicated by the paucity of support symbols on the basal nodes along the ‘‘backbones” (left side) of the trees. Thirty-eight of the 104 internal nodes of the EX data set exhibited maximum-parsimony-bootstrap support (MPBS) P 70, and for 32 of these nodes the MP-BS were P95. Similar values were obtained for neighbor-joining bootstrapping (NJ-BS): 38 nodes with NJ-BS P70 and 31 nodes P95. Forty-five of the internal nodes had Bayesian posterior probabilities (PP) P0.95. Finally, 40 nodes had a maximum-likelihood-bootstrap support (MLBS) P70, and 32 of these were P95. In analyses of the ALL data set, 47 nodes had MP-BS values P70, and 41 were P95. Fifty-two nodes exhibited ML-BS P70, and 40 of these nodes were P95. Similar results were obtained for NJ-BS and PP values: 47 nodes P70, 35 nodes P95 and 64 P 0.95, respectively. Thus, the ALL data (with more characters) consistently yielded more supported nodes than did the EX data (with fewer characters), meaning that support and resolution increased with the number of nucleotide positions. Comparing the two trees (Figs. 9 and 10) reveals that for both data sets these support values favor more recent events and are not randomly spread out. For both data sets, only a few taxa beyond the recognized ‘‘family” level are supported significantly as indicated by bootstrap values. Monophyly of Aphroditiformia as well as of Clitellata is substantiated. The close relationships of Echiura and Capitellidae and of Eunicidae and Onuphidae (Bleidorn et al., 2003a, 2003b; Struck et al., 2006, 2007) are significantly corroborated. Additionally, in the analyses of the ALL data set (Fig. 10), monophyly of Terebelliformia excluding Pectinariidae achieves significant support by ML-BS and MP-BS as does the close relationship of Spionidae, Trochochaetidae and Poecilochaetidae. These results corroborate previous findings (e.g., Bleidorn et al., 2003b; Colgan et al., 2006; Rousset et al., 2007; Struck and Purschke, 2005; Struck et al., 2007; Worsaae et al., 2005). For this study we also obtained new 18S and/or 28S data for some polychaete taxa. Our analyses allow tentative conclusions regarding some of these taxa. Both analyses (Figs. 9 and 10) grouped Spionidae, Trochochaetidae and Poecilochaetidae together. This clade gathered significant support in the analyses using ALL data (Fig. 10). This result is in agreement with morphological data, which place these taxa in Spionida (Rouse and Fauchald, 1997; Rouse and Pleijel, 2001). On the other hand, the other spionidan taxon for which we obtained new data, Apistobranchidae, is sister to the sabellidan taxon Oweniidae in both analyses (Figs. 9 and 10). In this study coverage of Opheliidae was increased in comparison to recent studies (Colgan et al., 2006; Struck et al., 2007) and Opheliidae were recovered as sister(s) to Clitellata, albeit with weak support (Figs. 9 and 10). This was also found in an analysis based on 18S data alone (Struck et al., 2005). Most new sequences were obtained for the former archiannelidan taxa Dinophilidae, Nerillidae, Polygordiidae, Protodrilidae, Protodriloidae, and Saccocirridae (see Hermans, 1969). Monophyly of the ‘‘Archiannelida” has been rejected based on morphological and molecular data (Purschke and Jouin, 1988; Struck et al., 2005, 2002a,b; Westheide, 1997), which is further substantiated by our results (Figs. 9 and 10). Monophyly of the morphologically well-defined taxon Protodrilida comprising Protodrilidae, Protodriloidae, and Saccocirridae (Purschke and Jouin, 1988; Rouse and Pleijel, 2001) was also not recovered by our analyses (Figs. 9 and 10). This is congruent with results mainly derived from 18S data (Hall et al., 2004; Rousset et al., 2007; Struck et al., 2002a; Struck and Purschke, 2005). 3.6. Number of positions Because our larger ALL data set with more positions (Fig. 10) gave better resolution than our smaller EX data set with fewer positions, we used simulations to explore the effect of number of positions on the quality of tree reconstructions, with simulation settings derived from a smaller, 10-taxon annelid example. The relation of number of positions to saturation, recovery of the true tree, and phylogenetic method employed was investigated using simulated data sets with 100 up to 5000 positions. (Note that there 640 T.H. Struck et al. / Molecular Phylogenetics and Evolution 48 (2008) 628–645 Fig. 9. Cladogram of the ML tree excluding possibly saturated positions (EX) leaving 2871 total positions. ln L = 31,643.88339. Support values are indicated at the branches. Model parameters: base frequencies: A = 0.2659, C = 0.2057, G = 0.2943, T = 0.2341; substitution rates: A M C = 1.3007, A M G = 3.0178, A M T = 1.2128, C M G = 0.8623, C M T = 6.8312, G M T = 1.0000; a = 0.4900, number of categories = 4; proportion of invariant sites = 0.3810. A phylogram, indicating the actual branch length, is provided in Supplementary Figure 1. was no deliberately introduced noise this time; the different-sized data sets were increases in constant quality.) With an increasing number of positions, the C factor slightly decreases (Fig. 11). On the other hand, recovery of the true tree (Fig. 1) increases with an increasing number of positions even though the C factor indicates that saturation is increasing (Fig. 11). T.H. Struck et al. / Molecular Phylogenetics and Evolution 48 (2008) 628–645 641 Fig. 10. Cladogram of the ML tree not excluding possibly saturated positions (ALL) with 3566 total positions. ln L = 62,612.01527. That is, this is from a larger data set. Support values are indicated at the branches. Model parameters: base frequencies: A = 0.2410, C = 0.2331, G = 0.3008, T = 0.2251; substitution rates: A M C = 0.9835, A M G = 2.3338, A M T = 1.1084, C M G = 0.8864, C M T = 4.8116, G M T = 1.0000; a = 0.5726, number of categories = 4; proportion of invariant sites = 0.3803. A phylogram, indicating the actual branch length, is provided in Supplementary Figure 2. However, the chance of recovering the true tree depends on the reconstruction method used. Neither ML nor MP recovers the true tree with only 100 or 300 positions. Starting with 500 characters, the chance of recovering the true tree increases linearly with increasing number of positions for both ML and MP (R2 for ML is 0.997 and 0.826 for MP). Probability of recovering the true tree 642 0.9 40 0.8 35 0.7 30 0.6 25 0.5 20 0.4 15 0.3 10 0.2 5 0.1 0 0 1000 2000 3000 4000 ML MP 45 Ptrue C factor T.H. Struck et al. / Molecular Phylogenetics and Evolution 48 (2008) 628–645 0 5000 Positions Fig. 11. Plots of the C Factor and Ptrue versus number of positions in a range of simulated data sets. Mean values with standard deviation are provided for the C factor. Ptrue shows the probability of recovering the true tree for ML (open squares) and MP analyses (open triangles). increases much more quickly with ML (5000 positions = 85% chance of recovery) than MP (5000 positions = 15%). By extrapolating these linear correlations, we can expect to recover the true tree with a probability of 0.95 at approximately 5500 positions for ML and 31,500 positions for MP. For a probability of 1.0 we would at least need 5800 and 33,000 positions, respectively. Similar discrepancies between ML and MP approaches can be seen in a fungal example for bootstrap support and non-stationary genes (Figs. 2 and 4 of Collins et al., 2005). 4. Discussion 4.1. Detection of saturation C factor correlates with saturation, providing a more objective mean than traditional approaches (e.g., Halanych and Robinson, 1999; Jördens et al., 2004; Struck et al., 2002a) of assessing saturation. Traditional approaches using plots are rather subjective because the assessment whether a partition is saturated depends on the subjective interpretation of graphs rather than on actual values. Furthermore, assessing the saturation of more than three to four partitions (see for example the Supplementary Materials of Struck et al., 2005; and Struck et al., 2007), the representation within one graph becomes less and less applicable, though such a comprehensive display does promote comparability and thus decision-making. Therefore the C factor, which condenses the information of the degree of saturation, shown in the graph, into a single value, allows more objective comparisons across partitions and to other analyses. Based on the notion of entropy in information theory, Xia et al. (2003) proposed another index of substitution saturation (Iss). Whereas the C factor is based on pairwise comparisons of taxa, Iss assesses saturation along the positions. Iss is the ratio of the average information entropy of positions to the expected entropy of the alignment. The smaller the Iss value, the less substitution saturation has occurred. The determination of Iss is based only on nucleotide frequencies in the alignment and thus is independent from any model of phylogenetic reconstruction or substitution. The C factor herein is also independent of such models: Prior to any tree reconstruction and without any explicit model assumption, the uncorrected genetic distance p between pairs of taxa as well as ratios of the transition and transversions events were calculated. However, in contrast to Iss the C factor can also be determined a posteriori and/or based on explicit model assumptions for both genetic distance and ti/tv. For example, the genetic distance between two taxa can also be determined based on an underlying tree and an explicit substitution model. Therefore, alteration in the C factor can be used in future studies to investigate if certain models increase this factor and thus are able to alleviate the saturation problem at least in part. Unfortunately, the C factor and Iss do not provide an intrinsic threshold indicating when saturation becomes problematic in phylogenetic reconstructions. Simulation studies can be conducted to obtain such threshold values for C factor or Iss (Xia et al., 2003). For sets of different topologies, substitution parameters, numbers of positions and taxa, values of the C factor or Iss can be determined below which the true tree is not recovered with a certain percentage, e.g., 95%. This generates a suite of critical C factor or Iss values that can be used to assess values of empirical data sets for their potential to mislead phylogenetic reconstructions (Xia et al., 2003). However, this procedure always bears the risk that such a suite of critical values is not fully appropriate for the empirical data at hand. For example, if the suite is based on topologies with terminal OTUs being equidistant to the root, in contrast to the topologies of the empirical data. Another approach would be to use the parameters of the empirical data for the simulations, parameters such as number of taxa and positions, substitution parameters and topology of the best tree. However, this approach always includes the possibility of circularity: This is because, first, the parameters obtained from the empirical data (e.g., branch length and model tree) are used to conduct the simulation studies; and second, these simulation studies are used to assess whether some parameters obtained from the empirical data set (e.g., topology) are misled by saturation. But if, in the empirical data some long-branched taxa grouped together due to increased substitution rates and saturation, this tree including the artificial group would be taken as the model tree for the simulation studies. Therefore, an indication that saturation poses no problem to the placement of the longbranched taxa by the empirical data may not be reliable, because the simulation studies were based on an artificial result in the first place. Therefore, to avoid this uncertainty we combined the C factor with the O/E ratio, which is an estimator independent from the C factor. While the C factor assesses saturation, the O/E ratio determines the degree of homoplasy in a partition relative to the complete data set. As expected, the correlation of C with O/E (Figs. 7 and 8) showed that saturation results in an increased degree of homoplasy. In contrast to the C factor, the O/E ratio contains an T.H. Struck et al. / Molecular Phylogenetics and Evolution 48 (2008) 628–645 intrinsic threshold value of 1. Values above 1 indicate that the partition is less homoplasious than the average of the complete data set and values below 1 that it is more homoplasious. In our analyses, O/E values clearly below 1 corresponded with a C factor below 10, and C factors in the range of 10–20 were associated with O/E ratios below or close to 1 (Fig. 7 and 8C). At higher C values, the O/E ratio leveled off. Therefore, the critical threshold of the C factor may be in the range of 10–20, because this is where the drop-off of the O/E ratio occurred. However, this range and the correlation between C and O/E should be further investigated by employing a wide variety of empirical data. Although the degrees of homoplasy and saturation are correlated in the present study, this might not hold up in all instances. For example, a data set showing no or only slight saturation can still exhibit some homoplasy if there are numerous parallel mutations, as might occur under strong selection. Thus, there will be O/E values below 1 in parts of such a data set, without indicating saturation. Conversely, if a data sets shows saturation overall, some parts of it will still have a O/E ratio above 1 among the many classes of identity. Moreover, because the O/E ratio is a relative measure to the degree of homoplasy of the complete data set, there will always be values above and below 1. Additionally, if a single class of identity covers all positions of a data set, the average of O/E ratios of the ‘‘classes” of identity is 1. Thus, the relative O/E ratio can only be used as an additional, supportive measurement for saturation in combination with direct factors such as the C factor. For the annelid ribosomal data, we excluded all partitions with an O/E value below 1 from the subsequent phylogenetic analysis. Given these arguments, this cut-off point might have been overly conservative, though the values correlated with low C factors below 20. O/E ratio is valuable not only for comparing possibly saturated characters, but also for assessing the quality of different data partitions in a concatenated analysis in terms of degree of homoplasy: The smaller the O/E ratio of a partition, the more homoplasy it contains (in comparison to the complete data set and other partitions in the analysis). Because it is a parsimony measurement and it is calculated using CIs, the O/E ratio can directly compare different types of characters such as amino acid, nucleotide and morphological traits, within one data set and one approach. Furthermore, in contrast to only using CIs of individual partitions, O/E ratios can tell how CIs are altered when positions are deleted or added. Alteration approaches are able to reveal not only contributions of individual partitions, but also their influence on other partitions in the data set (e.g., Gatesy et al., 1999; Struck, 2007). 4.2. Possibly saturated positions and Annelida phylogeny Though exclusion of possibly saturated positions can improve phylogenetic reconstructions (e.g., Simon et al., 1994), this approach seems not be helpful for the reconstruction of the annelid phylogeny using 18S and 28S sequence data. Instead, we found that using more positions (3566 nt, Fig. 10, ALL) was better than using fewer by excluding some (2871 nt, Fig. 9, EX). That is, both nodal support and tree resolution increased with increasing number of positions (cf. Figs. 9 and 10). Our EX analyses showed results similar to previous analyses (e.g., Colgan et al., 2006; Rousset et al., 2007; Struck and Purschke, 2005), with low resolution and paraphyly of many clades – even of clades that are well-supported by anatomical characters, such as Phyllodocida and Eunicida (Fig. 9). On the other hand, though the support is still weak, analyses based on the ALL data set (Fig. 10) showed some progress in comparison to the ones based on EX. Some of the clades recovered in this ALL tree have also been found by Struck et al. (2007) and/or are supported by morphological data. The analyses of Struck et al. (2007) used three nuclear genes and included nearly complete 643 18S and 28S, but excluded possibly saturated positions. These analyses were based on roughly the same number of parsimony-informative characters as the ALL analyses in the present study. For example, Struck et al. (2007) also recovered a clade comprising all Phyllodocida, Eunicida and Orbiniidae, albeit given low support (Fig. 10). Unlike the earlier analysis, this analysis (Fig. 10) also includes Parergodrilidae, and places this taxon in the clade as well; in agreement, a close relationship of Orbiniidae and Parergodrilidae has been shown previously (Bleidorn et al., 2003a; Erséus and Kallersjo, 2004; Jördens et al., 2004; Struck et al., 2002a). As recovered here, Struck et al. (2007) found a monophyletic Terebelliformia and basal positions of Amphinomida and Chaetopteridae, but again with low support. Basal Chaetopteridae has also been shown in the analyses of Colgan et al. (2006). Our and the analyses of Struck et al. (2007) recovered with low support part of Spionida as being a subgroup of paraphyletic Sabellida. Our analyses based on ALL data recovered Cirratuliformia with low support, a group proposed by Rouse and Fauchald (1997). The conclusion that adding more positions outweighs the impact of reducing saturation in annelid phylogeny in this study is further substantiated by the simulation studies (Fig. 11), which were based on actual substitution parameters of annelid taxa. With an increasing number of positions, recovery of the true tree increases tremendously using ML methods, which is superior to MP in reconstructing annelid phylogeny. This agrees with previous studies showing the general superiority of ML methods over MP methods in reconstructing difficult trees—because ML can account for multiple substitutions at a site (e.g., Gadagkar and Kumar, 2005; Hall, 2005; Huelsenbeck, 1995). With more than 5000 positions modeled for the 18S and 28S sequences, the ‘true’ phylogeny of the 9 annelid OTUs can be recovered with high certainty using ML (right side of Fig. 11). Thus, for small numbers of annelid taxa, the ‘true’ phylogeny can be recovered with high certainty using fewer than 10,000 positions (though the nodal support might still be weak). However, with increasing numbers of taxa, the amount of positions needed to recover the ‘true’ phylogeny with high certainty will likewise increase. (Note that although these considerations are valuable in showing that more sites yield better resolution, they are only of limited practical value because the nuclear rRNA-gene family has only about 5400 usable sites, and this number cannot be increased.) Our results of empirical and simulated data showed that increasing the amount of nucleotide positions outweighed the negative impact of saturated positions in the reconstruction of annelid phylogeny using nuclear rRNA data. Similar results have been obtained for a plastid gene in land plants and mitochondrial genes in birds (Crowe et al., 2006; Källersjö et al., 1999; Nickrent et al., 2000). In other studies, the third codon positions of protein-coding genes have been especially problematic due to saturation (Graybeal, 1993; Halanych and Robinson, 1999; Xia et al., 2003). Third codon positions of mitochondrial genes in annelids exhibit high mean-genetic distances, of about 0.6 (see Supplementary Material in Struck et al., 2007). This means that 60% of the third codon positions are different between any two annelid taxa. Additionally, these third codon positions do not show homogeneity of base frequencies across taxa (Supplementary Material in Struck et al., 2007). Thus, these positions begin to approximate random data. Therefore, neither should saturated positions be excluded a priori before any phylogenetic analysis of these positions nor should one fail to investigate saturation entirely, but the effect of saturation on the phylogenetic reconstruction should be assessed for each data set anew. The C and O/E factors provided here may help to guide the decision more objectively. Finally, increasing the number of characters will not always help. For example, ambiguously aligned positions should always be excluded from the data set a priori, because robust hypothesis 644 T.H. Struck et al. / Molecular Phylogenetics and Evolution 48 (2008) 628–645 about positional homology are important. Positional homology is the fundamental principal in phylogenetic analyses at the sequence level of nucleotides or amino acids and thus its violation has strong impacts on both the reconstruction procedure itself and its philosophical underpinnings (Swofford et al., 1996). saturated positions. Therefore, with respect to 28S rDNA, using nearly complete 28S sequences is preferable to using small chunks of the gene. 4.3. Comparison of 28S fragments We thank Christoph Bleidorn, Monika C. Müller, Anna Paul and Katrine Worsaae for providing tissue material. The crew of the R/V Point Sur and R/V Oceanus were most helpful in obtaining samples. For both cruises, we also acknowledge the help of the scientific crews (which are too numerous to list here). Friday Harbor Laboratories, University of Washington, is also acknowledged for their support. Computational assistance on GUMP (Genomics Using Multiple Processors) at Auburn University was kindly provided by Scott Santos. We also thank two anonymous reviewers for important contributions to the manuscript. This work was supported by the USA National Science Foundation (NSF) WormNet Grant (EAR-0120646) to K.M.H., NSF OCE-0425060 to K.M.H. and T.H.S., and by the Grant DFG-STR-683/3-1 from the Deutsche Forschungsgemeinschaft to T.H.S. This work is AU Marine Biology Program contribution #37. Given the above considerations about sequence length, when working with 28S data, it is preferable to use the nearly complete sequence than just small chunks of the gene. Of the three different fragments used in annelid phylogeny the D9/10 fragment is best in having the least genetic distance and the highest C factor (Table 3). Furthermore, the O/E ratio is clearly above 1 and is not tremendously altered due to the exclusion of possibly saturated positions. Thus, this fragment seems not be hampered by saturation. The 2.2 kb fragment, on the other hand, has the benefits of being longer and of achieving the best O/E ratio after exclusion of possibly saturated positions. On the other hand, before removal of its saturated characters, this fragment has the worst O/E ratio with 0.93. Thus, the degree of homoplasy of this fragment is strongly influenced by those partitions in it that have low C factors and thus, saturated characters. This strong impact might be due to the fact that this 2.2 kb fragment contains the large D2, D3, and D8 divergence domains, which are highly variable and homoplasious (THS and KMH pers. obs. and see Hassouna et al., 1984; Hillis and Dixon, 1991; Mallatt et al., 2001; Zardoya and Meyer, 2001). Furthermore, the marked increase in the O/E ratio upon removing saturated sites from the 2.2 kb piece shows the correlation of degree of homoplasy with the C factor: the higher the factor, the less the homoplasy. The lowest C factor is achieved for the D1 fragment (Table 3). It is below 10 and thus similar to the values of the classes of identity showing clear saturation (Fig. 8). Furthermore, and in contrast to the 2.2 kb fragment, the exclusion of possibly saturated characters from D1 only raises the low O/E ratio slightly above 1, even though it discards more than half of the already small fragment of 0.3 kb (Brown et al., 1999). Additionally, the D1 fragment shows the highest genetic distance (across taxa) of all fragments. The D1 fragment is located at the 50 end of 28S rDNA and contains the D1 divergence domain, which is like the D2, D3 and D8 divergence domains highly variable and homoplasious in annelids (THS pers. obs.). High substitution numbers in the D1 domain have also been shown for vertebrates as well (Zardoya and Meyer, 2001). Therefore, the D1 fragment is the least favorable one of the three in terms of both size and properties. The D9/10 fragment is superior to it in terms of properties and the 2.2 kb one in terms of size. 5. Conclusions Herein we introduce the C factor, a better way to assess saturation than is plotting ti/tv ratios against genetic distances. Our results showed that the C factor correlates with saturation as well as homoplasy. To determine the amount of homoplasy in a partition relative to that in the complete data set, we also developed the O/E ratio. Both factors in combination allow a more objective way to decide whether or not positions are saturated and are thus misleading the phylogenetic reconstruction. However, we also showed using empirical and simulated data that the effect of saturation can be counterbalanced by increased numbers of nucleotide positions if appropriate methods of phylogenetic reconstruction, such as ML, are used. For the annelid example employed here, based on recovered clades, nodal support and resolution, the increasing number of positions is more beneficial for improving tree resolution in phylogenetic reconstruction than is excluding Acknowledgments Appendix A. Supplementary data Supplementary data associated with this article can be found, in the online version, at doi:10.1016/j.ympev.2008.05.015. References Altekar, G., Dwarkadas, S., Huelsenbeck, J.P., Ronquist, F., 2004. Parallel Metropoliscoupled Markov chain Monte Carlo for Bayesian phylogenetic inference. Bioinformatics 20, 407–415. Bartolomaeus, T., 1995. Structure and formation of the uncini in Pectinaria koreni, Pectinaria auricoma (Terebellida) and Spirorbis spirorbis (Sabellida): implications for annelid phylogeny and the position of the Pogonophora. Zoomorphology 115, 161–177. Bleidorn, C., Podsiadlowski, L., Bartolomaeus, T., 2006. The complete mitochondrial genome of the orbiniid polychaete Orbinia latreilli (Annelida, Orbiniidae)–a novel gene order for Annelida and implications for annelid phylogeny. Gene 370, 96–103. Bleidorn, C., Vogt, L., Bartolomaeus, T., 2003a. A contribution to sedentary polychaete phylogeny using 18S rRNA sequence data. J. Syst. Zool. Evol. Res. 41, 186–195. Bleidorn, C., Vogt, L., Bartolomaeus, T., 2003b. New insights into polychaete phylogeny (Annelida) inferred from 18S rDNA sequences. Mol. Phylogenet. Evol. 29, 279–288. Bleidorn, C., Vogt, L., Bartolomaeus, T., 2005. Molecular phylogeny of lugworms (Annelida, Arenicolidae) inferred from three genes. Mol. Phylogenet. Evol. 34, 673–679. Brown, S., Rouse, G.W., Hutchings, P., Colgan, D., 1999. Assessing the usefulness of histone H3, U2 snRNA and 28S rDNA in analyses of polychaete relationships. Aust. J. Zool. 47, 499–516. Castresana, J., 2000. Selection of conserved blocks from multiple alignments for their use in phylogenetic analysis. Mol. Biol. Evol. 17, 540–552. Colgan, D.J., Hutchings, P.A., Braune, M., 2006. A multigene framework for polychaete phylogenetic studies. Organisms Diversity & Evolution 6, 220–235. Collins, T.M., Fedrigo, O., Naylor, G.J.P., 2005. Choosing the best genes for the job: the case for stationary genes in genome-scale phylogenetics. Syst. Biol. 54, 493– 500. Crowe, T.M., Bowie, R.C.K., Bloomer, P., Mandiwana, T.G., Hedderson, T.A.J., Randi, E., Pereira, S.L., Wakeling, J., 2006. Phylogenetics, biogeography and classification of, and character evolution in, gamebirds (Aves: Galliformes): effects of character exclusion, data partitioning and missing data. Cladistics 22, 495–532. Erséus, C., 2005. Phylogeny of oligochaetous Clitellata. Hydrobiologia (535/536), 357–372. Erséus, C., Kallersjo, M., 2004. 18S rDNA phylogeny of Clitellata (Annelida). Zool. Scr. 33, 187–196. Fauchald, K., Rouse, G., 1997. Polychaete systematics: past and present. Zool. Scr. 26, 71–138. Gadagkar, S.R., Kumar, S., 2005. Maximum likelihood outperforms maximum parsimony even when evolutionary rates are heterotachous. Mol. Biol. Evol. 22, 2139–2141. Gatesy, J., O’Grady, P., Baker, R.H.(R.H., 1999. Corroboration among data sets in simultaneous analysis: hidden support for phylogenetic relationships among higher level artiodactyl taxa. Cladistics 15, 271–313. T.H. Struck et al. / Molecular Phylogenetics and Evolution 48 (2008) 628–645 Graybeal, A., 1993. The phylogenetic utility of cytochrome b: lessons from bufonid frogs. Mol. Phylogenet. Evol. 2, 256–269. Halanych, K.M., 2004. The new view of animal phylogeny. Annu. Rev. Ecol. Evol. Syst. 35, 229–256. Halanych, K.M., Robinson, T.J.(T.J., 1999. Multiple substitutions affect the phylogenetic utility of cytochrome b and 12S rDNA data: examining a rapid radiation in leporid (Lagomorpha) evolution. J. Mol. Evol. 48, 369–379. Hall, B.G., 2005. Comparison of the accuracies of several phylogenetic methods using protein and DNA sequences. Mol. Biol. Evol. 22, 792–802. Hall, K.A., Hutchings, P.A., Colgan, D.J., 2004. Further phylogenetic studies of the Polychaeta using 18S rDNA sequence data. J. Mar. Biol. Ass. UK 84, 949–960. Hassouna, N., Michot, B., Bachellerie, J.-P., 1984. The complete nucleotide sequence of mouse 28S rRNA gene. Implications for the process of size increase of the large subunit rRNA in higher eukaryotes. Nucleic Acids Res. 12, 3563–3583. Hermans, C.O., 1969. The systematic position of the Archiannelida. Syst. Zool. 18, 85–102. Hessling, R., 2002. Metameric organisation of the nervous system in developmental stages of Urechis caupo (Echiura) and its phylogenetic implications. Zoomorphology 121, 221–234. Hillis, D.M., Dixon, M.T., 1991. Ribosomal DNA: molecular evolution and phylogenetic inference. Q. Rev. Biol. 66, 411–453. Huelsenbeck, J.P., 1995. Performance of phylogenetic methods in simulation. Syst. Biol. 44, 17–48. Huelsenbeck, J.P., Ronquist, F., 2001. MRBAYES: Bayesian inference of phylogenetic trees. Bioinformatics 17, 754–755. Jördens, J., Struck, T.H., Purschke, G., 2004. Phylogenetic inference regarding Parergodrilidae and Hrabeiella periglandulata (‘‘Polychaeta”, Annelida) based on 18S rDNA, 28S rDNA and COI sequences. Journal of Zoological Systematics and Evolutionary Research 42, 270–280. Källersjö, M., Albert, V.A., Farris, J.S., 1999. Homoplasy increases phylogenetic structure. Cladistics 15, 91–93. Kuhner, M.K., Felsenstein, J., 1994. A simulation comparison phylogeny algorithms under equal and unequal evolutionary rates. Mol. Biol. Evol. 11, 459–468. Lake, J.A., 1994. Reconstructing evolutionary trees from DNA and protein sequences: paralinear distances. Proc. Natl. Acad. Sci. USA 91, 1455–1459. Lockhart, P.J., Steel, M.A., Hendy, M.D., Penny, D., 1994. Recovering evolutionary trees under a more realistic model of sequence evolution. Mol. Biol. Evol. 11, 605–612. Lopez, P., Forterre, P., Philippe, H., 1999. The root of the tree of life in the light of the covarion model. J. Mol. Evol. 49, 496–508. Maddison, W.P., Maddison, D.R. 2004. Mesquite: a modular system for evolutionary analysis. Version 1.05. http://mesquiteproject.org. Mallatt, J., Sullivan, J., Winchell, C.J., 2001. The relationship of lampreys to hagfishes: a spectral analysis of ribosomal DNA sequences. In: Ahlberg, P.E. (Ed.), Major Events in Early Vertebrate Evolution: Palaeontology, Phylogeny, Genetics, and Development. Taylor & Francis, London, pp. 106–118. McHugh, D., 1997. Molecular evidence that echiurans and pogonophorans are derived annelids. Proc. Natl. Acad. Sci. USA 94, 8006–8009. McHugh, D., 2000. Molecular phylogeny of the Annelida. Can. J. Zool. 78, 1873– 1884. McHugh, D., 2005. Molecular systematics of polychaetes (Annelida). Hydrobiologia, 309–318. Milinkovitch, M.C., LeDuc, R.G., Adachi, J., Farnir, F., Georges, M., Hasegawa, M., 1996. Effects of character weighting and species sampling on phylogeny reconstruction: a case study based on DNA sequence data in Cetaceans. Genetics 144, 1817–1833. Nicholas, K.B., Nicholas, H.B.J. 1997. GeneDoc: a tool for editing and annotating multiple sequence alignments. Distributed by the author. Nickrent, D.L., Parkinson, C.L., Palmer, J.D., Duff, R.J., 2000. Multigene phylogeny of land plants with special reference to bryophytes and the earliest land plants. Mol. Biol. Evol. 17, 1885–1895. Nylander, J.A.A. (2002) MrModeltest. http://morphbank.ebc.uu.se/MR.BAYES. Pesole, G., Sbisà, E., Preparata, G., Saccone, C., 1992. The evolution of the mitochondrial D-Loop and the origin of modern man. Mol. Biol. Evol. 9, 587– 598. Philippe, H., Forterre, P., 1999. The rooting of the universal tree of life is not reliable. J. Mol. Evol. 49, 509–523. Posada, D., Buckley, T.R., 2004. Model selection and model averaging in phylogenetics: advantages of akaike information criterion and bayesian approaches over likelihood ratio tests. Syst. Biol. 53, 793–808. Posada, D., Crandall, K.A., 1998. MODELTEST: testing the model of DNA substitution. Bioinformatics 14, 817–818. Posada, D., Crandall, K.A., 2001. Selecting the best-fit model of nucleotide substitution. Syst. Biol. 50, 580–601. Purschke, G., 2002. On the ground pattern of Annelida. Org. Divers. Evol. 2, 181–196. Purschke, G., Jouin, C., 1988. Anatomy and ultrastructure of the ventral pharyngeal organs of Saccocirrus (Saccocirridae) and Protodriloides (Protodriloidae fam. n.) 645 with remarks on the phylogenetic relationships within Protodrilida (Annelida: Polychaeta). J. Zool. 215, 405–432. Regier, J., Shultz, J., 1997. Molecular phylogeny of the major arthropod groups indicates polyphyly of crustaceans and a new hypothesis for the origin of hexapods. Mol. Biol. Evol. 14, 902–913. Ronquist, F., Huelsenbeck, J.P., 2003. MRBAYES 3: Bayesian phylogenetic inference under mixed models. Bioinformatics 19, 1572–1574. Rouse, G.W., Fauchald, K., 1997. Cladistics and polychaetes. Zool. Scr. 26, 139–204. Rouse, G.W., Pleijel, F., 2001. Polychaetes. University Press, Oxford. Rousset, V., Pleijel, F., Rouse, G.W., Erséus, C., Siddall, M.E., 2007. A molecular phylogeny of annelids. Cladistics 23, 41–63. Simon, C., Frati, F., Beckenbach, A.T., Crespi, B., Liu, H., Flook, P., 1994. Evolution, weighting, and phylogenetic utility of mitochondrial gene sequences and a compilation of conserved polymerase chain reaction primers. Ann. Entomol. Soc. Am. 87, 651–701. Struck, T.H., 2007. Data congruence, paedomorphosis and salamanders. Front. Zool. 4, 22. Struck, T.H., Halanych, K.M., Purschke, G., 2005. Dinophilidae (Annelida) is most likely not a progenetic Eunicida: evidence from 18S and 28S rDNA. Mol. Phylogenet. Evol. 37, 619–623. Struck, T.H., Hessling, R., Purschke, G., 2002a. The phylogenetic position of the Aeolosomatidae and Parergodrilidae, two enigmatic oligochaete-like taxa of the ‘Polychaeta’, based on molecular data from 18SrDNA sequences. J. Zool. Syst. Evol. Res. 40, 155–163. Struck, T.H., Purschke, G., 2005. The sister group relationship of Aeolosomatidae and Potamodrilidae (Annelida: ‘‘Polychaeta”)—a molecular phylogenetic approach based on 18S rDNA and Cytochrome Oxidase I. Zool. Anz. 243, 281–293. Struck, T.H., Purschke, G., Halanych, K.M., 2006. Phylogeny of Eunicida (Annelida) and Exploring Data Congruence using a Partition Addition Bootstrap Alteration (PABA) approach. Syst. Biol. 55, 1–20. Struck, T.H., Schult, N., Kusen, T., Hickman, E., Bleidorn, C., McHugh, D., Halanych, K.M., 2007. Annelida phylogeny and the status of Sipuncula and Echiura. BMC Evol. Biol. 7, 57. Struck, T.H., Westheide, W., Purschke, G., 2002b. Progenesis in Eunicida (‘‘Polychaeta,” Annelida)—separate evolutionary events? Evidence from molecular data. Mol. Phylogenet. Evol. 25, 190–199. Sturmbauer, C., Meyer, A., 1992. Genetic divergence, speciation and morphological stasis in a lineage of African cichlid fishes. Nature 358, 578–581. Swofford, D.L. 2002. PAUP*. Phylogenetic Analysis Using Parsimony (*and Other Methods). Sinauer Associates, Sunderland, MA. Swofford, D.L., Olsen, G.J., Waddell, P.J., Hillis, D.M., 1996. Chapter 11—phylogenetic inference. In: Hillis, D.M., Moritz, C., Mable, B.K. (Eds.), Molecular Systematics. Sinauer Associates, Sunderland, MA, pp. 407–514. Thompson, J.D., Higgins, D.G., Gibson, T.J., 1994. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 22, 4673–4680. Tzetlin, A., Purschke, G., 2006. Fine structure of the pharyngeal apparatus of the pelagosphera larva in Phascolosoma agassizii (Sipuncula) and its phylogenetic significance. Zoomorphology 125, 109–117. Vogt, L., 2002. Testing and weighting characters. Org. Diversity Evol. 2, 319–333. Westheide, W., 1997. The direction of evolution within the Polychaeta. J. Nat. Hist. 31, 1–15. Westheide, W., McHugh, D., Purschke, G., Rouse, G., 1999. Systematization of the Annelida: different approaches. Hydrobiologia 402, 291–307. Worsaae, K., Nygren, A., Rouse, G., Giribet, G., Persson, J., Sundberg, P., Pleijel, F., 2005. Phylogenetic position of Nerillidae and Aberranta (Polychaeta, Annelida), analysed by direct optimization of combined molecular and morphological data. Zool. Scr. 34, 313–328. Wuyts, J., Van de Peer, Y., De Wachter, R., 2001. Distribution of substitution rates and location of insertion sites in the tertiary structure of ribosomal RNA. Nucleic Acids Res. 29, 5017–5028. Xia, X., Xie, Z., Salemi, M., Chen, L., Wang, Y., 2003. An index of substitution saturation and its application. Mol. Phylogenet. Evol. 26, 1–7. Yang, Z., 1993. Maximum-likelihood estimation of phylogeny from DNA sequences when substitution rates differ over sites. Mol. Biol. Evol. 10, 1396–1401. Yang, Z., 1996. Among-site rate variation and its impact on phylogenetic analyses. Trends Ecol. Evol. 11, 367–372. Zardoya, R., Meyer, A., 2001. Vertebrate phylogeny: limits of inference of mitochondrial genome and nuclear rDNA sequence data due to an adverse phylogenetic signal/noise ratio. In: Ahlberg, P.E. (Ed.), Major Events in Early Vertebrate Evolution: Palaeontology, Phylogeny, Genetics, and Development. Taylor & Francis, London, pp. 135–155. Zwickl, D.J., 2006. Genetic algorithm approaches for the phylogenetic analysis of large biological sequence datasets under the maximum likelihood criterion. Ph.D. dissertation., The University of Texas at Austin.

RELATED PAPERS

RELATED TOPICS

Log In

Detecting possibly saturated positions in 18S and 28S sequences and their influence on phylogenetic reconstruction of Annelida (Lophotrochozoa)

Detecting possibly saturated positions in 18S and 28S sequences and their influence on phylogenetic reconstruction of Annelida (Lophotrochozoa)

Related Papers

RELATED PAPERS

RELATED TOPICS