Molecular Phylogenetics and Evolution 48 (2008) 628–645
Contents lists available at ScienceDirect
Molecular Phylogenetics and Evolution
journal homepage: www.elsevier.com/locate/ympev
Detecting possibly saturated positions in 18S and 28S sequences and their
influence on phylogenetic reconstruction of Annelida (Lophotrochozoa)
Torsten H. Struck a,b,*, Maximilian P. Nesnidal b, Günter Purschke b, Kenneth M. Halanych a
a
b
Auburn University, 101 Rouse Building, Auburn, AL 36849, USA
Universität Osnabrück, FB05 Biologie/Chemie, AG Zoologie, Barbarastrabe 11, D-49076 Osnabrück, Germany
a r t i c l e
i n f o
Article history:
Received 11 November 2007
Revised 25 April 2008
Accepted 13 May 2008
Available online 20 May 2008
Keywords:
C factor
O/E ratio
Annelida
Saturation
rRNA
Phylogeny
Iss
a b s t r a c t
Phylogenetic reconstructions may be hampered by multiple substitutions in nucleotide positions obliterating signal, a phenomenon called saturation. Traditionally, plotting ti/tv ratios against genetic distances
has been used to reveal saturation by assessing when ti/tv stabilizes at 1. However, interpretation of
results and assessment of comparability between different data sets or partitions are rather subjective.
Herein, we present the new C factor, which quantifies convergence of ti/tv ratios, thus allowing comparability. Furthermore, we introduce a comparative value for homoplasy, the O/E ratio, based on alterations
of tree length. Simulation studies and an empirical example, based on annelid rRNA-gene sequences,
show that the C factor correlates with noise, tree length and genetic distance and therefore is a proxy
for saturation. The O/E ratio correlates with the C factor, which does not provide an intrinsic threshold
of exclusion, and thus both together can objectively guide decisions to exclude saturated nucleotide positions. However, analyses also showed that, for reconstructing annelid phylogeny using Maximum Likelihood, an increase in numbers of positions improves tree reconstruction more than does the exclusion of
saturated positions.
Ó 2008 Elsevier Inc. All rights reserved.
1. Introduction
Reliable reconstruction of phylogenies using molecular data is
affected by several factors such as long branches, heterogeneity
of base frequencies, and rate heterogeneity among positions (e.g.,
Kuhner and Felsenstein, 1994; Lake, 1994; Lockhart et al., 1994;
Xia et al., 2003). All of these factors can result in homoplasy, which
can decrease phylogenetic signal and hamper reconstruction efforts. An indicator of homoplasy in molecular data sets can be saturation of transition events relative to other mutations (e.g.,
Halanych and Robinson, 1999; Jördens et al., 2004; Lopez et al.,
1999; Nickrent et al., 2000; Philippe and Forterre, 1999; Simon
et al., 1994; Struck et al., 2002a; Swofford et al., 1996). Saturation
is invoked when no further increase in transitions can be observed
despite increasing genetic distance, indicating that multiple substitutions at nucleotide positions have occurred. Therefore, when a
data set is saturated, phylogenetic reconstruction may be misled
by homoplasious signal (e.g., Milinkovitch et al., 1996; Simon
et al., 1994). Saturation is especially problematic for taxa that radi-
* Corresponding author. Address: Universität Osnabrück, FB05 Biologie/Chemie,
AG Zoologie, Barbarastrabe 11, D-49076 Osnabrück, Germany. Fax: +49 541 969
2587.
E-mail addresses: struck@biologie.uni-osnabrueck.de (T.H. Struck), nesnidal@
gmail.com (M.P. Nesnidal), Purschke@biologie.uni-osnabrueck.de (G. Purschke),
ken@auburn.edu (K.M. Halanych).
1055-7903/$ - see front matter Ó 2008 Elsevier Inc. All rights reserved.
doi:10.1016/j.ympev.2008.05.015
ated rapidly long ago. For example, the radiation of annelid worms
within Lophotrochozoa apparently was fast and resulted in
branching patterns with short internodes in the basal part of the
annelid tree and with branch lengths rapidly increasing towards
the tips. Such branching patterns accumulate only a few informative substitutions along the internal and more basal branches,
but numerous homoplasious substitutions on terminal branches
(Struck et al., 2002a). For example, in Struck et al.’s (2007) analyses
of three nuclear genes, the ratio of terminal branch length to internal branch length is 4.2/1 even excluding the extremely long
branch leading to the highly divergent taxon, Ophryotrocha labronica. Thus, phylogenetic information is likely to be eroded due to
multiple substitutions, or homoplasy, along the terminal branches.
To avoid these problems, several authors have used conservative
genes such as elongation factor 1a, 18S and/or 28S rDNA (e.g., Regier and Shultz, 1997; Struck et al., 2002a, 2007; Xia et al., 2003).
However, even in these genes certain regions may be saturated
and therefore mislead reconstruction. In contrast, excluding problematic regions may leave too little phylogenetic information to resolve the deep branches (Xia et al., 2003).
Traditionally, saturation is shown by plotting either numbers of
substitutions or the transition-to-transversion (ti/tv) ratios of all
pairwise comparisons of taxa in an alignment against the genetic
distances (p) between these pairs (e.g., Halanych and Robinson,
1999; Jördens et al., 2004; Struck et al., 2002a). In the latter case, convergence upon a ti/tv value of 1 is expected with the approach of
T.H. Struck et al. / Molecular Phylogenetics and Evolution 48 (2008) 628–645
saturation at higher genetic distances. Generally, plotting ti/tv ratios is
more useful because it integrates properties of two substitution types.
However, identifying saturation is subjective when using such plots:
Whether or not to include a gene or certain subsets of a gene (i.e., positions that exhibit only slight saturation or strong saturation) may be
obvious, but whether to incorporate or eliminate positions with intermediate stages of saturation is not straightforward and lacks objectivity (Struck et al., 2007). Furthermore, with an increasing number of
taxa and genes or subsets of genes, the representation of the plots
within one graph becomes increasingly problematic (Struck et al.,
2007). A comprehensive method of display would promote comparison of plots and aid objective decision-making.
Due to convergence upon a value, the distribution of the ti/tv
values of the pairwise comparisons covers an increasingly smaller
range of values as genetic distance increases, and thus the standard
deviation of this distribution [r(ti/tv)] decreases with increasing
saturation. On the other hand, the distribution of the genetic distances p of these pairwise comparisons becomes more spread out
and thus the standard deviation of this distribution [r(p)] is
increasing. By extrapolating from this behavior of convergence,
we developed a convergence factor (C factor), which is based on
the ratio of the standard deviation of the ti/tv distribution to the
standard deviation of the p distribution:
C¼
rðtvti Þ
rðpÞ
ð1Þ
The C factor is determined before, and independently of, the tree
reconstruction and thus is an a priori approach.
Next, using the parsimony criterion we developed another factor
to assess the degree of homoplasy. This one being based on alteration
of tree lengths is an a posteriori approach. Given a data set without
homoplasy, removal of any subset of the data reduces tree length
by an amount directly proportional to the reduction in the minimum
number of substitutions necessary to explain the data set. Because
each substitution is one step of the tree length, the relative change
is the same for both, tree length and number of substitutions, and
thus their ratio is 1. Given a starting tree length of l, alteration in tree
length can be expressed as the observed change O and the reduction
in minimum necessary substitutions s as the expected change, E,
assuming no homoplasy in the data set:
O¼
lr
lc
ð2Þ
sr
sc
ð3Þ
and
E¼
For both l and s, the subscript c indicates the value for the
complete, initial data set, and subscript r indicates the value for
the reduced data set. Their ratio can be given as O/E. An easy
way to calculate this ratio is to estimate the ratio of the consistency
indices CIc/CIr:
O lr =lc
lr sc
1
CIc
¼
¼ ¼
CIc ¼
E sr =sc sr lc CIr
CIr
ð4Þ
In empirical data sets, homoplasious substitutions contribute
more to the tree length than do the synapomorphic ones, because
they require at least two steps instead of one. Therefore, if the removed subset is less homoplasious than the complete data set, the
O/E ratio is >1, because fewer steps were excluded from the data
set than expected from the reduction in substitutions. In other
words, lr is larger than expected. On the other hand, if the removed
subset is more homoplasious than the complete set, then O/E is <1,
because lr is smaller than expected. An O/E ratio of 1 indicates that
the removed part has the same degree of homoplasy as the complete
data set.
629
Herein, we tested both our new factors—C and O/E—using simulation studies as well as a real data set derived from 18S and 28S
sequences of Annelida.
1.1. On the Annelida
As mentioned, Annelida are part of the rapid radiation of lophotrochozoan animals that occurred in or before the early Cambrian;
besides annelids, Lophotrochozoa also includes mollusks, platyhelminths, brachiopods, nemerteans, rotifers and several other
taxa (Halanych, 2004; McHugh, 2000, 2005). Annelida comprises
an ecologically important animal phylum with over 16,500 described species, and its members include the dominant macrofauna of the deep sea. Traditionally two major groups have been
distinguished: Clitellata (earthworms, leeches) and Polychaeta
(e.g., Fauchald and Rouse, 1997; Westheide, 1997; Westheide et
al., 1999). Monophyly of Clitellata is well-supported by both morphological and molecular data (e.g., Erséus, 2005; Erséus and Kallersjo, 2004; Purschke, 2002; Struck and Purschke, 2005).
However, Polychaeta has proven more problematic. Cladistic analyses based on morphological data seem to support a monophyletic
Polychaeta (Rouse and Fauchald, 1997), but recent molecular phylogenetic studies reject monophyly of polychaetes, as well as of
their traditional subgroups (e.g., Rousset et al., 2007; Struck et
al., 2007). Additionally, Siboglinidae (a.k.a. Pogonophora and Vestimentifera; beard worms) and Echiura (spoon worms; as sister
to Capitellidae) have also to be included in Annelida (e.g., Bartolomaeus, 1995; Bleidorn et al., 2003b; Hessling, 2002; McHugh,
1997; Struck et al., 2007). Furthermore, Sipuncula (peanut worms)
are discussed as either an annelid subtaxon or their sister group
(e.g., Bleidorn et al., 2006; Struck et al., 2007; Tzetlin and Purschke,
2006). Unfortunately, well-supported more-inclusive groupings of
the approximately 80 polychaete families are largely lacking.
Although some recent studies used up to five genes, molecular
analyses addressing annelid phylogeny are mainly based on the
18S and 28S genes (Colgan et al., 2006; Rousset et al., 2007; Struck
et al., 2007). Different molecular analyses of annelid phylogeny
have treated the problem of saturated positions in different ways.
Some did not exclude any positions at all (e.g., Rousset et al., 2007),
others excluded ambiguously aligned ones (e.g., Colgan et al.,
2006), and others also excluded saturated ones (e.g., Jördens et
al., 2004; Struck et al., 2007). Furthermore, although complete
18S was always used, different parts of the 28S have been included: the nearly complete sequence of 3.3 kilobases (kb) (Struck
et al., 2005, 2007), the so-called D1 (0.3 kb) and/or D9/10 (0.6 kb)
fragments (Brown et al., 1999; Colgan et al., 2006; Rousset et al.,
2007) or a fragment comprising the region between these two fragments of about 2.2 kb (Jördens et al., 2004).
Herein we evaluated our saturation factors, C and O/E, using
simulation studies as well as a data set of 18S and 28S sequences
from 98 annelid and 9 lophotrochozoan outgroup taxa. Furthermore, we determined how likely three different fragments of 28S
are to saturate, compared to nearly complete sequences of 28S
and 18S. Additionally, we propose an improvement of the sliding
window analysis for rRNA and similar genes to group positions
with similar genetic variation within one partition.
2. Material and methods
2.1. Simulated data
To model simulation studies on an empirical tree, we selected
18S and 28S sequences of 9 polychaete and 1 outgroup from Struck
et al. (2007). This taxonomic subsample was evenly distributed on
their tree and represented the variety of branch lengths they recov-
630
T.H. Struck et al. / Molecular Phylogenetics and Evolution 48 (2008) 628–645
ered. Chosen operational taxonomic units (OTUs) were Glycera,
Sthenalanella, Drilonereis, Ninoe, Pectinaria, Ophelina, Notomastus,
Polydora, Chaetopterus and Crassostrea (mollusk outgroup), and
with these sequences we reconstructed the ML tree. The appropriate model of sequence evolution, Tamura & Nei + I + C, was determined based on the AIC criterion in Modeltest V 3.7 (Posada and
Crandall, 1998, 2001). The estimated model parameters were: base
frequencies: A = 0.2514, C = 0.2275, G = 0.3017, T = 0.2194; substitution rates: A M C, A M T, C M G, G M T = 1.0000, A M G = 1.8754,
C M T = 4.7929; a = 0.6216, number of categories = 4; proportion
of invariant sites = 0.4575. ML tree reconstruction based on these
model parameters employed PAUP*4.0b (Swofford, 2002) with 10
random taxon additions and tree bisection-reconnection (TBR)
branch swapping.
Simulated data sets based on this ML tree (Fig. 1) and on the
corresponding model parameters were generated using Mesquite
v1.05 (Maddison and Maddison, 2004). By definition the initial
data set had no or only a low degree of homoplasy or saturation,
so we introduced these features in two ways. In the first strategy,
we altered the evolutionary rate along the whole tree using different scale factors (0.1, 0.5, 1, 2, 10). This means that each branch of
the tree was varied from one tenth to ten times the length of the
branch shown in Fig. 1. Thus, the evolutionary rates were made
to vary by a factor of 100 over the simulations. Note that, the simulations using a scale factor of 0.1 had the slowest evolutionary
rates and the lowest degree of homoplasy. Subsequently, data sets
generated for the second strategy were based on a scale factor of
0.1 to ensure that both strategies had similar, low, initial degrees
of homoplasy. For the second strategy, we directly introduced random data (i.e., noise) using the option ‘‘random fill” in Mesquite
v1.05 (Maddison and Maddison, 2004). In the different trials, 0%,
5%, 10%, 25% or 50% of the positions were randomized. For both
strategies and for each individual option (i.e., 0%, 5%, 10%, 25%
and 50% noise or scale factors of 0.1, 0.5, 1, 2 and 10), we simulated
50 individual data sets with 300 nucleotide positions resulting in
total 10 times 50 (=500) data sets.
For each individual data set, we were able to determine the C
factor directly. However, the O/E ratio is a relative factor comparing
a data set reduced by a certain partition against the complete data
set. Therefore, to calculate the O/E ratio we had to concatenate
individual data sets generated for each of the two strategies and
five options, respectively. Because we were interested in the correlation of the O/E ratio with both saturation and the C factor, we
concatenated for each strategy five individual data sets, one from
each of the five options. For example, for the first strategy using
scale factors we concatenated five individual data sets, one from
each of the 50 data sets generated by a scale factor of 0.1, 0.5, 1,
2 or 10, respectively. For the second strategy, such a concatenated
data set consisted of five individual data sets, one from each of the
50 data sets generated using 0%, 5%, 10%, 25% or 50% noise level,
respectively. Thus, we got 2 times 50 concatenated data sets generated from the 10 times 50 individual data sets. This concatenation procedure allowed assessing the correlation of the O/E ratio
with both saturation (i.e., 0%, 5%, 10%, 25% and 50% noise, or scale
factors of 0.1, 0.5, 1, 2 and 10) and the determined C factor of the
individual data.
Finally, we also investigated if an increased number of positions
results in an increased recovery of the true tree. Therefore, we simulated 50 data sets with 100, 300, 500, 1000, 2000 or 5000 positions, respectively, using a scale factor of 1. Herein, we used a
scale factor of 1 in contrast to 0.1 as the initial degree of homoplasy
to be able to compare these results with the results we got from
the empirical data of Annelida. We determined the C factor as well
as the MP and ML tree for each data set using 10 random taxon
additions and TBR branch swapping in PAUP*4.0b (Swofford,
2002).
2.2. Empirical data
The empirical data set comprised 107 OTUs of Annelida and
outgroups including new sequence data of 18S and 28S determined
for 11 and 16 polychaete OTUs, respectively.
2.3. Collection of data and sequence alignment
Taxon 4
Taxon 10
Taxon 5
Taxon 6
Taxon 7
Taxon 3
Taxon 9
Taxon 8
Taxon 1
Taxon 2
0.1
Fig. 1. Tree used in simulation study. Phylogram of ML analysis of 9 annelid and 1
outgroup OTUs (Glycera, Sthenalanella, Drilonereis, Ninoe, Pectinaria, Ophelina,
Notomastus, Polydora, Chaetopterus and Crassostrea). ln L = 12625.63033. OTU
names have been concealed, because the tree was generated for simulation
purposes only.
Table 1 lists the taxa, gene sequences and GenBank accession
numbers used in this study. Upon collection, tissue samples or
whole individuals were preserved in >70% non-denatured EtOH
or frozen at 80 °C. Genomic DNA was extracted using the DNeasy
Tissue Kit (Qiagen). Amplification and sequencing of the nuclear
18S and 28S rDNA were carried out using protocols described by
Struck et al. (2006). All products were verified on a 1% agarose
gel and purified with the QIAquick PCR Purification Kit (Qiagen).
As necessary, PCR products were size-selected on agarose gels
and/or cloned using the pGEMÒ-T Easy Vector System (Promega).
A CEQTM 8000 Genetic Analysis System (Beckman Coulter) using
CEQ dye terminator chemistry was used for bidirectional sequencing of all products.
Because Annelida are part of the poorly resolved lophotrochozoan clade (Halanych, 2004), their sister group is not known yet,
so several other lophotrochozoan taxa were employed as outgroup
taxa (see Table 1). Sequences were aligned with CLUSTAL W using
default settings (Thompson et al., 1994) and subsequently corrected by hand in GeneDoc (Nicholas and Nicholas, 1997). Ambiguous positions were determined using GBlocks (Castresana, 2000)
with default settings except the allowed gap positions, which were
set to ‘‘with half”. Positions possessing a gap in less than 50% of the
sequences are not excluded a priori, but can be selected in the final
alignment if they are within an appropriate block. After the determination of classes of identity, all ambiguous positions were excluded from all subsequent analyses. The alignment (Accession
#S2045) is available at TREEBASE (http://www.treebase.org).
631
T.H. Struck et al. / Molecular Phylogenetics and Evolution 48 (2008) 628–645
Table 1
List of the taxa used in the analyses with 18S and 28S sequence accession numbers
Taxon
Species
18S
Arenicola brasiliensis (Nonato, 1958)
Abarenicola affinis (Ashworth, 1903)
DQ790076
Capitellidae
Heteromastus filiformis (Claparede, 1864)
Notomastus tenuis Moore, 1909
DQ790081
DQ790084
Maldanidae
Clymenella torquata (Leidy, 1855)
Clymenura clypeata (Saint-Joseph, 1894)
Axiothella rubrocincta (Johnson, 1901)
AF448152
DQ790078
DQ790027
Opheliidae
Armandia brevis (Moore, 1906)a
Ophelia rathkei McIntosh, 1908
Ophelina acuminata Oersted, 1843
EU418854
AF448157
DQ790085
EU418862
AY366513
DQ790045
Orbiniidae
Orbinia swani Pettibone, 1957
Scoloplos fragilis (Verrill, 1873)b
DQ790087
AY532360
DQ790048
EU418863
Paranoidae
Aricidea sp.c
EU418855
DQ790052
Scalibregmatidae
Scalibregma inflatum Rathke, 1843
Travisia brevis Moore, 1923
Travisia forbesii Johnston, 1840
DQ790093
DQ790060
DQ790069
AF508127
Aphroditidae
Aphrodita sp.
AY894295
DQ790024
Polynoidae
Gattyana ciliata Moore, 1902
Lepidonotus sublevis Verrill, 1873
AY894297
AY894301
DQ790035
DQ790039
Sigalionidae
Sigalion spinosus (Hartman, 1939)
Sthenelanella uniformis Moore, 1910
AY894304
AY894306
DQ790062
DQ790064
Hesionidae
Ophiodromus pugettensis (Johnson, 1901)
DQ790086
DQ790046
Nereididae
Nereis succinea (Frey and Leuchart, 1847)
Nereis vexillosa Grube, 1851
AY210447
DQ790083
AY210464
DQ790043
Pilargidae
Ancistrosyllis groenlandica McIntosh, 1879
DQ790075
DQ790023
Syllidae
Exogone naidina Oersted, 1845
Exogone verugera (Claparede, 1868)
Proceraea cornuta (Agassiz, 1862)
Typosyllis anoculata (Hartmann-Schröder, 1962)
AF474290
AF474312
DQ790098
DQ790033
AF212165
DQ790071
Glyceridae
Glycera dibranchiata Ehlers, 1868
Glycera americana Leidy, 1855d
AY995208
EU418856
AY995207
EU418864
Goniadidae
Goniada brunnea Treadwell, 1906
Glycinde armigera Moore, 1911
DQ790080
DQ790079
DQ790037
DQ790036
Nephtyidae
Nephtys longosetosa (Oersted, 1842)
Nephtys incisa Malmgren, 1865e
Aglaophamus circinata (Verrill, 1874)
DQ790082
EU418857
DQ790072
DQ790042
EU418865
DQ790020
Annelida
Arenicolidae
28S
DQ790025
DQ790038
DQ790044
DQ790030
Paralacydoniidae
Paralacydonia paradoxa Fauvel, 1913
DQ790088
DQ790050
Phyllodocidae
Phyllodoce groenlandica (Oersted, 1843)
DQ790092
DQ790055
Alciopidae
Alciopina sp.
Torrea sp.
DQ790073
DQ790096
DQ790021
DQ790068
Tomopteridae
Tomopteris sp.
DQ790095
DQ790067
Amphinomidae
Paramphinome jeffreysii (Mcintosh, 1868)
Eurythoë complanata (Pallas, 1766)
AY838856
AY364851
AY838865
AY364849
Dorvilleidae
Protodorvillea kefersteini (McIntosh, 1869)
Parougia eliasoni (Oug, 1978)
AF412799
AF412798
AY732230
DQ790053
Eunicidae
Marphysa sanguinea (Montagu, 1815)
Eunice sp.
AY525621
AF412791
AY838861
AY732229
Lumbrineridae
Lumbrineris latreilli Audouin & Milne-Edwards, 1834
Lumbrineris inflata (Moore, 1911)
Ninoe nigripes Pettibone, 1982
AY525623
AY525622
AY838852
AY366512
AY364864
AY838862
Oenonidae
Arabella semimaculata (Moore, 1911)
Drilonereis longa Webster, 1879
Oenone fulgida Pettibone, 1982
AY838844
AY838847
AY838853
AY838857
AY838860
AY838863
Onuphidae
Diopatra aciculata Knox and Cameron, 1971
Hyalinoecia tubicola O.F. Müller, 1776
AY838845
AF412794
AY838858
AY732228
Oweniidae
Owenia fusiformis delle Chiaje, 1841
AF448160
DQ790049
Sabellaridae
Sabellaria cementarium Moore, 1906
AY732223
AY732226
Sabellidae
Serpulidae
Schizobranchia insignis Bush, 1905
Serpula vermicularis Linnaeus, 1767
AY732222
AY732224
AY732225
AY732227
Siboglinidae
Riftia pachyptila Jones, 1981
Siboglinum fiordicum Webb, 1963
AF168739
X79876
Z21534
DQ790061
Cirratulidae
Cirratulus spectabilis (Kinberg, 1866)
AY708536
DQ790029
(continued on next page)
632
T.H. Struck et al. / Molecular Phylogenetics and Evolution 48 (2008) 628–645
Table 1 (continued)
Taxon
Species
18S
Ctenodrilidae
Ctenodrilus serratus (Schmidt, 1857)
AY364850
AY364864
Fauveliopsidae
Fauveliopsis scabra Hartman & Fauchald, 1971
AY708537
DQ790034
Flabelligeridae
Diplocirrus glaucus (Malmgren, 1867)
Pherusa plumosa (Müller, 1776)
AY708534
AY708528
DQ790031
DQ790056
Poeobiidae
Poeobius meseres Heath, 1930
AY708526
DQ790058
Sternaspidae
Sternaspis scutata (Ranzani, 1817)
AY532329
DQ790063
Ampharetidae
Auchenoplax crinita Ehlers, 1887
DQ790077
DQ790026
Pectinariidae
Pectinaria gouldi (Verrill, 1873)
DQ790091
DQ790054
Terebellidae
Amphitrite ornata (Leidy, 1855)
Pista cristata (O. F. Mueller, 1776)
Polycirrus sp.f
DQ790074
AY611461
EU418858
DQ790022
DQ790057
EU418866
Trichobranchidae
Terebellides stroemi Sars, 1835
DQ790094
DQ790066
Chaetopteridae
Chaetopterus variopedatus (Renier, 1804)
U67324
AY145399
Spionidae
Polydora ciliata (Johnston, 1838)
Polydora sp.
Prionospio dubia Maciolek, 1985g
Scolecolepis viridis Verrill, 1873h
U50971
EU418859
EU418860
28S
DQ790059
EU418867
EU418868
Trochochaetidae
Trochochaeta sp.
DQ790097
DQ790070
Poecilochaetidae
Poecilochaetus serpens Allen, 1904i
AY569652
EU418869
Apistobranchidae
Apistobranchus typicus (Webster & Benedict, 1887)i
AF448150
EU418870
Clitellata
Lumbricus terrestris Linnaeus, 1758
Lumbricus sp.
Lumbriculus variegatus (Mueller, 1774)
Lumbriculus sp.
Eisenia foetida Savigny, 1826
Eisenia sp.
Stylaria sp.
Erpobdella octoculata (Linnaeus, 1758)
Hirudo medicinalis Linnaeus, 1758
AJ272183
U95946
AF099949
Z83752
DQ790032
DQ790065
AY364865
AY364866
DQ790041
AY040693
DQ790040
AB076887
Aeolosomatidae
Aeolosoma sp.
Z83748
DQ790019
Parergodrilidae
Parergodrilus heideri Reisinger, 1925
Stygocapitella subterranea Knöllner, 1934
Hrabeiella periglandulata Pizl & Chalupsky, 1984
AJ310504
AF412810
AJ310501
AY366514
AY366516
AY364867
Dinophilidae
Trilobodrilus axi Westheide, 1967
Trilobodrilus heideri Remane, 1925
AF412806
AF412807
AY732231
AY894292
Nerillidae
Leptonerilla prospera (Sterrer & Iliffe, 1982)j
AY834758
EU418871
Polygordiidae
Polygordius appendicularis (Fraipont, 1887)k
AY525629
EU418872
Protodrilidae
Protodrilus ciliatus Jägersten, 1952k
Protodrilus purpureus (Schneider, 1868)k
AY525631
AY527057
EU418873
EU418874
Protodriloidae
Protodriloides chaetifer (Remane, 1926)l
Protodriloides symbioticus (Giard, 1904)l
AY527058
AF508125
EU418875
EU418876
Saccocirridae
Saccocirrus sp.m
EU418861
EU418877
Echiura
Arhynchite pugettensis Fisher, 1949
Urechis caupo Fisher & MacGinitie, 1928
AY210441
AF119076
AY210455
AF519268
Sipuncula
Phascolopsis gouldi (Pourtalès, 1851)
AF342796
AF342795
Crassostrea virginica (Gmelin, 1791)
Solemya velum Say, 1822
Yoldia limaluta (Say, 1831)
AB064942
AF120524
AF120528
AY145400
AY145421
AY145424
Outgroups
Bivalvia
Gastropoda
Ilyanassa obsoleta (Say, 1822)
AY145379
AY145411
Polyplacophora
Chaetopleura apiculata (Say, 1834)
AY145370
AY145398
Brachiopoda
Terebratalia transversa (Sowerby, 1846)
Glottidia pyramidata (Stimpson, 1860)
AF025945
U12647
AF342802
AY210459
Nemertea
Cerebratulus lacteus (Leidy, 1851)
AY145368
AY145396
633
T.H. Struck et al. / Molecular Phylogenetics and Evolution 48 (2008) 628–645
Table 1 (continued)
Sequences that were newly obtained for the present study are indicated in boldface.
a
Locality: 48°32.971’N/122°55.378’W, Lopez Island, WA, USA; Specimen voucher: USNM 1112639.
b
Locality: Little Buttermilk Bay, MA, USA; Specimen voucher: USNM 1112638.
c
Locality: 39°58.397’N/70°40.281’W, Southern New England, MA, USA; Specimen voucher: USNM 1112641.
d
Locality: 41°43.250’N/70°20.150’W, Land grant area 1, Barnstable Harbor, MA, USA; Specimen voucher: USNM 1112637.
e
Locality: 40°53.019’N/70°25.003’W, Southern New England, MA, USA; Specimen voucher: USNM 1112640.
f
Locality: 40°35.596’N/71°31.911’W, Southern New England, MA, USA; Specimen voucher: USNM 1112642.
g
Locality: 39°56.172’N/69°34.573’W, Southern New England, MA, USA; completely used.
h
Locality: 41°41.484’N/70°37.612’W, Barlow’s Landing, MA, USA; Specimen voucher: USNM 1112636.
i
kindly provided by C. Bleidorn (University of Potsdam) (Bleidorn et al., 2003a, 2005).
j
kindly provided by K. Worsaae (University of Kopenhagen) (Worsaae et al., 2005).
k
Locality: North Sea island Helgoland, Germany; completely used.
l
Locality: List, North Sea island Sylt, Germany; completely used.
m
Locality: Giglio, Italy; completely used.
homoplasious (Hennig’s auxiliary principle), the degree of variation is expressed as a percentage of the 27 maximally possible substitutions in a window of nine characters. For example, if such a
window has six constant characters across some number of taxa
and three characters that possess two nucleotide states across
the same taxa, the window would be said to exhibit at least three
substitutions assuming Hennig’s auxiliary principle. The degree of
variation of this window would be 11.1% (3 characters with minimally one substitution/27 maximally possible substitutions). The
window then slides over 6 characters (resulting in an overlap of
three characters with the previous window) and the degree of variation (11.1% here) is assigned to the six characters left behind in
the previous window. The degree of variation is determined anew
each time the window slides down the alignment. The degree
ranges from 0% to 100%. Classes of identity are assigned to 10%
increments of genetic variation. To some degree, this procedure reflects a discrete C distribution (Yang, 1993, 1996).
This sliding-window procedure has a problem, however. Inclusion of an insertion not in frame with the number of characters the
window slides down each step (i.e., six characters herein) has a
shifting effect on all windows downstream from this insertion,
even if the sequence with the indel is otherwise completely identical to sequence(s) present in the alignment. To counterbalance
such effects and increase the robustness of the procedure, we modified the sliding window analyses to restrict the effects of indels to
their vicinity. Instead of six characters, the window is moved by
2.4. Determination of classes of identity
Saturation can be investigated for complete genes or for subsets
of characters (sites) within genes that share a certain property.
Such groups are called classes of identity (Vogt, 2002). Different
sites within genes can show tremendous differences in substitution rates with some characters seemingly static and others experiencing numerous substitutions (Yang, 1993, 1996). Therefore,
various regions of a gene can be treated differently. For structural
genes, like the small and large subunit ribosomal RNA genes, variable and conserved parts have been characterized (e.g., Hassouna
et al., 1984; Mallatt et al., 2001; Zardoya and Meyer, 2001). The former are the so-called divergent domains (D1-12) in 28S and the
variable regions (V1-9) in 18S. Plots on tertiary structure predictions reveal that the degree of variability depends on the distance
from the ribosome center and is continuously changing (Wuyts et
al., 2001). Therefore, we wanted to distinguish between more than
just these two basic classes of identity (conserved vs. variable) to
get more rate categories (classes of identity).
An approach using information contained in the data set seems
to be promising for discriminating more classes of identity. Various
authors have performed sliding window analyses examining the
degree of variation in nucleotide positions (e.g., Jördens et al.,
2004; Pesole et al., 1992; Struck et al., 2002a; Sturmbauer and
Meyer, 1992). Variation is examined within a window of nine characters (Fig. 2). Assuming a priori for each substitution that it is not
A
A
A
T
C
C
C
C
C
G
G
G
T
T
T
T
A
A
A
A
A
A
A
A
G
G
G
G
A
A
G
A
T
T
T
T
T
T
T
T
G
G
G
G
A
A
A
A
A
A
A
A
A
G
C
A
T
C
C
C
A
C
T
G
G ...
C .. .
A .. .
T . ..
0
1
0
0
0
0
1
0
0
0
0
0
2
1
3
3 ...
Alignment
...
...
...
...
Minimally necessary
substitutions
... 1
Sliding window
... 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 ...
11.1%
move on 6 characters
... 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 ...
14.8%
11.1%
Assigned degree
of variation
... 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 ...
11.1%
14.8%
74.7%
Fig. 2. Hypothetical scheme illustrating the traditional sliding window analysis. The minimal number of substitutions necessary to explain a pattern at a position are given
below the alignment. Character positions in the alignment are shown as 61–77, and the gray rectangles indicate the sliding window. Brackets indicate the assigned genetic
variation.
634
T.H. Struck et al. / Molecular Phylogenetics and Evolution 48 (2008) 628–645
Alignment
...
...
...
...
A
A
A
T
C
C
C
C
C
G
G
G
T
T
T
T
A
A
A
A
A
A
A
A
G
G
G
G
A
A
G
A
T
T
T
T
T
T
T
T
G
G
G
G
A
A
A
A
A
A
A
A
A
G
C
A
T
C
C
C
A
C
T
G
G ...
C .. .
A ...
T ...
Minimally necessary
substitutions
... 1
0
1
0
0
0
0
1
0
0
0
0
0
2
1
3
3 ...
Modified
sliding window
... 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 ...
11.1%
move on 1 character
... 61
62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 ...
... 61
... 61
62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 ...
7.4%
62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 ...
... 61
62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 ...
... 61
62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 ...
... 61
62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 ...
14.8%
... 61
62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 ...
... 61
62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 ...
33.3%
7.4%
3.7%
3.7%
11.1%
25.9%
13.2%
Position
Assigned degree
of variation
... 61
62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 ...
36.2%
24.7%
30.5%
15.2%
19.3%
11.1%
12.3%
13.2%
11.5%
22.2%
16.9%
36.2%
28.4%
53.9%
45.3%
70.8%
63.0%
Fig. 3. Hypothetical scheme showing the modified sliding window analysis proposed here, one moves along just one character per step (as opposed to six characters in Fig. 2).
The minimal number of substitutions necessary to explain a pattern at a position are given below the alignment. The numbers 61–77 are the character positions in the
alignment, and gray rectangles indicate the sliding window. Below the modified sliding window analysis the individual degrees of variation for the positions 61–77 are
shown.
0.12
Average contribution
(to genetic variation)
only one character and the variation is determined anew in the
next window. The degree of variation of a character is calculated
by averaging the percentage of variation of the nine windows comprising the character (Fig. 3). Thus, an individual degree of variation is assigned to each character contrary to the shared degree
by six characters. Furthermore, the average contribution of adjacent characters to the genetic variation of a certain character depends on the distance between them (Fig. 4). That is, direct
neighbors of a character contribute most strongly, and the more
sites there are between two characters the less is the contribution;
in a nutshell—the shorter the distance, the larger the influence. At
last, this approach is independent of the direction and the starting
point of the sliding window analysis. Thus, classes of identity (at
10% increments) were determined as in Fig. 3.
For each class of identity, as well as for the complete 18S and
28S alignments, v2 tests for homogeneity of base frequencies
across taxa were performed. Furthermore, single-factor analyses
of variance (ANOVA) in Microsoft Excel were conducted to test
the independence of each class of identity determined by the sliding-window approach from the other classes of identity with respect to the distribution of the genetic distances of the pairwise-
0.10
0.08
0.06
0.04
0.02
0.00
-8 -7 -6 -5 -4 -3 -2 -1 coi 1
2
3
4
5
6
7
8
Positions of character relative to
the character of interest (coi)
Fig. 4. Contribution of nearby positions to the genetic variation of a certain position
depending on the distance to that position.
taxa comparisons. First, we compared the two similar classes of
identity (e.g., the two 0–10% classes) of the two rRNA genes, 28S
and 18S, against each other to test if each was drawn from its
635
T.H. Struck et al. / Molecular Phylogenetics and Evolution 48 (2008) 628–645
2.5. Determination of C factor and O/E ratio
For the C factor, any genetic distance could be used, but herein we
used the uncorrected genetic distance p. Values for ti/tv and p were
determined for each pairwise comparison of taxa and the standard
deviations of both distributions were estimated using Microsoft Excel. The C factor was determined for each simulated data set as well
as for each class of identity of the empirical example.
The O/E ratio is the ratio of the CI values with and without the
subset of interest included in the concatenated data set. For the
simulated data sets, equally weighted parsimony analyses using
PAUP*4.0b (Swofford, 2002) and the branch and bound algorithm
were performed to determine the CI values of the concatenated
data sets as well as of the concatenated data sets reduced by each
individual data set in turn (e.g., 0%, 5%, 10%, 25% or 50% noise,
respectively).
For the empirical data, with its many (100+) taxa, we used another, faster strategy to determine the CI values of the complete
alignment as well as of the complete alignment reduced by each
class of identity in turn; that is, the faster heuristic search with
TBR branch swapping was used. Each tree search for the empirical
data was done in at least two steps. In the first step, we used the
option ‘‘random taxon addition” with 1000 replicates and a chuck
score of 1. A maximum of 50 trees were saved per replicate to
screen the tree space faster for the best solution(s). In the next
step, the chuck score was increased to the value of the most-parsimonious solution plus 1 additional step (and all other settings remained unaltered) to obtain all most-parsimonious trees with the
score of the first step. If the second step achieved a better solution
than the first pass, the second step was repeated with a newly adjusted chuck score. The search was stopped when the last step resulted in the same most-parsimonious tree length as the previous
step. Usually two passes were necessary; only occasionally a third
pass was necessary. In all analyses, gaps in the alignment were
treated as missing data.
scribed by Posada and Buckley (2004) using Modeltest V 3.06 (Posada and Crandall, 1998, 2001). The most likely tree was
reconstructed in PAUP*4.0b (Swofford, 2002) using the indicated
parameters, 10 random taxon additions and TBR branch swapping.
Furthermore, for each data set we also used the best tree of BI as an
individual starting tree to test if a better ML tree could be obtained.
The reliability of phylogenetic nodes was estimated in three
ways: ML, maximum parsimony (MP), and neighbor-joining
(NJ). First, using GARLI v0.951 (Zwickl, 2006), we determined
ML bootstrap (BS) values of 100 replicates employing the
GTR + I + C model and the option ‘‘gentreshfortopoterm” set to
5000. Second, in a MP-BS analysis 100 BS replicates were conducted with 10 random taxon additions per replicate, TBR
branch swapping, chuck score set to 1 and no more than 50
trees kept. Third, 10,000 BS replicates using a NJ search algorithm and ML settings were performed.
3. Results
3.1. Simulation studies
Fig. 5 shows how the C factor varies with noise and with scale
factor in our simulated data sets. The C factor decreases with
increasing noise (Fig. 5A) and with increasing scale factor (Fig.
5B) and thus with increasing homoplasy. The decreases follow a
power function. Whereas at 0% noise or a scale factor of 0.1 the C
factor is larger than 100, at 25% random data (i.e., 0.25 noise) or
a doubled tree length (i.e., scale factor of 2) it is around 10, and
at higher degrees of random data it drops to values around 5. This
shape of the curve has to be expected, because the C factor decreases either as the standard deviation of the ti/tv distribution decreases, or as the standard deviation of the distribution of the
genetic distances increases. In the case of increased saturation
and homoplasy, both effects occur at the same time and thus the
decrease of the C factor is amplified.
250
200
C factor
own underlying probability distribution. Second, we compared
each class of identity to its directly adjacent classes of identity
within each gene. For example, the 10–20% class of the 18S gene
was compared to the 0–10% class as well as to the 20–30% class
of the 18S gene. Based on the ANOVA test, we wanted to ensure
that each determined class of identity was truly distinct from the
other ones and not just due to chance.
150
100
2.6. Phylogenetic analyses of the empirical data
50
0
0
0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5
Noise
250
200
150
C factor
Phylogenetic analyses using Bayesian Inference (BI) and ML
were conducted on both the complete data set (ALL) and with saturated positions excluded (EX). For BI, MrModeltest 1.1b (Nylander, 2002) was used to determine appropriate models of sequence
evolution for both genes, 18S and 28S. Then, to reconstruct trees,
the parallel version of MrBayes 3.1 (Altekar et al., 2004; Huelsenbeck and Ronquist, 2001; Ronquist and Huelsenbeck, 2003) was
used. The model parameters were set in accordance with the specified model for each gene separately. Thus, a partitioned likelihood
analysis of the combined data set was implemented. Two simultaneous runs with four Markov chains each ran for 1.5 107 generations. Sampling every 250th tree, 2 60,001 trees were sampled
in total. Based on average standard deviations above 0.1 for the
implemented diagnosis feature, the first 18,000 trees in each run
were discarded as burn in. The majority-rule consensus tree containing posterior probabilities (PP) of the topology was determined
from the remaining 2*42,001 trees.
For the ML analyses, appropriate models of sequence evolution
for the data sets were assessed by model averaging procedures
using the weighted Akaike Information criterion (=AIC) as de-
100
50
0
0
2
4
6
8
10
12
Scale factor
Fig. 5. Plots of the C factor versus noise (A) and scale factor (B). Mean values with
standard deviation are provided.
636
T.H. Struck et al. / Molecular Phylogenetics and Evolution 48 (2008) 628–645
1.02
1.01
1
O/E ratio
0.99
0.98
0.97
0.96
0.95
0.94
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
Noise
1.02
1.01
1
O/E ratio
0.99
0.98
0.97
0.96
0.95
0.94
0.93
0
4
2
6
8
10
12
Scale factor
Fig. 6. Plots of the O/E ratio versus noise (A) and scale factor (B). Mean values with standard deviation are provided.
Fig. 6 shows how the O/E ratio varies with noise and with scale
factor in our simulated data sets. The O/E ratio remains relatively
constant at 1.00–1.01 in the range from 0% to 10% noise (Fig. 6A)
and the standard deviations overlap in this range. The same is true
for scale factors in the range of 0.1–2 (Fig. 6B). However, the O/E
ratio decreases sharply, to values of about 0.95, with 50% noisy
1.04
Scale
factor
Noise
1
O/E ratio
increasing homoplasy
1.02
0.98
0.96
0.94
0.92
0.9
0.88
0
50
100
150
200
250
300
C factor
increasing homoplasy
Fig. 7. Plot of the O/E ratio versus the C factor for the simulated data that were generated using either scale factors (gray triangles) or degrees of noise (black squares).
637
T.H. Struck et al. / Molecular Phylogenetics and Evolution 48 (2008) 628–645
data or a 10-fold longer tree length. Given that the O/E ratio measures the degree of homoplasy relative to the complete data set,
such a result is not surprising in our simulated concatenated data
sets. A partition with a high degree of homoplasy (e.g., 50% noise)
can increase the O/E ratio of another partition even though this one
has substantial homoplasy itself (e.g., 25% noise). The addition of
the first one increases the average homoplasy of the complete data
set while the degree of homoplasy in the other partition remains
unaltered by the addition. Thus, due to the addition of the first partition the O/E ratio of the other partition increases, because it is relatively less homoplasious in comparison to the new average degree
than before (see also Discussion).
Fig. 7 shows the relation of the C factor to the O/E ratio in our
simulated data sets. The C factor correlates positively and linearly
with the O/E ratio at lower values, when homoplasy is high (Fig.
7). Overall, the figure shows that an increasing C factor indicates
an increasing O/E ratio. However, the correlation does not stay
linear. The O/E ratio increases strongly, from 0.9 to 1, up to a C
factor of 10 and then it levels off. At a C factor of 20 and above
the O/E ratio has a constant value of 1 or slightly higher. The value of the C factor declines more sharply with increasing homoplasy (scale or noise) than does the O/E ratio, as is obvious from
comparing Figs. 5 and 6. More specifically, C starts declining at
far lower levels of homoplasy (>0 in Fig. 5) than does O/E (>10%
or >2 in Fig. 6). Thus, C is a much more sensitive indicator of
homoplasy. However, both C factor and O/E ratio correlate with
homoplasy and with each other when the degree of homoplasy
increases.
3.2. Classes of identity in the annelid example
The aligned, empirical, data set contained 2565 positions for the
18S and 5775 for the 28S partition, from which GBlocks excluded
1345 and 3429 ambiguously aligned positions, respectively. In
the final alignment of 3566 positions 1368 positions were parsimony-informative, 1684 constant and 514 sites parsimony uninformative. In the sliding window analyses, 10% increments of
variability could be determined ranging from 0–80% in the 18S partition and from 0–100% in the 28S partition. In both, most positions
were within the classes of identity with low variability (Table 2: 0–
10%, 10–20%, etc.) and higher classes of variability contain progressively fewer characters. The last two increments of the 28S partition (80–90% and 90–100%) were pooled, because the latter one
would have had only 28 positions, which is a rather small sample
size for the calculations of the C factor and O/E ratio. That is, too
small a sample size would produce too much random error.
The complete data partitions (‘‘ALL”), as well as every class of
identity within partitions, showed homogeneity of base frequencies across taxa (Table 2).
Finally, we checked if either two neighbouring classes of identity within a partition (e.g., 0–10% with 10–20% within the 18S partition) or the same classes of identity of two partitions (e.g., 0–10%
of 18S and 28S partitions) had to be merged with each other, because their characters were drawn from the same underlying probability distribution. To the contrary ANOVA indicated that the
characters of each class of identity were drawn from independent
underlying probability distributions and thus no classes have been
merged (see the right side of Table 2). This means that the distribution of genetic distances of pairwise-taxa comparisons was unique
for each class of identity and never similar or identical to the distribution of another one. This holds even though the mean-genetic
distances of some classes of identity are very similar (see the
points that nearly touch in Fig. 8).
3.3. Detection of saturation in the annelid example
Fig. 8A shows how the C factor varies with increasing genetic
distance across annelid taxa in our empirical example. With
increasing genetic distance the C factor decreases following a
power function. Classes of identity from 0–30% for the 28S and
from 0–20% for the 18S achieve values of more than 100, whereas
classes above 70% show values below 10 accompanied by much
higher genetic distances. Thus, there is a difference of an order of
a magnitude. This resembles the results of our simulation studies,
which exhibited a strong decrease of the C factor with increasing
homoplasy. Hence, the higher the variability in the class of identity
the stronger is the degree of saturation indicated by the C factor.
Turning to O/E ratio, all classes of identity in the 18S with degrees of variation below 60%, as well as in the 28S below 50%, have
ratios larger than or close to 1, whereas all other classes have values below 1 (Fig. 8B). Generally, O/E ratio decreases with increasing
variability. However, while the classes of identity of the 18S in the
range from 0–50% degree of variation remain more or less constant
the classes of the 28S exhibit an optimum of the O/E ratio in the
class of 20–30% variability.
The relation of the C factor to the O/E ratio is similar to our simulation studies (Fig. 8C). The O/E ratio increases sharply up to 1 (or
slightly higher) at low C factors and then starts to level off at C factors in the range from 20 to 30. At higher C factors the O/E ratio
might be viewed as levelling off at a value around 1.01, except
for two maverick points for 28S. Furthermore, all six classes of
identity possessing an O/E clearly smaller than 1 have a C factor
equal to or smaller than 20, whereas nearly all others have a C
above 20 (Fig. 8C). The lowest O/E value can be observed in the
class of identity 80–100% of the 28S partition, which has also the
lowest C factor (Fig. 8A and B).
We decided to treat all classes of identity with an O/E ratio
clearly below 1 (i.e., 0.99) as saturated, for three objective reasons.
Table 2
Homogeneity of base frequencies and ANOVA results for the genetic distance p of the classes of identity of 18S and 28S
Classes of identity
All
0–10%
10–20%
20–30%
30–40%
40–50%
50–60%
60–70%
70–80%
80–100%
Number of alignable characters
Homogeneity of base frequencies
18S
28S
18S
28S
1220
250
251
209
138
133
99
87
53
n.a.
2346
432
431
435
269
224
209
139
104
103
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
n.a.
0.99
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
0.99
Significant values are in bold. n.a., not applicable.
ANOVA results for the genetic distance p
18S vs. 28S
0–10%
10–20%
20–30%
30–40%
40–50%
50–60%
60–70%
70–80%
80–100%
<0.001
<0.001
<0.001
<0.001
<0.001
<0.001
<0.001
0.005
n.a.
Comparison
18S
28S
All
0–10%
10–20%
20–30%
30–40%
40–50%
50–60%
60–70%
70–80%
<0.001
<0.001
<0.001
<0.001
<0.001
<0.001
<0.001
<0.001
n.a.
<0.001
<0.001
<0.001
<0.001
<0.001
<0.001
<0.001
<0.001
<0.001
vs.
vs.
vs.
vs.
vs.
vs.
vs.
vs.
10–20%
20–30%
30–40%
40–50%
50–60%
60–70%
70–80%
80–100%
638
T.H. Struck et al. / Molecular Phylogenetics and Evolution 48 (2008) 628–645
300
18S
28S
0-10%
250
Power
C factor
200
150
10-20%
20-30%
100
30-40%
50
50-60%
40-50%
60-70%
70-80%
80-100%
0
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
uncorrected genetic distance p
1.06
O/E ratio
1.03
1
20-30% 30-40%
10-20%
0-10%
40-50%
0.97
50-60%
60-70%
70-80%
0.94
80-100%
0.91
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
uncorrected genetic distance p
1.04
1.03
1.02
O/E ratio
1.01
1.00
0.99
0.98
0.97
0.96
0.95
0.94
0
50
100
150
200
250
300
C factor
Fig. 8. Plots of both the C factor (A) and the O/E ratio (B) versus the genetic distance p for the actual data set, as well as a plot of the O/E ratio versus the C factor (C), all for the
empirical data. The classes of identity of the 18S (black diamonds) and 28S (black squares) are indicated in the plots. In (A), the power function of the C factor is shown.
First, an O/E ratio below 1 indicates that the degree of homoplasy in
this class of identity is worse than the average degree in the complete data set. Second, given the relation of C factor and O/E ratio
(Figs. 7 and 8C) the asymptote of the O/E ratios was in the range
from 1.00 to 1.01. Third, as mentioned in the previous paragraph,
all O/E ratios with values clearly below 1 were generally accompanied by low C factors of 20 or less (Figs. 8C and 7). This means we
eliminated in the saturation-excluded EX data set all classes of
identity with a degree of variation above 60% and also the class
of identity 50–60% of the 28S. Such a strict and conservative procedure is similar to the one used by Struck et al. (2007).
3.4. Comparison of 28S fragments
Of the three 28S fragments that have been employed in the literature, the D9/10 of annelids has a C factor of about 100, the
639
T.H. Struck et al. / Molecular Phylogenetics and Evolution 48 (2008) 628–645
Table 3
Comparison of the three different 28S fragments used in annelid phylogeny
Fragment
D1 (0.3 kb)
2.2 kb
D9/10 (0.6 kb)
C factor
O/E ratio
Genetic distance p
With s.c.
Without s.c.
With s.c.
Without s.c.
With s.c.
Without s.c.
9.86
15.66
103.77
51.46
64.77
159.80
0.97
0.93
1.03
1.01
1.06
1.03
0.1357
0.1010
0.0382
0.0477
0.0508
0.0269
s.c., saturated characters.
2.2 kb fragment has a C factor around 15, and D1 has a C factor
around 10 (Table 3 with s.c.). Deleting supposedly saturated positions increases the C factor in all three fragments. In fact, the C
factors for the 2.2 kb and D1 fragments increase markedly, by a
factor of 4–5. The O/E ratio of D9/10 is above 1 and is unaltered
by the exclusion of saturated positions (Table 3). On the other
hand, the O/E ratios of the other two fragments are below 1
and deleting saturated characters from these fragments increases
their O/E ratios to above 1. This effect is most marked for the
2.2 kb fragment, where an O/E ratio of 0.93 increases to 1.06. In
any case, we have found a way to assure O/E ratios are above
our threshold of 1, for all three fragments and therefore to minimize homoplasy.
3.5. Phylogenetic analyses of the annelid example
For the annelid example of nearly complete 18S and 28S data,
two data sets were analyzed, a complete one with only ambiguously aligned positions excluded, but saturated positions retained
(ALL) and another one with the possibly saturated positions also
excluded (EX). ML analyses of both data sets were conducted in
two ways; that is, by using starting trees obtained by random taxon addition, and by using the best tree of the BI as a starting tree. In
both cases, the tree obtained using the best BI tree achieved a better (smaller) ln L value than did the standard procedure of random taxon addition. Precisely, for the EX data set, ln L using
random taxon addition was 31,647.03318 and using the best BI
tree 31,643.88339. For the ALL data, the values were
62,622.70950 and 62,612.01527, respectively. Thus, for taxon-rich
data sets its seems more promising to use near optimal starting
trees obtained by other procedures such as BI to increase the probability to reconstruct the overall most likely tree. Additionally,
such a tree search needed only about one third of the CPU time
in our analyses.
Analyses excluding possibly saturated characters (EX) had just
four nodes showing zero branch length (Fig. 9) and the analyses
not excluding these characters (ALL) had only one (Fig. 10). Thus,
overall the resolution in the trees is good, but support is low beyond the ‘‘family” level in both. This is indicated by the paucity
of support symbols on the basal nodes along the ‘‘backbones” (left
side) of the trees. Thirty-eight of the 104 internal nodes of the EX
data set exhibited maximum-parsimony-bootstrap support (MPBS) P 70, and for 32 of these nodes the MP-BS were P95. Similar
values were obtained for neighbor-joining bootstrapping (NJ-BS):
38 nodes with NJ-BS P70 and 31 nodes P95. Forty-five of the
internal nodes had Bayesian posterior probabilities (PP) P0.95. Finally, 40 nodes had a maximum-likelihood-bootstrap support (MLBS) P70, and 32 of these were P95. In analyses of the ALL data set,
47 nodes had MP-BS values P70, and 41 were P95. Fifty-two
nodes exhibited ML-BS P70, and 40 of these nodes were P95.
Similar results were obtained for NJ-BS and PP values: 47 nodes
P70, 35 nodes P95 and 64 P 0.95, respectively. Thus, the ALL data
(with more characters) consistently yielded more supported nodes
than did the EX data (with fewer characters), meaning that support
and resolution increased with the number of nucleotide positions.
Comparing the two trees (Figs. 9 and 10) reveals that for both data
sets these support values favor more recent events and are not randomly spread out.
For both data sets, only a few taxa beyond the recognized ‘‘family” level are supported significantly as indicated by bootstrap values. Monophyly of Aphroditiformia as well as of Clitellata is
substantiated. The close relationships of Echiura and Capitellidae
and of Eunicidae and Onuphidae (Bleidorn et al., 2003a, 2003b;
Struck et al., 2006, 2007) are significantly corroborated. Additionally, in the analyses of the ALL data set (Fig. 10), monophyly of
Terebelliformia excluding Pectinariidae achieves significant support by ML-BS and MP-BS as does the close relationship of Spionidae, Trochochaetidae and Poecilochaetidae. These results
corroborate previous findings (e.g., Bleidorn et al., 2003b; Colgan
et al., 2006; Rousset et al., 2007; Struck and Purschke, 2005; Struck
et al., 2007; Worsaae et al., 2005).
For this study we also obtained new 18S and/or 28S data for
some polychaete taxa. Our analyses allow tentative conclusions
regarding some of these taxa. Both analyses (Figs. 9 and 10)
grouped Spionidae, Trochochaetidae and Poecilochaetidae together. This clade gathered significant support in the analyses
using ALL data (Fig. 10). This result is in agreement with morphological data, which place these taxa in Spionida (Rouse and Fauchald, 1997; Rouse and Pleijel, 2001). On the other hand, the other
spionidan taxon for which we obtained new data, Apistobranchidae, is sister to the sabellidan taxon Oweniidae in both analyses
(Figs. 9 and 10). In this study coverage of Opheliidae was increased
in comparison to recent studies (Colgan et al., 2006; Struck et al.,
2007) and Opheliidae were recovered as sister(s) to Clitellata, albeit with weak support (Figs. 9 and 10). This was also found in
an analysis based on 18S data alone (Struck et al., 2005). Most
new sequences were obtained for the former archiannelidan taxa
Dinophilidae, Nerillidae, Polygordiidae, Protodrilidae, Protodriloidae, and Saccocirridae (see Hermans, 1969). Monophyly of the
‘‘Archiannelida” has been rejected based on morphological and
molecular data (Purschke and Jouin, 1988; Struck et al., 2005,
2002a,b; Westheide, 1997), which is further substantiated by our
results (Figs. 9 and 10). Monophyly of the morphologically well-defined taxon Protodrilida comprising Protodrilidae, Protodriloidae,
and Saccocirridae (Purschke and Jouin, 1988; Rouse and Pleijel,
2001) was also not recovered by our analyses (Figs. 9 and 10). This
is congruent with results mainly derived from 18S data (Hall et al.,
2004; Rousset et al., 2007; Struck et al., 2002a; Struck and Purschke, 2005).
3.6. Number of positions
Because our larger ALL data set with more positions (Fig. 10)
gave better resolution than our smaller EX data set with fewer
positions, we used simulations to explore the effect of number of
positions on the quality of tree reconstructions, with simulation
settings derived from a smaller, 10-taxon annelid example. The
relation of number of positions to saturation, recovery of the true
tree, and phylogenetic method employed was investigated using
simulated data sets with 100 up to 5000 positions. (Note that there
640
T.H. Struck et al. / Molecular Phylogenetics and Evolution 48 (2008) 628–645
Fig. 9. Cladogram of the ML tree excluding possibly saturated positions (EX) leaving 2871 total positions. ln L = 31,643.88339. Support values are indicated at the branches.
Model parameters: base frequencies: A = 0.2659, C = 0.2057, G = 0.2943, T = 0.2341; substitution rates: A M C = 1.3007, A M G = 3.0178, A M T = 1.2128, C M G = 0.8623,
C M T = 6.8312, G M T = 1.0000; a = 0.4900, number of categories = 4; proportion of invariant sites = 0.3810. A phylogram, indicating the actual branch length, is provided in
Supplementary Figure 1.
was no deliberately introduced noise this time; the different-sized
data sets were increases in constant quality.) With an increasing
number of positions, the C factor slightly decreases (Fig. 11). On
the other hand, recovery of the true tree (Fig. 1) increases with
an increasing number of positions even though the C factor indicates that saturation is increasing (Fig. 11).
T.H. Struck et al. / Molecular Phylogenetics and Evolution 48 (2008) 628–645
641
Fig. 10. Cladogram of the ML tree not excluding possibly saturated positions (ALL) with 3566 total positions. ln L = 62,612.01527. That is, this is from a larger data set.
Support values are indicated at the branches. Model parameters: base frequencies: A = 0.2410, C = 0.2331, G = 0.3008, T = 0.2251; substitution rates: A M C = 0.9835,
A M G = 2.3338, A M T = 1.1084, C M G = 0.8864, C M T = 4.8116, G M T = 1.0000; a = 0.5726, number of categories = 4; proportion of invariant sites = 0.3803. A phylogram,
indicating the actual branch length, is provided in Supplementary Figure 2.
However, the chance of recovering the true tree depends on the
reconstruction method used. Neither ML nor MP recovers the true
tree with only 100 or 300 positions. Starting with 500 characters,
the chance of recovering the true tree increases linearly with
increasing number of positions for both ML and MP (R2 for ML is
0.997 and 0.826 for MP). Probability of recovering the true tree
642
0.9
40
0.8
35
0.7
30
0.6
25
0.5
20
0.4
15
0.3
10
0.2
5
0.1
0
0
1000
2000
3000
4000
ML MP
45
Ptrue
C factor
T.H. Struck et al. / Molecular Phylogenetics and Evolution 48 (2008) 628–645
0
5000
Positions
Fig. 11. Plots of the C Factor and Ptrue versus number of positions in a range of simulated data sets. Mean values with standard deviation are provided for the C factor. Ptrue
shows the probability of recovering the true tree for ML (open squares) and MP analyses (open triangles).
increases much more quickly with ML (5000 positions = 85%
chance of recovery) than MP (5000 positions = 15%). By extrapolating these linear correlations, we can expect to recover the true tree
with a probability of 0.95 at approximately 5500 positions for ML
and 31,500 positions for MP. For a probability of 1.0 we would at
least need 5800 and 33,000 positions, respectively. Similar discrepancies between ML and MP approaches can be seen in a fungal
example for bootstrap support and non-stationary genes (Figs. 2
and 4 of Collins et al., 2005).
4. Discussion
4.1. Detection of saturation
C factor correlates with saturation, providing a more objective
mean than traditional approaches (e.g., Halanych and Robinson,
1999; Jördens et al., 2004; Struck et al., 2002a) of assessing saturation. Traditional approaches using plots are rather subjective because the assessment whether a partition is saturated depends
on the subjective interpretation of graphs rather than on actual
values. Furthermore, assessing the saturation of more than three
to four partitions (see for example the Supplementary Materials
of Struck et al., 2005; and Struck et al., 2007), the representation
within one graph becomes less and less applicable, though such
a comprehensive display does promote comparability and thus
decision-making. Therefore the C factor, which condenses the
information of the degree of saturation, shown in the graph, into
a single value, allows more objective comparisons across partitions
and to other analyses.
Based on the notion of entropy in information theory, Xia et al.
(2003) proposed another index of substitution saturation (Iss).
Whereas the C factor is based on pairwise comparisons of taxa,
Iss assesses saturation along the positions. Iss is the ratio of the average information entropy of positions to the expected entropy of the
alignment. The smaller the Iss value, the less substitution saturation
has occurred. The determination of Iss is based only on nucleotide
frequencies in the alignment and thus is independent from any
model of phylogenetic reconstruction or substitution. The C factor
herein is also independent of such models: Prior to any tree reconstruction and without any explicit model assumption, the uncorrected genetic distance p between pairs of taxa as well as ratios
of the transition and transversions events were calculated. However, in contrast to Iss the C factor can also be determined a posteriori and/or based on explicit model assumptions for both genetic
distance and ti/tv. For example, the genetic distance between two
taxa can also be determined based on an underlying tree and an
explicit substitution model. Therefore, alteration in the C factor
can be used in future studies to investigate if certain models increase this factor and thus are able to alleviate the saturation problem at least in part.
Unfortunately, the C factor and Iss do not provide an intrinsic
threshold indicating when saturation becomes problematic in phylogenetic reconstructions. Simulation studies can be conducted to
obtain such threshold values for C factor or Iss (Xia et al., 2003).
For sets of different topologies, substitution parameters, numbers
of positions and taxa, values of the C factor or Iss can be determined
below which the true tree is not recovered with a certain percentage, e.g., 95%. This generates a suite of critical C factor or Iss values
that can be used to assess values of empirical data sets for their potential to mislead phylogenetic reconstructions (Xia et al., 2003).
However, this procedure always bears the risk that such a suite
of critical values is not fully appropriate for the empirical data at
hand. For example, if the suite is based on topologies with terminal
OTUs being equidistant to the root, in contrast to the topologies of
the empirical data. Another approach would be to use the parameters of the empirical data for the simulations, parameters such as
number of taxa and positions, substitution parameters and topology of the best tree. However, this approach always includes the
possibility of circularity: This is because, first, the parameters obtained from the empirical data (e.g., branch length and model tree)
are used to conduct the simulation studies; and second, these simulation studies are used to assess whether some parameters obtained from the empirical data set (e.g., topology) are misled by
saturation. But if, in the empirical data some long-branched taxa
grouped together due to increased substitution rates and saturation, this tree including the artificial group would be taken as the
model tree for the simulation studies. Therefore, an indication that
saturation poses no problem to the placement of the longbranched taxa by the empirical data may not be reliable, because
the simulation studies were based on an artificial result in the first
place.
Therefore, to avoid this uncertainty we combined the C factor
with the O/E ratio, which is an estimator independent from the C
factor. While the C factor assesses saturation, the O/E ratio determines the degree of homoplasy in a partition relative to the complete data set. As expected, the correlation of C with O/E (Figs. 7
and 8) showed that saturation results in an increased degree of
homoplasy. In contrast to the C factor, the O/E ratio contains an
T.H. Struck et al. / Molecular Phylogenetics and Evolution 48 (2008) 628–645
intrinsic threshold value of 1. Values above 1 indicate that the partition is less homoplasious than the average of the complete data
set and values below 1 that it is more homoplasious. In our analyses, O/E values clearly below 1 corresponded with a C factor below
10, and C factors in the range of 10–20 were associated with O/E
ratios below or close to 1 (Fig. 7 and 8C). At higher C values, the
O/E ratio leveled off. Therefore, the critical threshold of the C factor
may be in the range of 10–20, because this is where the drop-off of
the O/E ratio occurred.
However, this range and the correlation between C and O/E
should be further investigated by employing a wide variety of
empirical data. Although the degrees of homoplasy and saturation
are correlated in the present study, this might not hold up in all instances. For example, a data set showing no or only slight saturation can still exhibit some homoplasy if there are numerous
parallel mutations, as might occur under strong selection. Thus,
there will be O/E values below 1 in parts of such a data set, without
indicating saturation. Conversely, if a data sets shows saturation
overall, some parts of it will still have a O/E ratio above 1 among
the many classes of identity. Moreover, because the O/E ratio is a
relative measure to the degree of homoplasy of the complete data
set, there will always be values above and below 1. Additionally, if
a single class of identity covers all positions of a data set, the average of O/E ratios of the ‘‘classes” of identity is 1. Thus, the relative
O/E ratio can only be used as an additional, supportive measurement for saturation in combination with direct factors such as
the C factor. For the annelid ribosomal data, we excluded all partitions with an O/E value below 1 from the subsequent phylogenetic
analysis. Given these arguments, this cut-off point might have been
overly conservative, though the values correlated with low C factors below 20.
O/E ratio is valuable not only for comparing possibly saturated
characters, but also for assessing the quality of different data partitions in a concatenated analysis in terms of degree of homoplasy:
The smaller the O/E ratio of a partition, the more homoplasy it contains (in comparison to the complete data set and other partitions
in the analysis). Because it is a parsimony measurement and it is
calculated using CIs, the O/E ratio can directly compare different
types of characters such as amino acid, nucleotide and morphological traits, within one data set and one approach. Furthermore, in
contrast to only using CIs of individual partitions, O/E ratios can tell
how CIs are altered when positions are deleted or added. Alteration
approaches are able to reveal not only contributions of individual
partitions, but also their influence on other partitions in the data
set (e.g., Gatesy et al., 1999; Struck, 2007).
4.2. Possibly saturated positions and Annelida phylogeny
Though exclusion of possibly saturated positions can improve
phylogenetic reconstructions (e.g., Simon et al., 1994), this approach seems not be helpful for the reconstruction of the annelid
phylogeny using 18S and 28S sequence data. Instead, we found
that using more positions (3566 nt, Fig. 10, ALL) was better than
using fewer by excluding some (2871 nt, Fig. 9, EX). That is, both
nodal support and tree resolution increased with increasing number of positions (cf. Figs. 9 and 10). Our EX analyses showed results
similar to previous analyses (e.g., Colgan et al., 2006; Rousset et al.,
2007; Struck and Purschke, 2005), with low resolution and paraphyly of many clades – even of clades that are well-supported by
anatomical characters, such as Phyllodocida and Eunicida (Fig. 9).
On the other hand, though the support is still weak, analyses based
on the ALL data set (Fig. 10) showed some progress in comparison
to the ones based on EX. Some of the clades recovered in this ALL
tree have also been found by Struck et al. (2007) and/or are supported by morphological data. The analyses of Struck et al.
(2007) used three nuclear genes and included nearly complete
643
18S and 28S, but excluded possibly saturated positions. These analyses were based on roughly the same number of parsimony-informative characters as the ALL analyses in the present study. For
example, Struck et al. (2007) also recovered a clade comprising
all Phyllodocida, Eunicida and Orbiniidae, albeit given low support
(Fig. 10). Unlike the earlier analysis, this analysis (Fig. 10) also includes Parergodrilidae, and places this taxon in the clade as well; in
agreement, a close relationship of Orbiniidae and Parergodrilidae
has been shown previously (Bleidorn et al., 2003a; Erséus and Kallersjo, 2004; Jördens et al., 2004; Struck et al., 2002a). As recovered
here, Struck et al. (2007) found a monophyletic Terebelliformia and
basal positions of Amphinomida and Chaetopteridae, but again
with low support. Basal Chaetopteridae has also been shown in
the analyses of Colgan et al. (2006). Our and the analyses of Struck
et al. (2007) recovered with low support part of Spionida as being a
subgroup of paraphyletic Sabellida. Our analyses based on ALL data
recovered Cirratuliformia with low support, a group proposed by
Rouse and Fauchald (1997).
The conclusion that adding more positions outweighs the impact of reducing saturation in annelid phylogeny in this study is
further substantiated by the simulation studies (Fig. 11), which
were based on actual substitution parameters of annelid taxa. With
an increasing number of positions, recovery of the true tree increases tremendously using ML methods, which is superior to
MP in reconstructing annelid phylogeny. This agrees with previous
studies showing the general superiority of ML methods over MP
methods in reconstructing difficult trees—because ML can account
for multiple substitutions at a site (e.g., Gadagkar and Kumar,
2005; Hall, 2005; Huelsenbeck, 1995). With more than 5000 positions modeled for the 18S and 28S sequences, the ‘true’ phylogeny
of the 9 annelid OTUs can be recovered with high certainty using
ML (right side of Fig. 11). Thus, for small numbers of annelid taxa,
the ‘true’ phylogeny can be recovered with high certainty using
fewer than 10,000 positions (though the nodal support might still
be weak). However, with increasing numbers of taxa, the amount
of positions needed to recover the ‘true’ phylogeny with high certainty will likewise increase. (Note that although these considerations are valuable in showing that more sites yield better
resolution, they are only of limited practical value because the nuclear rRNA-gene family has only about 5400 usable sites, and this
number cannot be increased.)
Our results of empirical and simulated data showed that
increasing the amount of nucleotide positions outweighed the negative impact of saturated positions in the reconstruction of annelid
phylogeny using nuclear rRNA data. Similar results have been obtained for a plastid gene in land plants and mitochondrial genes
in birds (Crowe et al., 2006; Källersjö et al., 1999; Nickrent et al.,
2000). In other studies, the third codon positions of protein-coding
genes have been especially problematic due to saturation (Graybeal, 1993; Halanych and Robinson, 1999; Xia et al., 2003). Third codon positions of mitochondrial genes in annelids exhibit high
mean-genetic distances, of about 0.6 (see Supplementary Material
in Struck et al., 2007). This means that 60% of the third codon positions are different between any two annelid taxa. Additionally,
these third codon positions do not show homogeneity of base frequencies across taxa (Supplementary Material in Struck et al.,
2007). Thus, these positions begin to approximate random data.
Therefore, neither should saturated positions be excluded a priori
before any phylogenetic analysis of these positions nor should
one fail to investigate saturation entirely, but the effect of saturation on the phylogenetic reconstruction should be assessed for
each data set anew. The C and O/E factors provided here may help
to guide the decision more objectively.
Finally, increasing the number of characters will not always
help. For example, ambiguously aligned positions should always
be excluded from the data set a priori, because robust hypothesis
644
T.H. Struck et al. / Molecular Phylogenetics and Evolution 48 (2008) 628–645
about positional homology are important. Positional homology is
the fundamental principal in phylogenetic analyses at the sequence level of nucleotides or amino acids and thus its violation
has strong impacts on both the reconstruction procedure itself
and its philosophical underpinnings (Swofford et al., 1996).
saturated positions. Therefore, with respect to 28S rDNA, using
nearly complete 28S sequences is preferable to using small chunks
of the gene.
4.3. Comparison of 28S fragments
We thank Christoph Bleidorn, Monika C. Müller, Anna Paul
and Katrine Worsaae for providing tissue material. The crew
of the R/V Point Sur and R/V Oceanus were most helpful in
obtaining samples. For both cruises, we also acknowledge the
help of the scientific crews (which are too numerous to list
here). Friday Harbor Laboratories, University of Washington, is
also acknowledged for their support. Computational assistance
on GUMP (Genomics Using Multiple Processors) at Auburn University was kindly provided by Scott Santos. We also thank two
anonymous reviewers for important contributions to the manuscript. This work was supported by the USA National Science
Foundation (NSF) WormNet Grant (EAR-0120646) to K.M.H.,
NSF OCE-0425060 to K.M.H. and T.H.S., and by the Grant
DFG-STR-683/3-1 from the Deutsche Forschungsgemeinschaft
to T.H.S. This work is AU Marine Biology Program contribution
#37.
Given the above considerations about sequence length, when
working with 28S data, it is preferable to use the nearly complete
sequence than just small chunks of the gene. Of the three different
fragments used in annelid phylogeny the D9/10 fragment is best in
having the least genetic distance and the highest C factor (Table 3).
Furthermore, the O/E ratio is clearly above 1 and is not tremendously altered due to the exclusion of possibly saturated positions.
Thus, this fragment seems not be hampered by saturation.
The 2.2 kb fragment, on the other hand, has the benefits of
being longer and of achieving the best O/E ratio after exclusion of
possibly saturated positions. On the other hand, before removal
of its saturated characters, this fragment has the worst O/E ratio
with 0.93. Thus, the degree of homoplasy of this fragment is
strongly influenced by those partitions in it that have low C factors
and thus, saturated characters. This strong impact might be due to
the fact that this 2.2 kb fragment contains the large D2, D3, and D8
divergence domains, which are highly variable and homoplasious
(THS and KMH pers. obs. and see Hassouna et al., 1984; Hillis
and Dixon, 1991; Mallatt et al., 2001; Zardoya and Meyer, 2001).
Furthermore, the marked increase in the O/E ratio upon removing
saturated sites from the 2.2 kb piece shows the correlation of degree of homoplasy with the C factor: the higher the factor, the less
the homoplasy.
The lowest C factor is achieved for the D1 fragment (Table 3). It
is below 10 and thus similar to the values of the classes of identity
showing clear saturation (Fig. 8). Furthermore, and in contrast to
the 2.2 kb fragment, the exclusion of possibly saturated characters
from D1 only raises the low O/E ratio slightly above 1, even though
it discards more than half of the already small fragment of 0.3 kb
(Brown et al., 1999). Additionally, the D1 fragment shows the highest genetic distance (across taxa) of all fragments. The D1 fragment
is located at the 50 end of 28S rDNA and contains the D1 divergence
domain, which is like the D2, D3 and D8 divergence domains
highly variable and homoplasious in annelids (THS pers. obs.). High
substitution numbers in the D1 domain have also been shown for
vertebrates as well (Zardoya and Meyer, 2001). Therefore, the D1
fragment is the least favorable one of the three in terms of both
size and properties. The D9/10 fragment is superior to it in terms
of properties and the 2.2 kb one in terms of size.
5. Conclusions
Herein we introduce the C factor, a better way to assess saturation than is plotting ti/tv ratios against genetic distances. Our results showed that the C factor correlates with saturation as well
as homoplasy. To determine the amount of homoplasy in a partition relative to that in the complete data set, we also developed
the O/E ratio. Both factors in combination allow a more objective
way to decide whether or not positions are saturated and are thus
misleading the phylogenetic reconstruction. However, we also
showed using empirical and simulated data that the effect of saturation can be counterbalanced by increased numbers of nucleotide
positions if appropriate methods of phylogenetic reconstruction,
such as ML, are used. For the annelid example employed here,
based on recovered clades, nodal support and resolution, the
increasing number of positions is more beneficial for improving
tree resolution in phylogenetic reconstruction than is excluding
Acknowledgments
Appendix A. Supplementary data
Supplementary data associated with this article can be found, in
the online version, at doi:10.1016/j.ympev.2008.05.015.
References
Altekar, G., Dwarkadas, S., Huelsenbeck, J.P., Ronquist, F., 2004. Parallel Metropoliscoupled Markov chain Monte Carlo for Bayesian phylogenetic inference.
Bioinformatics 20, 407–415.
Bartolomaeus, T., 1995. Structure and formation of the uncini in Pectinaria koreni,
Pectinaria auricoma (Terebellida) and Spirorbis spirorbis (Sabellida): implications
for annelid phylogeny and the position of the Pogonophora. Zoomorphology
115, 161–177.
Bleidorn, C., Podsiadlowski, L., Bartolomaeus, T., 2006. The complete mitochondrial
genome of the orbiniid polychaete Orbinia latreilli (Annelida, Orbiniidae)–a
novel gene order for Annelida and implications for annelid phylogeny. Gene
370, 96–103.
Bleidorn, C., Vogt, L., Bartolomaeus, T., 2003a. A contribution to sedentary
polychaete phylogeny using 18S rRNA sequence data. J. Syst. Zool. Evol. Res.
41, 186–195.
Bleidorn, C., Vogt, L., Bartolomaeus, T., 2003b. New insights into polychaete
phylogeny (Annelida) inferred from 18S rDNA sequences. Mol. Phylogenet. Evol.
29, 279–288.
Bleidorn, C., Vogt, L., Bartolomaeus, T., 2005. Molecular phylogeny of lugworms
(Annelida, Arenicolidae) inferred from three genes. Mol. Phylogenet. Evol. 34,
673–679.
Brown, S., Rouse, G.W., Hutchings, P., Colgan, D., 1999. Assessing the usefulness of
histone H3, U2 snRNA and 28S rDNA in analyses of polychaete relationships.
Aust. J. Zool. 47, 499–516.
Castresana, J., 2000. Selection of conserved blocks from multiple alignments for
their use in phylogenetic analysis. Mol. Biol. Evol. 17, 540–552.
Colgan, D.J., Hutchings, P.A., Braune, M., 2006. A multigene framework for
polychaete phylogenetic studies. Organisms Diversity & Evolution 6, 220–235.
Collins, T.M., Fedrigo, O., Naylor, G.J.P., 2005. Choosing the best genes for the job:
the case for stationary genes in genome-scale phylogenetics. Syst. Biol. 54, 493–
500.
Crowe, T.M., Bowie, R.C.K., Bloomer, P., Mandiwana, T.G., Hedderson, T.A.J., Randi, E.,
Pereira, S.L., Wakeling, J., 2006. Phylogenetics, biogeography and classification
of, and character evolution in, gamebirds (Aves: Galliformes): effects of
character exclusion, data partitioning and missing data. Cladistics 22, 495–532.
Erséus, C., 2005. Phylogeny of oligochaetous Clitellata. Hydrobiologia (535/536),
357–372.
Erséus, C., Kallersjo, M., 2004. 18S rDNA phylogeny of Clitellata (Annelida). Zool. Scr.
33, 187–196.
Fauchald, K., Rouse, G., 1997. Polychaete systematics: past and present. Zool. Scr. 26,
71–138.
Gadagkar, S.R., Kumar, S., 2005. Maximum
likelihood outperforms maximum
parsimony even when evolutionary rates are heterotachous. Mol. Biol. Evol. 22,
2139–2141.
Gatesy, J., O’Grady, P., Baker, R.H.(R.H., 1999. Corroboration among data sets in
simultaneous analysis: hidden support for phylogenetic relationships among
higher level artiodactyl taxa. Cladistics 15, 271–313.
T.H. Struck et al. / Molecular Phylogenetics and Evolution 48 (2008) 628–645
Graybeal, A., 1993. The phylogenetic utility of cytochrome b: lessons from bufonid
frogs. Mol. Phylogenet. Evol. 2, 256–269.
Halanych, K.M., 2004. The new view of animal phylogeny. Annu. Rev. Ecol. Evol.
Syst. 35, 229–256.
Halanych, K.M., Robinson, T.J.(T.J., 1999. Multiple substitutions affect the
phylogenetic utility of cytochrome b and 12S rDNA data: examining a rapid
radiation in leporid (Lagomorpha) evolution. J. Mol. Evol. 48, 369–379.
Hall, B.G., 2005. Comparison of the accuracies of several phylogenetic methods
using protein and DNA sequences. Mol. Biol. Evol. 22, 792–802.
Hall, K.A., Hutchings, P.A., Colgan, D.J., 2004. Further phylogenetic studies of the
Polychaeta using 18S rDNA sequence data. J. Mar. Biol. Ass. UK 84, 949–960.
Hassouna, N., Michot, B., Bachellerie, J.-P., 1984. The complete nucleotide sequence
of mouse 28S rRNA gene. Implications for the process of size increase of the
large subunit rRNA in higher eukaryotes. Nucleic Acids Res. 12, 3563–3583.
Hermans, C.O., 1969. The systematic position of the Archiannelida. Syst. Zool. 18,
85–102.
Hessling, R., 2002. Metameric organisation of the nervous system in developmental
stages of Urechis caupo (Echiura) and its phylogenetic implications.
Zoomorphology 121, 221–234.
Hillis, D.M., Dixon, M.T., 1991. Ribosomal DNA: molecular evolution and
phylogenetic inference. Q. Rev. Biol. 66, 411–453.
Huelsenbeck, J.P., 1995. Performance of phylogenetic methods in simulation. Syst.
Biol. 44, 17–48.
Huelsenbeck, J.P., Ronquist, F., 2001. MRBAYES: Bayesian inference of phylogenetic
trees. Bioinformatics 17, 754–755.
Jördens, J., Struck, T.H., Purschke, G., 2004. Phylogenetic inference regarding
Parergodrilidae and Hrabeiella periglandulata (‘‘Polychaeta”, Annelida) based
on 18S rDNA, 28S rDNA and COI sequences. Journal of Zoological Systematics
and Evolutionary Research 42, 270–280.
Källersjö, M., Albert, V.A., Farris, J.S., 1999. Homoplasy increases phylogenetic
structure. Cladistics 15, 91–93.
Kuhner, M.K., Felsenstein, J., 1994. A simulation comparison phylogeny algorithms
under equal and unequal evolutionary rates. Mol. Biol. Evol. 11, 459–468.
Lake, J.A., 1994. Reconstructing evolutionary trees from DNA and protein
sequences: paralinear distances. Proc. Natl. Acad. Sci. USA 91, 1455–1459.
Lockhart, P.J., Steel, M.A., Hendy, M.D., Penny, D., 1994. Recovering evolutionary
trees under a more realistic model of sequence evolution. Mol. Biol. Evol. 11,
605–612.
Lopez, P., Forterre, P., Philippe, H., 1999. The root of the tree of life in the light of the
covarion model. J. Mol. Evol. 49, 496–508.
Maddison, W.P., Maddison, D.R. 2004. Mesquite: a modular system for evolutionary
analysis. Version 1.05. http://mesquiteproject.org.
Mallatt, J., Sullivan, J., Winchell, C.J., 2001. The relationship of lampreys to hagfishes:
a spectral analysis of ribosomal DNA sequences. In: Ahlberg, P.E. (Ed.), Major
Events in Early Vertebrate Evolution: Palaeontology, Phylogeny, Genetics, and
Development. Taylor & Francis, London, pp. 106–118.
McHugh, D., 1997. Molecular evidence that echiurans and pogonophorans are
derived annelids. Proc. Natl. Acad. Sci. USA 94, 8006–8009.
McHugh, D., 2000. Molecular phylogeny of the Annelida. Can. J. Zool. 78, 1873–
1884.
McHugh, D., 2005. Molecular systematics of polychaetes (Annelida). Hydrobiologia,
309–318.
Milinkovitch, M.C., LeDuc, R.G., Adachi, J., Farnir, F., Georges, M., Hasegawa, M.,
1996. Effects of character weighting and species sampling on phylogeny
reconstruction: a case study based on DNA sequence data in Cetaceans.
Genetics 144, 1817–1833.
Nicholas, K.B., Nicholas, H.B.J. 1997. GeneDoc: a tool for editing and annotating
multiple sequence alignments. Distributed by the author.
Nickrent, D.L., Parkinson, C.L., Palmer, J.D., Duff, R.J., 2000. Multigene phylogeny of
land plants with special reference to bryophytes and the earliest land plants.
Mol. Biol. Evol. 17, 1885–1895.
Nylander, J.A.A. (2002) MrModeltest. http://morphbank.ebc.uu.se/MR.BAYES.
Pesole, G., Sbisà, E., Preparata, G., Saccone, C., 1992. The evolution of the
mitochondrial D-Loop and the origin of modern man. Mol. Biol. Evol. 9, 587–
598.
Philippe, H., Forterre, P., 1999. The rooting of the universal tree of life is not reliable.
J. Mol. Evol. 49, 509–523.
Posada, D., Buckley, T.R., 2004. Model selection and model averaging in
phylogenetics: advantages of akaike information criterion and
bayesian
approaches over likelihood ratio tests. Syst. Biol. 53, 793–808.
Posada, D., Crandall, K.A., 1998. MODELTEST: testing the model of DNA substitution.
Bioinformatics 14, 817–818.
Posada, D., Crandall, K.A., 2001. Selecting the best-fit model of nucleotide
substitution. Syst. Biol. 50, 580–601.
Purschke, G., 2002. On the ground pattern of Annelida. Org. Divers. Evol. 2, 181–196.
Purschke, G., Jouin, C., 1988. Anatomy and ultrastructure of the ventral pharyngeal
organs of Saccocirrus (Saccocirridae) and Protodriloides (Protodriloidae fam. n.)
645
with remarks on the phylogenetic relationships within Protodrilida (Annelida:
Polychaeta). J. Zool. 215, 405–432.
Regier, J., Shultz, J., 1997. Molecular phylogeny of the major arthropod groups
indicates polyphyly of crustaceans and a new hypothesis for the origin of
hexapods. Mol. Biol. Evol. 14, 902–913.
Ronquist, F., Huelsenbeck, J.P., 2003. MRBAYES 3: Bayesian phylogenetic inference
under mixed models. Bioinformatics 19, 1572–1574.
Rouse, G.W., Fauchald, K., 1997. Cladistics and polychaetes. Zool. Scr. 26, 139–204.
Rouse, G.W., Pleijel, F., 2001. Polychaetes. University Press, Oxford.
Rousset, V., Pleijel, F., Rouse, G.W., Erséus, C., Siddall, M.E., 2007. A molecular
phylogeny of annelids. Cladistics 23, 41–63.
Simon, C., Frati, F., Beckenbach, A.T., Crespi, B., Liu, H., Flook, P., 1994. Evolution,
weighting, and phylogenetic utility of mitochondrial gene sequences and a
compilation of conserved polymerase chain reaction primers. Ann. Entomol.
Soc. Am. 87, 651–701.
Struck, T.H., 2007. Data congruence, paedomorphosis and salamanders. Front. Zool.
4, 22.
Struck, T.H., Halanych, K.M., Purschke, G., 2005. Dinophilidae (Annelida) is most
likely not a progenetic Eunicida: evidence from 18S and 28S rDNA. Mol.
Phylogenet. Evol. 37, 619–623.
Struck, T.H., Hessling, R., Purschke, G., 2002a. The phylogenetic position of the
Aeolosomatidae and Parergodrilidae, two enigmatic oligochaete-like taxa of the
‘Polychaeta’, based on molecular data from 18SrDNA sequences. J. Zool. Syst.
Evol. Res. 40, 155–163.
Struck, T.H., Purschke, G., 2005. The sister group relationship of Aeolosomatidae
and Potamodrilidae (Annelida:
‘‘Polychaeta”)—a molecular phylogenetic
approach based on 18S rDNA and Cytochrome Oxidase I. Zool. Anz. 243,
281–293.
Struck, T.H., Purschke, G., Halanych, K.M., 2006. Phylogeny of Eunicida (Annelida)
and Exploring Data Congruence using a Partition Addition Bootstrap Alteration
(PABA) approach. Syst. Biol. 55, 1–20.
Struck, T.H., Schult, N., Kusen, T., Hickman, E., Bleidorn, C., McHugh, D., Halanych,
K.M., 2007. Annelida phylogeny and the status of Sipuncula and Echiura. BMC
Evol. Biol. 7, 57.
Struck, T.H., Westheide, W., Purschke, G., 2002b. Progenesis in Eunicida
(‘‘Polychaeta,”
Annelida)—separate evolutionary events? Evidence from
molecular data. Mol. Phylogenet. Evol. 25, 190–199.
Sturmbauer, C., Meyer, A., 1992. Genetic divergence, speciation and morphological
stasis in a lineage of African cichlid fishes. Nature 358, 578–581.
Swofford, D.L. 2002. PAUP*. Phylogenetic Analysis Using Parsimony (*and Other
Methods). Sinauer Associates, Sunderland, MA.
Swofford, D.L., Olsen, G.J., Waddell, P.J., Hillis, D.M., 1996. Chapter 11—phylogenetic
inference. In: Hillis, D.M., Moritz, C., Mable, B.K. (Eds.), Molecular Systematics.
Sinauer Associates, Sunderland, MA, pp. 407–514.
Thompson, J.D., Higgins, D.G., Gibson, T.J., 1994. CLUSTAL W: improving the
sensitivity of progressive multiple sequence alignment through sequence
weighting, position-specific gap penalties and weight matrix choice. Nucleic
Acids Res. 22, 4673–4680.
Tzetlin, A., Purschke, G., 2006. Fine structure of the pharyngeal apparatus of the
pelagosphera larva in Phascolosoma agassizii (Sipuncula) and its phylogenetic
significance. Zoomorphology 125, 109–117.
Vogt, L., 2002. Testing and weighting characters. Org. Diversity Evol. 2, 319–333.
Westheide, W., 1997. The direction of evolution within the Polychaeta. J. Nat. Hist.
31, 1–15.
Westheide, W., McHugh, D., Purschke, G., Rouse, G., 1999. Systematization of the
Annelida: different approaches. Hydrobiologia 402, 291–307.
Worsaae, K., Nygren, A., Rouse, G., Giribet, G., Persson, J., Sundberg, P., Pleijel, F.,
2005. Phylogenetic position of Nerillidae and Aberranta (Polychaeta, Annelida),
analysed by direct optimization of combined molecular and morphological
data. Zool. Scr. 34, 313–328.
Wuyts, J., Van de Peer, Y., De Wachter, R., 2001. Distribution of substitution rates
and location of insertion sites in the tertiary structure of ribosomal RNA.
Nucleic Acids Res. 29, 5017–5028.
Xia, X., Xie, Z., Salemi, M., Chen, L., Wang, Y., 2003. An index of substitution
saturation and its application. Mol. Phylogenet. Evol. 26, 1–7.
Yang, Z., 1993. Maximum-likelihood estimation of phylogeny from DNA sequences
when substitution rates differ over sites. Mol. Biol. Evol. 10, 1396–1401.
Yang, Z., 1996. Among-site rate variation and its impact on phylogenetic analyses.
Trends Ecol. Evol. 11, 367–372.
Zardoya, R., Meyer, A., 2001. Vertebrate phylogeny: limits of inference of
mitochondrial genome and nuclear rDNA sequence data due to an adverse
phylogenetic signal/noise ratio. In: Ahlberg, P.E. (Ed.), Major Events in Early
Vertebrate Evolution: Palaeontology, Phylogeny, Genetics, and Development.
Taylor & Francis, London, pp. 135–155.
Zwickl, D.J., 2006. Genetic algorithm approaches for the phylogenetic analysis of
large biological sequence datasets under the maximum likelihood criterion.
Ph.D. dissertation., The University of Texas at Austin.