Home

DnaSP manual

1. of nucleotides each on a microcomputer Furthermore DnaSP can easily exchange data with other programs for example programs to perform multiple sequence alignments phylogenetic tree analysis or statistical analysis zx System requirements See Also Limitations DnaSP is written in Visual Basic v 6 0 Microsoft and it runs on an IBM compatible PC under 32 bit Windows The minimum hardware requirements for the program are a processor based on the Intel Pentium or higher 32 megabytes of RAM memory a mouse a hard disk DnaSP also requires Microsoft Windows versions 95 98 NT ME 2000 XP DnaSP can run in Macintosh PowerMac using SoftWindows or VirtualPC software emulators or in Linux using WMWare or Wine emulators NOTE Using emulators the computation speed of the program will decrease i Limitations See Also System requirements Both the number and length of the sequences that can be handled by DnaSP mainly depend on the available memory Nevertheless DnaSP is able to use all RAM memory available in a computer both the conventional and the extended memory DnaSP can also use virtual memory it can use the hard disk space as memory although in this case the computation speed will be much lower than when using RAM Thus the program can handle large numbers of sequences of up to thousands nucleotides each Input data file limitations Maximum number of nucleotides per sequence Depends on the availabl
2. A fixed nucleotide site between species is a site at which all sequences in one species contain nucleotide variants that are not in the second species Silent Substitutions Considered Substitutions in Coding Regions Only synonymous substitutions coding region will be considered In Coding and Noncoding Regions All silent substitutions will be considered synonymous substitutions and changes in noncoding positions If the data file does not contain assigned coding regions all sites will be considered as noncoding positions i e all substitutions will be considered as silent How DnaSP estimates Synonymous and Nonsynonymous changes in a codon In general DnaSP uses a conservative criterion to decide if a particular change in a nucleotide site is synonymous or replacement see the following examples Nevertheless the user should check the complex cases those triplets of sites segregating for several codons i e in highly variable regions Example using the Nuclear Universal Genetic Code Species 1 3 6 9 22 15 21 24 27 AGT TCT A CCC AAT A AGT UAU UAU gt p AGE TOT A CCC AGG T AGT UAU UAU AGA TCI CIG CAG ACT TT Q AGA CUG CUG AGG TCI Cite CAG ACT AT Q AGA CUG CUG Species 2 AGG CCT ATT CCC GGA T GGA CUG CUG AGG CCT ATT CCC GGA T GGA CUG CUG AGG CCT ATT CAC GGA T GGT CUG CUU AGG CCT ATT CAC GGA T GGT CUG CUU C
3. BEGIN DATA DIMENSIONS NTAX 4 NCHAR 55 FORMAT MISSING GAP DATATYPE DNA MATRIX seq 1 ATATACGGGGTTA TTAGA AAAATGTGTGTGTGTT CATGTG seq 2 ATATAC GGATA ACA AGAATCTATGTCTGCTTTC CATGTG seq 3 ATATACGGGGATA TTATA AGAATGTGTGTGTGTT CATGTG seq 4 ATATACGGGGATA GTAGT AAAATGTGTGTGTGTT CATGTG END BEGIN CODONS CODPOSSET UNTITLED 13 3 27 30 33 36 39 42 45 48 2 4 28 31 34 37 40 43 46 49 3s 9 Z9 32 25 38 41 44 47 9D Q ENCODE UNIVNUC END PHYLI P format Felsenstein 1993 See Also Input Data Files PHYLIP Format Example References Felsenstein 1993 DnaSP can recognize interleaved and noninterleaved PHYLIP formats PHYLIP formats must contain two integers in the first line of the file the first number indicates the number of sequences in the data file while the second indicates the total number of sites The sequence data starts in the second line The sequence name can be up to 10 characters The nucleotide sequence starts immediately position 11 Nucleotide data can be written in one or more lines In PHYLIP interleaved formats the sequence name must be indicate only in the first block Special characters Blank spaces Tabs and Carriage returns are ignored i e they can be used to separate blocks of nucleotides By default DnaSP uses the following symbols the hyp
4. G C Content e G Cn G C content at noncoding positions e G Cc G C content at coding positions e G C G C content in the genomic region Haplotype Nucleotide Diversity e The number of haplotypes NHap Nei 1987 p 259 e Haplotype gene diversity and its sampling variance Nei 1987 equations 8 4 and 8 12 but replacing 2n by n e Nucleotide diversity Pi x the average number of nucleotide differences per site between two sequences Nei 1987 equations 10 5 or 10 6 and its sampling variance Nei 1987 equation 10 7 e The average number of nucleotide differences k Tajima 1983 equation A3 e Theta per gene or per site from Eta n or from S Watterson 1975 equation 1 4a Nei 1987 equation 10 3 Theta 0 4Nu for an autosomal gene of a diploid organism N and p are the effective population size and the mutation rate per gene or per site per generation respectively Eta n is the total number of mutations and S is the number of segregating polymorphic sites Neutrality tests e Tajima s D Tajima 1989 equation 38 e Fuand Li s D Fu and Li 1993 p 700 bottom e FuandLi s F Fu and Li 1993 p 702 see also Simonsen et al 1995 equation 10 e Fu sFs Fu 1997 equation 1 e The Strobeck s S Strobeck 1987 see also Fu 1997 e DnasSP also provides the probability of obtaining a sample with a number of haplotypes equal to the number observed More I nformation in the specific modules
5. Note that DnaSP has not used the simplification indicated in Nei and Miller 1990 equation 25 i e to perform the Jukes and Cantor 1969 correction directly on Pi x Nei 1987 equations 10 5 Nevertheless for low levels of polymorphism similar estimates are given by both methods e Theta values per site from Eta n i e the Watterson estimator Watterson 1975 equation 1 4a but on base pair basis Nei 1987 equation 10 3 See how DnaSP estimates Synonymous and Nonsynonymous changes in a codon Note that the number of mutations might be different than the number of Synonymous and Nonsynonymous differences obtained in each pairwise comparison see below Theta values will not be reported in some cases where codons might differ by multiple changes The DnaSP output shows also the following For each sequence e The total number of Synonymous SS and Nonsynonymous NSS sites For each pair of sequences e The total number of Synonymous Nonsynonymous and silent sites e The total number of Synonymous Nonsynonymous and silent differences e The estimates of Ka the number of nonsynonymous substitutions per nonsynonymous site and Ks the number of synonymous or silent substitutions per synonymous or silent site Nei and Gojobori 1986 equations 1 3 Assign codons To assign noncoding and coding protein regions in a particular DNA sequence you should use the Assign Coding Regions command Geneti
6. Gilbert 1977 Sanger et al 1977 until 1990 the use of DNA sequence data had had little impact on population genetics This is because the effort in terms of both money and time required to obtain DNA sequence data from a relative large number of alleles was substantial The introduction of the polymerase chain reaction PCR Saiki et al 1985 1988 which allows direct sequencing of PCR products and avoids therefore their cloning has changed the situation Undoubtedly this has produced a revolutionary change in population genetics Although at present population studies at the DNA sequence level are still scarce and primarily carried out in Drosophila for example McDonald and Kreitman 1991 Schaeffer and Miller 1993 Rozas and Aguad 1994 they will certainly increase in the future The DnaSP DNA Sequence Polymorphism is a software addressed to molecular population geneticists and can compute several measures of DNA sequence variation within and between populations in noncoding in synonymous or in nonsynonymous sites gene flow gene conversion Betran et al 1997 recombination and linkage disequilibrium parameters In addition DnaSP performs some neutrality tests the Hudson Kreitman and Aguad 1987 the Tajima 1989 McDonald and Kreitman 1991 and the Fu and Li 1993 tests DnaSP takes advantage of the Microsoft Windows capabilities so that it can handle a large number of sequences of thousands
7. Prepare Submission for EMBL GenBank Databases Tools Coalescent Simulations HKA test Direct Mode Discrete Distributions Tests of Independence 2 x 2 table Evolutionary Calculator Menu Commands DnaSP user interface File Menu Data Menu Display Menu Analysis Menu Overview Menu Tools Menu Generate Menu Window Menu Help Menu Citation Distribution Policy and Updates Acknowledgements References What DnaSP can do Abstracts DnaSP v 1 0 DnaSP v 2 0 DnaSP v 3 0 DnaSP v 4 0 DnaSP DNA sequence polymorphism is an interactive computer program for the analysis of DNA polymorphism from nucleotide sequence data The program addressed to molecular population geneticists calculates several measures of DNA sequence variation within and between populations with or without the sliding window method in noncoding synonymous or nonsynonymous sites linkage disequilibrium recombination gene flow and gene conversion parameters and some neutrality tests Fu and Li s Hudson Kreitman and Aguad s McDonald and Kreitman and Tajima s tests DnaSP can also conduct computer simulations based on the coalescent process What DnaSP can not do DnaSP can not align sequences There are some available programs that can do this For example you can perform the multiple alignment with CLUSTAL W Thompson et al 1994 This program produces an output multiple aligned sequences in NBRF
8. CCC CTT CTT GGT GG ETT AAC CTT CTA AAT TTA CCA CTA CTA AGT GB CTA AAC CTA CTT AAC TTN CCT CTA CTT GGA GGT Close Outgroup CTT AAT GTC CTT A T TTT CCT OTT OTT GCA AGT Distant Outgroup CTT AAT CTT GTT AAT TTT CCT CTG CTA CCA Ter One Species with an Outgroup Close Outgroup Codon 1 2 3 CTT gt CTA Polymorphic synonymous change U gt U Codon 4 5 6 AAC Monomorphic codon Codon 7 8 9 CTT lt gt CTA Ambiguous Change Codon 10 11 12 Codon 13 14 15 Codon 16 17 18 Codon 19 20 21 Codon 22 23 24 Codon 25 26 27 Codon 28 29 30 Codon 31 32 33 U gt U U gt P CTT gt CTA Polymorphic synonymous change AAT gt AAC Polymorphic synonymous change Not analyzed Codon with missing data Not analyzed Multiple substitutions U gt U U gt U CTT gt CTA Polymorphic synonymous change CTT gt CTA Polymorphic synonymous change Not analyzed Multiple substitutions GGT Monomorphic codon One Species with two Outgroups Codon 1 2 3 CTT gt CTA Polymorphic synonymous change U gt U Codon 4 5 6 AAT gt AAC Fixed synonymous change U gt P Codon 7 8 9 There are two changes CTT gt CTA Polymorphic synonymous change U gt U GTT gt CTT Fixed nonsynonymous change Val gt Leu Codon 10 11 12 There are two changes CTT gt CTA Polymorphic synonymous change U gt U GTT gt CTT Fixed nonsynonymous change Val gt Leu Codon 13 14 15 Codon 19 2
9. Code used for translation There are 9 pre defined Genetic codes Nuclear universal Table 1 Mitochondrial of mammals Table 2 Mitochondrial of Drosophila Table 5 Mitochondrial of Yeast Table 3 Mitochondrial of Mold Protozoan and Coelenterate Table 4 Mitochondrial of Echinoderm Table 9 Mitochondrial of Flatworm Table 14 Nuclear of Ciliate Dasycladacean and Hexamita Table 6 Nuclear of some Candida species Table 12 In parenthesis is indicated the GenBank translation table number More information on the Genetic Codes used by GenBank in http www ncbi nlm nih gov htbin post Taxonomy wprintgc mode c Assign Preferred Unpreferred Codons Table Define Sequence Sets Filter Remove Positions Include Exclude Sequences Note This information will be stored if you save export or update the data file as a NEXUS file format zx Define Sequence Sets This command allows you to define sequence sets groups of sequences A sequence set is a group of related sequences that could represent for example a population a species of an outgroup That allows conducting analyses on a specific group of sequences Sequence sets assignations can be stored in NEXUS data files for that use the save export or update commands zs Define Domain Sets This command allows you to define domain sets A domain set is a partial fragment of the multiple alignment that could represent for example an exon a gen
10. Codon Usage Bias DNA Polymorphism Fu and Li s and other Tests Fu and Li s and other Tests with an Outgroup Tajima s Test oe MultiDomain Analysis See also Define Domain Sets DnaSP allows analysing polymorphism data in specific functional regions see Define Domain Sets for example exons introns etc It can compute a number of measures of the extent of DNA polymorphism and can also perform some common neutrality tests Haplotype Nucleotide Diversity e The number of Segregating Sites S e The total number of mutations Eta e The number of haplotypes NHap Nei 1987 p 259 e Haplotype gene diversity and its sampling variance Nei 1987 e Nucleotide diversity Pi p Nei 1987 and its sampling variance Nei 1987 equation 10 7 e The average number of nucleotide differences k Tajima 1983 e Theta per gene or per site from Eta h or from S Watterson 1975 Nei 1987 Neutrality tests e Tajima s D Tajima 1989 and its statistical significance e FuandLi s D Fu and Li 1993 and its statistical significance e FuandLi s F Fu and Li 1993 and its statistical significance e Fu sFs Fu 1997 e G Cn G C content at noncoding positions e G Cc G C content at coding positions oe Concatenated Data File See Also Input Data Files DnaSP allows you to create a concatenated data file NEXUS format that is a big data file containing DNA sequence information from a number of
11. Fs and R2 statistics C V Coefficient of variation see Rogers and Harpending 1992 p 554 MAE Mean Absolute Error see Rogers et al 1996 p 896 2 Segregating Sites 2 1 Constant Population Size DnaSP shows in tabular and graphic form the distribution of the observed frequency spectrum distribution of the allelic frequency in a site see Tajima 1989a figure 6 and the expected values in a stable population i e population with constant population size Tajima 1989a equation 50 2 2 Population Growth Decline DnaSP shows in tabular and graphic form the distribution at different times and for several sample sizes of Sn the expected number of segregating sites among n DNA sequences in generation t and Sn t al at equilibrium this value is equal to theta after a population growth or decline Tajima 1989b equation 9 The time is measured in N generations units where N is the effective population size al 2 1 i fromi 1ton 1 n is the sample size i e the number of nucleotide sequences Effective Population size oe Fu and Li s and other Tests See Also Coalescent Simulations Graphs Window Input Data Files Output References Ewens 1972 Fu and Li 1993 Fu 1995 Fu 1997 Kimura 1983 Simonsen et al 1995 Strobeck 1987 Tajima 1983 This command calculates the statistical tests D and F proposed by Fu and Li 1993 for testing the hypothesis that all mutations are selectively neutral Kim
12. Li 1993 p 702 top Fay and Wu H test statistic The H test statistic Fay and Wu 2000 equations 1 3 is based on the differences between two estimators of 0 0x or k the average number of nucleotide differences between pairs of sequences and 6H Fay and Wu 2000 equation 3 an estimator based on the frequency of the derived variants The number of mutations in external branches Assuming the infinite sites model DnaSP calculates the total number of mutations in the external branches of the genealogy as follows at a given particular polymorphic site the number of mutations in external branches is counted as the number of distinct singleton nucleotide variants in the intraspecific data file that are not shared with the outgroup a singleton mutation is a nucleotide variant that appears only once among the sequences The total number of mutations in external branches of the genealogy is then computed as the sum of the number of mutations in external branches of every polymorphic site Total number of mutations vs number of segregating sites The D and F test statistics can also be computed using S the number of segregating sites instead of n the total number of mutations see Simonsen et al 1995 Under the infinite sites model with two different nucleotides per site both D and F values should be the same S and have the same value However if there are sites segregating for more than two nucleotides values of S will be lo
13. Montserrat Aguad DNA variation at the rp49 gene region of Drosophila simulans evolutionary inferences from an unusual haplotype structure Abstract An approximately 1 3 kb region including the rp49 gene plus its 5 and 3 flanking regions was sequenced in 24 lines of Drosophila simulans 10 from Spain and 14 from Mozambique Fifty four nucleotide and 8 length polymorphisms were detected All nucleotide polymorphisms were silent 52 in noncoding regions and 2 at synonymous sites in the coding region Estimated silent nucleotide diversity was similar in both populations x 0 016 for the total sample Nucleotide variation revealed an unusual haplotype structure showing a subset of 11 sequences with a single polymorphism This haplotype was present at intermediate frequencies in both the European and the African samples The presence of such a major haplotype in a highly recombining region is incompatible with the neutral equilibrium model This haplotype structure in both a derived and a putatively ancestral population can be most parsimoniously explained by positive selection As the rate of recombination in the rp49 region is high the target of selection should be close to or within the region studied References AKASHI H 1995 Inferring weak selection from patterns of polymorphism anddivergence at silent sites in Drosophila DNA Genetics 139 1067 1076 AKASHI H 1999 Inferring the fitness effects of DNA mutations frompolym
14. R ROZAS 1997 DnaSP version 2 0 a novel software package for extensive molecular population genetics analysis Comput Applic Biosci 13 307 311 ROZAS J and R ROZAS 1999 DnaSP version 3 an integrated program for molecular population genetics and molecular evolution analysis Bioinformatics 15 174 175 ROZAS J M GULLAUD G BLANDIN and M AGUAD 2001 DNA variation at the rp49 gene region of Drosophila simulans Evolutionary inferences from an unusual haplotype structure Genetics 158 1147 1155 ROZAS J J C SANCHEZ DELBARRIO X MESSEGUER and R ROZAS 2003 DnaSP DNA polymorphism analyses by the coalescent and other methods Bioinformatics 19 2496 2497 SAIKI R K S SCHARF F FALOONA K B MULLIS G T HORN H A ERLICHand N ARNHEIM 1985 Enzymatic amplification of B globin genomic sequences and restriction site analysis for diagnosis ofsickle cell anemia Science 230 1350 1354 SAIKI R K D H GELFAND S STOFFEL S J SCHARF R HIGUCHI G T HORN K B MULLIS and H A ERLICH 1988 Primer directed enzymatic amplificationof DNA with a thermostable DNA polymerase Science 239 487 491 SANGER F S NICKLEN and A R COULSON 1977 DNA sequencing withchain terminating inhibitors Proc Natl Acad Sci USA 74 5463 5467 SCHAEFFER S W and E L MILLER 1993 Estimates of linkage disequilibriumand the recombination parameter determined from segregating nucleotide sitesin t
15. SWOFFORD and D R MADDISON 1997 NEXUS anextendible file format for systematic information System Biol 46 590 621 MAXAM A M and W GILBERT 1977 A new method for sequencing DNA Proc Natl Acad Sci USA 74 560 564 McDONALD J H and M KREITMAN 1991 Adaptive protein evolution at the Adh locus in Drosophila Nature 351 652 654 MORTON B R 1993 Chloroplast DNA codon use Evidence for selection at the psb A locus based on tRNA availability J Mol Evol 37 273 280 NEI M 1973 Analysis of gene diversity in subdivided populations Proc Natl Acad Sci USA 70 3321 3323 NEI M 1982 Evolution of human races at the gene level pp 167 181 In B Bonne Tamir T Cohen and R M Goodman eds Human genetics part A The unfolding genome Alan R Liss New York NEI M 1987 Molecular Evolutionary Genetics Columbia Univ Press New York NEI M and T GOJOBORI 1986 Simple methods for estimating the numbers ofsynonymous and nonsynonymous nucleotide substitutions Mol Biol Evol 3 418 426 NEI M and J C MILLER 1990 A simple method for estimating averagenumber of nucleotide substitutions within and between populations from restrictiondata Genetics 125 873 879 OSAWA S T H JUKES K WATANABE and A MUTO 1992 Recent evidence forEvolution of the genetic code Microbiol Rev 56 229 264 PRESS W H S A TEUKOLSKY W T VETTERLING and B P FLANNERY 1992 Numerical r
16. alignment gaps or missing data in any data file are not used i e these sites or codons are completely excluded Output The program estimates the following measures From the intraspecific data file e Nucleotide diversity Pi x Nei 1987 equation 10 5 Nucleotide diversity with Jukes and Cantor correction Pi JC Lynch and Crease 1990 equations 1 2 e Theta per site from Eta n the total number of mutations Watterson 1975 equation 1 4a but on base pair basis Nei 1987 equation 10 3 Theta values will not be reported in some cases where codons might differ by multiple changes this feature will indicated by n a From both data files e Nucleotide divergence average proportion of nucleotide differences between populations or species K or Dxy Nei 1987 equation 10 20 e K JC average number of nucleotide substitutions per site between populations or between species with Jukes and Cantor correction Between populations Dxy Nei 1987 equation 10 20 Between species K Nei 1987 equation 5 3 but computing as the average of all comparisons between sequences of data file 1 and 2 Estimation of nucleotide diversity and divergence separately for synonymous and nonsynonymous sites is performed using Nei and Gojobori 1986 equations 1 3 Implementation Estimation of nucleotide diversity and of divergence by the Jukes and Cantor 1969 correction is performed using the simplification indicated in Ne
17. commands might be desirable The total number of synonymous and nonsynonymous sites for a set of sequences is estimated as the average of the number of synonymous and nonsynonymous sites of all sequences these values are used for all sequences Note than in the Synonymous and Nonsynonymous Substitutions command the total number of synonymous and nonsynonymous sites is performed in every pairwise comparison So that nucleotide diversity estimates in synonymous nonsynonymous and silent sites based on the present and on the Synonymous and Nonsynonymous Substitutions command could be slightly different Sites Considered Silent Synonymous and Noncoding The analysis is limited to both synonymous sites and noncoding positions Only Synonymous Sites The analysis is restricted to synonymous sites Only Nonsynonymous Sites The analysis is restricted to nonsynonymous sites All total sites All sites will be used excluding those sites with gaps or missing data The synonymous and nonsynonymous sites and changes will be computed if the data file contains sequences with assigned coding regions more help in Assign Coding Regions and Assign Genetic Code Note See how DnaSP estimates the number of Synonymous and Nonsynonymous changes in a codon Sites Silent Synonymous and Noncoding Indicates both synonymous sites in coding region and noncoding positions Synonymous Sites Indicates sites in the coding regi
18. dialog box with information about authors and the DnaSP version number a Citation Abstracts DnaSP v 1 0 DnaSP v 2 0 DnaSP v 3 0 DnaSP v 4 0 The suggested citation for the current DnaSP version is ROZAS J SANCHEZ DELBARRIO J C MESSEGUER X AND ROZAS R 2003 DnaSP DNA polymorphism analyses by the coalescent and other methods Bioinformatics 19 2496 2497 zx Distribution Policy Copies of DnaSP can be freely distributed to academic users for research This software is provided as is without of any kind of warranty For other uses please get into contact with Julio Rozas E mail jrozas ub edu Queries comments and suggestions may be addressed via E mail to Julio Rozas Availability The program the help file and some examples of the different data files are available from http www ub es dnasp DnaSP updates and Bug reports will be advertised in the DnaSP Web in the Departament de Gen tica Universitat de Barcelona Web http www ub es dnasp and in The Software Biocatalog Molecular Evolution Population Genetics http www ebi ac uk biocat Copyright 1995 2006 by Julio Rozas amp Universitat de Barcelona zx Acknowledgements Our thanks are due to the following who made comments and suggestions or tested the DnaSP program with their data Particularly we would like to thank those who are or were in the the Molecular Evolutionary Genetics group at the Departament de Gen tic
19. e DnaSP Version 4 5 Help Contents Running DnaSP press F1 to view the context sensitive help What DnaSP can do Introduction System requirements Input and Output Input Data Files FASTA format MEGA format NBRF PIR format NEXUS format PHYLIP format Open Multiple Data Files Open Unphase Genotype Data Output UCSC Browser Data Data Menu Define Sequence Sets Define Domain Sets Filter Remove Positions Include Exclude Sequences Analysis Polymorphic Sites DNA Polymorphism InDel Insertion Deletion Polymorphism DNA Divergence Between Populations Polymorphism and Divergence Polymorphism and Divergence in Functional Regions Synonymous and Nonsynonymous Substitutions Codon Usage Bias Preferred and Unpreferred Synonymous Substitutions Gene Conversion Gene Flow and Genetic Differentiation Linkage Disequilibrium Recombination Population Size Changes Fu_and Li s and other Tests Fu_and Li s and other Tests with an Outgroup HKA Hudson Kreitman and Aguad s Test McDonald and Kreitman s Test Tajima s Test Overview Intraspecific Data MultiDomain Analysis Generate Concatenated Data File Shuttle to DNA Slider ms Dick Hudson Data File Format Polymorphic Sites File Haplotype Data File Translate to Protein Data File Reverse Complement Data File
20. e Nei 1973 Gst Nei 1973 equation 9 and Nm DnaSP calculates Gst as equations 5 and 6 in Hudson et al 1992a From nucleotide sequence data information e Hudson et al 1992b Fst equation 3 and Nm equation 4 e Lynch and Crease1990 Nst equation 36 and Nm Nst estimator is almost the same as Fst Hudson et al 1992b The difference is that Nst uses the Jukes and Cantor 1969 correction e Nei 1982 DeltaST dst equation 4 GammaST yst equation 5 Nm Note DnaSP calculates PiS ms the average of the Pi x for over populations using Nei 1982 equation 2 i e making use of the relative size of any population Wright 1951 The estimates of Nm are based on the island model of population structure Haploids Fst yst Nst 1 1 2Nm Diploids autosome Fst yst Nst 1 1 4Nm Diploids X chromosome Fst yst Nst 1 1 3Nm Diploids Y chromosome Fst yst Nst 1 1 Nm Effective Population size Note n a not applicable When the proportion of differences is equal or higher than 0 75 the Jukes and Cantor correction can not be computed zx Select Statistic Use this command to choose the statistic to be included in the Genetic Distance MEGA or PHYLIP data files That file will allow performing subsequent phylogenetic analyses using the MEGA or PHYLIP softwares Precision Value Number of decimal included in the distance data files Note DnaSP can not rea
21. gaps are not considered Positions with Alignment Gaps option Excluded These sites are removed Included These sites are included Included if there is a polymorphism These sites are included if there is a polymorphism Positions option Remove Non Selected Positions Non Selected positions will be definitively removed from the active data Generate a NEXUS File with selected Selected positions will be included in a NEXUS data file The active data file will maintain all the positions zx Include Exclude Sequences DnaSP allows you the analysis in a subset of sequences of the original data file This command allows you to include or exclude sequences from the analysis All analyses will be performed with the information of only the included sequences Consequently if you use the Save Export Data As command the saved exported data file will not contain excluded sequences Note DnaSP also allows you the analysis in a subset of sequences by using the Define Sequence Sets command Options There are two options that deal with alignment gaps Suppose the following original data file Seql ATCTCTTAGGGTCGATTTGTTG GTATTTAA Seq2 AT TCTTATTTTCGA TTGTTG GTATITAA Seq3 ATCGCTTA TCGATTTGT TGTATTTAA Segs ATCTCITA TCGATTIGITG GTATTTAA segs ATCTCTTA TCGATTTGITG GTATITAA DnaSP will not use any site with alignment gaps or missing data Thus if you are using the complete data f
22. nucleotide distance after removing the alignment gaps i e nucleotide distance see Nucleotide Distance in Linkage Disequilibrium command Note that the average length is equal to the average nucleotide distance 1 The minimum number of recombination events RM Hudson and Kaplan 1985 The parameter indicates the minimum number of recombination events in the history of the sample note that RM underestimates the total number of recombination events RM is obtained using the four gamete test see Figure 1 and Appendix 2 in Hudson and Kaplan 1985 From RM it is possible to estimate R by computer simulations Coalescent Simulations The output shown by DnaSP is The RM value The list of all the pairs of sites with the four gametic types The list of all RM pairs of sites where it is possible to assign at least one recombination event Note for the present analysis sites segregating for three or four nucleotides are completely excluded from the analysis ZZ test statistic Rozas et al 2001 This test statistic could be useful in detecting intragenic recombination see Linkage Disequilibrium Statistical significance by the coalescent DnaSP can provide the confidence intervals of the RM statistic by computer simulations using the coalescent algorithm see Computer Simulations Effective Population size Other methods The recombination parameter can also be estimated by the method described in Hey and Wakele
23. of Drosophila subobscura Maximum likelihood estimate of for this data set is 0 9918 which represents an average conversion tract length of 122 bp a DnaSP version 1 0 Computer Applications in the Biosciences 11 621 625 1995 Julio Rozas and Ricardo Rozas DnaSP DNA sequence polymorphism an interactive program for estimating population genetics parameters from DNA sequence data Abstract DnaSP DNA sequence polymorphism is an interactive computer program for the analysis of DNA polymorphism from nucleotide sequence data The program addressed to molecular population geneticists calculates several measures of DNA sequence variation within and between populations linkage disequilibrium parameters and Tajima s D statistic The program which is written in Visual Basic v 3 0 and runs on an IBM compatible PC under Windows can handle a large number of sequences of up to thousands of nucleotides each DnaSP version 2 0 Computer Applications in the Biosciences 13 307 311 1997 Julio Rozas and Ricardo Rozas DnaSP version 2 0 a novel software package for extensive molecular population genetics analysis Abstract Motivation Several methods in molecular population genetics have recently been described to estimate the amount and pattern of the DNA polymorphism in natural populations and also to test the neutral theory of molecular evolution These methods are essential for understanding the molecular evolutionary pro
24. probability and the expected value of obtaining a particular number of haplotypes Ewens 1972 equations 19 21 24 See also the Fu and Li s Tests command Tests of Independence 2 x 2 table This command allows testing independence in a 2 x 2 tables contingency tables DnaSP performs three types of Independence tests Fisher s exact test Chi square test standard and using Yates correction and G test standard and using Williams or Yates corrections see Sokal and Rohlf 1981 The probability associated with a particular chi square or G value with 1 degree of freedom is obtained by the trapezoidal method of numerical integration Evolutionary Calculator This command displays a calculator that allows computing some commonly used molecular evolutionary parameters al 2 1 i fromi 1 to n 1 where n is the number of nucleotide sequences Watterson 1975 Tajima 1989 equation 3 a2 2 1 i 2 from i 1 to n 1 where n is the number of nucleotide sequences Watterson 1975 Tajima 1989 equation 4 K 3 4 Ln 1 4p 3 is the Jukes and Cantor 1969 correction where p is the proportion of different nucleotides between two sequences oe DnaSP user interface DnaSP has a standard Microsoft Windows user interface including the menu bar pull down menus dialog boxes and windows with scroll bars The DnaSP menu bar displays the following pull down menu titles File Data Display Analysis Overview Generate To
25. significance of D and F test statistics Note that these values were obtained by computer simulations considering that the true value of 0 falls into the interval 2 20 so that the critical values are not applicable when the true value of 9 is not in that interval DnaSP will not determine the critical values for sample sizes larger than 300 For sample sizes 100 300 DnaSP uses the same critical values than for n 100 the reason is that the critical values increases or decreases with In n so that when n is large the curve of critical values becomes flat Fu personal communication n d not determined P lt 0 10 P 0 05 P 0 02 Statistical significance by the coalescent DnaSP can also provide the confidence intervals of the Fu and Li s D and F and the Fu s Fs by computer simulations using the coalescent algorithm see Coalescent Simulations Sliding window option This option computes both D and F values and their statistical significance by the Sliding Window method The output of the analysis is given in a grid table The results can also be presented graphically by a line chart In the graph D and F values Y axis can be plotted against the nucleotide position X axis zx Fu and Li s and other Tests with an Outgroup See Also Coalescent Simulations Graphs Window Input Data Files Output References Fay and Wu 2000 Fu and Li 1993 Fu 1995 Kimura 1983 Simonsen et al 1995 Tajima 1983 T
26. the chromosome and the physical position of the reference sequence the first one If you don t know this information you can obtain it e Performing a BLAT search Kent et al 2002 against the appropriate UCSC genome e Searching the appropriate UCSC genome by key words and import the output information to DnaSP Genomic position assignations can be stored in NEXUS data files for that use the save export or update commands zi Data Menu Format Use this command to indicate if the data file contains sequences of DNA or RNA the chromosomal genomic type where the region is located Autosome X chromosome Y chromosome Z chromosome W chromosome prokaryotic mitochondrial chloroplast or the organism s genomic state Diploid or Haploid Gaps in Sliding Window This command is used to exclude include sites with alignment gaps in the length of the windows Sliding Window method Gaps in Sequence Sets Use this command to choose how to treat alignment gaps in sequence sets Segregating Sites Mutations Use this command to select between the number of segregating sites or the total number of mutations in computing some parameters of the Fu and Li s and other Tests Fu and Li s and other Tests with an Outgroup and Tajima s Test Assign Coding Regions Use this command to assign noncoding and coding protein regions to a particular data file Assign Genetic Code Use this command to assign the Genetic
27. the total number of mutations Watterson 1975 equation 1 4a but on base pair basis Nei 1987 equation 10 3 Theta values will not be reported in some cases where codons might differ by multiple changes this feature will indicated by n a From both data files e Nucleotide divergence average proportion of nucleotide differences between populations or species K or Dxy Nei 1987 equation 10 20 e K JC average number of nucleotide substitutions per site between populations or between species with Jukes and Cantor correction Between populations Dxy Nei 1987 equation 10 20 Between species K Nei 1987 equation 5 3 but computing as the average of all comparisons between sequences of data file 1 and 2 Estimation of nucleotide diversity and divergence separately for synonymous and nonsynonymous sites is performed using Nei and Gojobori 1986 equations 1 3 Implementation Estimation of nucleotide diversity and of divergence by the J ukes and Cantor 1969 correction is performed using the simplification indicated in Nei and Miller 1990 equation 25 That is the correction of Pi and of K is performed directly on the uncorrected value and not in each pairwise comparison of two sequences Nevertheless for low levels of polymorphism and of divergence both methods give similar estimates For high polymorphism and divergence levels the use of the DNA Polymorphism and Synonymous and Nonsynonymous Substitutions
28. to 20 characters Blank spaces and tabs are not allowed underlines should be used to indicate a blank space Example of NBRF PI R format DL seq 1 Comment on seq 1 example file EX N1 NBR ATATACGGGG TTA TTAG A AAAAT DL seq 2 Comment seq 2 GIGIGIGI amp I ITTITTTITITTGO ASG ATATAC GG ATA TTAC A AGAAT 2 DL seq 3 Comment seq 3 CTATGTCTGC ITTCTTITITIC ATSTG ATATACGGGG ATA TTAT A AGAAT DL seq 4 Comment seq 4 GTGTGTGTGT ITTITTTITIC ATSTG ATATACGGGG ATA GTAG I AAAAT GTGTGTGTGT ITITTTITIG ATGTG NEXUS File format Maddison et al 1997 See Also Input Data Files NEXUS Format Example 1 NEXUS Format Example 2 References Maddison et al 1997 DnaSP can read NEXUS file formats These files are standard text files that have been designed Maddison et al 1997 to store systematic data DnaSP can read NEXUS files both old and new versions Maddison et al 1997 containing DNA or RNA sequence data The file can contain one or more sequences in the later case the homologous nucleotide sequences must be aligned i e the sequences must have the same length Nucleotide sequences should be entered using the letters A T or U C or G in lower case upper case or any mixture of lower and upper case Blank spaces and Tabs are ignored i e they can be used to separate blocks of nucleotides Carriage returns are also ignored in non interle
29. 0 21 Codon 22 23 24 Codon 25 26 27 Codon 28 29 30 Codon 31 32 33 Abbreviations Not analyzed Codon with alignment gaps Not analyzed Multiple substitutions CTT gt CTA Polymorphic synonymous change U gt U CTT lt gt CTA Ambiguous polymorphic change ancestral polymorphism Not analyzed Multiple substitutions Not analyzed Ambiguous fixed change MRCA Most recent common ancestor U gt U Unpreferred to unpreferred change U gt P Unpreferre P gt U Preferred t d to preferred change o Unpreferred change P gt P Preferred to Preferred change Syn Synonymous change NonSyn Nonsynonymous change zx Table for Preferred and Unpreferred Codons See Also Preferred and Unpreferred Synonymous Substitutions analysis References Akashi 1995 Akashi and Schaeffer 1997 Duret and Mouchiroud 1999 Kanaya et al 1999 Use this command to assign the specific table of preferred unpreferred synonymous codons for the Preferred and Unpreferred Synonymous Substitutions analysis There are 8 predefined tables Nevertheless the user can define its own table That information could be included in the NEXUS data file for that use the save export or update commands Create New Table This button allows the user to define a new preference synonymous codons table The table will be linked to a particular Genetic code Codes P preferred codon unknown preference none unpre
30. 2 equation 1 The Coalescent process The computer simulations are based on the coalescent process for a neutral infinite sites model and assuming a large constant population size Hudson 1990 DnaSP uses the ran1 routine Press et al 1992 as a source of uniform random deviates i e random numbers uniformly distributed within a specified range No recombination For no recombination DnaSP generates the genealogy of the alleles using a modification of the routine make tree Hudson 1990 Intermediate level For intermediate levels of recombination the genealogy is generated as described in Hudson 1983 1990 Free Recombination For free recombination DnaSP generates an independent genealogy for each segregating site At each variable site the number of sequences having one particular nucleotide variant only two nucleotide variants per segregating site is randomly obtained with probability proportional to their expected frequency Tajima 1989 equation 50 Simulations Given Theta per gene Mutations along the lineages are Poisson distributed using the poidev routine Press et al 1992 Segregating Sites The number of mutations segregating sites is fixed Mutations are uniformly distributed at random along lineages Recombination option No Recombination It is assumed that there is no intragenic recombination R 0 e g mitochondrial DNA data Intermediate level of Recombination This is the case for most nuclear
31. 8 for testing the hypothesis that all mutations are selectively neutral Kimura 1983 The D test is based on the differences between the number of segregating sites and the average number of nucleotide differences Minimum number of sequences in data files The data file must contain at least four sequences Alignment gaps and missing data Sites containing alignment gaps or sites with missing data are not used these sites are completely excluded Analysis Tajima s test is based on the neutral model prediction that estimates of S a1 and of k are unbiased estimates of 9 where S is the total number of segregating sites al S 1 i from i 1 to n 1 n the number of nucleotide sequences k is the average number of nucleotide differences between pairs of sequences Tajima 1983 equation A3 q 4Nu for diploid autosomal N and u are the effective population size and the mutation rate per DNA sequence per generation respectively Total number of mutations vs number of segregating sites The D test statistic can also be computed using n the total number of mutations see Fu and Li s test instead of S the total number of segregating sites Under the infinite sites model with two different nucleotides per site estimates of the D test statistic based on S and on n should be the same S and n have the same value However if there are sites segregating for more than two nucleotides values of S will be lower than th
32. 983 Wakeley and Hey 1997 This command computes some measures of the extent of DNA divergence between populations taking into account the effect of the DNA polymorphism Data Files For the present analysis at least two sets of sequences one for each population must be defined see Data Define Sequence Sets command Alignment gaps and missing data Sites containing alignment gaps or sites with missing data in any population are not used these sites are completely excluded Analysis The program estimates the following measures For each individual population e The average number of nucleotide differences Tajima 1983 equation A3 e The nucleotide diversity Pi Nei 1987 equation 10 5 e Nucleotide diversity with Jukes and Cantor Pi JC Nei 1987 equations 10 19 and 5 3 Lynch and Crease 1990 equations 1 2 Variance of Pi J C Nei 1987 equation 10 7 The standard deviation or standard error is the square root of the variance These estimates may be different from those obtained by the DNA Polymorphism command This is because in the present analysis all sites with alignment gaps in population 1 or in population 2 are not considered That is the total number of analyzed sites considered in this command can be equal or lower than those taken into account in the DNA polymorphism command For the total data e The average number of nucleotide differences Tajima 1983 equation A3 e The n
33. Del length 3 nucleotides Event 2 Seq3 InDel length 2 nucleotides Event 3 Seq6 Seq7 Seq8 Seq10 and Seq11 InDel length 7 nucleotides Event 4 Seq9 InDel length 3 nucleotides Optionz 1 Diallelic Only InDel diallelic states gap event not gap will be considered That is positions 10 16 will be excluded from the analysis since InDel event 3 and event 4 overlap at positions 14 16 Output Total number of InDel events analysed 2 event 1 and event 2 Average InDel length per event 2 5 the average length of event 1 and event 2 Average deletion length 2 667 2 sequences with 3 nucleotides deleted plus 1 sequence with 2 deleted nucleotides divided by 3 the number of analysed sequences with gaps DnaSP also computes The number of InDel Haplotypes 3 InDel Haplotype Diversity 0 410 e InDel Diversity k i 0 436 this is the analogue of k the average number of nuc differences nDel Diversity per site Pi i 0 03963 this is the analogue of Pi the nucleotide diversity Pi i is computed as k i m where m is the net number of positions analysed 11 18 minus the 7 positions with overlapping InDels Theta per sequence from the number of InDel events 0 644 Tajima s D 0 9092 Additionally DnaSP allows generating a NEXUS file with ONLY InDel events information The data file will be recoded as Seql AA Seco as Seq3 G Seq4 Sseqo Beqo ods Seq 6G Seq8 Seq9 Seql10 Seqlil S
34. GGCG TAGES ud ad eth do Rob dd This DnaSP module allows reconstructing the 10 sequences from the 5 individuals DnaSP might handle and use the reconstructed data set 10 sequences of 16 nucleotides each for further analysis Haplotype Reconstruction DnaSP can reconstruct the haplotype phases from unphase data This haplotype reconstruction is conducting using the algorithms provided by PHASE Stephens et al 2001 fastPHASE Stephens et al 2005 and HAPAR Wang and Xu 2005 PHASE 2 1 uses a coalescent approximation method to estimate haplotype frequencies under the assumption of Hardy Weinberg equilibirum random mating the best solution for an individual is that with haplotypes at higher frequencies It can also be used to estimate the recombination rate along the sequences fastPHASE 1 1 modifies the PHASE algorithm and takes into account the locus spacing and the decay of linkage disequilibrium with the physical distance HAPAR uses a pure parsimony approach to estimate the haplotypes the optimal solution is that which requires less haplotypes to resolve the genotypes For positions not completly resolved the user can choose between to replace unresolved positions as N or to assign randomly nucleotide variants Note fastPHASE and HAPAR can not handle positions with more than two variants in this case therefore these methods will not be able to be used Very important See the PHASE fastPHASE or HAPAR docume
35. GGGATA GTAGT AAAA GTGTGTGTGT ITTITTITCA CTIAXTGTCIGC CITTITCA GTGTGTGTG ITITICA GTGTGTGTG TITITCA MEGA format Kumar et al 1994 See Also Input Data Files Mega Format Example References Kumar et al 1994 DnaSP can recognize interleaved and noninterleaved MEGA formats DnaSP v 1 0 only recognized noninterleaved MEGA formats MEGA formats must contain the identifier 4 MEGA in the first line of the file The second line must start with the word TITLE followed by some comments if any on the data comments within the sequences must be contained by a pair of double quotation marks comment The sequence data starts in the third line The sequence name is the text after the character until the first Blank space Tab or Carriage return The nucleotide sequence is written in one or more lines after the sequence name until the next sequence name that also starts with the symbol see the MEGA user manual Special characters Blank spaces Tabs and Carriage returns are ignored i e they can be used to separate blocks of nucleotides By default DnaSP uses the following symbols the hyphen character to specify an alignment gap the dot character to specify that the nucleotide in this site is identical to that in the same site of the first sequence i e identical site or matching symbol the symbols N n to designate missing data Nevertheless
36. PIR format that can be read by DnaSP DnaSP can not make phylogenetic inferences or manipulate trees There are many programs to do this for example MacClade Maddison and Maddison 1992 MEGA Kumar et al 1994 PHYLIP Felsenstein 1993 PAUP Swofford 1991 Nevertheless the input file formats used by DnaSP FASTA MEGA NBRF PIR NEXUS and PHYLI P format are also recognized for some of them DNA sequences can not be edited or manipulated by DnaSP You can do this by using for example MacClade Maddison and Maddison 1992 or SeqApp SeqPup programs Gilbert 1996 DnaSP can not directly analyze diploide genetic information for instance SNPs data from diploid genomic regions If you are using diploid unphase data you can reconstruct the phase using the Open Unphase Genotype Data module os Introduction Abstracts DnaSP v 1 0 DnaSP v 2 0 DnaSP v 3 0 DnaSP v 4 0 Population genetics is a branch of the evolutionary biology that tries to determine the level and distribution of genetic polymorphism in natural populations and also to detect the evolutionary forces mutation migration selection and drift that could determine the pattern of genetic variation observed in natural populations Ideally the best way to quantify genetic variation in natural populations should be by comparison of DNA sequences Kreitman 1983 However although the methodology for DNA sequencing is available since 1977 Maxam and
37. S file format NEXUS This is an example of the new NEXUS file format N MacClade 3 05 or later File EX newl nex BEGIN TAXA DIMENSIONS NTAX 4 TAXLABELS seq 1 seq 2 seq 3 BEGIN CHARACTI m RS DIMENSIONS NCHAR 55 FORMAT DATATYPE DNA MISSING MATRIX seq l1 ATATACGGGGTTA TTAGA AAAATGTGTGTGTGT BOE 2 discus A d ess alode qul s sta BET Gaak amp dad 36 Le Ses TD eseelil da ea RR E DERE uude dois fiae Lau eem 6a Sas eee eae BOSE 2 staGae pig eiind cs BEGIN SETS TaxSet Barcelona 1 2 TaxSet Girona 3 TaxSet Catalunya 1 3 TaxSet Outgroup 4 END BEGIN CODONS CODONPOSSET UNTITLED Hs 1 x 626 BIe954 i 3 27 483 2 4 28 49X3 3 B 29 5033 CODESET UNTITLED Universal all BEGIN CODONUSAGE PREF UUC UCC UCG UAC UGC CUC CUG Che CAC CAG CEG AUC ACC AAC AAG AGC GUC GUG GCC GAC GAG GGC Ej D BEGIN DNASP CHROMOSOMALLOCATION Autosome GENOME Diploid END GAP MATCHCHAR EXUS version 1 INT ERL EAV PREFUNPREFCODONS GENETICCODE Universal Drosophila melanogaster This is the version used by Example 2 of NEXUS file format NEXUS This is an example of the Old NEXUS File Format used by MacClade 3 0 File EX old1 nex
38. UG AGG TCT CIG CAG ACT ATG AGA CUG CUG Codon 1 2 3 3 mutations in site 3 1 replacement 2 synonymous Codon 4 5 6 Monomorphic Codon 7 8 9 Site 7 is replacement Site 9 is synonymous When there are two possible evolutionary paths Path 1 ATT Ile gt CTT Leu gt CTG Leu Site 7 Replacement Site 9 Synonymous Path 2 ATT Ile gt ATG Met gt CTG Leu Site 7 Replacement Site 9 Replacement DnaSP will choose path 1 the path that requires the minor number of replacements Codon 13 14 15 Site 14 2 replacements Site 15 is Synonymous Here there are four possible paths Path 1 Path 2 Path 3 Path 4 ACT Thr gt AAT Asp gt AGT Ser gt AGG Arg Site 14 2 Replacements Site 15 1 Replacement ACT Thr gt AAT Asp gt AAG Lys gt AGG Arg Site 14 2 Replacements Site 15 1 Replacement AAT Asn gt ACT Thr gt AGT Ser gt AGG Arg Site 14 2 Replacements Site 15 1 Replacement AAT Asn gt ACT Thr gt ACG Thr gt AGG Arg Site 14 2 Replacements Site 15 1 Synonymous DnaSP will choose path 4 the path that requires the minor number of replacements Codon 16 17 18 Site 16 1 replacement Site 18 1 synonymous Here there is a circular path ATA Ile gt A Leu ATG Met G Leu Let us suppose that the number of mutations were only two one in site 16 and another in site 18 DnaSP must a
39. a Universitat de Barcelona M Aguad D Alvarez C Arboledas D Balafi A Blanco Garc a J Braverman S Cirera J M Comeron D De Lorenzo T Guebitz S Guirao N Khadem S O Kolokotronis H Kuittinen A Llopart J M Mart n Campos A Munt A Navarro Sabat C Nobrega D Orengo J P rez Pires H Quesada U Ram rez S Ramos Onsins C Romero Ib fiez A S nchez Gracia C Segarra F G Vieira A Vilella Apart from the mentioned special thanks are due to H Akashi A Barbadilla J Bertranpetit E Betr n C H Biermann M Blouin F Calafell F Gonz lez Candelas D Govindaraju R R Hudson P de Knijff T Mes A Navarro D Posada C Robin A P Rooney S Schaeffer W Stephan S Wells and R Zardoya for their comments suggestions and help We also acknowledge D R Maddison for providing advice about the NEXUS file formats and for supplying us with precise instructions on this format This work was supported by the Direcci n General de Investigaci n Cient fica y T cnica of Spain to M Aguad grants PB91 0245 PB94 0923 PB97 0918 and BMC2001 2906 and by the Comisi n Interministerial de Ciencia y Tecnolog a of Spain to J Rozas grant TXT98 1802 a Gene Conversion Detection Genetics 146 89 99 1997 Ester Betr n Julio Rozas Arcadi Navarro and Antonio Barbadilla The estimation of the number and the length distribution of gene conversion tracts from population DNA
40. ad s Test HKA Test See Also Input Data Files Output References Hudson et al 1987 Kimura 1983 Nei 1987 This command performs the Hudson Kreitman and Aguad s 1987 test HKA test The test is based on the Neutral Theory of Molecular Evolution Kimura 1983 prediction that regions of the genome that evolve at high rates will also present high levels of polymorphism within species The test requires data from one interspecific comparison of at least two regions of the genome and also data of the intraspecific polymorphism in the same regions of at least one species Data Files For the present analysis at least two sets of sequences one with the intraspecific data and other with the outgroup sequences must be defined see Data Define Sequence Sets command Minimum number of sequences in data files The intraspecific data file must contain at least two sequences while the interspecific data file can contain one or more sequences Alignment gaps and missing data Sites containing alignment gaps or sites with missing data in any data file are not used these sites are completely excluded Implementation The test is performed considering intraspecific data from only one species Hudson et al equation 6 If there is more than one sequence in the interspecific data file the intraspecific polymorphism will be ignored however this information will be considered in computing the interspecific divergence The
41. age Table The RSCU value of a codon is the observed frequency of that codon in the gene divided by that expected under the assumption of equal usage of synonymous codons A RSCU value of 1 indicates that the frequency of that codon is the expected for an equal codon usage values less than 1 or more than 1 indicates that the codons are used less often or more often than the expected Codon Usage Table DnaSP shows for a given codon the observed frequency and its RSCU value in parenthesis For a given DNA sequence the Codon Usage Table also shows the Scaled chi square value ENC Effective Number of Codons Wright 1990 That measure quantifies the effective number of codons that are used in a gene For the nuclear universal genetic code the value of ENC ranges from 20 only one codon is used for each amino acid i e the codon bias is maximum to 61 all synonymous codons for each amino acid are equally used i e no codon bias CBI Codon Bias I ndex Morton 1993 CBI is a measure of the deviation from the equal use of synonymous codons CBI values range from 0 uniform use of synonymous codons to 1 maximum codon bias SChi2 Scaled Chi Square Shields et al 1988 The scaled 2 chi square is a measure based on the chi square statistics i e based on the difference between the observed number of codons and those expected from equal usage of codons The sum of the chi square values is divided by the total number of cod
42. an intr n etc That definition allows DnaSP conducting analyses on specific functional regions using MultiDomain Analysis command Domain sets assignations can be stored in NEXUS data files for that use the save export or update commands Example in the data file DmelOsRegions nex included in DnaSP package there are defined two genes with three and four exons Each gene would correspond to a domain and each exon to a subdomain Filter Remove Positions See Also Input Data Files Output This command allows the user to remove some positions DnaSP module generates a NEXUS Data File including information about the polymorphic sites Selected Positions DnaSP can select the following sorts of positions Coding and Noncoding positions First Second and Third codon positions Zero Two and Four Fold Degenerate positions Example using the nuclear universal genetic code How DnaSP select the X fold degenerate positions 3 6 9 ATA TTA AC ATA TTA GA ALA TIA Q Positions 1 2 5 7 and 8 are zero fold degenerate positions Position 3 is a three fold degenerate position Position 4 and 6 are two fold degenerate positions Position 9 could be i four fold degenerate codon ACT or ii two fold degenerate codon GAT DnaSP will no include that position neither for two fold degenerate positions nor for four fold degenerate positions Codons with missing information or alignment
43. and Miller 1990 Watterson 1975 This command computes some measures of the extent of DNA polymorphism and divergence in synonymous nonsynonymous silent and in all sites Unlike the Polymorphism and Divergence command this command provides estimates of nucleotide diversity divergence and the number of mutations in functional regions i e separately for noncoding regions exons introns etc Data Files For the present analysis at least one set of sequences must be defined see Data Define Sequence Sets command Analysis using one sequence set The sequence set must include intraspecific data information DnaSP will estimate some measures of the extent of DNA polymorphism Analysis using two sequence sets One Sequence Set must contain the intraspecific data while the other must contain sequences one or more from a different species or from a different population DnaSP will estimate some measures of the extent of DNA polymorphism and of divergence Alignment gaps and missing data Sites or codons with alignment gaps or missing data in any data file are not used i e these sites or codons are completely excluded Output The program estimates the following measures From the intraspecific data file e Nucleotide diversity Pi x Nei 1987 equation 10 5 Nucleotide diversity with Jukes and Cantor correction Pi JC Lynch and Crease 1990 equations 1 2 e Theta per site from Eta n
44. and negative selectionon the human genome Genetics 158 1227 1234 FELSENSTEIN J 1993 Phylogeny Inference Package PHYLIP Version 3 5 University of Washington Seattle FU Y X 1995 Statistical properties of segregating sites Theor Pop Biol 48 172 197 FU Y X 1997 Statistical tests of neutrality of mutations againstpopulation growth hitchhiking and background selection Genetics 147 915 925 FU Y X and W H LI 1993 Statistical tests of neutrality of mutations Genetics 133 693 709 GILBERT D 1996 A biological sequence editor and analysis program IndianaUniversity HEY J 1991 The structure of genealogies and the distribution of fixeddifferences between DNA sequence samples from natural populations Genetics 128 831 840 HEY J and J WAKELEY 1997 A coalescent estimator of the populationrecombination rate Genetics 145 833 846 HARPENDING H 1994 Signature of ancient population growth in alow resolution mitochondrial DNA mismatch distribution Human Biology 66 591 600 HILL W G and A ROBERTSON 1968 Linkage disequilibrium in finitepopulations Theor Appl Genet 38 226 231 HUDSON R R 1983 Properties of a neutral allele model with intragenicrecombination Theor Pop Biol 23 183 201 HUDSON R R 1987 Estimating the recombination parameter of a finite population model without selection Genet Res 50 245 250 HUDSON R R 1990 Gene genealogies and the c
45. aracter Gene Flow and Genetic Differentiation commands The results of these analyses could differ in function of how the alignment gaps are dealt see Data Menu Gaps in Sequence Sets or Include Exclude Sequences commands Suppose the following complete data file Segl ATCTCTTAGGGTCGATTTGTTG GTATTITAA Seq2 AT TTATTITCGA TTGTTG GTATTTAA Seq3 ATCGCTTA TCGATTTGT TGTATTTAA Seq4 AT TCTTA CGATTTGTTG GIle e rTAA Seqo AT TCTTA CGATTTGTTG GTATTTAA Assume that the active sequence set is constituted by sequences Seqi Seq2 and Seq3 Sequences Seq4 and Seq5 would be excluded Include Exclude Sequences command or alternatively Seq4 and Seq5 would be part of a different non active sequence set Define Sequence Sets command With option Sites with alignment gaps are excluded if they are present in the active subset the analysis of Seql Seq2 and Seq3 sequences will be conducted using information of all sites with gaps However with option Sites with alignment gaps in the original data file are excluded in all subsets DnaSP will not use information of sites with gaps or missing data present in excluded sequences Therefore sites 3 9 10 11 28 29 30 i e all sites with alignment gaps in the original data file will not be used On the contrary sites with gaps in included sequences sites 4 5 16 21 22 23 and 24 will be used x Effect
46. art X axis Nucleotide distance Y axis D Print Graph Black White Use this command to print in black and white the contents of the window the graph at the default printer Print Graph color Use this command to print the graph in color at the default printer Save Graph bmp Use this command to save the graph in a file bmp format Copy Graph clipboard Use this command to copy the graph to the clipboard i e you can paste it to other applications Show Significant This command displays the significant values in Linkage Disequilibrium analysis Display in Black White Use this command to display the graph in black and white Display Default Color Use this command to display the graph in the default colors Colors You can use this command to change the default colors of the graph a Sliding Window The sliding window method allows you to calculate some measures or parameters for example the nucleotide diversity across a DNA region In this method a window segment of DNA is moved along the sequences in steps The parameter is calculated in each window and the value is assigned to the nucleotide at the midpoint of the window Both the window length and the step size default values can be changed by the user DnaSP allows you to perform sliding window analyses in non overlapping windows for that analysis you must assign the same values to both the window length and the step size The ou
47. ating sites by the Sliding Window method The output of the sliding window analysis is given in a grid table The results can also be presented graphically by a line chart In the graph the nucleotide diversity theta or S Y axis is plotted against the nucleotide position X axis Pairwise Deletion option The average number of nucleotide differences k Tajima 1983 equation A3 and nucleotide diversity Pi x Nei 1987 equations 10 5 or 10 6 can also be calculated by the Pairwise Deletion option DnaSP will not compute their variances Using this option only those gaps present in a particular pairwise comparison are ignored Note Pairwise sequence comparisons with O sites after excluding the gaps are ignored Statistical significance by the coalescent DnaSP can provide the confidence intervals of the number of haplotypes the haplotype diversity and the nucleotide diversity by computer simulations using the coalescent algorithm see Coalescent Simulations Note n a not applicable When the proportion of differences is equal or higher than 0 75 the Jukes and Cantor correction can not be computed a Gaps in Pairwise Deletion Gaps as the Fifth Character DnaSP allows you the analysis of the nucleotide diversity using the pairwise deletion option DNA Polymorphism commands and of the gene flow and genetic differentiation among populations using the pairwise deletion and also using the gap as the fifth ch
48. ation and the power of statistical tests ofneutrality Genet Res 74 65 69 WANG L S and XU Y 2003 Haplotype inference by maximum parsimony Bioinformatics 19 1773 1780 WATTERSON G A 1975 On the number of segregating sites in geneticalmodels without recombination Theor Pop Biol 7 256 276 WEIR B S 1996 Genetic Data Analysis Il Sinauer Associates Inc Sunderland WRIGHT S 1951 The genetical structure of populations Ann Eugenics 15 323 354 WRIGHT F 1990 The effective number of codons used in a gene Gene 87 23 29
49. atter graph In the graph D D R R 2 can be plotted against the nucleotide distance X axis a Recombination See Also Coalescent Simulations Input Data Files Output References Hudson 1987 Hudson and Kaplan 1985 Rozas et al 2001 This command computes some estimates of the Recombination parameter R 4Nr for autosomal loci of diploid organisms where N is the population size and r is the recombination rate per sequence per gene In the literature the recombination parameter is also indicated as C 4Nc For the present analysis sites containing alignment gaps or missing data in the data files are not used these sites are completely excluded The program estimates the following measures Recombination parameter R 4Nr Hudson 1987 The estimator is based on the variance of the average number of nucleotide differences between pairs of sequences S2k Hudson 1987 equation 1 The estimator R is obtained after solving equation 4 Hudson 1987 The solution of the function g C n of equation 4 is obtained numerically see the Appendix in Hudson 1987 The output DnaSP shows the estimate of R 4Nr per gene r is the recombination rate per generation between the most distant sites Hudson 1987 from equation 4 DnaSP also calculates the estimate of R between adjacent sites R between adjacent sites R per gene D where D is the average nucleotide distance in base pairs of the analyzed region the average
50. aved file formats Alignment gap symbol The symbol used to designate an alignment gap should be indicated by the subcommand GAP For example GAP indicates that the hyphen character should be used to specify an alignment gap Default symbol Identical site matching character symbol The symbol used to designate that the nucleotide in a site is identical to that in the same site of the first sequence should be indicated by the subcommand MATCHCHAR For example MATCHCHAR Default symbol Missing data symbol The symbol used to designate missing data should be indicated by the subcommand MISSING For example MISSING Default symbol Note the following symbols are not allowed in the subcommands GAP MISSING and MATCHCHAR The white space and 11 2 V lt gt see Maddison et al 1997 Moreover these subcommands cannot share the same symbol Sequence name There is no limit for the sequence name length nevertheless DnaSP will only display the first 20 characters Blank spaces and tabs are not allowed underlines should be used to indicate a blank space Interleaved format NEXUS files can contain nucleotide sequences with interleaved and non interleaved formats The former format must be indicated by the subcommand INTERLEAVE NEXUS blocks NEXUS blocks must end with the command END DnaSP will read the following NEXUS blocks see Maddison et al 1997 DATA TAXA CHARACTERS blocks These
51. blocks contain information about the taxa and the molecular sequence data SETS block That block allows the user to store information of groups of sequences characters taxa etc DnaSP only uses the TaxSet command This block contains information about groups of sequences NOTE See also Define Sequence Sets CODONS block This block contains information about the genetic code and about the regions of the sequence that are noncoding or protein coding regions NOTE See also Assign Coding Regions CODONUSAGE block This is a private NEXUS that contains information about the specific table of Preferred and Unpreferred codons that will be used in the Preferred and Unpreferred Synonymous Substitutions analysis There are 8 predefined tables nevertheless the user can define their own table Subcommands Pref subcommand Includes the preferred codons Unknown subcommand Includes codons of unknown preference nature NOTE See also the Data Menu See also the NEXUS Format Example 1 DNASP block This is a private NEXUS block that contains information about i the chromosomal location of the DNA region CHROMOSOMALLOCATION command There are 8 predefined chromosomal locations Autosome Xchromosome Ychromosome Zchromosome Wchromosome prokaryotic mitochondrial chloroplast ii or the organism s genomic type GENOME command There are 2 predefined genomic types Diploid Haploid NOTE See also Data Menu Example 1 of NEXU
52. by the Student s t test with n 2 degrees of freedom n is the total number of values pairwise comparisons this test is not included in DnaSP But be careful This test requires independent sample values and certainly it is not the case for LD Alternative You could determine the confidence intervals of the ZZ test statistic by coalescent based simulations see Coalescent Simulations Another alternatives not included in present version of DnaSP you might test the decay of LD with physical distance by the randomization permutation test i e by random permutation of the polymorphic sites Nucleotide distance The nucleotide distance Dist in the output i e the distance in nucleotides between a given pair of polymorphic sites is calculated as the average number of nucleotides that separate two particular polymorphic sites For example the nucleotide distance between polymorphic sites 1 and 18 marked with asterisks in the following four sequences is 13 seq 1 ATATACGGGGTTA TTAGA seq 2 CGATAC GG TA TAACA seq 3 AGATACGG GATA TAATA seq 4 ATAAACGGGGATA GTAGT Output The output of the analysis is given in a grid table The columns Site1 and Site2 refer to the polymorphic sites analyzed compared Dist to the nucleotide distance between them Fisher to the probability obtained by Fisher s exact test and Chi sq to the value of X 2 The results are also presented graphically by a sc
53. c Code To compute synonymous and nonsynonymous substitutions DnaSP will use the defined Assign Genetic Code assigned the default is the Nuclear Universal Notes and abbreviations n a not applicable When the proportion of differences is equal or higher than 0 75 the Jukes and Cantor correction can not be computed Seq 1 and Seq 2 the two sequences compared SynDif the total number of synonymous differences SynPos the total number of synonymous sites SilentDif the total number of silent differences SilentPos the total number of silent sites Ks the number of synonymous or silent substitutions per synonymous or silent site NSynDif the total number of nonsynonymous differences NSynPos the total number of nonsynonymous sites Ka the number of nonsynonymous substitutions per nonsynonymous site zx Codon Usage Bias See Also Input Data Files Output Window Menu References Morton 1993 Sharp et al 1986 Shields et al 1988 Wright 1990 This command computes some measures of the extent of the nonrandom usage of synonymous codons Data Files The present analysis requires only one data file This command works only if the coding regions and the genetic code have been previously defined more help in Assign Coding Regions and Assign Genetic Code Codon Bias Measures RSCU Relative Synonymous Codon Usage Sharp et al 1986 For a given DNA sequence DnaSP shows the RSCU value at each codon Codon Us
54. ces between the two codons compared two or six putative pathways exit DnaSP considers all pathways with equal probability but it excludes those pathways that go through stop codons Obviously all nucleotide differences in noncoding positions are considered silent Silent differences will include therefore both the synonymous differences in coding regions and all differences in noncoding positions Silent Substitutions Considered Substitutions in Coding Regions Only synonymous substitutions coding region will be considered In Coding and Noncoding Regions All silent substitutions will be considered synonymous substitutions and changes in noncoding positions If the data file does not contain assigned coding regions all sites will be considered as noncoding positions i e all substitutions will be considered as silent Analysis e The average number of nucleotide differences per site between two sequences or nucleotide diversity Pi x Nei 1987 equations 10 5 or 10 6 e The average number of nucleotide substitutions per site between two sequences or nucleotide diversity Pi x using the Jukes and Cantor 1969 correction Lynch and Crease 1990 equations 1 2 The correction has been performed in each pairwise comparison of two sequences Nei and Gojobori 1986 equations 1 3 the Pi x estimates was obtained as the average of the values of all comparisons of Ks and Ka values see also the DNA Polymorphism command
55. cess However a comprehensive computer program for the analysis is not currently available Results Here we present DnaSP version 2 0 a software package for Windows that performs extensive population genetics analyses on DNA sequence data DnaSP estimates several measures of DNA sequence variation within and between populations linkage disequilibrium recombination gene flow and gene conversion a new algorithm to detect gene conversion tracts has been included DnaSP can also carry out several tests of neutrality those of Fu and Li Hudson Kreitman and Aguad and Tajima The results of the analyses are displayed in tabular and graphic form Availability For academic uses DnaSP is available via anonymous ftp ftp ebi ac uk in the directory pub software dos Contact E mail julio porthos bio ub es a DnaSP version 3 0 Bioinformatics 15 174 175 1999 Julio Rozas and Ricardo Rozas DnaSP version 3 an integrated program for molecular population genetics and molecular evolution analysis Summary DnaSP is a Windows integrated software for the analysis of the DNA polymorphism from nucleotide sequence data DnaSP version 3 incorporates several methods for estimating the amount and pattern of DNA polymorphism and divergence and for conducting neutrality tests Availability For academic uses DnaSP is available free of charge from http www ub es dnasp DnaSP version 4 0 Bioinformatics 19 2496 2497 2003 Julio Ro
56. d MEGA PHYLIP data files with genetic distance information These files can be read by the MEGA or PHYLIP softwares which allows performing some phylogenetic analysis Any Word Processor could also be used to read edit MEGA or PHYLIP files these files are just text files MEGA v 2 Molecular Evolutionary Genetics Analysis Software The MEGA software is distributed by free from http www megasoftware net a Linkage Disequilibrium See Also Coalescent Simulations Graphs Window Input Data Files Output References Hill and Robertson 1968 Kelly 1997 Langley et al 1974 Lewontin 1964 Lewontin and Kojima 1964 Rozas et al 2001 Sokal and Rohlf 1981 Wall 1999 Weir 1996 Abstracts Rozas et al 2001 This command calculates the degree of linkage disequilibrium LD or nonrandom association between nucleotide variants at different polymorphic sites Sites containing alignment gaps or polymorphic sites segregating for three or four nucleotides are completely excluded from the analysis The analysis can be performed with all polymorphic sites in the data or only with parsimony informative sites sites that segregate for only two nucleotides that are present at least twice Linkage disequilibrium between nucleotide variants The degree of LD is estimated by the following parameters D Lewontin and Kojima 1964 D Lewontin 1964 R and R 2 Hill and Robertson 1968 DnaSP considers as coupling gametes those with the most or
57. e memory gt 3 000 000 nt Maximum number of sequences 32767 Other limitations The grid control cannot display more than 16351 rows or 5448 columns Therefore for the sliding window option the maximum number of rows of results is 16351 Hence the maximum number of polymorphic sites linkage disequilibrium module or of sequences synonymous and nonsynonymous module that can be analyzed and displayed on the screen is 181 the total number of pairwise comparisons is 181 180 2 16290 Although DnaSP will not display the results of these analyses on the screen the results could be saved in a file NOTE These upper limits will be increased I hope in following DnaSP versions zx Input Data Files DnaSP can automatically read the following types of data file formats FASTA MEGA Kumar et al 1994 NBRF PIR Sidman et al 1988 NEXUS Maddison et al 1997 PHYLIP Felsenstein 1993 In all cases one or more homologous nucleotide sequences should be included in just one file ASCII file The sequences must be aligned i e the sequences must have the same length Nucleotide sequences should be entered using the letters A T or U C or G in lower case upper case or any mixture of lower and upper case DnaSP allows you to analyze a subset of sites of the data file this option is useful for the analysis of particular regions of the data file for example when analyzing exonic and intronic regions separa
58. e number of monomorphic sites the number of polymorphic sites segregating for two three or four nucleotides DnaSP also indicates the total number of parsimony informative sites sites that have a minimum of two nucleotides that are present at least twice and non informative sites singleton sites This command also displays information about the genetic code used for these data and the regions that are protein coding and noncoding if this information was included in the NEXUS file or has been defined using the Assign Coding Regions command in the Coding Region Menu In this case for the coding region DnaSP also displays the number of synonymous and nonsynonymous replacement substitutions see how DnaSP estimates the number of Synonymous and Nonsynonymous changes in a codon a Number of Synonymous and Nonsynonymous Mutations How DnaSP estimates Synonymous and Nonsynonymous changes in a codon In general DnaSP uses a conservative criterion to decide if a particular change in a nucleotide site is synonymous or nonsynonymous replacement see the following examples Nevertheless the user should check the complex cases those triplets of sites segregating for several codons i e in highly variable regions Example using the Nuclear Universal Genetic Code 3 6 9 12 153 18 21 24 27 AGT TCT A Cle AAT A AGT UAU UAU AGC TCT A CCC AGG A AGT UAU UAU AGA TCT CTG CAG ACT TTG AGA CUG C
59. ecipes in C The art of Scientific Computing Cambridge UniversityPress Cambridge RAMOS ONSINS S E and J ROZAS 2002 Statistical properties of newneutrality tests against population growth Mol Biol Evol 19 2092 2100 RAND D M and L M KANN 1996 Excess amino acid polymorphism inmitochondrial DNA contrasts among genes from Drosophila mice and humans Mol Biol Evol 13 735 748 ROGERS A R 1995 Genetic evidence for a pleistocene population explosion Evolution 49 608 615 ROGERS A R and H HARPENDING 1992 Population growth makes waves in thedistribution of pairwise genetic differences Mol Biol Evol 9 552 569 ROGERS A R A E FRALEY M J BAMSHAD W SCOTT WATKINS and L B JORDE 1996 Mitochondrial mismatch analysis is insensitive to the mutationalprocess Mol Biol Evol 13 895 902 ROZAS J and M AGUADE 1993 Transfer of genetic information in the rp49 region of Drosophila subobscura between different chromosomal gene arrangements Proc Natl Acad Sci USA 90 8083 8087 ROZAS J and M AGUADE 1994 Gene conversion is involved in the transfer of genetic informationbetween naturally occurring inversions of Drosophila Proc Natl Acad Sci USA 91 11517 11521 ROZAS J and R ROZAS 1995 DnaSP DNA sequence polymorphism an interactive program for estimating Population Genetics parameters from DNA sequence data Comput Applic Biosci 11 621 625 ROZAS J and
60. ences 0 Codon 13 14 15 species 1 Site 14 2 replacements Site 15 is Synonymous Here there are four possible paths Path 1 ACT Thr gt AAT Asp gt AGT Ser gt AGG Arg Site 14 2 Replacements Site 15 1 Replacement Path 2 ACT Thr gt AAT Asp gt AAG Lys gt AGG Arg Site 14 2 Replacements Site 15 1 Replacement Path 3 AAT Asn gt ACT Thr gt AGT Ser gt AGG Arg Site 14 2 Replacements Site 15 1 Replacement Path 4 AAT Asn gt ACT Thr gt ACG Thr gt AGG Arg Site 14 2 Replacements Site 15 1 Synonymous DnaSP will choose path 4 the path that requires the minor number of replacements species 2 Monomorphic within species 2 replacements site 14 and 1 synonymous Site 15 fixed differences 2 Site 13 is replacement Site 15 is synonymous For computing fixed differences DnaSP will check all paths between codons of the two species and it will choose the path with the minor number of changes If there are several paths with the same number of differences DnaSP will choose the path with the lower number of replacement changes Codon 16 17 18 species 1 Site 16 1 replacement Site 18 1 synonymous Here there is a circular path ATA Ile gt A Leu ATG Met G Leu Let us suppose that the number of mutations were only two one in site 16 and another in site 18 DnaSP must assume one recombination event the recombinati
61. ene Flow and Genetic Differentiation Linkage Disequilibrium Recombination Population Size Changes Fu and Li s and other Tests Fu and Li s and other Tests with an Outgroup HKA Hudson Kreitman and Aquad s Test McDonald and Kreitman s Test Tajima s Test Overview Menu This menu has the following command The command allows you to choose different options for the analysis Intraspecific Data ii Generate Menu This menu has the following commands Shuttle to DNA Slider Polymorphic Sites File Haplotype Data File Translate to Protein Data File Reverse Complement Data File Prepare Submission for EMBL GenBank Databases Window Menu Use this command to change the active window windows with results calculator sequence data The active window is the window that appears in the foreground zx Help Menu This menu is provided with the following four commands Contents This command provides information for using DnaSP the commands open the present help file Search For Help on This command displays Help s Search dialog box where you can quickly find the information that you need by keywords DnaSP Bug Reports This command displays DnaSP Bug Reports Web page Citation This command displays a dialog box with the suggested citation for DnaSP DnaSP Home Page This command displays DnaSP Web page About DnaSP This command displays a
62. enome Research 12 656 664 KIMURA M 1983 The neutral theory of Molecular Evolution Cambridge University Press Cambridge Massachusetts KREITMAN M 1983 Nucleotide polymorphism at the alcohol dehydrogenaselocus of Drosophila melanogaster Nature 304 412 417 KUMAR S K TAMURA and M NEI 1994 MEGA Molecular Evolutionary GeneticsAnalysis software for microcomputers Comput Applic Biosci 10 189 191 LANGLEY C H Y N TOBARI and K KOJIMA 1974 Linkage disequilibrium innatural populations of Drosophila melanogaster Genetics 78 921 936 LEWONTIN R C 1964 The interaction of selection and linkage Generalconsiderations heterotic models Genetics 49 49 67 LEWONTIN R C and K KOJIMA 1960 The evolutionary dynamics of complexpolymorphisms Evolution 14 458 472 LYNCH M and T J CREASE 1990 The analysis of population survey data onDNA sequence variation Mol Biol Evol 7 377 394 McDONALD J H 1996 Detecting Non neutral heterogeneity across a region ofDNA sequence in the ratio of Polymorphism to divergence Mol Biol Evol 13 253 260 McDONALD J H 1998 Improved tests for heterogeneity across a region ofDNA sequence in the ratio of polymorphism to divergence Mol Biol Evol 15 377 384 MADDISON W P and D R MADDISON 1992 MacClade Analysis of phylogenyand character evolution Version 3 Sinauer Associates Sunderland Massachusetts MADDISON W P D L
63. eq12 Seq13 where A and G represents the two InDel states no InDel InDel Option 2 Triallelic Only Diallelic and Triallelic InDel states will be considered In the example all positions will be used Total number of InDel events analysed 4 Average InDel length per event 3 75 Average deletion length 5 111 Number of InDel haplotypes 6 DnaSP will generate the following recoded NEXUS file Seql AAAA BEES xad Seq3 G Seq4 Sem Gig Seq6 Seq7 G Seg8 Seq9 G Seq10 Segll sss Seq12 Seq13 ooo Q Option 3 Tetrallelic Only Diallelic Triallelic and Tetrallelic InDel states will be considered Optionz 4 Multiallelic All InDel events will be considered Option 5 As Is DnaSP will no infer events from InDel information DnaSP will generate the following recoded NEXUS file Seql AAAAAAAAAAAA SOq6 seres GGGGGGG Seq GGG GGGGGGG SEGS press GGGGGGG Beg LisseR4d GGG Segl srera GGGGGGG Semi ssx GGGGGGG Sele 266i dg E SELS Gaa4k44 X Note Throughout this module nucleotide substitution polymorphism is not considered either in non InDel sites such as the nucleotide polymorphism at site 2 in the example data file or in InDel positions such as the nucleotide polymorphism at site 4 a DNA Divergence Between Populations See Also Gene Flow and Genetic Differentiation Graphs Window Input Data Files Output References Hey 1991 Jukes and Cantor 1969 Nei 1987 Tajima 1
64. estimate of D the between species divergence is obtained as the average number of differences between DNA sequences from species 1 and 2 that is D is estimated in the same way as the Dxy Nei 1987 equation 10 20 but per sequence Regions loci DnaSP performs the HKA test of only two regions These regions could be any two non overlapping segments of sites of the data file Chromosomal location DnaSP assumes that the two regions loci are located in the same chromosome the two compared regions are from the same data file i e both are in autosomal chromosomes or in sex chromosomes Even though the statistical significance of the HKA test will be the same in both cases different estimates of the divergence time or theta are expected so that it is convenient to indicate the chromosome where the region is located DnaSP has considered that the expectation of x is 4Nu for autosomal 3Nu for X linked genes and Nu for Y linked genes we have slightly modified the equations of Begun and Aquadro 1991 for comparisons involving autosomal or X linked with Y linked genes You can compare regions located in autosomes with regions in sex chromosomes using the module HKA test Direct Mode Substitutions Considered All substitutions All substitutions are used excluding substitutions in sites with gaps or missing data Silent substitutions Only silent substitutions are used synonymous substitutions and changes in noncoding positio
65. f you save export the data file as a NEXUS file format zx Preferred and Unpreferred Synonymous Substitutions See Also Codon Preference Table Input Data Files Output References Akashi 1995 Akashi 1999 This command determines the polarity status ancestral gt derived of the polymorphic or fixed substitutions and it also estimates the number of preferred and unpreferred substitutions Data Files For the present analysis at least two sets of sequences one with the intraspecific data and other with the outgroup sequences must be defined see Data Define Sequence Sets command Alignment gaps and missing data Sites or codons with alignment gaps or missing data in any group of sequences sequence sets are not used i e these sites or codons are completely excluded Analyze One Species with an Outgroup Analysis of the polarity status ancestral gt derived of the polymorphic substitutions The outgroup allows inferring that information One Species with two Outgroups Analysis of the polarity status ancestral gt derived of the polymorphic substitutions intraspecific Data and also of the fixed differences between the MRCA of the intraspecific data and the common ancestor of the close outgroup The distant outgroup allows inferring that information Pref Unpref Tables Use this command to assign the specific codon preference table to the data Options Non Coding Positions This opt
66. ferred a Gene Conversion See Also Graphs Window Input Data Files Output References Betr n et al 1997 Rozas and Aguad 1994 Abstracts Betr n et al 1997 Rozas and Aguad 1994 DnaSP incorporates the algorithm developed by Betr n et al 1997 to detect gene conversion tracts from two differentiated populations referred to as subpopulations These subpopulations could be for example two different chromosomal gene arrangements Rozas and Aguad 1994 or two sets of paralogous sequences Data Files For the present analysis at least two sets of sequences one for each population must be defined see Data Define Sequence Sets command Minimum number of sequences in each set One sequence set must contain at least three sequences and the other a minimum of five Alignment gaps and missing data Sites containing alignment gaps or sites with missing data in any population are not used these sites are completely excluded Implementation DnaSP estimates the observed tract length in nucleotides as L TR TL 1 G where TL left and TR right are the site positions of the outermost informative nucleotide sites of a congruent tract and G is the number of alignment gaps if any between TL and TR in the particular sequence where the gene conversion tract is detected see Betr n et al 1997 equation A1 DnaSP also estimates the parameter y Betr n et al 1997 equation A4 which measures the probab
67. genes You must indicate the value of the per gene recombination parameter R Free Recombination Maximum theoretical value of the Recombination parameter R Recombination parameter R R is the recombination parameter R 4Nr for autosomal loci of diploid organisms where N is the effective population size and r is the recombination rate per gene sequence i e r is the recombination rate per generation between the most distant sites of the DNA sequence see also the Recombination module and Effective Population size information Theta value per gene Usually the theta value is unknown in this case that value can be estimated from the data see DNA Polymorphism and Tajima s Test modules Theta per gene can be estimated from i k the average number of nucleotide differences ii S al where S is the total number of segregating sites al 2 1 i fromi 1ton 1 n the number of nucleotide sequences Observed values If the observed value is provided DnaSP will estimate the probability of obtaining lower values than the ones observed For example for the Tajima s D test statistic P D lt D obs 0 01 means that the probability of obtaining D values under the neutral coalescent process equal or lower than the observed is 0 01 Note You can perform computer simulations fixing the number of segregating sites In this case the estimated values of theta in different replicates will be also fixed becau
68. he alcohol dehydrogenase region of Drosophila pseudoobscura Genetics 135 541 552 SCHNEIDER S ROESSLI D AND EXCOFFIER L 2000 Arlequin A software forpopulation genetics data analysis Ver 2 001 Genetics and Biometry Lab Dept of Anthropology University of Geneva SHARP P M T M F TUOHY and K R MOSURSKI 1986 Codon usage in yeast Cluster analysis clearly differentiates highly and lowly expressed genes Nucleic Acids Res 14 5125 5143 SHIELDS D C P M SHARP D G HIGGINS and F WRIGHT 1988 Silent sites in Drosophila genes are not neutral Evidence of selection among synonymous codons Mol Bio Evol 5 704 716 SIDMAN K E D G GEORGE W C BARKER and L T HUNT 1988 The protein identification resource PIR Nucleic Acids Res 16 1869 1871 SIMONSEN K L G A CHURCHILL and C F AQUADRO 1995 Properties ofstatistical tests of neutrality for DNA polymorphism data Genetics 141 413 429 SLATKIN M and R R HUDSON 1991 Pairwise comparisons of mitochondrialDNA sequences in stable and exponentially growing populations Genetics 129 555 562 SOKAL R R and F J ROHLF 1981 Biometry Second Edition W H Freeman and Company New York STEPHENS M SMITH N and DONNELLY P 2001 A new statistical method for haplotype reconstruction from population data American Journal of Human Genetics 68 978 989 STEPHENS M and SCHEET P 2005 Accounting for Decay of Linkage Di
69. he opposite direction i you should generate the reverse complement data file ii define the coding regions iii perform the appropriate analysis a Prepare Submission for EMBL GenBank Databases See Also Input Data Files This command generates a text file with the relevant information for a submission of DNA sequence information to the nucleotide sequence database EMBL GenBank DDBJ This command is appropriated for researchers wishing to submit multiple related sequences see the Bulk Submissions in the EMBL Nucleotide Sequence Database Information If the coding regions have been previously defined see Assign Coding Regions command DnaSP will include information on the exonic intronic regions More Information on the EMBL GenBank Databases http www ebi ac uk embl index html a Coalescent Simulations See Also DNA Polymorphism Linkage Disequilibrium Population Size Changes Recombination Fu and Li s and other Tests Fu and Li s and other Tests with an Outgroup Tajima s Test References Depaulis and Veuille 1998 Fay and Wu 2000 Fu and Li 1993 Fu 1997 Harpending 1994 Hudson 1983 Hudson 1990 Hudson and Kaplan 1985 Kelly 1997 Nei 1987 Press 1992 Ramos Onsins and Rozas 2002 Rozas et al 2001 Simonsen et al 1995 Tajima 1989 Wall 1999 Watterson 1975 DnaSP can generate the empirical distributions of some test statistics From that distributions DnaSP can provide the confidence limits for a gi
70. he type of line delimiter IBM PC or compatible CR LF ASCII 13 amp ASCII 10 Macintosh CR ASCII 13 Unix systems LF ASCII 10 To indicate the version of NEXUS file format Old version used by MacClade 3 04 or older New version NEXUS version 1 used by MacClade 3 05 or later To indicate the symbol used for missing data alignment gap and identical site matching character Send All Output to File Use this command to send all generated output except graphs in a file The command displays the standard Windows directory dialog box where you may choose where to place the file Close Output File Use this command if you wish to close the output file Save Current Output Use this command to save the output of the last analysis in a file The command displays the standard Windows directory dialog box where you may choose where to place the file Page Setup The command displays the standard Windows Page Setup dialog box where you may change various printer settings for example the default printer paper size orientation etc Print Output Use this command to print the output on the default printer File 1 2 3 4 Lists the four most recently used Data Files Exit This command ends the current DnaSP session oe File Menu Shortcut Keys To choose Press Open Data File CTRL O Close File CTRL W Save Output CTRL S Print Output CTRL P Exit CTRL X zx Display Menu This menu has four comma
71. hen character to specify an alignment gap the dot character to specify that the nucleotide in this site is identical to that in the same site of the first sequence i e identical site or matching symbol the symbols N n to designate missing data Sequence name The sequence name can be up to 10 characters Blank spaces are allowed Example of PHYLI P format 4 55 seq 1 ATATACGGGGTTA TTAGA AAAATGTGTGTGTG TTCA Secuencia2ATATAC GGATA ACA AGAATCTATGTCTGC e TTCA DmelanogasATATACGGGGATA TTATA AGAATGTGTGTGTG TTCA seq_4 ATATACGGGGATA GTAGT AAAATGTGTGTGTG TTCA e Open Multiple Data Files DnaSP can automatically read several data file formats see Input Data Files This module also allows you to analyse at once multiple files sequentially as a Batch mode These data files can contain different number of sequences or different genomic regions Analysis This command can analyse sequentially several data files It can compute a number of measures of the extent of DNA polymorphism and can also perform some common neutrality tests Haplotype Nucleotide Diversity The number of Segregating Sites S The total number of mutations Eta The number of haplotypes NHap Nei 1987 p 259 Haplotype gene diversity and its sampling variance Nei 1987 Nucleotide diversity Pi p Nei 1987 and its samp
72. his command calculates the statistical tests D and F proposed by Fu and Li 1993 for testing the hypothesis that all mutations are selectively neutral Kimura 1983 These tests require data of the intraspecific variation polymorphism and data from an outgroup one or more sequences from a related species Data Files For the present analysis at least two sets of sequences one with the intraspecific data and other with the outgroup sequences must be defined see Data Define Sequence Sets command Minimum number of sequences in data files The intraspecific data file must contain at least four sequences The outgroup can contain more than one sequence but the analysis will be performed in the first sequence one Nevertheless if there are more than one sequence in the outgroup sites with alignment gaps or with missing data in any of the outgroup sequences will not be used see below Alignment gaps and missing data Sites containing alignment gaps or sites with missing data in any data file are not used these sites are completely excluded Ambiguous information In some cases the polarity of some substitutions could not be unambiguously determined for example Intraspecific Data 10 segl CTTAACCTTC seq2 CATTA AC seq3 CTATATTCCC seq4 A AAACCTAC Outgroup Seq5 CT AAGGGAC Seq6 CTA AGCTAC Site 1 The A in seq 4 is an external mutation and derived substitution Site 2 No
73. i and Miller 1990 equation 25 That is the correction of Pi and of K is performed directly on the uncorrected value and not in each pairwise comparison of two sequences Nevertheless for low levels of polymorphism and of divergence both methods give similar estimates For high polymorphism and divergence levels the use of the DNA Polymorphism and Synonymous and Nonsynonymous Substitutions commands might be desirable The total number of synonymous and nonsynonymous sites for a set of sequences is estimated as the average of the number of synonymous and nonsynonymous sites of all sequences these values are used for all sequences Note than in the Synonymous and Nonsynonymous Substitutions command the total number of synonymous and nonsynonymous sites is performed in every pairwise comparison So that nucleotide diversity estimates in synonymous nonsynonymous and silent sites based on the present and on the Synonymous and Nonsynonymous Substitutions command could be slightly different Sites Considered Silent synonymous sites and noncoding positions Only silent both synonymous sites and noncoding positions are used Noncoding Positions Only noncoding positions are used Only synonymous sites Only synonymous sites are used substitutions in the coding region that cause no amino acid changes This option works only if the data file contains sequences with assigned coding regions more help in Assign Codi
74. ies 1 2 replacements Site 22 and Site 23 and 1 synonymous Site 24 species 2 1 synonymous Site 24 within species 2 replacements Site 22 and Site 23 and 1 synonymous Site 24 fixed differences 0 Output Codons not analyzed DnaSP does not estimate synonymous and replacement changes in some complex cases ambiguous complex codons those sites segregating for several codons i e in highly variable regions The user should do manually DnaSP does not estimate synonymous and replacement changes in codons with alignment gaps Neutrality Index Indicates the extent to which the levels of amino acid polymorphism depart from the expected in the neutral model Rand and Kann 1996 Alfa value o Indicates the proportion of amino acid substitutions driven by positive selection Fay et al 2001 Statistical significance Both the two tailed Fisher s exact test and G test of independence are computed to determine whether the deviations on the ratio of replacement to synonymous fixed substitutions between species vs polymorphisms within species are or not significant DnaSP obtains the probability associated with the G value with 1 degree of freedom by the trapezoidal method of numerical integration a Tajima s Test See Also Coalescent Simulations Graphs Window Input Data Files Output References Kimura 1983 Tajima 1983 Tajima 1989 This command calculates the D test statistic proposed by Tajima 1989 equation 3
75. ile DnaSP will not use sites 3 9 10 11 16 21 22 23 24 for further analysis If you exclude 2 sequences for example Seq2 and Seq4 from the previous original data file the active data will be composed of Segl ATCTCTTAGGGTICGATTTGTTG GTATTTAA Seq3 ATCGCTTA TCGATTTGT TGTATTTAA Seq5 ATCTCTTA TCGATTTGTTG GTATITAA With the Sites with alignment gaps are excluded if they are present in the active subset option default option DnaSP will not use information of sites 9 10 11 21 22 23 24 With the option Sites with alignment gaps in the original data file are excluded in all subsets DnaSP will not use information of sites 3 9 10 11 16 21 22 23 24 i e all sites with alignment gaps in the original data file This option is appropriate to analyze exactly the same sites in different subsets of sequences Note Both options generate the same estimates of the nucleotide distance see the Linkage Disequilibrium command Include Exclude Populations Use this command to include or exclude a particular population a group of sequences i e a sequence set from the analysis In any case populations with a single included sequence will not be used a Polymorphic Sites See Also Input Data Files Output This command displays some general information about the polymorphisms on the data file the number of sites with alignment gaps or missing data th
76. ility per site of detecting a conversion event between two subpopulations From this information it is possible to estimate the true number and length of the gene conversion tracts Sliding window option This option computes the parameter y by the Sliding Window method The output of the analysis is given in a grid table The results can also be presented graphically by a line chart In the graph the parameter y Y axis can be plotted against the nucleotide position X axis a Gene Flow and Genetic Differentiation See Also Define Sequence Sets Define Populations Input Data Files Output References Hudson et al 1992a Hudson et al 1992b Hudson 2000 Lynch and Crease 1990 Nei 1973 Nei 1982 Nei 1987 Tajima 1983 Wright 1951 This command computes some measures of the extent of DNA divergence among populations and from these measures it computes the average level of gene flow Additionally DnaSP allows testing for population subdivision Data Files For the present analysis at least two sets of sequences one set for each population must be previously defined see Data Define Sequence Sets command Missing data Sites containing missing data in any population are not used these sites are completely excluded Include Exclude Populations set of sequences Use this command to include or exclude a particular population from the analysis In any case populations with an unique included sequence will no
77. ion allows analyzing the polarity of changes in noncoding positions Statistical significance DnaSP conducts the Mann Whitney test to determine if the frequency distribution of preferred and unpreferred substitutions are significantly different DnaSP can carry out the fdMWU test that uses information of only polymorphism data Akashi 1999 or the fddMWU test that uses information of both polymorphic substitutions and fixed differences Akashi 1999 Ambiguous information In some cases the polarity of some substitutions could not be unambiguously determined see below There are several sources of ambiguity ancestral polymorphism multiple substitutions alignment gaps missing data etc In that cases DnaSP will list the ambiguous sites or codons How DnaSP polarizes the nucleotide changes and assigns the preferred and unpreferred status coding region DnaSP uses a conservative parsimony criterion to infer the ancestral nucleotide state only unambiguous cases are used for the analysis see the following examples Once the polarity has been established DnaSP will use the codon preference table to assign codons or changes as preferred or unpreferred Some examples using the Nuclear Universal Genetic Code with the D melanogaster Akashi 1995 codon preference table Intraspecific Data 3 6 g 13 15 18 21 22 29 30 33 CTT AAC CTT CTA AAT TTA
78. ite 30 noncoding the rest of the sites Assuming a Universal Nuclear Genetic Code 19 20 30 ATCTCTTATCGTCGATTITGTIGITIGIATITAAT LeuSerSerIl eCysIle You have to do two codon assignments i In the dialog box indicate as selected region 6 16 Set the codon position of the first site as 1 First position ii In the dialog box indicate as selected region 24 30 Set the codon position of the first site as 3 Third position You can see the current assignation using the View Data command You will see the following NNNNNLeuSerSerIlNNNNNNNeCysIleNNNN ATCICITADOGICGAILIGITGITLIGIAITIAAI N noncoding Examples DnaSP assigns codons in the following way assuming the Universal Nuclear Genetic Code Examples 3 and 4 show how DnaSP assigns codons in case of misassignations or alignment gaps Example 1 NNNNNLeuSerSerIleNNNNNNCysI1leNNNN ATCTCTTAICGTCGATITGTTGTTGTATTTAAT Example 2 NNNNNLeuSerSerIlNNNNNNNeCysIleNNNN BICTCTTATICGTCGATTIGTIGTITTGTATTITAAT Example 3 123123112312 3123123 NNNNNLeuSerSfterIlNNNNNNNeCysIleNNNN AICICITAICGUIGUGAIDIITGITGLIITGIAIITAAI 4 wrong assignation Example 4 12312312312 3123123 NNNNNLeu SerIlNNNNNNNeCysIleNNNN ATCTCTTA TCGATTIGTTGTTITGTATTITAAT Leu SerIl eCysIle alignment gaps in nucleotide sequence Note The amino acid assignation corresponds to the first nucleotide sequence This information will be stored i
79. ive Population Sizes The mutation parameter 0 theta is defined as 4Nu for autosomal loci of diploid organisms where N is the effective population size diploid individuals and u is the neutral mutation rate per gene or per base pair per generation Assuming equal population sizes of males and females the parameter 0 is 3Nu for X linked or Z linked loci of diploid organisms In the same way the parameter 0 is Nu for Y linked or W linked loci of diploid organisms In both cases N is the effective population size considering both males and females diploid individuals For Y linked loci the parameter 0 would be 2Nmu where Nm is the male effective population size For mitochondrial DNA or haploid individuals 6 is 2Nu where N is the effective population size of females Likewise the recombination parameter C or R is 4Nc for autosomal loci of diploid organisms where N is the effective population size and c is the is the recombination rate per generation C 3Nc and C Nc for X linked and Y linked loci respectively zx Graphs Window This window displays graphs from results given in a grid table The are the following commands Select Graph Use this command to select the kind of graph There are the following DNA Polymorphism command Graph Line chart X axis Nucleotide position Y axis Pi z X axis Nucleotide position Y axis Theta per site X axis Nucleotide position Y axis S DNA Divergence between Popu
80. ize and the mutation rate per nucleotide site per generation respectively Theta 6 per site from Pi x Tajima 1996 equation 9 Theta 6 per site from S Tajima 1996 equation 10 Theta 0 per site from Eta n Tajima 1996 equation 16 e The average number of nucleotide differences k Tajima 1983 equation A3 Stochastic variance of k no recombination Vst k Tajima 1993 equation 14 Sampling variance of k no recombination Vs k Tajima 1993 equation 15 Total variance of k no recombination V k Tajima 1993 equation 13 Stochastic variance of k free recombination Vst k Tajima 1993 equation 17 Sampling variance of k free recombination Vs k Tajima 1993 equation 18 Total variance of k free recombination V k Tajima 1993 equation 16 e Theta per DNA sequence from S Watterson estimator Theta 6 4Nu for an autosomal gene of a diploid organism N and u are the effective population size and the mutation rate per DNA sequence per generation respectively Tajima 1993 equation 3 Variance of 0 no recombination Tajima 1993 equation 4 Variance of 0 free recombination Tajima 1993 equation 8 Note Tajima 1993 uses M to indicate 6 per DNA sequence and v to indicate the mutation rate per DNA sequence per generation Effective Population size Sliding window option This option allows you to calculate the nucleotide diversity theta per site and S the number of segreg
81. lations command Graph Line chart X axis Nucleotide position Y axis Pi 1 pop 1 X axis Nucleotide position Y axis Pi 2 pop 2 X axis Nucleotide position Y axis Dxy X axis Nucleotide position Y axis Da X axis Nucleotide position Y axis Pi 1 and Pi 2 X axis Nucleotide position Y axis Dxy and Da X axis Nucleotide position Y axis Pi 1 Pi 2 and Dxy X axis Nucleotide position Y axis Pi 1 Pi 2 and Da Polymorphism and Divergence command Graph Line chart X axis Nucleotide position Y axis Pi z and K Gene Conversion command Graph Line chart X axis Nucleotide distance Y axis Psi y Linkage Disequilibrium command Graph Scatter graph X axis Nucleotide distance Y axis D X axis Nucleotide distance Y axis D X axis Nucleotide distance Y axis D X axis Nucleotide distance Y axis D X axis Nucleotide distance Y axis R X axis Nucleotide distance Y axis R 2 Population Size Change command Pairwise Number of Differences Graph Line chart X axis Pairwise differences Y axis Frequency Segregating sites Graph Line and Bar chart X axis Number of nucleotide variants in a site Y axis Frequency X axis Sample size Y axis Segregating sites Fu and Li s tests command Graph Line chart X axis Nucleotide distance Y axis D X axis Nucleotide distance Y axis F X axis Nucleotide distance Y axis D X axis Nucleotide distance Y axis F Tajima s test command Graph Line ch
82. lations 1 and 2 Dxy and Da by the Sliding Window method The output of the analysis is given in a grid table The results can also be presented graphically by a line chart In the graph the nucleotide diversity Dxy or Da Y axis can be plotted against the nucleotide position X axis DNA Divergence among Populations You can perform some analyses of the DNA divergence among populations by using the Gene Flow and Genetic Differentiation command zx Polymorphism and Divergence See Also Graphs Window Input Data Files Output References Jukes and Cantor 1969 Lynch and Crease 1990 Nei 1987 Nei and Gojobori 1986 Nei and Miller 1990 Watterson 1975 This command computes some measures of the extent of DNA polymorphism and divergence in synonymous nonsynonymous silent and in all sites Data Files For the present analysis at least one set of sequences must be defined see Data Define Sequence Sets command Analysis using one sequence set The sequence set must include intraspecific data information DnaSP will estimate some measures of the extent of DNA polymorphism Analysis using two sequence sets One Sequence Set must contain the intraspecific data while the other must contain sequences one or more from a different species or from a different population DnaSP will estimate some measures of the extent of DNA polymorphism and of divergence Alignment gaps and missing data Sites or codons with
83. ling variance Nei 1987 equation 10 7 The average number of nucleotide differences k Tajima 1983 Theta per gene or per site from Eta h or from S Watterson 1975 Nei 1987 Neutrality tests Tajima s D Tajima 1989 and its statistical significance Fu and Li s D Fu and Li 1993 and its statistical significance Fu and Li s F Fu and Li 1993 and its statistical significance Fu s Fs Fu 1997 G Cn G C content at noncoding positions G Cc G C content at coding positions More Information in the specific modules Codon Usage Bias DNA Polymorphism Fu and Li s and other Tests Tajima s Test a Open Unphase Genotype Data Files References Stephens 2001 Stephens 2005 Wang and Xu 2005 DnaSP can automatically read Unphase or Genotype data files diploid individuals in FASTA format see FASTA This format is the standard FASTA format but including the IUPAC nucleotide ambiguity codes to represent heterozyous sites Suppose a data set containing 5 diploid individuals therefore a total of 10 sequences with 16 positions each Indl TRCAAGACCGGAGGCG IS aA hy cs or damm THOS AssMsscgaes es DAS ghee os ab os C IDIOMA Dem NL E For instance as the second site of Ind1 is heterozyous R Purine A and G Ind1 includes the following two sequences Indle TACAAGACCGGAGGCG Indi 2 m As there are not heterozyous site in Ind2 then the two composing sequences are Ind2 1 TACCAG CGGA
84. luded in the data file Data from interspecific divergence and data on levels of intraspecific polymorphism must be entered in the dialog box If you want that DnaSP obtains the information necessary to perform the HKA test directly from your sequences you must use the HKA test module Output Estimates of the Time of divergence measured in 2N generations where N is the effective population size Estimates of theta 0 per nucleotide in region locus 1 Estimates of theta 0 per nucleotide in region locus 2 The X square value and the statistical significance Statistical significance The statistical significance is obtained assuming a x square distribution with one degree of freedom DnaSP obtains the probability associated with a particular chi square value with 1 degree of freedom by the trapezoidal method of numerical integration P lt 0 10 P lt 0 05 P lt 0 01 P lt 0 001 Tools Menu References Ewens 1972 Fu 1997 Jukes and Cantor 1969 Sokal and Rohlf 1981 Strobeck 1987 Tajima 1989 Watterson 1975 This menu has the following commands Coalescent Simulations HKA Test Direct Mode Discrete Distributions Use this command to calculate probabilities the expected value and the variance of some distributions Binomial Hypergeometric and Poisson The Ewens option allows computing the Strobeck s S statistic Strobeck 1987 see also Fu 1997 the Fu s Fs statistic Fu 1997 and the
85. ma 1996 Watterson 1975 This command computes several measures of the extent of DNA polymorphism and their variances Alignment gaps and missing data Sites with alignment gaps or missing data are not used these sites are completely excluded Analysis DnaSP computes the following measures e Haplotype gene diversity and its sampling variance Nei 1987 equations 8 4 and 8 12 but replacing 2n by n The standard deviation or standard error is the square root of the variance e Nucleotide diversity Pi x the average number of nucleotide differences per site between two sequences Nei 1987 equations 10 5 or 10 6 see also Nei and Miller 1990 and its sampling variance Nei 1987 equation 10 7 The standard deviation or standard error is the square root of the variance e Nucleotide diversity Jukes and Cantor Pi J C the average number of nucleotide substitutions per site between two sequences Lynch and Crease 1990 equations 1 2 Unlike the previous estimates Nei 1987 equations 10 5 or 10 6 this one has been obtained using the Jukes and Cantor 1969 correction The correction has been performed in each pairwise comparison the Pi x estimates were obtained as the average of the values for all comparisons Note that DnaSP does not use the simplification indicated in Nei and Miller 1990 equation 25 i e to perform the Jukes and Cantor 1969 correction directly on Pi x Nei 1987 equations 10 5 Neve
86. n the coding region that not result in amino acid changes This option works only if the data file contains sequences with assigned coding regions more help in Assign Coding Regions and Assign Genetic Code Note See how DnaSP estimates the number of Synonymous and Nonsynonymous changes in a codon Note DnaSP does not perform the tests described in McDonald 1996 1998 but it can create the data file with the relevant information for the test This data file can be read by the DNA Slider program DNA Slider program It is a Macintosh program that performs the heterogeneity tests described in McDonald 1996 1998 You can download the program from the John McDonald Web Page http udel edu mcdonald http udel edu mcdonald aboutdnaslider html a ms Dick Hudson Data File Format See Also Input Data Files Output References Hudson 2002 This command allows exporting the data intraspecific data on the ms file format Hudson 2002 This format can be read among others by some of Hudson see Hudson 2002 and Kim programs Kim Y amp Stephan W 2003 Selective sweeps in the presence of interference among partially linked loci Genetics 164 389 398 Kim Y amp Stephan W 2002 Detecting a local signature of genetic hitchhiking along a recombining chromosome Genetics 160 765 777 Data Files For the present analysis at least one set of sequences must be defined see Data Define Sequence Set
87. ndividual test is obtained from a 1 1 a 1 L where L is the number of tests performed DnaSP obtain the probability associated with a particular chi square value with 1 degree of freedom by the trapezoidal method of numeric integration Significant disequilibrium by the Bonferroni procedure for an a 0 05 is indicated by the letter B NOTE The Bonferroni correction applied to non independent tests as in the LD tests would be highly conservative Statistical significance by the coalescent DnaSP can also provide the confidence intervals of B Q ZnS Za and ZZ statistics by computer simulations using the coalescent algorithm see Coalescent Simulations LI NKAGE DI SEQUI LI BRI UM AND PHYSI CAL DI STANCE DnaSP estimates the relationship of linkage disequilibrium with physical distance by the regression analysis Sokal and Rohlf 1981 DnaSP estimates the linear regression equation Y a bX where Y is the LD value and X is the nucleotide distance measured in kilobases kb The regression equation is performed for D absolute value of D D absolute value of D and R 2 values For D values DnaSP gives two regression equations i for all D values blue line in the graph default colour ii for all D values excluding values of D 1 1 and 1 black line in the graph default colour Statistical significance The statistical significance of the regression coefficient could be conducted
88. nds Graphs This command opens the Graphs Window where graphs from results given in the grid table can be displayed Data I nfo This command displays a summary of the data file The number of sequences the number of sites the data file format the Genetic Code assigned the organism s genomic type diploid haploid the chromosome type where the nucleotide region is located autosomal X chromosome etc View Data This command displays a window with the sequence data of the active data file In this window you can get information about Coding and noncoding regions The status of a selected site monomorphic polymorphic informative synonymous nonsynonymous etc View Data Options You can use this command to specify some options about displaying the nucleotide sequences To indicate by the dot symbol a nucleotide with identical nucleotide variant to the one in the first sequence To show polymorphic sites in Lower Case E Analysis Menu This menu is provided with the following commands Each command starts with a dialog box that allows you to choose different options for the analysis Polymorphic Sites DNA Polymorphism DNA Divergence Between Populations Polymorphism and Divergence Polymorphism and Divergence in Functional Regions Synonymous and Nonsynonymous Substitutions Codon Usage Bias Preferred and Unpreferred Synonymous Substitutions Gene Conversion G
89. ng Regions and Assign Genetic Code Only Nonsynonymous sites Only nonsynonymous sites are used substitutions in the coding region that cause amino acid changes This option works only if the data file contains sequences with assigned coding regions more help in Assign Coding Regions and Assign Genetic Code Pi a Pi s and Ka Ks ratios DnaSP will compute o ratios o Ka Ks also known as o dN dS for the intraspecific and interspecific if available data sets This option works only if the data file contains sequences with assigned coding regions more help in Assign Coding Regions and Assign Genetic Code All sites All sites are used excluding substitutions in sites with gaps or missing data Note See how DnaSP estimates the number of Synonymous and Nonsynonymous changes in a codon Sliding window option This option computes the nucleotide diversity intraspecific data file and divergence between both data files by the Sliding Window method The output of the analysis is given in a grid table The results can also be presented graphically by a line chart In the graph Pi the nucleotide diversity and K divergence Y axis can be plotted against the nucleotide position X axis Abbreviations n a not available a Polymorphism and Divergence in Functional Regions See Also Input Data Files Output References Jukes and Cantor 1969 Lynch and Crease 1990 Nei 1987 Nei and Gojobori 1986 Nei
90. ns If the data file does not contain assigned coding regions all sites will be considered as noncoding positions i e all substitutions will be considered as silent Synonymous substitutions Only synonymous substitutions are used substitutions in the coding region that not result in amino acid changes This option works only if the data file contain sequences with assigned coding regions more help in Assign Coding Regions and Assign Genetic Code Note see how DnaSP estimates Synonymous and Nonsynonymous changes in a codon Effective Population size Output The present module displays the following output Estimates of the Time of divergence measured in 2N generations where N is the effective population size Estimates of theta 0 per nucleotide in region locus 1 Estimates of theta 0 per nucleotide in region locus 2 The X square value and the statistical significance Statistical significance The statistical significance is obtained assuming a x square distribution with one degree of freedom DnaSP obtains the probability associated with a particular chi square value with 1 degree of freedom by the trapezoidal method of numerical integration P lt 0 10 P lt 0 05 P lt 0 01 P lt 0 001 See also To compare autosomal and sex linked regions or to perform the HKA test with polymorphism data with different number of sequences in the two regions or with different number of sites for
91. ntation for more information and details IUPAC nucleotide ambiguity codes Symbol Meaning Nucleic Acid A A Adenine G G Cytosine G G Guanine F T Thymine U U Uracil M A or C R A or G W A or T S C or G Y C or T K Gor T V A or C or G H Acor C or T D A or G or T B G Or G or T X G or A or T or C N Gor Aor Torc a Output See Also Graph Window The output is displayed in three kinds of windows text table or grid the output data are laid out in rows and columns like in a spreadsheet and graphic scatter graph and line chart All commands produce an output text window moreover some of them also produce a grid table window Data in the grid can be used to create a graph Graphs command in the Display menu The data generated from DnaSP can be saved as an ASCII text file The grid output data file can be easily used by other applications such as spreadsheets statistical or graphics applications by simply removing the header a UCSC Browser References Kent 2002 DnaSP allows you visualizing DNA sequence data and sliding window results integrated with available genome annotations using the UCSC browser To display the genome annotations DnaSP requires that the information of genomic position of the data chromosome and physical position was defined DnaSP allows searching available genomes in UCSC To define the genomic position of your data choose the appropriate genome and specify
92. oalescent process Oxf Surv Evol Biol 7 1 44 HUDSON R R 2000 A new statistic for detecting genetic differentiation Genetics 155 2011 2014 HUDSON R R 2002 Generating samples under a Wright Fisher neutral model of genetic variation Bioinformatics 18 337 338 HUDSON R R and N L KAPLAN 1985 Statistical properties of the numberof recombination events in the history of a sample of DNA sequences Genetics 111 147 164 HUDSON R R M KREITMAN and M AGUADE 1987 A test of neutral molecularevolution based on nucleotide data Genetics 116 153 159 HUDSON R R BOOS D D and N L KAPLAN 1992 A statistical test for detecting population subdivision Mol Biol Evol 9 138 151 HUDSON R R M SLATKIN and W P MADDISON 1992 Estimation of levels of gene flow from DNA sequence data Genetics 132 583 589 JUKES T H and C R CANTOR 1969 Evolution of protein molecules pp 21 132 In H N Munro ed Mammalian Protein Metabolism Academic Press New York KANAYA S Y YAMADA Y KUDO and T IKEMURA 1999 Studies of codon usageand tRNA genes of 18 unicellular organisms and quantification of Bacillussubtilis tRNAs gene expression level and species specific diversity of codon usagebased on multivariate analysis Gene 238 143 155 KELLY J K 1997 A test of neutrality based on interlocus associations Genetics 146 1197 1206 KENT W J 2002 BLAT The Blast like alignment tool G
93. odon 1 2 3 species 1 3 mutations in site 3 1 replacement 2 synonymous species 2 Monomorphic within species 1 replacement and 2 synonymous site 3 fixed differences 0 Codon 4 5 6 species 1 Monomorphic species 2 Monomorphic within species 0 fixed differences 1 replacement site 4 Codon 7 8 9 species 1 Site 7 is replacement Site 9 is synonymous If there are two possible paths Path 1 ATT lle gt CTT Leu gt CTG Leu Site 7 Replacement Site 9 Synonymous Path 2 ATT Ile gt ATG Met gt CTG Leu Site 7 Replacement Site 9 Replacement DnaSP will choose path 1 the path that requires the minor number of replacements however see the next codon species 2 Monomorphic within species 1 replacement 1 synonymous fixed differences 0 Codon 10 11 12 species 1 Site 11 is replacement Site 12 is replacement Here there are also two possible paths Path 1 CCC Pro gt CCG Pro gt CAG Gln Site 11 Replacement Site 12 Synonymous Path 2 CCC Pro gt CAC His gt CAG Gln Site 11 Replacement Site 12 Replacement However DnaSP will choose path 2 If there are two possible paths and one of the non extant codons e g CAC in this case is found in the other species DnaSP assume that the true evolutionary path is the path with that codon i e path 2 in the present example species 2 Site 11 is replacement within species 2 replacements site 11 and site 12 fixed differ
94. ols Window Help Each menu contains a set of related commands which are displayed when the menu is pulled down i File Menu See Also Shortcut Keys Input Data Files Output This menu has the following commands Open Data File This command allows you to open the data file The command displays the standard Windows directory dialog box in which you may locate files Close Data File Use this command if you wish to close the active data file Save Export Data As Use this command to save the changes made in the active data file or to export translate the active data file from one file format to another note the data file exported will not contain the excluded sequences see the Include Exclude Sequences command The command displays the standard Windows directory dialog box where you may choose where to place the file This command also allows you to generate an Arlequin project file or a Roehl Data File see the Haplotype Data File command Update NEXUS Data File Use this command to update the information of the opened NEXUS Data File The command is enabled for non NEXUS Data Files or if there are some excluded sequences Options for Saving NEXUS format You can use this command to specify some options about saving or exporting NEXUS files Saving in an interleaved format The number of nucleotides of each interleaved block To indicate the type of nucleotide sequences DNA or RNA To indicate t
95. on event that requires the lower number of replacement substitutions TTG Leu TTA Leu gt recomb ATG Met ATA Ile species 2 Monomorphic within species Sitez 16 1 replacement Site 18 1 synonymous fixed differences 1 Site 18 is replacement Note This kind of codons will be analyzed only for Nuclear Genetic Codes Codon 19 20 21 species 1 1 replacement site 21 species 2 1 synonymous site 21 within species 1 replacement site 21 If there is discordance between replacement and synonymous changes within species for the same nucleotide variants DnaSP will choose the case with more replacement substitutions fixed differences 1 replacement site 19 Codon 22 23 24 species 1 There are 3 changes between codons So that there are 6 putative evolutionary paths in this particular example there are only 4 because we exclude paths that go through stop codons DnaSP will choose one of following paths 2 replacements Site 22 and Site 23 and 1 synonymous Site 24 and 2 replacements Site 23 and Site 24 and 1 synonymous Site 22 however see also the next codon species 2 Monomorphic within species the same than for species 1 fixed differences 0 Codon 25 26 27 The present example is similar to the previous codon 22 23 24 example Here however there is variation in species 2 In this case DnaSP will check the codons in species 2 to decide the assignation of species 1 spec
96. on where all mutations result in synonymous substitutions no amino acid changes Nonsynonymous Sites Indicates sites in the coding region where all mutations cause amino acid changes The analysis is restricted to nonsynonymous sites Abbreviations Tot Total analysis in total all sites Sil analysis in silent Synonymous and noncoding sites Syn analysis in synonymous coding region only sites NoSyn analysis in nonsynonymous sites SilSites the total number of silent sites NSynSites the total number of nonsynonymous sites SilMut the total number of silent mutations intraspecific data file NSynMut the total number of nonsynonymous mutations intraspecific data file n a not available zx Synonymous and Nonsynonymous Substitutions See Also Assign Coding Regions Input Data Files Output References Jukes and Cantor 1969 Lynch and Crease 1990 Nei 1987 Nei and Gojobori 1986 Nei and Miller 1990 Osawa et al 1992 Watterson 1975 This command estimates Ka the number of nonsynonymous substitutions per nonsynonymous site and Ks the number of synonymous substitutions per synonymous site for any pair of sequences Nei and Gojobori 1986 equations 1 3 it also computes several measures of the extent of DNA polymorphism in protein coding regions noncoding regions or in regions with both protein coding and noncoding regions i e regions with both exons and introns or exons and flanking regions One inte
97. ons in the gene excluding those codons coding for a unique amino acid i e all codons excluding the Trp and Met codons nuclear universal genetic code DnaSP can compute the scaled chi square with Yates correction and also assuming a given G C content by default the G C content is 50 G C content G 4 Cn G C content at noncoding positions G C2 G C content at second coding positions G C3s G C content at synonymous third coding positions i e the G C content in the third codon positions excluding the Trp and Met codons nuclear universal genetic code Wright 1990 G 4 Cc G C content at coding positions G 4 C G C content in the genomic whole region Assign Coding Regions This command allows you to assign noncoding and coding protein regions to the data file This information might be needed for several analyses The meaning of a specific codon will depend on the Genetic Code assigned There is no maximum number of coding protein regions exons You can assign a specific region as a noncoding region or as a coding region In the later case you have to indicate what is the codon position of the first site selected First second third from this site DnaSP will assign codons to the remainder sites following the reading frame Example Assume that you have a data file including DNA sequences 34 nucleotides long and you would like to indicate assign exon 1 from site 6 to site 16 exon 2 from site 24 to s
98. orphism and divergence data Statistical power to detect directional selectionunder stationarity and free recombination Genetics 151 221 238 AKASHI H and W SCHAEFFER 1997 Natural selection and the frequencydistributions of silent DNA polymorphisms in Drosophila Genetics 146 295 307 BANDELT H J P FORSTER and A R HL 1999 Median Joining networks forinferring intraspecific phylogenies Mol Biol Evol 16 37 48 BEGUN D J and C F AQUADRO 1991 Molecular population genetics of the distal portion of the X chromosome in Drosophila Evidence for genetic hitchhiking of the yellow achaete region Genetics 129 1147 1158 BETR N E J ROZAS A NAVARRO and A BARBADILLA 1997 The estimation ofthe number and the length distribution of gene conversion tracts frompopulation DNA sequence data Genetics 146 89 99 DEPAULIS F and M VEUILLE 1998 Neutrality tests based on thedistribution of haplotypes under an infinite site model Mol Biol Evol 15 1788 1790 DURET L and D MOUCHIROUD 1999 Expression pattern and surprisingly gene length shape codon usage in Caenorhabditis Drosophila and Arabidopsis Proc Natl Acad Sci USA 96 4482 4487 EWENS W J 1972 The sampling theory of selectively neutral alleles Theor Pop Biol 3 87 112 FAY J C and C I WU 2000 Hitchhiking under positive darwinianselection Genetics 155 1405 1413 FAY J WYKCOFF G J and WU C I 2001 Positive
99. ose of n Effective Population size Statistical significance The confidence limits of D two tailed test is obtained assuming that D follows the beta distribution Tajima 1989 equation 47 i e the confidence limits given in Table 2 of Tajima 1989 Note that the critical values will not be determined for sample sizes larger than 1000 n d not determined P lt 0 10 P lt 0 05 P lt 0 01 P lt 0 001 Statistical significance by the coalescent DnaSP can also provide the confidence intervals of the Tajima s D by computer simulations using the coalescent algorithm see Coalescent Simulations Sliding window option This option computes the D test statistic and the confidence limits of D by the Sliding Window method The output of the analysis is given in a grid table The results can also be presented graphically by a line chart In the graph the D value can be plotted against the nucleotide position X axis zx Intraspecific Data See Also Input Data Files Output Window Menu References Ewens 1972 Fu and Li1993 Fu 1997 Nei 1987 Strobeck 1987 Tajima 1993 Tajima 1989 Watterson 1975 This command computes a number of measures of the extent of DNA polymorphism and also performs some common neutrality tests Use this command to obtain a summary of the data analysis Data Files The present analysis requires only one data file Analysis DnaSP computes the following measures
100. ost powerful tests for detecting population growth are Fu s FS test and the newly developed R2 test The behavior of the R2 test is superior for small sample sizes whereas FS is better for large sample sizes We also show that some popular statistics based on the mismatch distribution are very conservative a Evidence for Gene Conversion Proc Natl Acad Sci USA 91 11517 11521 1994 Julio Rozas and Montserrat Aguad Gene conversion is involved in the transfer of genetic information between naturally occurring inversions of Drosophila Abstract The DNA sequences of the ribosomal protein 49 rp49 region were determined for 34 isochromosomal strains of Drosophila subobscura representing two chromosomal arrangements the OST and the 03 4 gene arrangements which differ by two overlapping inversions The data reveal that gene conversion is a mechanism responsible for the transfer of genetic information between naturally occurring inversions of Drosophila The estimated rate of gene transfer by gene conversion at this region which is close to an inversion breakpoint is lower than previous estimates obtained experimentally at the rosy ry gene in Drosophila melanogaster Our data indicate that gene arrangements OST and 03 4 are monophyletic and rather old 0 58 and 0 73 million years old respectively oe Haplotype Structure and Linkage Disequilibrium Genetics 158 1147 1155 2001 Julio Rozas Myriam Gullaud Gaelle Blandin and
101. otide differences between pairs of sequences Fu and Li 1993 p 702 see also Simonsen et al 1995 equation 10 Fu s Fs statistic The Fs test statistic Fu 1997 equation 1 is based on the haplotype gene frequency distribution conditional the value of 0 Ewens 1972 equations 19 21 Strobeck s S statistic The Strobeck s S test statistic Strobeck 1987 see also Fu 1997 is also based on the haplotype gene frequency distribution conditional the value of 6 Ewens 1972 equations 19 21 The S statistic gives the probability of obtaining a sample with equal or less number of haplotypes than the observed DnaSP also provides the probability of obtaining a sample with a number of haplotypes equal to the observed See also the Discrete Distributions command in the Tools Menu Total number of mutations vs number of segregating sites The D and F test statistics can also be computed using S the number of segregating sites instead of n the total number of mutations Simonsen et al 1995 equations 9 10 Under the infinite site model with two different nucleotides per site both D and F values should be the same S and n have the same value However if there are sites segregating for more than two nucleotides values of S will be lower than those of n Effective Population size Statistical significance DnaSP uses the critical values obtained by Fu and Li 1993 two tailed test Tables 2 and 4 to determine the statistical
102. resting feature of DnaSP is that both coding and noncoding protein regions can be included in the data file DnaSP can thus estimate the nucleotide diversity for synonymous nonsynonymous and silent both synonymous and noncoding positions sites Four pre defined genetic codes can be used the universal nuclear code and the mitochondrial code of Drosophila mammals and yeast Alignment gaps and missing data Sites or codons with alignment gaps or missing data are not used i e these sites or codons are completely excluded Implementation DnaSP can compute the nucleotide diversity in synonymous nonsynonymous and silent sites The total number of synonymous and nonsynonymous sites is computed as Nei and Gojobori 1986 By silent sites we refer both to the synonymous sites and the noncoding positions Synonymous sites are those sites in a codon where nucleotide changes result in synonymous substitutions For computing synonymous and nonsynonymous sites DnaSP will exclude all pathways that go through stop codons No stop codons should be found in the middle of coding regions however if DnaSP finds stop codons in the middle of coding regions they will be considered as if they would code for a new amino acid the amino acid 21 for example Selenocysteine Secys Osawa et al 1992 DnaSP computes the synonymous and nonsynonymous differences between a pair of sequences as Nei and Gojobori 1986 When there are two or three nucleotide differen
103. rtheless for low levels of polymorphism both methods give similar estimates e Theta per site from Eta n or from S i e the Watterson estimator Watterson 1975 equation 1 4a but on base pair basis Nei 1987 equation 10 3 Theta 0 4Nu for an autosomal gene of a diploid organism N and p are the effective population size and the mutation rate per nucleotide site per generation respectively Eta n is the total number of mutations and S is the number of segregating polymorphic sites The variance of this estimator depends on the recombination between sites The variances for no recombination and for free recombination are estimated from equations 4 and 8 of Tajima 1993 respectively These variances are computed on a per nucleotide site basis Variance per nucleotide site Variance per DNA sequence m m where m is the total number of nucleotides studied The standard deviation or standard error is the square root of the variance Note for no recombination estimates of the variance of theta can be different from those obtained from equation 10 2 of Nei 1987 see Tajima 1989 equations 33 and 34 Tajima 1993 equations 4 and 8 e Finite Sites Model four possible nucleotides per site The total number of mutations Eta n Fu and Li 1993 also referred as the minimum number of mutations Tajima 1996 Estimates of theta 9 per site 4Nu for an autosomal gene of a diploid organism N and u are the effective population s
104. s command Analysis using one sequence set The sequence set must include intraspecific data information DnaSP will estimate some measures of the extent of DNA polymorphism Analysis using two sequence sets interspecific data One Sequence Set must contain the intraspecific data while the other must contain sequences one or more from an outgroup DNA data from a different species or a different population in this case DnaSP will use the first sequence to polarize substitutions Alignment gaps and missing data Sites with alignment gaps or missing data in any data file are not used i e these sites are completely excluded Implementation Sites with three of four nucleotide variants will not be used Sites with ambiguous information in the outgroup if this option is chosen will also not be used zx Polymorphic Sites File See Also Input Data Files Output This module generates a NEXUS Data File including polymorphic sites information Sites with Alignment Gaps option Excluded These sites are removed Included These sites are included in the file Included if there is a polymorphism These sites are included if there is a polymorphism Substitutions Considered All Substitutions All polymorphic sites will be included Silent Synonymous coding region and non coding region Only silent i e noncoding positions plus synonymous sites in the coding region polymorphic sites will be included Only Synonymou
105. s Arlequin software Schneider et al 2000 Arlequin is a software for population genetics analysis and it is distributed from http Igb unige ch arlequin Network software Bandelt et al 1999 The Phylogenetic Network Analysis software was written by Arne R hl and it is distributed for free from http www fluxus engineering com s Translate to Protein Data File See Also Input Data Files In this module DnaSP will translate the nucleotide sequence into an amino acid sequence and generates a NEXUS Data File with that information This command works only if the coding regions and the genetic code have been previously defined more help in Assign Coding Regions and Assign Genetic Code Data File The present analysis requires only one data file Note DnaSP can not read NEXUS data files with Protein information You can read that information with MacClade or with any Word Processor a Reverse Complement Data File See Also Input Data Files This module generates a NEXUS Data File including the sequence data in the reverse complement direction This option would be interesting for the analysis of synonymous and nonsynonymous substitutions in data files with coding regions transcribed in both directions DnaSP can only analyze nucleotide variation in synonymous and nonsynonymous sites if the coding regions in data file are in the 5 gt 3 direction If the coding regions are transcribed in t
106. s Only synonymous polymorphic sites at the coding region will be included Only Nonsynonymous Only nonsynonymous polymorphic sites will be included a Haplotype Data File See Also Input Data Files Output References Bandelt et al 1999 Hudson et al 1992 Schneider et al 2000 This module generates Data Files with information on haplotype data Results can be saved on a NEXUS or Roehl Data Files Sites with Alignment Gaps option Not considered These sites are ignored complete deletion Considered Gaps are considered just like another nucleotide variant fifth state Only gaps are considered Only gaps information is considered to built haplotypes Invariable Sites option Removed Invariable monomorphic sites will not be included in the output file Included Invariable monomorphic sites will be included in the output file Generate option NEXUS Data File Haplotype information will be stored on a NEXUS data file Later this file could be opened by DnaSP and might be exported in another data file format Arlequin Project File DnaSP will create an Arlequin project file arp with haplotype information This file format is the format accepted by the Arlequin software Roehl Data File Haplotype information will be stored on a Roehl R hl Data File multistate data This file format is the format accepted by the Network software That program allows reconstructing intraspecific phylogenies network analysi
107. se theta is estimated from the number of segregating sites Abbreviations obs observed value a Hudson Kreitman and Aguad s Test HKA Test Direct Mode See Also Input Data Files Output References Begun and Aquadro 1991 Hudson et al 1987 Kimura 1983 This command performs the Hudson Kreitman and Aguad s 1987 test HKA test The test is based on the Neutral Theory of Molecular Evolution Kimura 1983 prediction that regions of the genome that evolve at high rates will also present high levels of polymorphism within species The test requires data from an interspecific comparison of at least two regions of the genome and also data of the intraspecific polymorphism in the same regions of at least one species In the present module DnaSP allows you to perform the HKA test when comparing autosomal and sex linked regions Begun and Aquadro 1991 or to perform the HKA test with polymorphism data with different number of sequences in the two regions or with different number of sites for the intraspecific and interspecific comparisons DnaSP has considered that the expectation of x is 4Nu for autosomal 3Nu for X linked genes and Nu for Y linked genes so that we have slightly modified the equations of Begun and Aquadro 1991 for comparisons involving autosomal or X linked with Y linked genes Effective Population size Data The present module does not perform the HKA test from information of the DNA sequences inc
108. sequence data Abstract A growing number of DNA sequence data studies report the transfer of small continuous segments of DNA among different sequences within and among populations This has been attributed to meiotic gene conversion events Here we provide a an algorithm to detect gene conversion tracts and b a statistical model to estimate the number and the length distribution of conversion tracts for population DNA sequence data Two length distributions are defined in the model 1 that of the observed tract lengths and 2 that of the true tract lengths Assuming that the latter follows a geometric distribution we obtain the relationship between both distributions which depends on two basic parameters y that measures the probability of detecting a converted site and the parameter of the geometric distribution from which the average true tract length 1 1 can be estimated Expressions are provided for estimating by the method of the moments and that of the maximum likelihood The robustness of the model to different types of sequence variation is examined by computer simulation We also derive the expression for the probability of detecting a given conversion event which allows the estimation of the rate of conversion events per generation We show that for a wide range of y and values only a small percentage of extant conversion events is detected The present methods have been applied to the published rp49 sequences
109. sequilibrium in Haplotype Inference and Missing Data Imputation American Journal of Human Genetics 76 449 462 STROBECK C 1987 Average number of nucleotide differences in a sample froma single subpopulation a test for population subdivision Genetics 117 149 153 SWOFFORD D L 1991 PAUP phylogenetic analysis using parsimony version3 0 Illinois Natural History Survey Champaign TAJIMA F 1983 Evolutionary relationship of DNA sequences in finitepopulations Genetics 105 437 460 TAJIMA F 1989 Statistical method for testing the neutral mutationhypothesis by DNA polymorphism Genetics 123 585 595 TAJIMA F 1989 The effect of change in population size on DNApolymorphism Genetics 123 597 601 TAJIMA F 1993 Measurement of DNA polymorphism pp 37 59 In Takahata N and Clark A G eds Mechanisms of Molecular Evolution Sinauer Associates Inc Sunderland Massachusetts TAJIMA F 1996 The amount of DNA polymorphism maintained in a finitepopulation when the neutral mutation rate varies among sites Genetics 143 1457 1465 THOMPSON J D D G HIGGINS and T J GIBSON 1994 CLUSTAL W improvingthe sensitivity of progressive sequence alignment through sequence weighting position specific gap penalties and weight matrix choice Nucleic Acids Res 22 4673 4680 WAKELEY J and J HEY 1997 Estimating ancestral population parameters Genetics 145 847 855 WALL J D 1999 Recombin
110. single data files Assumptions All files must have the same number of sequences and in the same order DnaSP generates the concatenated data file by consecutive adding single data files to the right Individual Data Files Option Real length For a single data file DnaSP will use the DNA sequence information selected in the Region to Analyse box Fixed length All data files will contribute with a fixed X nucs number of sites If the current single data file has less than X sites DnaSP will complement with missing information On the contray if the current data file has more than X sites DnaSP will use only the firsts X sites Notes Any codon assignation present in single data files will be saved on the concatenated file The concatenated file will also save the population set information present in only the first single data file a Shuttle to DNA Slider See Also nput Data Files Output References Kimura 1983 McDonald 1996 McDonald 1998 The neutral theory of molecular evolution predicts that the levels of polymorphism will be correlated with levels of divergence between species Kimura 1983 see also HKA test McDonald 1996 1998 has proposed some tests to detect heterogeneity in the polymorphism to divergence ratio across a region of DNA These tests are based on the distribution of polymorphic sites and fixed differences across a DNA region DnaSP searches for polymorphic sites and fixed differences and can genera
111. ssume one recombination event the recombination event that requires the lower number of replacement substitutions TTG Leu TTA Leu gt recomb ATG Met ATA Ile Note This kind of codons will be analyzed only for Nuclear Genetic Codes Codon 19 20 21 1 replacement site 21 Codon 22 23 24 There are 3 changes among codons So that there are 6 putative evolutionary paths in this particular example there are only 4 because we exclude paths that go through stop codons DnaSP will choose randomly between 2 replacements Site 22 and Site 23 and 1 synonymous Site 24 and 2 replacements Site 23 and Site 24 and 1 synonymous Site 22 Codons not analyzed DnaSP does not estimate synonymous and replacement changes in some complex cases ambiguous complex codons those sites segregating for several codons i e in highly variable regions The user should do manually DnaSP does not estimate synonymous and replacement changes in codons with alignment gaps NOTE Estimates of the number of synonymous and nonsynonymous substitutions might be different than the number of the synonymous and nonsynonymous differences see the Synonymous and Nonsynonymous Substitutions module a DNA Polymorphism See Also Coalescent Simulations Graphs Window Input Data Files Output References Jukes and Cantor 1969 Lynch and Crease 1990 Nei 1987 Nei and Miller 1990 Tajima 1983 Tajima 1989 Tajima 1993 Taji
112. t be used Sites with Alignment Gaps option 1 Excluded Sites with gaps in any population will be completely excluded from the analysis 2 Considered Gap as the fifth state Gaps will be used They will be considered as a different nucleotide variant 3 Excluded only in pairwise comparisons Using this option gaps will be ignored only if they are present in a particular pairwise comparison Note this option does not work for estimating haplotype based statistics in that case DnaSP will considered the gap as a fifth state Genetic Diversity Analysis The program estimates the following measures For each individual population e The number of haplotypes h e The haplotype diversity Hd Nei 1987 equations 8 4 e The average number of nucleotide differences K Tajima 1983 equation A3 e The nucleotide diversity Pi x Nei 1987 equation 10 5 e Nucleotide diversity with the Jukes and Cantor correction Pi JC Lynch and Crease 1990 equations 1 2 Present estimates may differ from those obtained by the DNA Polymorphism command This is because in the present analysis all sites with alignment gaps in any population are excluded if you are using the excluded option in the sites with alignment gaps That is the total number of analyzed sites considered in this command can be equal or lower than those taken into account in the DNA polymorphism command For the total data e The average number of nucleo
113. t used alignment gaps in intraspecific data Site 3 Not used alignment gaps in the used outgroup Site 4 Not used alignment gaps in one sequence of the outgroup data file Sites 6 8 Not used ambiguous information in outgroup Site 9 The T and the C on the intraspecific data file are singleton and also external mutations Ambiguous positions will not be used DnaSP will list them Analysis These tests are based on the neutral model prediction that estimates of 1 al ne and of k are unbiased estimates of 0 where h is the total number of mutations al S 1 i fromi 1ton 1 n the number of nucleotide sequences he is the total number of mutations in external branches of the genealogy k is the average number of nucleotide differences between pairs of sequences Tajima 1983 equation A3 Note that Fu and Li use Pn to indicate k q 4Nu for diploid autosomal N and u are the effective population size and the mutation rate per DNA sequence per generation respectively D test statistic The D test statistic is based on the differences between he the total number of mutations in external branches of the genealogy and h the total number of mutations Fu and Li 1993 equation 32 F test statistic The F test statistic is based on the differences between ne the total number of mutations in external branches of the genealogy and k the average number of nucleotide differences between pairs of sequences Fu and
114. te a Data File that can be read by the DNA Slider program McDonald 1998 The DNA Slider program will perform the tests described in McDonald 1996 1998 Data Files For the present analysis at least two sets of sequences one with the intraspecific data and other with the outgroup sequences must be defined see Data Define Sequence Sets command Alignment gaps and missing data Sites containing alignment gaps or sites with missing data in any population are not used these sites are completely excluded Implementation If there are more than one sequence in the interspecific data file DnaSP will assign one substitution as a fixed difference if in a particular site all nucleotide variants from file 1 differ from those of file 2 Sites with three of four nucleotide variants are treated as if they were at adjacent sites and polymorphism fixed differences are put in the order that maximizes the number of runs see McDonald 1996 Substitutions Considered All substitutions All substitutions are used excluding substitutions in sites with gaps or missing data Silent substitutions Only silent substitutions are used synonymous substitutions and changes in noncoding positions If the data file does not contain assigned coding regions all sites will be considered as noncoding positions i e all substitutions will be considered as silent Synonymous substitutions Only synonymous substitutions are used substitutions i
115. tely or to carry out analyses in a subset of sequences of the data file see the Include Exclude Sequences command a FASTA format See Also Input Data Files FASTA Format Example DnaSP can recognize FASTA data file formats also called Person format FASTA file format must begin with the symbol gt in the first line of the file the sequence name is the first word after that symbol Additional characters in this line are considered to be comments The sequence data starts in the second line Nucleotide data can be written in one or more lines DnaSP only recognize noninterleaved FASTA data files Special characters Blank spaces Tabs and Carriage returns are ignored i e they can be used to separate blocks of nucleotides By default DnaSP uses the following symbols the hyphen character to specify an alignment gap the dot character to specify that the nucleotide in this site is identical to that in the same site of the first sequence i e identical site or matching symbol the symbols N n to designate missing data Sequence name The sequence name can be up to 20 characters Blank spaces and tabs are not allowed underlines should be used to indicate a blank space Example of FASTA format seq 1 comment optional ATATACGGGGTTA TTAGA AAAA seq 2 comment optional ATATAC GGATA ACA AGAA gt seq_3 ATATACGGGGATA TTATA AGAA seq 4 ATATACG
116. the intraspecific and interspecific comparison the module HKA test Direct Mode should be used McDonald and Kreitman s Test See Also Input Data Files Output References Fay et al 2001 Kimura 1983 McDonald and Kreitman 1991 Rand and Kann 1996 This command conducts the test of the neutral hypothesis Kimura 1983 proposed by McDonald and Kreitman 1991 The test is based on a comparison of synonymous and nonsynonymous replacement variation within and between species Under neutrality the ratio of replacement to synonymous fixed substitutions differences between species should be the same as the ratio of replacement to synonymous polymorphisms within species Alignment gaps and missing data Codons containing alignment gaps or codons with missing data in any species are not used these codons are completely excluded Data Files For the present analysis at least two sets of sequences one for each species must be defined see Data Define Sequence Sets command DnaSP performs the McDonald and Kreitman test from sequence information included in data files DnaSP calculates Number of polymorphic synonymous substitutions within species Number of polymorphic nonsynonymous replacement changes within species Number of synonymous substitutions fixed between species Number of nonsynonymous replacement differences fixed between species and for these information computes the 2 x 2 contingence table
117. the least common variants Langley et al 1974 Linkage disequilibrium for the whole data For the whole data DnaSP computes ZnS statistic Kelly 1997 equation 3 ZnS is the average of R 2 Hill and Robertson 1968 over all pairwise comparisons Za and ZZ statistics Rozas et al 2001 Za is the average of R 2 Hill and Robertson 1968 over all pairwise comparisons between adjacent polymorphic sites ZZ Za ZnS ZZ statistic could be used for detecting intragenic recombination see Recombination DnaSP can compute the Confidence Intervals of ZnS Za ZZ by coalescent based simulations see Coalescent Simulations Association among nucleotide variants DnaSP also computes the B and Q statistics Wall 1999 Statistical significance of LD Both the two tailed Fisher s exact test and the chi square test are computed to determine whether the associations between polymorphic sites are or are not significant see Sokal and Rohlf 1981 P 0 05 P lt 0 01 P lt 0 001 DnaSP also performs the Bonferroni correction for multiple tests see Weir 1996 The Bonferroni procedure tries to avoid spurious rejections of the null hypothesis in multiple tests assuming that all tests are independent For an overall a o is the probability that at least one test causes the rejection of a true null hypothesis a a is the probability that an individual test causes the rejection of a true null hypothesis i e type error of an i
118. these symbols can be changed in the dialog box that appears when opening a data file Sequence name The sequence name can be up to 20 characters Blank spaces and tabs are not allowed underlines should be used to indicate a blank space Example of MEGA format MEGA TITLE 4 sequences 55 nucleotides File EX N1 MEG seq_1 ATATACGGGGTTA TTAGA AAAATGTGTGTGTGTTTTTTTTTTCATGTG seq_2 Sey eho aqqadx Ree oh T Greek Caw ee eS seq_3 ee ee PbeewsEESLee4espR long dG RP PPREGd eee E Ir Sone eae E RS seq_4 oe NBRF PIR format Sidman et al 1988 See Also Input Data Files NBRF PIR Format Example References Sidman et al 1988 In the NBRF PIR files the sequence names are placed immediately after the identifier 2 DL The next line is used for comments The nucleotide sequence is written in the next line in one or more lines and is ended with the symbol The file must contain nucleotide sequences in a noninterleaved form Sequence data Blank spaces Tabs and Carriage returns are ignored i e they can be used to separate blocks of nucleotides The hyphen character must be used to specify an alignment gap The dot character can be used to specify that the nucleotide in this site is identical to that in the same site of the first sequence The symbols N n could be used to designate missing data No other symbols are allowed Sequence name The sequence name can be up
119. tide differences k Tajima 1983 equation A3 e The nucleotide diversity Pi x Nei 1987 equation 10 5 e The average number of nucleotide substitutions per site between populations Dxy Nei 1987 equation 10 20 e The number of net nucleotide substitutions per site between populations Da Nei 1987 equation 10 21 Genetic Differentiation Analysis DnaSP conducts the following analyses Haplotype based statistics Hs Hudson et al 1992a eq 3a Hst Hudson et al 1992a eq 2 Nucleotide Sequence based statistics Ks Hudson et al 1992a eq 10 Kst Hudson et al 1992a eq 9 Ks and Kst Hudson et al 1992a eq 11 Z Hudson et al 19922 Z Hudson et al 1992a Snn Hudson 2000 Statistical tests Chi square test haplotype data Nei 1987 Hudson et al 1992a eq 1 PM Permutation randomization test Hudson et al 19923 Population Size Weighting factor see Hudson et al 1992a p 144 DnaSP uses the weighting factors recommended in Hudson et al 1992a Export Genetic Distances Use this command to export genetic distances into MEGA or PHYLIP format files These files will allow performing subsequent phylogenetic analyses using the MEGA or PHYLIP softwares Gene Flow Analysis The gene flow estimates are computed using information about the organism s genomic type haploid diploid indicated in the Data Menu DnaSP computes the following measures From haplotype data information
120. tput of the sliding window analysis is given in a grid table The results can also be presented graphically by a line chart In the graph the parameter Y axis is plotted against the nucleotide position X axis Gaps in Sliding Window Sites with alignment gaps are not considered in the length of the windows i e all windows have the same number of net nucleotides Windows with a fixed number of net nucleotides All windows will have the same number of net nucleotides i e the number of nucleotides excluding sites with alignment gaps In the same way the step size will also have the same number of net nucleotides Sites with alignment gaps are considered Windows with a fixed number of total nucleotides All windows will have exactly the same number of nucleotides For example if we choose a window length of 50 nucleotides and in a particular window the DNA region contains 4 sites with gaps the analysis will be performed in only 46 sites Likewise the step size will also have the same total number of nucleotides zx InDel Insertion Deletion Polymorphism This module allows estimating several measures of the level of Insertion Deletion InDel polymorphism DIPs In particular DnaSP will infer the number of InDel events from the data Let me suppose the following example data file 13 sequences with 18 positions each Seql AAAAAAGGGGGGGGGGGG In this data file we can identify 4 InDel events Event 1 Seq5 and Seq7 In
121. ucleotide diversity Pi total Nei 1987 equation 10 5 Between populations e The number of fixed differences between populations nucleotide sites at which all of the sequences in one population are different from all of the sequences in the second population Hey 1991 e Mutations that are polymorphic in population 1 but monomorphic in population 2 e Mutations that are polymorphic in population 2 but monomorphic in population 1 e The total number of shared mutations e The average number of nucleotide differences between populations e The average number of nucleotide substitutions per site between populations Dxy Nei 1987 equation 10 20 e Dxy with Jukes and Cantor Nei 1987 equation 10 20 using the Jukes and Cantor correction e The number of net nucleotide substitutions per site between populations Da Nei 1987 equation 10 21 e Da with Jukes and Cantor Nei 1987 equation 10 21 using the Jukes and Cantor correction Variance of Dxy JC Nei 1987 equation 10 24 The standard deviation or standard error is the square root of the variance Variance of Da JC Nei 1987 equation 10 23 The standard deviation or standard error is the square root of the variance This information can be used to estimate the 4 parameters thetaA thetal theta2 and tau that describes the isolation model see Wakeley and Hey 1997 equations 1 3 Sliding window option This option computes the nucleotide diversity for popu
122. ura 1983 In this command DnaSP also computes the Fu s Fs and the Strobeck s S statistics These tests require data only on molecular polymorphism Alignment gaps and missing data Sites containing alignment gaps or sites with missing data are not used these sites are completely excluded Minimum number of sequences in data files The data file must contain at least four sequences Analysis D and F tests are based on the neutral model prediction that estimates of n a1 n 1 ns n and of k are unbiased estimates of 6 where h is the total number of mutations al S 1 i fromi 1ton 1 n the number of nucleotide sequences hs is the total number of singletons mutations appearing only once among the sequences k is the average number of nucleotide differences between pairs of sequences Tajima 1983 equation A3 Note that Fu and Li use Pn to indicate k q 4Nu for diploid autosomal N and u are the effective population size and the mutation rate per DNA sequence per generation respectively D test statistic The D test statistic is based on the differences between ns the number of singletons mutations appearing only once among the sequences and n the total number of mutations Fu and Li 1993 p 700 bottom F test statistic The F test statistic is based on the differences between ns the number of singletons mutations appearing only once among the sequences and k the average number of nucle
123. ven interval Both one sided and two sided tests can be conducted Statistics analysed on per gene basis DnaSP can generate the empirical distribution of the following statistics Haplotype diversity Hd Nei 1987 equation 8 4 but replacing 2n by n See also Depaulis and Veuille 1998 eq 1 By careful the H test defined in Depaulis and Veuille 1998 eq 1 corresponds to H Hd n 1 n Number of haplotypes h Nei 1987 p 259 see also Deapulis and Veuille 1998 Nucleotide diversity Pi x Nei 1987 equations 10 5 or 10 6 but on per gene basis that is the average number of nucleotide differences Theta 8 Watterson 1975 equation 1 4a Linkage disequilibrium ZnS statistic Kelly 1997 equation 3 Linkage disequilibrium Za statistic Rozas et al 2001 equation 2 Linkage disequilibrium ZZ Za ZnS statistic Rozas et al 2001 equation 1 Recombination Rm the minimum number of recombination events Hudson and Kaplan 1985 Appendix 2 Tajima s D Tajima 1989 equation 38 Fu and Li s D Fu and Li 1993 p 700 bottom Fu and Li s F Fu and Li 1993 p 702 see also Simonsen et al 1995 equation 10 Fu and Li s D Fu and Li 1993 equation 32 Fu and Li s F Fu and Li 1993 p 702 top Fay and Wu s H Fay and Wu 2000 equations 1 3 Fu s Fs Fu 1997 equation 1 Wall s B Wall 1999 Wall s Q Wall 1999 Raggedness r Harpending 1994 equation 1 Ramos Onsins and Rozas R2 Ramos Onsins and Rozas 200
124. wer than those of n Effective Population size Statistical significance DnaSP uses the critical values obtained by Fu and Li 1993 two tailed test Tables 2 and 4 to determine the statistical significance of D and F test statistics Note that these values were obtained by computer simulations considering that the true value of 0 falls into the interval 2 20 so that the critical values are not applicable when the true value of 0 is not in that interval DnaSP will not determine the critical values for sample sizes larger than 300 For sample sizes 100 300 DnaSP uses the same critical values than for n 100 the reason is that the critical values increases or decreases with In n so that when n is large the curve of critical values becomes flat Fu personal communication n d not determined P 0 10 P lt 0 05 P lt 0 02 Statistical significance by the coalescent DnaSP can also provide the confidence intervals of the Fu and Li s D and F and the Fay and Wu s H by computer simulations using the coalescent algorithm see Coalescent Simulations Sliding window option This option computes both D and F values and their statistical significance by the Sliding Window method The output of the analysis is given in a grid table The results can also be presented graphically by a line chart In the graph D and F values Y axis can be plotted against the nucleotide position X axis zx Hudson Kreitman and Agu
125. wise nucleotide site differences also called mismatch distribution and the expected values for no recombination in growing and declining populations Rogers and Harpending 1992 equation 4 The model is based on three parameters Theta Initial theta before the population Growth or Decline Theta Final theta after the population Growth or Decline and Tau is the date of the Growth or Decline measured in units of mutational time Tau 2ut t is the time in generations and u is the mutation rate per sequence and per generation Rogers and Harpending 1992 By letting Theta Final as infinite it is possible to estimate Theta Initial and Tau 2ut from the data Rogers 1995 DnaSP gives these estimates that can be used to obtain the expected values DnaSP also estimates the raggedness statistic r Harpending 1994 equation 1 This statistic quantifies the smoothness of the observed pairwise differences distribution DnaSP can provide the confidence intervals of this statistic by computer simulations using the coalescent algorithm see Computer Simulations Nevertheless the raggedness statistic has low statistical power for detecting population expansion Therefore it is better to use more powerful statistics as the Fu s Fs see Fu and Li s and other Tests and the Ramos Onsins and Rozas s R2 DnaSP can also provide by computer simulations using the coalescent see Coalescent Simulations the confidence intervals of the
126. y 1997 this method however is not included in the DnaSP software That method is implemented in the SITES computer program distributed by Jody Hey Jody Hey Web Page http heylab rutgers edu zx Population Size Changes See Also Coalescent Simulations Graphs Window Input Data Files Output References Harpending 1994 Ramos Onsins and Rozas 2002 Rogers 1995 Rogers and Harpending 1992 Rogers et al 1996 Slatkin and Hudson 1991 Tajima 1989a Tajima 1989b Watterson 1975 Abstracts Ramos Onsins and Rozas 2002 This command analyzes the frequency spectrum for segregating sites and the pairwise number of differences DnaSP performs these analyses for constant size and for growing size populations Rozas et al 2001 Alignment gaps and missing data Sites containing alignment gaps or sites with missing data in the data file are not used these sites are completely excluded 1 Pairwise Number of Differences 1 1 Constant Population Size DnaSP shows in tabular and graphic form the distribution of the observed pairwise nucleotide site differences also called mismatch distribution and the expected values at equilibrium for no recombination in a stable population i e population with constant population size Watterson 1975 Slatkin and Hudson 1991 equation 1 Rogers and Harpending 1992 equation 3 1 2 Population Growth Decline DnaSP shows in tabular and graphic form the distribution of the observed pair
127. zas Juan Carlos S nchez DelBarrio Xavier Messeguer and Ricardo Rozas DnaSP DNA polymorphism analyses by the coalescent and other methods Summary DnaSP is a software package for the analysis of DNA polymorphism data Present version introduces several new modules and features which among other options allows 1 handling big data sets 5 Mbp per sequence 2 conducting a large number of coalescent based tests by Monte Carlo computer simulations 3 extensive analyses of the genetic differentiation and gene flow among populations 4 analysing the evolutionary pattern of preferred and unpreferred codons 5 generating graphical outputs for an easy visualization of results Availability The software package including complete documentation and examples is freely available to academic users from http www ub es dnasp e Statistical Tests for Detecting Population Growth Mol Biol Evol 19 2092 2100 2002 Statistical Properties of New Neutrality Tests Against Population Growth Sebastian E Ramos Onsins and Julio Rozas Abstract A number of statistical tests for detecting population growth are described We compared the statistical power of these tests with that of others available in the literature The tests evaluated fall into three categories those tests based on the distribution of the mutation frequencies on the haplotype distribution and on the mismatch distribution We found that for an extensive variety of cases the m

DnaSP manual

Contents

Download Pdf Manuals

Related Search

Related Contents