Home

Arlequin User Manual

1. d setot tcp toga 222 5 ci 28 p Cy 28y Q 2 8 p8y O8y PP Re Tag Manual Arlequin ver 3 1 Methodological outlines 103 2p 2p 2n D D AN2 v c3 P c4P 5Q0 c P c P c Gamma correction A A P P ie Q yla te q 2 Q ya C AER 2 28y Sy Ry ya 28485 287 8c ERE A G ELC R Y ci Ca 28R8y References Tamura and Nei 1994 Kumar et al 1993 7 1 2 6 Estimation of genetic distances between RFLP haplotypes 7 1 2 6 1 Number of pairwise difference We simply count the number of different alleles between two RFLP haplotypes L d iy Do 5 i where 2A i is the Kronecker function equal to 1 if the alleles of the th locus are identical for both haplotypes and equal to O otherwise When estimating genetic structure indices this choice amounts at estimating weighted Fsr statistics over all loci Weir and Cockerham 1984 Michalakis and Excoffier 1996 7 1 2 6 2 Proportion of difference We simply count the proportion of loci that are different between two RFLP haplotypes 1 L d 7 Len i where On i is the Kronecker function equal to 1 if the alleles of the th locus are identical for both haplotypes and equal to O otherwise Manual Arlequin ver 3 1 Methodological outlines 104 When estimating genetic structure indices this choice will lead to exactly the same results as the number of pairwise differences 7 1 2 7 Estimation of distances between Microsatellite ha
2. n n ij i2j2 4 Re if i j andi j i Gh GL a ee J2 lJi L n j 1 1 3 R tL wi 242 ae pentane wares L eS Ce ee ee a ee n LJ 1 J 1 2 2 1 As usual 6 denotes the Kronecker function R is just the ratio of the probabilities of the two tables The switch to the new table is accepted if R is larger than 1 The P value of the test is the proportion of the visited tables having a probability smaller or equal to the observed initial contingency table The standard error on the P value is estimated like in the case of linkage disequilibrium using a system of batches see section 7 1 4 1 Reference Guo and Thomson 1992 7 1 6 Neutrality tests 7 1 6 1 Ewens Watterson homozygosity test This test is based on Ewens 1972 sampling theory of neutral alleles Watterson 1978 has shown that the distribution of selectively neutral haplotype frequencies could be conveniently summarized by the sum of haplotype allele frequencies F equivalent to the expected homozygosity for diploids This test can be performed equally well on diploid or haploid data as the test statistic is not used for its biological meaning but just as a way to summarize the allelic frequency distribution The null distribution of F is generated by simulating random neutral samples having the same number of genes and the same number of haplotypes using the algorithm of Stewart 1977 The probability of observing random samples with F values identical
3. On the left tree pane you can see project files listed in the batch file Settings choice You can either use the same options for all project files by selecting Use interface settings or use the setting file associated with each project file by selecting Use associated settings n the first case the same analyses will be performed on all project files listed in the batch file In the second case you can perform different computations on each project file listed in the batch file giving you much more flexibility on what should be done However it implies that setting files have been prepared previously recording the analyses needing to be performed on the data as well as the options of these analyses Results to summarize Some results can be collected from the analysis of each batch file and put into summary files See section Batch files 6 3 7 for additional information If the associated project file does not exist the current settings are used Note that the batch file the project files and the setting files should all be in the same folder Manual Arlequin ver 3 1 Output files 41 4 OUTPUT FILES The result files are all output in a special sub directory having the same name as your project but with the res extension This has been done to structure your result files according to different projects For instance if your project file is called my_file arp then the result files will be in a sub directory called my_file
4. 2a 1 DT EE C ol cC c c 2a 1 P d 2 l log l w4 2p 2 D AN2 v _ ci 2 5a 2 1 log 1 20 References Tamura 1992 Kumar et al 1993 7 1 2 5 6 Tajima and Nei Outputs a corrected percentage of nucleotides for which two haplotypes are different The correction is an extension of Jukes and Cantor method allowing for unequal nucleotide frequencies The overall nucleotide frequencies are computed from the data 4 a 4 x2 onh 1 m E a aN i l s i l j fif j where the g s are the four nucleotide frequencies and Xi is the relative frequency of the nucleotide pair i and j d pies b vid P aye References Tajima and Nei 1984 Kumar et al 1993 Manual Arlequin ver 3 1 Methodological outlines 102 7 1 2 5 7 Tamura and Nei Outputs a corrected percentage of nucleotides for which two haplotypes are different Like Kimura 2 parameters and Tajima and Nei distances the correction allows for different transversion and transition rates but a distinction is also made between transition rates between purines and between pyrimidines 7 _28 48G r _28c8r pin 28 48GER 1 ti D bd 3 A A ER 8y 28 8G8r Sab 8 482 2 8r8c8y oo 87 8 8 Sr8c8y y 2 rfc b OGAE 5 A A 8 p 28 8 o8R S ah 8 18 GO 28780 By 28r8c8y 8p 878 9 SRST 8C By 84 8G 2 e728 8 p8yO n n A amp G n CeT Q sS 2 sS n d i
5. At ENy 1 2 m A EN Ny At ENy 1 t MNQ At EM 1n E M 1 where Ni is the sample count of haplotypes that are close to hy within the current window Since is a parameter reflecting the effect of mutation it should for example be larger for STR than for SNP or DNA data By simulation we have found that a value of 0 1 gave good results for STR microsatellite data and a value of 0 01 for other data types worked well 7 1 3 2 3 4 Sliding window size updates The value of R max r 1 r where r P11P22 Pi2P21 gives a measure of linkage disequilibrium LD within the window Broadly speaking at each choice between two windows we would generally prefer the window that gives the largest value to R Based on 2 a natural estimate of r is on Q EN ny a E na 1 m2 a EM m ta E oy 1 but this estimate leads to difficulties since larger windows tend to have smaller counts and hence more extreme estimates amounting to a bias towards larger windows This bias could be counteracted by increasing a but we prefer to adjust ato optimize the phase updates probability 2 Instead we add a constant 7 to both numerator and denominator leading to mM A EN nn At ENy 1 7 oe 3 MN tat eE M2 1 aA 6 Ny 1 y Thus at each attempt to update the length of a window in step 3 above we choose n l between windows according to their R marfi z values window 2 repl
6. DNA sequences Transition weight f The weight given to transitions when comparing DNA sequences Deletion weight f The weight given to deletions when comparing DNA or RFLP sequences Haplotype definition Use original definition m Haplotypes are identified according to their original identifier without considering the fact that their molecular definition could be identical Infer from distance matrix m Similar haplotypes will be identified by computing a distance matrix based on the settings chosen above When this option is activated a search for shared haplotypes is automatically performed at the beginning of each run and new haplotypes definitions and frequencies are computed for each population 6 3 8 2 Diversity indices Arlequin 3 0 D Laurent Arlequin Code New test files Disequil ld_gen0 arp z I a leg File View Options Help J Open project View project Q View results Ey View Log fie Close project gt Start x Pause x stop Project Settings Arlequin Configuration Project wizard Import data Settings Molecular diversity indices Reset Load Save Standard diversity indices ARLEQUIN SETTINGS Molecular Diversity Calculation settings 7 Molecular div Gaia studure Frasan AMOVA Population comparisons Molecular distance Number of different aleles Population differentiation Genotype assignment Haplotype inference Print distance matrix between
7. Q View results Close project f Exit Arlequin P Start HJ Pause Compute correlation between distance matrices ARLEQUIN SETTINGS No of permutations for Mante test 1000 gt General settings Calculation settings Genetic structure AMOVA Population comparisons gt 43 Population differentiation Genotype assignment Haplotype inference E Linkage disequilibrium Hardy Weinberg Pairwise linkage i Cog aean ocup B pian Mismatch distribution 3 Molecular diversity indices 3 Neutrality tests e Compute correlation between distance matrices Test the correlation or the partial correlations between 2 or 3 matrices by a permutation procedure Mantel 1967 Smouse et al 1986 e Number of permutations Sets the number of permutations for the Mantel test Manual Arlequin ver 3 1 Methodological outlines 89 7 METHODOLOGICAL OUTLINES The following table gives a rapid overview of the methods implemented in Arlequin A Vv indicates that the task corresponding to the table entry is possible Some tasks are only possible or meaningful if there is no recessive data and those cases are marked with a X Data types DNA amp RFLP Microsat Standard Frequency Types of computations G G G G H G G H Standard indices X v iv vjJvi v vij_viv v Molecular diversity xX Vv iv vivis viv iv Mismatch dis
8. haplotypes present in the samples For instance it can be tedious to write a full sequence of several hundreds of nucleotides next to each haplotype in each sample It is much easier to assign an identifier to a given DNA sequence in the haplotype list and Manual Arlequin ver 3 1 Input files 28 then use this identifier in the sample data section This way Arlequin will know exactly the DNA sequences associated to each haplotype However this section is optional The haplotypes can be fully defined in the sample data section An identifier and a combination of alleles at different loci one or more describe a given haplotype The locus separator defined in the profile section must separate each adjacent allele from each other It is also possible to have the definition of the haplotypes in an external file Use the keyword EXTERN followed by the name of the file containing the definition of the haplotypes Read Example 2 to see how to proceed If the file hap _file hap contains exactly what is between the braces of Example 1 the two haplotype lists are equivalent Example 1 HaplotypeDefinition start the section of Haplotype definition HaplListName list1 give any name you whish to this list HaplList hl AT on each line the name of the haplotype is h2 GC followed by its definition h3 AG h4 AA h5 GG Example 2 HaplotypeDefinition start the section o
9. we choose to change the phase of all sites either located on the right or on the left of the focal site The proportion of updates being recombination steps can be set up in ELB tab dialog as shown in section 6 3 8 4 2 1 A small value is in order less than 5 since it implies a large change which may often be rejected and cause the chain not to mix properly The rationale for this kind update initially not described in Excoffier et al 2003 is to more largely explore the set of possible gametic phase by provoking a radical change from time to time 7 1 3 2 3 3 Handling mutations Increasing a thus allows more flexibility to choose new haplotypes but this is a noisy solution all unobserved haplotypes are treated the same However a recent mutation event can create haplotypes that are rare but similar to a more common haplotype whereas haplotypes that are very dissimilar to all observed haplotypes are highly implausible This phenomenon is particularly prevalent for STR loci with their relatively high mutation rates To encapsulate the effect of mutation when making a phase assignment we give additional weight to an unobserved haplotype for each observed haplotype that is close Manual Arlequin ver 3 1 Methodological outlines 110 to it Here we define close to mean differs at one locus and in the phase update we choose h hz2 rather than A 2 h2 with probability Pr mi 39 nyny 1 mi A EN Ny
10. 0 1 13 3 Version 3 1 compared to version 3 01 1 14 Forthcoming developments 1 15 Reporting bugs and comments 1 16 Remaining problems 2 Getting started 2 1 Arlequin configuration 2 2 Preparing input files 2 2 1 Defining the Genetic Structure to be tested 2 3 Loading project files into Arlequin 2 4 Selecting analyses to be performed on your data 2 5 Creating and using Setting Files 2 6 Performing the analyses 2 7 Interrupting the computations 2 8 Consulting the results mR 3 Input files 3 1 Format of Arlequin input files 3 2 Project file structure OOO AN NNN WN BRR PPP RRR Ree PPP eee BPE NNN ODO ODUN HAHAH AWNNNNNKH OO NNMNNNNNNHH FE RWWWNHRHR OD amp OW NNN U Uu a Manual Arlequin ver 3 1 Table of contents 4 3 2 1 Profile section 25 3 2 2 Data section 27 3 2 2 1 Haplotype list optional 27 3 2 2 2 Distance matrix optional 28 3 2 2 3 Samples 29 3 2 2 4 Genetic structure 31 1 1 1 4 Mantel test settings 32 3 3 Example of an input file 36 3 4 Automatically creating the outline of a project file 38 3 5 Conversion of data files 38 3 6 Arlequin batch files 39 4 Output files 41 4 1 Result file 41 4 2 Arlequin log file 41 4 3 Linkage disequilibrium result file 41 4 4 View your results in HTML browser 41 5 Examples of input files 43 5 1 Example of allele frequency data 43 5 2 Example of standard data Genotypic data unknown gametic phase recessive alleles 43 5 3 Example of DNA sequence data Haplotypic 44 5 4 Exam
11. 1302 1500 1401 1301 1302 e of 6 A e of 12 Input files 0502 0200 0301 0301 0200 0601 0502 0603 f 12 Egyptians 0301 0502 0301 0502 0601 0602 0301 0301 0502 0609 0302 0602 le of 8 French 0200 0501 0200 0200 0604 0602 0503 0603 0604 me My population structure lgerians Egyptians e of 11 Bulgarians e of 8 French 37 Manual Arlequin ver 3 1 Input files 38 3 4 Automatically creating the outline of a project file In order to help you setting up quickly a project file Arlequin can create the outline of a project file for you In order to do this use the Project wizard tab Arlequin 3 0 D Laurent Arlequin Code New test files DNA mtDNAHV1 arp Woe File View Options Help J Open project View project Q View resutts EY View Log file Close project P start m Pause E stop roo Project wizard New project file name Browse Create project Edit project m Data type STANDARD I Genotypicdata Known gametic phase Recessive data Controls No of samples 1 Locus separator WHITESPACE Missingdata zi Optional sections J Include haplotype list J Include distance matrix Include genetic structure See section Project Wizard 6 3 4 for more information on how to setup up the different parameters 3 5 Conversion of data files Selecting the Import Data tab
12. 26 Possible values Any integer number between 1 and 1000 Example NbSamples 3 The type of data to be analyzed Only one type of data is allowed per project Notation DataType Possible values DNA RFLP MICROSAT STANDARD and FREQUENCY Example DataType DNA If the current project deals with haplotypic or genotypic data Notation GenotypicData Possible values 0 haplotypic data 1 genotypic data Example GenotypicData 0 One can also optionally specify The character used to separate the alleles at different loci the locus separator Notation LocusSeparator Possible values WHITESPACE TAB NONE or any character other than or the character specifying missing data Example LocusSeparator TAB Default value WHITESPACE If the gametic phase of genotypes is known Notation GameticPhase Possible values 0 gametic phase not known 1 known gametic phase Example GameticPhase 1 Default value 1 If the genotypic data present a recessive allele Notation RecessiveData Possible values 0 co dominant data 1 recessive data Example RecessiveData 1 Default value 0 The code for the recessive allele Notation RecessiveAllele Possible values Any string of characters within double quotes This string can be explicitly used in the input file to indicate the occurrence of a recessive homozygote at one or several loci Example RecessiveAllele xxx Default value null The charac
13. 4 Linkage disequilibrium between pairs of loci 111 7 1 4 1 Exact test of linkage disequilibrium haplotypic data 111 7 1 4 2 Likelihood ratio test of linkage disequilibrium genotypic data gametic phase unknown 113 7 1 4 3 Measures of gametic disequilibrium haplotypic data 114 7 1 5 Hardy Weinberg equilibrium 115 7 1 6 Neutrality tests 116 7 1 6 1 Ewens Watterson homozygosity test 116 7 1 6 2 Ewens Watterson Slatkin exact test 117 7 1 6 3 Chakraborty s test of population amalgamation 117 7 1 6 4 Tajima s test of selective neutrality 117 7 1 6 5 Fu s Fs test of selective neutrality 118 7 2 Inter population level methods 119 7 2 1 Population genetic structure inferred by analysis of variance AMOVA 119 7 2 1 1 Haplotypic data one group of populations 122 7 2 1 2 Haplotypic data several groups of populations 122 7 2 1 3 Genotypic data one group of populations no within individual level 123 7 2 1 4 Genotypic data several groups of populations no within individual level124 7 2 1 5 Genotypic data one population within individual level 125 7 2 1 6 Genotypic data one group of populations within individual level 125 7 2 1 7 Genotypic data several groups of populations within individual level 126 7 2 2 Minimum Spanning Network MSN among haplotypes 127 7 2 3 Locus by locus AMOVA 127 7 2 4 Population specific Fs7 indices 128 7 2 5 Population pairwise genetic distances 128 7 2 5 1 Reynolds distance Reynolds et al 1983 1
14. Arlequin 3 0 File View Options Help v Append results Oper Use associated settings About C Keep Amova null distributions Prompt for handling unphased multitocus data E S Append results If checked Add results of a new analysis at the end of the current result file Otherwise previous results are deleted before adding the new results Use associated settings Check this box if you want Arlequin to automatically load the settings associated to each project If this box is unchecked the same settings will be used for different projects see section 6 3 2 Keep Amova null If checked the nulle distribution of variance compoents are distributions written in specific files see section 6 3 2 Prompt for handling If checked you will have the option of estimating the unphased multi locus data gametic phase of unphased genotype data with the ELB algorithm see section 6 3 8 4 2 1 Manual Arlequin ver 3 1 Methodological outlines 52 6 1 4 Help Menu Arlequin 3 0 File View Options Help Arlequin PDF Help file Arlequin web site About Arlequin Open project The menu to get access to the Help File System Arlequin PDF Help file Open Arlequin help file Actually it tries to open the file arlequin pdf You thus need to have installed the Adobe Acrobat extensions in your web browser Arlequin web site Link to Arlequin web site http cmpg unibe ch software arlequin3 About Arl
15. Choose a gametic phase inference algorithm to set up Calculation settings Settings for EM algorithm Genetic structure 2et ngs tor CLD algonthm AMOVA Settings for ELB algorithm Population comparisons Population differentiation Genotype assignment 3 EM algorithm 2 Linkage disequilibrium 3 Hardy Weinberg Pairwise linkage gt Mantel test 3 Mismatch distribution Molecular diversity indices gt Neutrality tests General settings Manual Arlequin ver 3 1 Methodological outlines 69 6 3 8 4 2 1 Settings for the ELB algorithm The ELB algorithm has been described recently in Excoffier et al 2003 Arlequin 3 0 D Laurent Arlequin Code New test files Disequil Id_gen0 arp File View Options Help E Open project View project Q View results EY View Log file Close project gt start m Pause E stop Project Settings Arlequin Configuration Project wizard import data Settings Haplotype inference via ELB algorithm Reset Load Save ARLEQUIN SETTINGS m ELB algorithm settings Calculation settings E Genetic structure Haplotype inference Dirichlet prior alpha value o o1 Epsilon value AMOVA Het site influence zone s Gamma value Population comparisons ae 5 saat Sampling interval 500 No of samples 2000 gt Population differentiation FA Genotype assignment Burnin steps 100000 Recombination steps fo ELB al
16. D My Documents DVD CD RW Drive E Desktop cm My My Network Object name Places Objects of type Adequin project files amp Manual Arlequin ver 3 1 Getting started 22 The Arlequin project files must have the arp extension If your project file is valid its main properties will be shown in the Project tab Open project View project Q View results BY View Log fle Close project P Start m Pause E Stop Project information File name D Laurent Ariequin Code New test files DNA mtDNAHV1 arp Project title mtDNA sequences in the Senegalese Mandenka hypervariable region 1 Genotypic data Known Haplotypic data Unknown r Ploidy Gametic Phase DataType Standard DNA Frequency C RFLP Microsat m Dominance Recessive data Codominant data Recessive allele nul Locus separator NONE Missing data 7 2 4 Selecting analyses to be performed on your data Different analyses can be selected and their parameters tuned in the Settings tab Reset Load Save Choose one of the following computations to set up General settings amp Calculation settings Genetic structure El Genetic structure AMOVA gt AMOVA Population comparisons Population comparisons Population differentiation Population differentiation Genotype assigment 3 Genotype assignment 3 Haplotype inf
17. DNA mtDNAHV1 arp dog File View Options Help J Open project View project Q View results Gy View Log fie Close project Start m Pause E e Samples Project information P Mandenka File name FR no groups D Laurent Arlequin Code New test files DNA mDNAHV1 arp Project title mtDNA sequences in the Senegalese Mandenka hypervariable region 1 Ploidy m Gametic Phase Genotypic data Known Haplotypic data C Unknown m DataType Standard DNA C Frequency RFLP Microsat Dominance Recessive data Codominant data Recessive allele nut Locus separator NONE Missing data 7 Once a project has been loaded the Project tab dialog becomes active It shows a brief outline of the project in an explorable tree pane and a few information on the data type The project can be edited by pressing the View Project button on the Toolbar which will launch the text editor currently specified in the Arlequin Configuration tab All the information shown under the project profile section is read only In order to modify them you need to edit the project file with your text editor and reload the project with the File Recent projects menu File name r The location and the name of the current project Project title r The title of the project as entered in the input file Ploidy r Specifies whether input data consist of diploid ge
18. Genetic Structure Editor Resulting structure E ZP Genetic Structure E Group 1 Tharu Oriental Maya Pima E Group 2 Wolof Peul Group 4 Finnish Sic an Group 5 israeli Arab israeli Jew Manual Arlequin ver 3 1 Getting started 21 By pressing on the Update Project this new Structure will be added in the project file a backup copy of the old project will be created with the extension arp bak and the new revised project will be reloaded into Arlequin 2 3 Loading project files into Arlequin Once the project file is built you must load it into Arlequin You can do this either by activating the menu File Open project by clicking on the Open project button on the toolbar or by activating the File Recent projects menu Guegan eee EE EEE ISIS File View Options Help SJ Open project Ed View project a View results Ey View Log file 4 Close project D Start o Pause fl Stop About Arlequin Configuration Project wizard Import data About Arlequin Arlequin ver 3 1 5 c Laurent Excoffier 1998 2006 Computational and Molecular Population Genetics Lab CMPG Zoological Institute University of Berne M A http cmpg_unibe ch software arlequin3 A dialog box should open to allow the selection of an existing project you want to work on like Open Arlequin project or batch file Look in Q AA 3v2 Floppy A Local Disk C Recent Local Disk
19. This is the method we now use to estimate the parameters of the demographic expansion 7T amp and 6 Manual Arlequin ver 3 1 Methodological outlines 96 Approximate confidence intervals for those parameters are obtained by a parametric bootstrap approach The principle is the following We computed approximate confidence intervals for the estimated parameters and f using a parametric bootstrap approach Schneider and Excoffier 1999 generating percentile confidence intervals see e g Efron 199 p 53 and chap 13 e We generate a large number B of random samples according to the estimated demography using a coalescent algorithm modified from Hudson 1990 e For each of the B simulated data sets we reestimate 7 amp and 0 to get B bootstrapped values 6 0 andr e For a given confidence level a the approximate limits of the confidence interval were obtained as the a 2 and 1 a 2 percentile values Efron 1993 p 168 It is important to underline that this form of parametric bootstrap assumes that the data are distributed according the sudden expansion model In Schneider and Excoffier 1999 we showed by simulation that only the confidence interval CI for r has a good coverage i e that the true value of the parameter is included in a 100x 1 a Cl with a probability very close to 1 a The Cl of the other two parameters are overly large the true value of the parameter was almost always included in the Cl
20. and thus too conservative The validity of the estimated stepwise expansion model is tested using the same parametric bootstrap approach as described above We used here the sum of square deviations SSD between the observed and the expected mismatch as a test statistic We obtained its distribution under the hypothesis that the estimated parameters are the true ones by simulating B samples around the estimated parameters As before we re estimated each time new parameters 0 0 and T and computed their associated sums of squares SSDsim The P value of the test is therefore approximated by number of SSD larger or equal to SSD a sim obs i B For convenience we also compute the raggedness index of the observed distribution defined by Harpending 1994 as d 1 2 De DRC E i l Manual Arlequin ver 3 1 Methodological outlines 97 where d is the maximum number of observed differences between haplotypes and the X s are the observed relative frequencies of the mismatch classes This index takes larger values for multimodal distributions commonly found in a stationary population than for unimodal and smoother distributions typical of expanding populations Its significance is tested similarly to that of SSD References Rogers and Harpending 1991 Rogers 1995 Schneider and Excoffier 1999 Excoffier 2004 7 1 2 4 2 Spatial expansion A population spatial expansion generally occurs if the range of a population
21. be used This parameter is mostly useful for inferring gametic phase of DNA sequences where there is only a few heterozygote sites among long stretches of homozygous sites see section 7 1 3 2 3 details e Gamma value f This parameter prevents adaptive windows where gametic phase is estimated to grow too much It can be set to zero for microsatellite data and to a small value for other data sets like 0 01 see section 7 1 3 2 3 details e Sampling interval i It is the number of steps in the Gibbs chain between two consecutive samples of gametic phases e Number of samples i It represents the number of samples of gametic phases one wants to draw in the Gibbs chain to get the posterior distribution of gametic phases and haplotype frequencies for each individual See section 7 1 3 2 3 details e Burnin steps i It is the number of steps to perform in the Gibbs chain before sampling gametic phases The total number of steps in the chain will thus be Burnin steps Number of samples x Sampling interval see section 7 1 3 2 3 details e Recombination steps i It is the proportion of steps in the Gibbs chain consisting in implementing a pseudo recombination phase update instead of a simple phase switch corresponding to a double recombination around a heterozygous site see section 7 1 3 2 3 details e Output phase distribution files b Controls if one wants to output Arlequin files with the gametic phase of each sa
22. consuming when the number of populations is large Significance level f The level at which the test of differentiation is considered significant for the output table If the P value is smaller than the Significance level then the two populations are considered as significantly different Choice of Euclidian distance m Select a distance method to compute the distances between haplotypes Different square Euclidean distances can be used depending on the type of data analyzed o Use project distance matrix m Use the distance matrix defined in the project file if available o Compute distance matrix m Compute a given distance matrix based on a method defined below With this setting selected the distance matrix potentially defined in the project file will be ignored This matrix can be generated either for haplotypic data or genotypic data Michalakis and Excoffier 1996 o Gamma a value f Set the value for the shape parameter a of the gamma function when selecting a distance allowing for unequal Manual Arlequin ver 3 1 Methodological outlines 85 mutation rates among sites See the Molecular diversity section 7 1 2 5 This parameter only applies to DNA data o Use conventional F statistics m With this setting activated we will use a lower diagonal distance matrix with zeroes on the diagonal and ones as off diagonal elements It means that all distances between non identical haplotypes will be considered as ident
23. correlations can be obtained similarly see e g Sokal and Rohlf 1981 The significance of the partial correlations are tested by keeping one matrix constant and permuting the rows and columns of the other two matrices recomputing each time the new partial correlations and comparing it to the observation Smouse et al 1986 Applications of the Mantel test in anthropology and genetics can be found in Smouse and Long 1992 Manual Arlequin ver 3 1 References 135 8 REFERENCES Abramovitz M and I A Stegun 1970 Handbook of Mathematical Functions Dover New York Aris Brosou S and L Excoffier 1996 The impact of population expansion and mutation rate heterogeneity on DNA sequence polymorphism Mol Biol Evol 13 494 504 Cavalli Sforza L L and W F Bodmer 1971 The Genetics of Human Populations W H Freeman and Co San Francisco CA Chakraborty R 1990 Mitochondrial DNA polymorphism reveals hidden heterogeneity within some Asian populations Am J Hum Genet 47 87 94 Chakraborty R and K M Weiss 1991 Genetic variation of the mitochondrial DNA genome in American Indians is at mutation drift equilibrium Am J Hum Genet 86 497 506 Cockerham C C 1969 Variance of gene frequencies Evolution 23 72 83 Cockerham C C 1973 Analysis of gene frequencies Genetics 74 679 700 Davies N Villablanca FX and Roderick GK 1999 Determining the source of individuals multilocus genotyping in no
24. data The raw data consist here of the allelic state of one or an arbitrary number of microsatellite loci For each locus one should provide the number of repeats of the microsatellite motif as the allelic definition if one wants his data to be analyzed according to the step wise mutation model for the analysis of genetic structure It may occur that the absolute number of repeats is unknown If the difference in length between amplified products is the direct consequence of changes in repeat numbers then the minimum length of the amplified product could serve as a reference allowing to Manual Arlequin ver 3 1 Introduction 10 code the other alleles in terms of additional repeats as compared to this reference If this strategy is impossible then any other number could be used as an allelic code but the stepwise mutation model could not be assumed for these data 1 4 4 Standard data Data for which the molecular basis of the polymorphism is not particularly defined or when different alleles are considered as mutationally equidistant from each other Standard data haplotypes are thus compared for their content at each locus without taking special care about the nature of the alleles which can be either similar or different For instance HLA data human MHC enters the category of standard data 1 4 5 Allele frequency data The raw data consist of only allele frequencies single locus treatment only so that no haplotypic informa
25. distance matrix based on the settings chosen above When this option is activated a search for shared haplotypes is automatically performed at the beginning of each run and new haplotypes definitions and frequencies are computed for each population Manual Arlequin ver 3 1 Methodological outlines 80 6 3 8 7 Genetic structure 6 3 8 7 1 AMOVA 6 3 8 7 1 1 AMOVA with haplotypic data Arlequin 3 1 C users Laurent Batranke2 Arlequin Code New test files A JOE File view Options Help J Open project Fad View project Q View results Ey View Log file Close project Bb Start Ej Pause Stop Project Structure Editor Settings Arlequin Configuration Project wizard Import data Reset Load Save JV Standard AMOVA computations haplotypic format ARLEQUIN SETTINGS Locus by locus AMOVA Amova settings Population comparisons Compute population specific FST s Population differentiation No of permutations 1000 Genotype assignment Haplotype inference Compute Minimum Spanning Network MSN among haplotypes E Linkage disequilibrium Compute distance matrix Hardy Weinberg Pairwise linkage Pairwise difference gt gt Mantel test 2 Mismatch distribution Molecular diversity indices Print distance matrix Neutrality tests General settings e Standard AMOVA b Analysis of MOlecular VAriance framework and computation of a Minimum Spanning Network among haplotypes
26. distance matrix between haplotypes b If checked the inter haplotypic distance matrix used to evaluate the molecular diversity is printed in the result file Theta Hom b An estimation of 0 obtained from the observed homozygosity H see section 7 1 2 3 1 Theta S b An estimation of obtained from the observed number of segregating site S see section 7 1 2 3 2 Theta k b An estimation of obtained from the observed number of alleles k see section 7 1 2 3 3 Theta 7 b An estimation of obtained from the mean number of pairwise differences z see section 7 1 2 3 4 6 3 8 3 Mismatch distribution Compute the distribution of the observed number of differences between pairs of haplotypes in the sample see section 7 1 2 4 It also estimates parameters of a sudden demographic or spatial expansion using a generalized least square approach as described in Schneider and Excoffier 1999 see section 7 1 2 4 Manual Arlequin ver 3 1 Methodological outlines 66 Arlequin 3 0 D Laurent Arlequin Code New test files DNA mtDNAHV 1 arp Cee File View Options Help Open project View project Q View results Gy View Log fie Close project P start m Paus I Project Settings Arlequin Configuration Project wizard Import data Settings Mismatch distribution analysis Reset Load Save Estimate parameters of demographic expansion ARLEQUIN SETTINGS V Estimate parameters of spati
27. h22 and h 2 h21 with probabilities proportional to their joint population frequencies These are unknown and in practice they are too small for direct estimation to be feasible To overcome the latter problem we assume HWE so that we now seek to choose between j A22 and Ay2 A21 with probabilities proportional to p11p22 and pi2P21 where py i j 1 2 denotes the population frequency of hj Although the p are also unknown we can estimate them using the nj the haplotype counts among the other n 1 individuals in the sample given their current phase assignments within the window Adopting a Bayesian posterior mean estimate of pj Piy based on a symmetric Dirichlet prior distribution for the p with parameter a gt 0 and hence we propose Pr Mm n 1 Ny a mta A M2 a m a Manual Arlequin ver 3 1 Methodological outlines 109 Larger values of imply a greater chance of choosing a haplotype pair that includes an unobserved haplotype A small values of 0 01 has been show to perform well by simulation in most circumstances Current phase in selected window ACCTCGCCT GCTATCTAG Switch phase update ACCT TGCCT GCTACCTAG 7 1 3 2 3 2 Recombination update Instead of performing a switch update as before we can also update the phase using a recombination update like Current phase in selected window ACCTCGCCT GCTATCTAG Right recombination phase update ACCTTCTAG GCTACGCCT In that case
28. haplotypes ELB algorithm EM algorithm JV Theta Hom V Theta S V Theta k Theta Pi E Linkage disequilibrium 3 Hardy Weinberg 3 Pairwise linkage Mantel test Mismatch distribution Molecular diversity indices 3 Neutrality tests General settings Manual Arlequin ver 3 1 Methodological outlines 65 e Standard diversity indices b Compute several common indices of diversity like the number of alleles the number of segregating loci the heterozygosity level etc see section 7 1 1 e Molecular diversity indices b Check box for computing several indices of diversity at the molecular level Compute minimum spanning tree among haplotypes b Computes a minimum spanning tree and a minimum spanning network among the haplotypes found in each population sample see section 7 1 2 9 This option is only valid for haplotypic data Molecular distance I Choose the type of distance used when comparing haplotypes see section 7 1 2 5 and below o Gamma a value f Set the value for the shape parameter of the gamma function when selecting a distance allowing for unequal mutation rates among sites This option is only valid for some distances computed between DNA sequences Note that a value of zero deactivates here the Gamma correction of these distances whereas in reality a value of infinity would deactivate the Gamma correction procedure This option is only valid for DNA data Print
29. identifier is then used at the population samples level Note that the list of haplotypes can include haplotypes that are not listed in the population samples The genetic diversity of the samples is then simply described as a list of haplotypes found in each population as well as their sample frequencies Profile Title A small example of RFLP data 3 populations NbSamples 3 GenotypicData 0 DataType RFLP LocusSeparator WHITESPACE Manual Arlequin ver 3 1 Methodological outlines We tell Arlequin to compute Euclidian square distances between the haplotypes listed below MissingData Data HaplotypeDefinition HaplListName A fictive list of RFLP haplotypes HaplList 1 000011100111 2 100011100111 6 000011100111 7 100011100111 8 000011100111 T1 000001100111 12 000011100111 17 000011100111 22 000011100111 36 000011100111 Si 000011100111 38 000111100111 40 000011100111 47 000011100111 139 000011100111 140 000011100111 141 000011100111 10011 10011 1001 10011 10011 11011 10011 10011 11011 10011 11011 10011 00011 10011 10011 10011 1001 1001001 1001001 1001001 1001001 1001001 1001001 1001101 1001001 1001001 1001001 1001001 1001001 1001001 1001001 1001001 1001001 1001000 110011 110011 110011 10011 10011 110011 110011 110011 110011 100011 110011 110011 110011 110011 110011 110011 110011 1001 1001 1001 1001 1001 1001 1001 1001 100
30. labels will be entered consecutively on one or several lines within the MatrixData segment before the distance matrix elements COLUMN the haplotype labels will be entered as the first column of each row of the distance matrix itself The matrix data will be entered as a format free lower diagonal matrix The haplotype labels can be either entered consecutively on one or several lines if LabelPosition ROW or entered at the Manual Arlequin ver 3 1 Appendix 143 first column of each row if labelPosition COLUMN The special keyword EXTERN may be used followed by a file name within quotation marks stating that the data must be read in an another file Keywords Description Possible values Data Samples SampleName SampleSize SampleData The name of the sample This keyword is used to mark the beginning of a sample definition Specifies the sample size The sample data listed within braces A string within quotation marks An integer larger than zero For haplotypic data it must specify the number of gene copies in the sample For genotypic data it must specify the number of individuals in the sample The keyword EXTERN may be used followed by a file name within quotation marks stating that the data must be read in a separate file The SampleData keyword ends a sample definition Manual Arlequin ver 3 1 Appendix 144 Keywords Descript
31. like gametic phase of multi locus genotypes using a pseudo Bayesian approach ELB algorithm Test of non random association of alleles at different loci Test of non random association of alleles within diploid individuals Test of the selective neutrality of a random sample of DNA sequences or RFLP haplotypes under the infinite site model Test of the selective neutrality of a random sample of DNA sequences or RFLP haplotypes under the infinite site model Tests of selective neutrality based on Ewens sampling theory under the infinite alleles model A test of selective neutrality and population homogeneity This test can be used when sample heterogeneity is suspected Computes a Minimum Spanning Tree MST and Network MSN among haplotypes This tree can also be computed for all the haplotypes found in different populations if activated under the AMOVA section Manual Arlequin ver 3 1 Introduction 12 Inter population methods Short description Search for shared haplotypes Comparison of population samples for their between populations haplotypic content All the results are then summarized in a table AMOVA Different hierarchical Analyses of Molecular Variance to evaluate the amount of population genetic structure Pairwise genetic distances Fs based genetic distances for short divergence time Exact test of population Test of non random distribution of haplotypes differentiation into population samples
32. of daughter populations are indeed unequal According to our simulations Table 4 in Gaggiotti and Excoffier 2000 conventional methods such as described above lead to better results for equal population Manual Arlequin ver 3 1 Methodological outlines 132 size k 0 5 and short divergence times 7T No lt 0 5 However the fact that the present method leads to clearly aberrant results in some cases is not necessarily a drawback It has the advantage to draw the user attention to the fact that some care has to be taken with the interpretations of the results Some other estimators that would be grossly biased but whose values would be kept within reasonable bounds would often lead to misinterpretations Note that the numerical method we have used to resolve the system of equation may sometimes fail to converge An asterisk will indicate those cases in the result file that should be discarded because of convergence failure 7 2 6 Exact tests of population differentiation We test the hypothesis of a random distribution of k different haplotypes or genotypes among r populations as described in Raymond and Rousset 1995 This test is analogous to Fisher s exact test on a 2x2 contingency table extended to a rxk contingency table All potential states of the contingency table are explored with a Markov chain similar to that described for the case of the linkage disequilibrium test section 7 1 4 1 During this random walk between the states of th
33. on the number of segregating sites in the sample and the other being based on the mean number of pairwise differences between haplotypes Under the infinite site model both estimators should estimate the same quantity but differences can arise under selection population non stationarity or heterogeneity of mutation rates among sites See section 7 1 6 4 Fu s Fs b This test described by Fu 1997 is based on the probability of observing k or more alleles in a sample of a given size conditioned on the observed average number of pairwise differences The distribution of the statistic is obtained Manual Arlequin ver 3 1 Methodological outlines 79 by simulating samples according to a given value taken as the average number of pairwise differences This test has been shown to be especially sensitive to departure from population equilibrium as in case of a population expansion see section 7 1 6 4 e Haplotype definition The way haplotypes are defined is important here since some tests are based on the number of alleles in the samples and therefore it is better to re evaluate this quantity before doing these tests Chakraborty s test Ewens Watterson and Fu s Fs Use original definition m Haplotypes are identified according to their original identifier without considering the fact that their molecular definition could be identical Infer from distance matrix m Similar haplotypes will be identified by computing a
34. or any character within quotes other specify the code for than those previously used Manual Arlequin ver 3 1 Appendix missing data 142 Default 2 Frequency Specifies the format of ABS absolute values haplotype frequencies REL relative values absolute values will be found by multiplying the relative frequencies by the sample sizes Default ABS Keywords Description Possible values Data HaplotypeDefinition HaplIListName HaplList facultative section The name of a haplotype definition list The list of haplotypes listed within braces 5 A string within quotation marks A series of haplotype definitions given on separate lines for each haplotype Each haplotype is defined by a haplotype label and a combination of alleles at different loci The Keyword EXTERN followed by a string within quotation marks may be used to specify that a given haplotype list is in a different file Keywords Description Possible values Data DistanceMatrix MatrixName MatrixSize LabelPosition MatrixData facultative section The name of the distance matrix The size of the matrix Specifies whether haplotypes labels are entered by row or by column The matrix data itself listed within braces CERS A string within quotation marks A positive integer larger than zero corresponding to the number of haplotypes listed in the haplotype list ROW the haplotype
35. res 4 1 Result file The file containing all the results of the analyses just performed By default it has the same name than the Arlequin input file with the extension htm This file is opened in the right frame of the html browser at the end of each run If the option Append Results of the Configuration Arlequin tab is checked the results of the current computations are appended to those of previous calculations otherwise the results of previous analyses are erased and only the last results are output in the result file 4 2 Arlequin log file A file where run time WARNINGS and ERRORS encountered during any phases of the current Arlequin session are issued The file has the name Arlequin_log txt and is located in the result directory of the opened project You should consult this file if you observe any warning or error message in your result file If Arlequin has crashed then consult Arlequin_log txt before running Arlequin again It will probably help you in finding where the problem was located A reference to the log file is provided in the left pane of the html result file and can be activated in your web browser The log file of the current project can also be viewed by pressing on the View Log File button on the Toolbar 4 3 Linkage disequilibrium result file This file contains the results of pairwise linkage disequilibrium tests between all pairs of loci By default it has the name LD_DIS XL As suggested by its exten
36. the pairwise Fsr distances In case of YMatrix custom the labels can be chosen by the user These labels will be used to select the sub matrices on which correlation or partial correlation is computed Manual Arlequin ver 3 1 Input files 34 Notation YMatrixLabels Possible values A list containing the names of the label name belonging to the group entered within braces Example yMatrixLabels Populationl Population4 Population2 Populations Population5 e A keyword that allows to define a matrix with witch the correlation with the YMatrix is computed Notation DistMatMantel Example DistMatMantel 0 00 3 20 0 00 0 47 0 76 0 00 0 00 1 23 0 37 0 00 0 22 0 37 0 21 0 38 0 00 e Labels defining the sub matrix on witch the correlation is computed Notation UsedYMatrixLabels Possible values A list containing the names of the label name belonging to the group entered within braces Example UsedYMatrixLabels Populationl Population5 Populations Note If you want to compute the correlation between entirely user specified matrices you need to list a dummy population sample in the Sample section in order to allow for a proper reading of the Arlequin project We hope to remove this weird limitation but it is the way it works for now Two complete examples Example 1 We compute the partial correlation between the YMatrix and two other matrices X1 and X2 The YMa
37. v selected by the user to perform some calculation with Arlequin NOT TO BE MODIFIED BY HAND Arl_run txt A file containing information about Arlequin v working directory and path to working project file NOT TO BE MODIFIED BY HAND Arlecore3 exe A console application that can perform all v computations selected by the graphical interface for advanced users wanting to write scripts to analyse many data sets Arlecore3 exe needs the three files Arlequin ini arl_run ars and arl_run txt to perform correctly recent_pro txt A file containing the list of up to the last ten v projects loaded into Arlequin NOT TO BE MODIFIED BY HAND ua js And ftiens4 js ua js and ftiens4 js contain the J ava scripts v that allows the browsing of the result HTML files This script needs gif files 14 gif files These gif files are used by the java scripts v for graphical display in the main result html file Qtinf dil A dynamic link library necessary for the v display of graphical components of the application Arlequin3 pdf Arlequin 3 user manual in pfd format Readme30 txt A text file containing a short description of the main features of Arlequin Manual Arlequin ver 3 1 Introduction 14 Example files in subdirectory datafiles Amova amovahap arp Conversion gene_pop1 gpp Amova amovahap ars Amova amovadis arp Amova amovadis ars Amova 56hapdef txt Dna mtdna_hv1 arp Dna mtdna_hvi ars Dna nucl_div arp Dna nucl_div ars Amov
38. 1 1001 1001 1001 1001 1001 1001 1001 1001 100 100 100 100 100 100 100 100 100 100 100 100 00 oO O H t t Oo t t COOCCOOOCOOrFCCOAOC OG rr Ot Pa Roe j 1001110 100101 100100 COOOOCOOCOCOCCOOCOOOCOCOCO fo COOOOCOCOOCOCOCOCOOCOOOCOCCO fo OCOOOOCOCOOFrRrCDCOOCOCOOCCC GO COOrRPGDVAOVOGVCCOCCCOCOCOOCCC 0 oOoo0o0o0000000000000O O COOOOCOCOOCCOOOCOOOCOCOCO CO O Samples al SampleName pop 1 SampleSize 28 SampleData 1 27 40 1i SampleName pop 2 SampleSize 75 SampleData 1 37 17 1 6 21 7 2 1 22 5 11 2 36 139 47 140 141 37 38 SampleName pop 3 SampleSize 48 SampleData 1 46 8 1 12 1 Structure Manual Arlequin ver 3 1 Methodological outlines 48 StructureName A single group of 3 samples NbGroups 1 Group pop 1 pop A pop au 5 6 Example of standard data Genotypic data known gametic phase In this example we have defined 3 samples consisting of standard multi locus data with known gametic phase It means that the alleles listed on the same line constitute a haplotype on a given chromosome For instance the genotype G1 is made up of the two following haplotypes AD on one chromosome and BC on the second A and b being two alleles at the first locus and C and D being two alleles at the sec
39. 29 7 2 5 2 Slatkin s linearized Fst s Slatkin 1995 129 7 2 5 3 M values M Nm for haploid populations M 2Nm for diploid populations 129 7 2 5 4 Nei s average number of differences between populations 130 7 2 5 5 Relative population sizes Divergence between populations of unequal sizes 131 7 2 6 Exact tests of population differentiation 132 7 2 7 Assignment of individual genotypes to populations 132 7 2 8 Mantel test 133 8 References 135 9 Appendix 141 9 1 Overview of input file keywords 141 Manual Arlequin ver 3 1 Introduction 7 1 INTRODUCTION 1 1 Why Arlequin Arlequin is the French translation of Arlecchino a famous character of the Italian Commedia dell Arte As a character he has many aspects but he has the ability to switch among them very easily according to its needs and to necessities This polymorphic ability is symbolized by his colorful costume from which the Arlequin icon was designed 1 2 Arlequin philosophy The goal of Arlequin is to provide the average user in population genetics with quite a large set of basic methods and statistical tests in order to extract information on genetic and demographic features of a collection of population samples The graphical interface is designed to allow users to rapidly select the different analyses they want to perform on their data We felt important to be able to explore the data to analyze several times the same data set from different perspec
40. 47 1943 1957 Manual Arlequin ver 3 1 References 138 Prim R C 1957 Shortest connection networks and some generalizations Bell Syst Tech J 36 1389 1401 Press W H S A Teukolsky W T Vetterling and B P Flannery 1992 Numerical Recipes in C The Art of Scientific Computing Cambridge Cambridge University Press Rannala B and Mountain JL 1997 Detecting immigration by using multilocus genotypes Proc Natl Acad Sci USA 94 9197 9201 Ray N Currat M Excoffier L 2003 Intra Deme Molecular Diversity in Spatially Expanding Populations Mol Biol Evol 20 1 76 86 Raymond M and F Rousset 1994 GenePop ver 3 0 Institut des Sciences de Evolution Universit de Montpellier France Raymond M and F Rousset 1995 An exact tes for population differentiation Evolution 49 1280 1283 Reynolds J Weir B S and Cockerham C C 1983 Estimation for the coancestry coefficient basis for a short term genetic distance Genetics 105 767 779 Rice J A 1995 Mathematical Statistics and Data Analysis 2nd ed Duxburry Press Belmont CA Rogers A 1995 Genetic evidence for a Pleistocene population explosion Evolution 49 608 615 Rogers A R and H Harpending 1992 Population growth makes waves in the distribution of pairwise genetic differences Mol Biol Evol 9 552 569 Rohlf F J 1973 Algorithm 76 Hierarchical clustering using the minimum spanning tree The Computer J ournal 16 93 95 Ro
41. 7 1 1 6 Garza Williamson index G W 91 7 1 2 Molecular indices 91 7 1 2 1 Mean number of pairwise differences r 91 7 1 2 2 Nucleotide diversity or average gene diversity over L loci 92 7 1 2 3 Theta estimators 92 7 1 2 3 1 Theta Hom 92 7 1 2 3 2 Theta S 93 7 1 2 3 3 Theta k 94 7 1 2 3 4 Theta 7 94 7 1 2 4 Mismatch distribution 94 7 1 2 4 1 Pure demographic expansion 95 7 1 2 4 2 Spatial expansion 97 7 1 2 5 Estimation of genetic distances between DNA sequences 98 7 1 2 5 1 Pairwise difference 99 7 1 2 5 2 Percentage difference 99 7 1 2 5 3 Jukes and Cantor 99 7 1 2 5 4 Kimura 2 parameters 100 7 1 2 5 5 Tamura 101 7 1 2 5 6 Tajima and Nei 101 7 1 2 5 7 Tamura and Nei 102 7 1 2 6 Estimation of genetic distances between RFLP haplotypes 103 7 1 2 6 1 Number of pairwise difference 103 7 1 2 6 2 Proportion of difference 103 7 1 2 7 Estimation of distances between Microsatellite haplotypes 104 7 1 2 7 1 No of different alleles 104 7 1 2 7 2 Sum of squared size difference 104 7 1 2 8 Estimation of distances between Standard haplotypes 104 7 1 2 8 1 Number of pairwise differences 104 7 1 2 9 Minimum Spanning Network among haplotypes 105 Manual Arlequin ver 3 1 Table of contents 6 7 1 3 Haplotype inference 105 7 1 3 1 Haplotypic data or Genotypic data with known Gametic phase 105 7 1 3 2 Genotypic data with unknown Gametic phase 105 7 1 3 2 1 EM algorithm 105 7 1 3 2 2 EM zipper algorithm 107 7 1 3 2 3 ELB algorithm 107 7 1
42. A33 Cwl0 B70 DR1304 DQ0301 A33 Cwl0 B7801 DR1304 DQ0302 MAN0103 22 A33 Cwl0 B70 DR1301 DQ0301 A33 Cwl0 B7801 DR1302 DQO0501 MANO108 23 A23 Cw6 B35 DR1102 DQO0301 A29 Cw7 B57 DR1104 DQ0602 MANO109 6 A30 Cw4 B35 DR0801 xxx A68 Cw4 B35 DR0801 xxx 5 3 Example of DNA sequence data Haplotypic Here we define 3 population samples of haplotypic DNA sequences A simple genetic structure is defined that just incorporates the three population samples into a single group of populations Profile Title An example of DNA sequence data NbSamples 3 GenotypicData 0 DataType DNA LocusSeparator NONE Data Samples SampleName Population 1 SampleSize 6 SampleData 000 3 GACTCTCTACGTAGCATCCGATGACGATA 001 1 GACTGTCTGCGTAGCATACGACGACGATA 002 2 GCCTGTCTGCGTAGCATAGGATGACGATA SampleName Population 2 SampleSize 8 SampleData 000 1 GACTCTCTACGTAGCATCCGATGACGATA 001 1 GACTGTCTGCGTAGCATACGACGACGATA 002 1 GCCTGTCTGCGTAGCATAGGATGACGATA 003 1 GCCTGTCTGCCTAGCATACGATCACGATA 004 1 GCCTGTCTGCGTACCATACGATGACGATA 005 1 GCCTGTCCGCGTAGCGTACGATGACGATA 006 E GCCCGTGTGCGTAGCATACGATGGCGATA 007 1 GCCTGTCTGCGTAGCATGCGACGACGATA SampleName Population 3 SampleSize 6 SampleData 023 1 GCCTGTCTGCGTAGCATACGATGACGGTA 024 1 GCCTGTCTGCGTAGCGTACGATGACGATA 025 1 GCCTGTCTGCGTAGCATACGATGACGATA Manual Arlequin ver 3 1 Method
43. ARLEQUIN Ver 3 1 An Integrated Software Package for Population Genetics Copyright 1995 2006 Laurent Excoffier All rights reserved Manual Arlequin ver 3 1 ARLEQUIN VER 3 1 USER MANUAL Arlequin ver 3 1 An Integrated Software Package for Population Genetics Data Analysis Authors Laurent Excoffier Guillaume Laval and Stefan Schneider Computational and Molecular Population Genetics Lab CMPG Institute of Zoology University of Berne Baltzerstrasse 6 3012 Bern Switzerland E mail laurent excoffier zoo unibe ch URL http cmpg unibe ch software arlequin3 September 2006 Manual Arlequin ver 3 1 Table of contents Table of contents ARLEQUIN ver 3 1 user manual Table of contents 1 Introduction 1 1 Why Arlequin 1 2 Arlequin philosophy 1 3 About this manual 1 4 Data types handled by Arlequin 1 4 1 DNA sequences 1 4 2 RFLP Data 1 4 3 Microsatellite data 1 4 4 Standard data 1 4 5 Allele frequency data 5 Methods implemented in Arlequin 1 6 System requirements 1 7 Installing and uninstalling Arlequin 1 7 1 Installation 1 7 1 1 Arlequin 3 installation 1 7 1 2 Arlequin 3 uninstallation 8 List of files included in the Arlequin package 9 Arlequin computing limitations 10 How to cite Arlequin 11 Acknowledgements 12 How to get the last version of the Arlequin software 13 What s new in version 3 1 1 13 1 Version 3 0 compared to version 2 1 13 2 Version 3 01 compared to version 3
44. EM algorithm i Set the number of random initial conditions from which the EM algorithm is started to repeatedly estimate haplotype frequencies The haplotype frequencies globally maximizing the likelihood of the sample will be kept eventually Figures of 50 or more are usually in order Maximum no of iterations i Set the maximum number of iterations allowed in the EM algorithm The iterative process will have at most this number of iterations but may stop before if convergence has been reached Here convergence is reached when the sum of the differences between haplotypes frequencies between two successive iterations is smaller than the epsilon value defined above Use Zipper version of EM b Use the zipper version of the EM algorithm consisting in building haplotypes progressively by adding one locus at a time see section 7 1 3 2 2 No of loci orders Defines how many random loci orders should be used in the zipper version of the EM algorithm Results about haplotype frequencies obtained for the locus order leading to the best likelihood is shown in the result file Recessive data b Specify whether a recessive allele is present This option applies to all loci The code for the recessive allele can be specified in the project file see section 3 2 1 Estimate standard deviation through bootstrap b Uses a bootstrap approach to estimate the standard deviation of haplotype frequencies No of bootstrap to perf
45. Estimate genetic structure indices using information on the allelic content of haplotypes as well as their frequencies Excoffier et al 1992 The information on the differences in allelic content between haplotypes is entered as a matrix of Euclidean squared distances The significance of the covariance components associated with the different possible levels of genetic structure within individuals within populations within groups of populations among groups is tested using non parametric permutation procedures Excoffier et al 1992 The type of permutations is different for each covariance component see section 7 2 The minimum spanning tree and network is computed among all haplotypes defined in the samples included in the genetic structure to test see section 7 2 2 The number of hierarchical levels of the variance analysis and the kind of permutations that are done depend on the kind of data the genetic structure that Manual Arlequin ver 3 1 Methodological outlines 81 is tested and the options the user might choose All details will be given in section 7 2 e Locus by locus AMOVA b A separate AMOVA can be performed for each locus separately For this purpose we use the same number of permutations as in the global Amova This procedure should be favored when there is some missing data Note that diploid individuals that are found with missing data for one of their two alleles at a given locus are removed from th
46. Fsr which could be indicative of special evolutionary constraints in these populations selection bottleneck etc Note that in locus by locus analyses we have noticed that populations with two alleles and one being a singleton will show large negative population specific Fsr indices which can even be smaller than 1 which is clearly an artifact because SSD AP will be very small while SSD WP will still be substantial 7 2 5 Population pairwise genetic distances The pairwise Fsr s can be used as short term genetic distances between populations with the application of a slight transformation to linearize the distance with population divergence time Reynolds et al 1983 Slatkin 1995 The pairwise Fsr values are given in the form of a matrix The null distribution of pairwise Fs values under the hypothesis of no difference between the populations is obtained by permuting haplotypes between populations The P value of the test is the proportion of permutations leading to a Fs value larger or equal to the observed one The P values are also given in matrix form Manual Arlequin ver 3 1 Methodological outlines 129 Three other matrices are computed from the Fs values 7 2 5 1 Reynolds distance Reynolds et al 1983 Since Fs between pairs of stationary haploid populations of size N having diverged t generations ago varies approximately as _ 1 lag t N For 1 dl ne zl e The genetic distance D log d For
47. If the gametic phase is known the genotype can be considered as made up of two well defined haplotypes For genotypic data with unknown gametic phase you can consider the two Manual Arlequin ver 3 1 Introduction 9 alleles present at each locus as codominant or you can allow for the presence of a recessive allele This gives finally four possible forms of genetic data e Haplotypic data e Genotypic data with known gametic phase e Genotypic data with unknown gametic phase no recessive alleles e Genotypic data with unknown gametic phase recessive alleles 1 4 1 DNA sequences Arlequin can accommodate DNA sequences of arbitrary length Each nucleotide is considered as a distinct locus The four nucleotides C T A G are considered as unambiguous alleles for each locus and the is used to indicate a deleted nucleotide Usually the question mark codes for an unknown nucleotide The following notation for ambiguous nucleotides are also recognized R A G purine C T pyrimidine A C A T C G G T C G T A G T gt A C T A C G A C G T oe ee VO ON IN eee X 1 4 2 RFLP Data Arlequin can handle RFLP haplotypes of arbitrary length Each restriction site is considered as a distinct locus The presence of a restriction site should be coded as a 1 and its absence as a 0 The character should be used to denote the deletion of a site not its absence due to a point mutation 1 4 3 Microsatellite
48. N elements of these matrix are not all independent as there are only N 1 independent contrasts in the data This is why the permutation procedure does not permute the elements of the matrices independently The correlation of the two matrices is classically defined as SP X Y xy SS X SS Y the ratio of the cross product of X and Y over the square root of the product of sums of squares We note that the denominator of the above equation is insensitive to Manual Arlequin ver 3 1 Methodological outlines 134 permutation such that only the numerator will change upon permutation of rows and columns Upon closer examination it can be shown that the only quantity that will actually change between permutations is the Hadamard product of the two matrices noted as N i omnes Ly yi i l j which is the only variable term involved in the computation of the cross product The Mantel testing procedure applied to two matrices will then consist in computing the quantity Zyy from the original matrices permute the rows and column of one matrix xy and while keeping the other constant and each time recompute the quantity Z compare it to the original Zyy value Smouse et al 1986 In the case of three matrices say Y X and X23 the procedure is very similar The partial correlation coefficients are obtained from the pairwise correlations as ry r r YX XX2 YX Looe fam mo dry 0 rix The other relevant partial
49. Pj where Pij is the frequency of the haplotype having allele at the first locus and allele j at the second locus and pj and p j are the frequencies of alleles and j respectively Di The linkage disequilibrium coefficient D standardized by the maximum value it can take D he given the allele frequencies Lewontin 1964 as D D 1 ij max where D takes one of the following values ma r Another conventional measure of linkage disequilibrium between pairs of alleles at two loci is the square of the correlation coefficient between allele frequencies which can be expressed as a function of the linkage disequilibrium measure D as r e Pil p p Q p Manual Arlequin ver 3 1 Methodological outlines 115 7 1 5 Hardy Weinberg equilibrium To detect significant departure from Hardy Weinberg equilibrium we follow the procedure described in Guo and Thompson 1992 using a test analogous to Fisher s exact test on a two by two contingency table but extended to a triangular contingency table of arbitrary size The test is done using a modified version of the Markov chain random walk algorithm described Guo and Thomson 1992 The modified version gives the same results than the original one but is more efficient from a computational point of view This test is obviously only possible for genotypic data If the gametic phase is unknown the test is only possible for each locus separately For data with known ga
50. a amovadis dis Disequil hwequil arp Batcn pater ex arp Disequil hwequil ars Batch amoval arp Disequil ld_gen0 arp Batch amoval ars Disequil ld_gen0 ars Batch amova2 arp Disequil ld_gen1 arp Batch amova2 ars Disequil ld_gen1 ars Batch amovaimat dis Disequil ld_hap arp Batch genotsta alp Disequil ld_hap ars Batch genotsta ars Batch microsat arp Freqncy cohen arp Batch microsat ars Freqncy cohen ars Batch missdata arp Hepiireg vas Zpoan Batch missdata ars Mapliveg Nie 7 POprars Batch phenohla arp Batch phenohla ars Batch relfreq arp Batch relfreq ars Batch indlevel arp Batch indlevel ars 1 9 Arlequin computing limitations Mantel custom_corr3mat arp Mantel custom_corr3mat ars Mantel fst_corr arp Mantel fst_corr ars Mantelfst_partial_corr arp Mantel fst_partial_corr ars Microsat 2popmic arp Microsat 2popmic ars Microsat micdipl arp Microsat micdipl ars Microsat micdip 2 arp Microsat micdipl2 ars Neutrtst chak_tst arp Neutrtst chak_tst ars Neutrtst ew_watt arp Neutrtst ew_watt ars Neutrtst Fu_s_test arp Neutrtst Fu_s_test ars The amount of data that Arlequin can handle mostly depends on the memory available on your computer However a few parameters are limited to values within the range shown below Portions of Arlequin concerned by the limitations Limited parameter Maximum value Ewens Watterson and Chakraborty s neutrality tests Ewens Watterson and Number of
51. aces window 1 r with probability EE a 1 2 Even a large value for y can fail to prevent a window from growing too large when two consecutive heterozygous loci in an individual are separated by many homozygous loci The window must then be large in order to contain the necessary minimum of two heterozygous loci To circumvent the problem of small haplotype counts which may then Manual Arlequin ver 3 1 Methodological outlines 111 result when updating an individual s phase allocation we can ignore homozygous loci that are separated from the nearest heterozygous locus by more than an given number of intervening homozygous loci This is the parameter called Heterozygous site influence Zone to be chosen in ELB tab dialog in section 6 3 8 4 2 1 7 1 3 2 3 5 Handling missing data In handling missing data the philosophy underpinning ELB is to ignore the affected loci rather than to impute missing data or to augment the space of possible genotypes In the presence of missing data the haplotype counts nj and nj are not necessarily integers individuals with missing data at m loci within a current window of length L contribute 1 m L to nj or nz 1 for each haplotype at which the remaining L m loci match hj exactly or with one mismatch Reference Excoffier et al 2003 7 1 4 Linkage disequilibrium between pairs of loci Depending on whether the haplotypic composition of the sample is known or not we have implemented tw
52. al Institute University of Berne M A http cmpg unibe ch software arlequin3 September 2006 The graphical interface is made up of a series of tabbed dialog boxes whose content vary dynamically depending on the type of data currently analyzed 6 1 Menus 6 1 1 File Menu New project Prompts the Project Wizard dialog box Open project Opens a dialog box to locate an existing project Close project Closes the current project Recent projects Open a submenu with the last 10 more recently opened projects Load settings Load previously saved computqtion settings Save settings Save current computation settings Save settings as Save current computation settings under a specific name Exit Exit Arlequin and close current project Manual Arlequin ver 3 1 Methodological outlines 51 6 1 2 View Menu Arlequin 3 0 D Laurent File View Options Help a Project Information cS Settings Proj View Project View Results View Log file v Show button text Project information Open tab dialog with information on current project Settings Open specific tab dialogs to active some computations and choose their associated settings View Project View current project in text editor View Results View computation result in default web browser View Log file View log file in text editor Show button text a presence absence of text associated to toolbar uttons 6 1 3 Options Menu
53. al expansion Calculation settings g8 ppe Molecular distance Pairwise difference gt Population comparisons Number of bootstrap replicates 1000 3 Population differentiation i Genotype assignment i Haplotype inference Linkage disequilibrium Hardy Weinberg Pairwise linkage Mantel test Mismatch distribution 3 Molecular diversity indices gt Neutrality tests General settings Estimate parameters of demographic expansion b The parameters of an instantaneous demographic expansion are estimated from the mismatch distribution See section 7 1 2 4 using a generalized least square approach as described in Schneider and Excoffier 1999 see section7 1 2 4 1 Estimate parameters of spatial expansion b Estimate the specific parameters of spatial expansion following Excoffier 2004 see section 7 1 2 4 2 Molecular distance Here we only allow one genetic distance the mere number of observed differences between haplotypes Number of bootstrap replicates The number of coalescent simulations performed using the estimated parameters of the demographic or spatial expansion These parameters will be re estimated for each simulation in order to obtain their empirical confidence intervals and the empirical distribution of the output statistics such as the sum of squared deviations between the observed and the expected mismatch the raggedness index or percentile values for ea
54. ample He also noticed that the homozygosity of the sample was less sensitive to the amalgamation and therefore proposed to use the mutation parameter inferred from the homozygosity 0 see Hom section 7 1 2 3 1 to compute the probability of observing a random neutral sample with a number of alleles similar or larger than the observed value Pr K k see P section 7 1 2 3 3 to see how this probability can be computed It is an approximation of the conditional probability of observing some number of alleles given the observed homozygosity References Ewens 1972 Chakraborty 1990 7 1 6 4 Tajima s test of selective neutrality Tajima s 1989a test is based on the infinite site model without recombination appropriate for short DNA sequences or RFLP haplotypes It compares two estimators of the mutation parameter theta 2Mu with M 2N in diploid populations or M N in haploid populations of effective size N The test statistic D is then defined as Manual Arlequin ver 3 1 Methodological outlines 118 r s JVar s a A zi where 0 7 and Os S o 1 i and S is the number of segregating sites in the D sample The limits of confidence intervals around D may be found in Table 2 of Tajima s paper Tajima 1989a for different sample sizes The significance of the D statistic is tested by generating random samples under the hypothesis of selective neutrality and population equilibrium using a coalesce
55. are entered the size of each sample will be checked against the sum of all haplotypic Manual Arlequin ver 3 1 Input files 30 frequencies will check If a discrepancy is found a Warning message is issued in the log file and the sample size is set to the sum of haplotype frequencies When relative frequencies are specified no such check is possible and the sample size is used to convert relative frequencies to absolute frequencies e The data itself Notation SampleData Possible values A list of haplotypes or genotypes and their frequencies as found in the sample entered within braces Example SampleData idl 1 ACGGTGTCGA id2 2 ACGGTGTCAG id3 8 ACGGTGCCAA id4 10 ACAGTGTCAA id5 1 GCGGTGTCAA Note The last closing brace marks the end of the sample definition A new sample definition begins with another keyword SampleName FREQUENCY data type If the data type is set to FREQUENCY one must only specify for each haplotype its identifier a string of characters without blanks and its sample frequency either relative or absolute In this case the haplotype should not be defined Example SampleData idl id2 2 id3 id4 10 id5 Haplotypic data For all data types except FREQUENCY one must specify for each haplotype its identifier and its sample frequency If no haplotype list has been defined earlier one must also define here the allelic content of the haplotype The haplotype identifier is used to establ
56. ased calle a a and output the posterior distribution of individuals phases info a seres of arp files The analysis of a resutting batch file wil allow one to gef the distribution of any statistic that explicitly depends on phase like D for instance Consider unphased data as multijocus data with unknown gametic phase analyses requiring expiict phase information won t be possible m Options of the ELB algorithm Dirichlet prior alpha value joon Epsilon value jon Het site influence zone 5 o Gamma value jo Sampling interval No of files to generate in the Bumin steps distribution 2000 If the menu Prompt for handling unphased multi locus data is checked in the Option menu see section 6 1 3 this dialog box will appear when projects containing genotypic data with unknown phase are loaded The two options appearing in the dialog box are self explanatory and the settings for the ELB algorithm are described in the Settings for the ELB algorithm and ELB algorithm sections 6 3 8 4 2 1 and 7 1 3 2 3 If you choose to estimate the gametic phase with the ELB algorithm then Arlequin project files as many as the variable No of files to generate in the distribution defined above are written in a subdirectory of the result directory called PhaseDistribution They have the name ELB_EstimatedPhase lt Sample number gt arp Arlequin also outputs a file called ELB_Best_Phases arp containing for each individual the
57. at the rate of nucleotide substitution is identical for all 4 nucleotides A C G and T p n L zag 4 d log 1 4 g Ai Manual Arlequin ver 3 1 Methodological outlines 100 vid U p 3 YL Gamma correction a 3 4 Ala a is zal zP V p p py 20D VL References Jukes and Cantor 1969 Jin and Nei 1990 Kumar et al 1993 7 1 2 5 4 Kimura 2 parameters Outputs a corrected percentage of nucleotides for which two haplotypes are different The correction also allows for multiple substitutions per site but takes into account different substitution rates between transitions and transversions The transition transversion ratio is estimated from the data D ie ny P as L L ax a A Cc c Cy 1 1 2P Q c 1 2Q c ar ses d logt 2P Tioga 2 2 2A D A2 ci P c3Q c P c 0 V d 7 Gamma correction c 1 20 G9 e 1 26 UD c I d 4 a 20 6 4 40 26 4 3 2 2 2 v CoP c30 a c References Kimura 1980 Jin and Nei 1990 Manual Arlequin ver 3 1 Methodological outlines 101 7 1 2 5 5 Tamura Outputs a corrected percentage of nucleotides for which two haplotypes are different The correction is an extension of Kimura 2 parameters method allowing for unequal nucleotide frequencies The transition transversion ratios as well as the overall nucleotide frequencies are computed from the original data Panes je L ee ae
58. at which the test of linkage disequilibrium is considered significant for the output table 6 3 8 5 1 2 Gametic phase unknown When the gametic phase is not known we use a different procedure for testing the significance of the association between pairs of loci see section 7 1 4 2 It is based on a likelihood ratio test where the likelihood of the sample evaluated under the hypothesis of no association between loci linkage equilibrium is compared to the likelihood of the sample when association is allowed see Slatkin and Excoffier 1996 The significance of the observed likelihood ratio is found by computing the null distribution of this ratio under the hypothesis of linkage equilibrium using a permutation procedure Manual Arlequin ver 3 1 Methodological outlines 75 Arlequin 3 0 D Laurent Arlequin Code New test files Disequil Id_gen0 arp cog File View Options Help S3 Open project View project Q View results By View Log file Close project Start H Pause Project Selling Arlequin Configuration Project wizard Import data Pairwise LD phase unknown ra i see areaey eee peranper oer ARLEQUIN SETTINGS Calculation settings a P D eCa No of initial conditions for EM 10 AMOVA Generate histograms and table in file LD_DIS XL i Population comparisons i Population differentiation 3 Genotype assignment E Haplotype inference ELB algorithm EM algorithm Linkage diseq
59. ation of D is Dy 2p 4 2pt where wis the average mutation rate per nucleotide zis the divergence time between the two populations Thus D4 is also expected to increase linearly with divergence times between the populations Manual Arlequin ver 3 1 Methodological outlines 131 7 2 5 5 Relative population sizes Divergence between populations of unequal sizes We have implemented a method to estimate divergence time between populations of unequal sizes Gaggiotti and Excoffier 2000 The model assumes that two populations have diverged from an ancestral population of size No some T generations in the past and have remained isolated from each other ever since The sizes of the two daughter populations can be different but their sum adds up to the size of the ancestral population From the average number of pairwise differences between and within populations we try to estimate the divergence time scaled by the mutation rate T 2Tu the size of the ancestral population size scaled by the mutation rate 6 2N pu for haploid populations and o ANou for diploid populations as well as the relative sizes k and 1 k of the two daughter populations The estimated parameters result from the numerical resolution of a system of three non linear equations with three unknowns based on the Broyden method Press et al 1992 p 389 The significance of the parameters is tested by a permutation procedure similar tot that used i
60. ations Arlequin will compare all genotypes to all others and recompute the genotype frequencies The allelic content of a genotype is entered on two separate lines in the form of two pseudo haplotypes Examples 1 Idl 2 ACTCGGGTTCGCGCGC the first pseudo haplotype ACTCGGGCTCACGCGC the second pseudo haplotype my_id 4 6 Ort aGi 2 Of Oh et If the gametic phase is supposed to be known the pseudo haplotypes are treated as truly defined haplotypes If the gametic phase is not supposed to be known only the allelic content of each locus is supposed to be known In this case an equivalent definition of the upper phenotype would have been my_id 4 a ore a 3 2 2 4 Genetic structure The hierarchical genetic structure of the samples is specified in this optional sub section It is possible to define groups of populations This subsection starts with the keyword Structure The definition of a genetic structure is only required for AMOVA analyses Manual Arlequin ver 3 1 Input files 32 One must specify e Aname for the genetic structure Notation StructureName Possible values Any string of characters within quotes Example StructureName A first example of a genetic structure Note This name will be used to refer to the tested structure in the output files e The number of groups defined in the structure Notation NbGroups Possible values Any integer value Example NbGroups 5 Note If this value do
61. atrix Manual Arlequin ver 3 1 Appendix 145 Ymatrix UsedYMatrixLabels Labels defining the sub A series of strings within quotation marks all matrix of the YMatrix on enclosed within braces and if desired on which the correlation is separate lines computed
62. beck 1998 The resulting Manual Arlequin ver 3 1 Methodological outlines 133 output tables can be used to represent log log plots of genotypes for pairs of populations likelihood see Paetkau et al 1997 and Waser and Strobeck 1998 to identify those genotypes that seem better explained by belonging to another population from that they were sampled For instance we have plotted on this graph the log likelihood of individuals sampled in Algeria white circles for two HLA class II loci versus those of Senegalese Mandenka individuals black diamonds The overlap of the two distribution suggests that two loci are not enough to provide a clear cut separation between these two populations One also sees that there is at least one Mandenka individual whose genotype would be much better explained if it came from the Algerian population than if it came from Eastern Senegal Note that interpreting these results in terms of gene flow is difficult and hazardous 7 2 8 Mantel test The Mantel test consists in testing the significance of the correlation between two or more matrices by a permutation procedure allowing getting the empirical null distribution of the correlation coefficient taking into account the auto correlations of the elements of the matrix In more details the testing procedure proceeds as follows Let s first define two square matrices X xj and Y yjj of dimension N The
63. bility of the observed sample is compared to that of a random neutral sample with same number of alleles and identical size The probability of the sample selective neutrality is obtained as the proportion of random samples which are less or equally probable than the observed sample No of simulated samples i Number of random samples to be generated for the two neutrality tests mentioned above Values of several thousands are in order and 16 000 permutations guarantee to have less than 1 difference with the exact probability in 99 of the cases see Guo and Thomson 1992 e Chakraborty s test of population amalgamation b A test of selective neutrality and population homogeneity and equilibrium Chakraborty 1990 This test can be used when sample heterogeneity is suspected It uses the observed homozygosity to estimate the population mutation parameter O Hom The estimated value of this parameter is then used to compute the probability of observing k alleles or more in a neutral sample drawn from a stationary population This test is based on Chakraborty s observation that the observed homozygosity is not very sensitive to population amalgamation or sample heterogeneity whereas the number of observed low frequency alleles is more affected by this phenomenon Infinite site model e Tajima s D b This test described by Tajima 1989a 1989b 1993 compares two estimators of the population parameter one being based
64. ch point of the expected mismatch see section 7 1 2 4 Hundreds to thousands of simulations are necessary to obtain meaningful estimates Manual Arlequin ver 3 1 6 3 8 4 Haplotype inference Depending on the data type different methods are used to estimate the haplote frequencies 6 3 8 4 1 Haplotypic data or genotypic diploid data with known gametic phase Arlequin 3 0 D Laurent Arlequin Code New test files DNA mtDNAHV1 arp Cee File View Options Help Methodological outlines 3 Open project View project Q View results By View Log file Close project P Start E Paus ARLEQUIN SETTINGS Calculation settings Genetic structure AMOVA Population comparisons Population differentiation Genotype assignment Haplotype inference Linkage disequilibrium i Hardy Weinberg gt Pairwise linkage gt Mantel test Mismatch distribution Molecular diversity indices Neutrality tests General settings ince Salm Arlequin Configuration Project wizard Import data Haplotype inference phase known Search for shared haplotypes m Haplotype definition Use original definition Infer from distance matrix m Haplotype frequency estimation Cie Search for shared haplotypes b Look for haplotypes that are effectively similar after computing pairwise genetic distances according to the distance calculation settings in the General Settings secti
65. d 22 and 23 repeats were observed for the second locus and finally 16 and 17 repeats were found at the third locus Profile Title A small example of microsatellite data NbSamples 4 GenotypicData 1 Unknown gametic phase between the 2 loci GameticPhase 0 DataType MICROSAT LocusSeparator WHITESPACE Data Samples SampleName MICRI1 SampleSize 28 SampleData Genotl 27 12 23 1 7 13 22 16 Genot2 1 15 2216 13 22 16 SampleName MICR2 SampleSize 59 Manual Arlequin ver 3 1 Methodological outlines 46 SampleData Genot3 37 12 24 18 T2 22 16 Genot4 1 15 20 18 13 22 18 Genot5 21 14 22 16 14 23 16 Samp leName MICR3 SampleSize 30 SampleData Genot 17 12 21 16 1 33 22715 Genot7 1 12 20 16 13 23 16 Genot8 12 10 22 15 12 22 15 SampleName MICR4 SampleSize 16 SampleData Genot9 15 13 24 16 13 234197 Genot10 1 12 24 16 13 23 16 Structure StructureName Test microsat structure NbGroups 2 The first group is made up of the first 2 samples Group MI CRI MICR2 The last 2 samples will be put into the second group Group MICR3 MICR4 5 5 Example of RFLP data Haplotypic In this example we show how to use a definition list of RFLP haplotypes Different RFLP haplotypes are first defined in the HaplotypeDefinition section The allelic content of each haplotype is then defined after a given identifier The
66. d be added in random order or not and how many random orders to implement After multiple trials Arlequin outputs the locus order having led to the largest likelihood This version of the EM algorithm is equivalent to that implemented in the SNPHAP program http www gene cimr cam ac uk clayton software snphap txt by David Clayton 7 1 3 2 3 ELB algorithm Contrary to the EM algorithm which aims at estimating haplotype frequencies the ELB algorithm attempts at reconstructing the unknown gametic phase of multi locus genotypes Phase updates are made on the basis of a window of neighbouring loci and the window size varies according to the local level of linkage disequilibrium Suppose that we have a sample of n individuals drawn from some population and genotyped at S loci whose chromosomal order is assumed known Adjacent pairs of loci are assumed to be tightly linked but S may be large so that the two external loci are effectively unlinked In this case reconstructing the gametic phase in one step can be inefficient because recombination may have created too many distinct haplotypes for Manual Arlequin ver 3 1 Methodological outlines 108 their frequencies to be well estimated Locally however recombination may be rare and to exploit this situation the updates in ELB of the phase at a heterozygous locus are based on windows of neighboring loci The algorithm adjusts the window sizes and locations in order to maximize the
67. definition section MatrixName none name of the distance matrix MatrixSize 4 size number of lines of the distance matrix MatrixData h1 h2 h3 h4 labels of the distance matrix identifier of the 0 00000 haplotypes 2 00000 0 00000 1 00000 2 00000 0 00000 1 00000 2 00000 1 00000 0 00000 Examplez2 DistanceMatrix start the distance matrix definition section MatrixName none name of the distance matrix MatrixSize 4 size number of lines of the distance matrix MatrixData EXTERN mat_file dis 3 2 2 3 Samples In this obligatory sub section one defines the haplotypic or genotypic content of the different samples to be analyzed Each sample definition begins by the keyword Samp eName and ends after a SampleData has been defined One must specify e Aname for each sample Notation SampleName Possible values Any string of characters within quotes Example SampleName A first example of a sample name Note This name will be used in the Structure sub section to identify the different samples which are part of a given genetic structure to test e The size of the sample Notation SampleSize Possible values Any integer value Example SampleSize 732 Note For haplotypic data the sample size is equal to the haploid sample size For genotypic data the sample size should be equal to the number of diploid individuals present in the sample When absolute frequencies
68. dividuals or populations among individuals populations or groups of populations After each permutation round we recompute all statistics to get their null distribution Depending on the tested statistic and the given hierarchical design different types of permutations are performed Under this procedure the normality assumption usual in analysis of variance tests is no longer necessary nor is it necessary to assume equality of variance among populations or groups of populations A large number of Manual Arlequin ver 3 1 Methodological outlines 121 permutations 1 000 or more is necessary to obtain some accuracy on the final probability A system of batches similar to those used in the exact test of linkage disequilibrium see end of section 7 1 4 1 has been implemented to get an idea of the standard deviation of the P values We have implemented here 6 different types of hierarchical AMOVA The number of hierarchical levels varies from two to four In each of the situations we describe the way the total sum of squares is partitioned how the covariance components and the associated F statistics are obtained and which permutation schemes are used for the significance test Before enumerating all the possible situations we introduce some notations SSD T Total sum of squared deviations SSD AG Sum of squared deviations Among Groups of populations SSD AP Sum of squared deviations Among Populations SSD AI Sum of s
69. e Markov chain we estimate the probability of observing a table less or equally likely than the observed sample configuration under the null hypothesis of panmixia For haplotypic data the table is built using sample haplotype frequencies Raymond and Rousset 1995 For genotypic data with unknown gametic phase the contingency table is built from sample genotype frequencies Goudet et al 1996 As it was done previously an estimation of the error on the P value is done by partitioning the total number of steps into a given number of batches see section 7 1 4 1 7 2 7 Assignment of individual genotypes to populations It can be of interest to try to determine the origin of particular individuals knowing a list of potential source populations e g Rannala and Montain 1997 Waser and Strobeck 1998 Davies et al 1999 The method we have implemented here is the most simplest one as it consists in determining the log likelihood of each individual multi locus genotype in each population sample assuming that the individual comes from that population For computing the likelihood we simply use the allele frequencies estimated in each sample from the original constitution of the samples We also assume that all loci are independent such that the global individual likelihood is obtained as the product of the likelihood at each locus The method we have implemented is inspired from that described in Paetkau et al 1995 1997 and Waser and Stro
70. e analysis for that locus Compute Population Specific FST s b Population specific Fsr indices will be computed as defined in section 7 2 4 for all loci and for each locus separately if the Locus by locus AMOVA option is checked Note that this option is only available if a single group is defined in the Structure section No test of these coefficients is performed as they are only provided for exploratory purposes No of permutations i Enter the number of permutations used to test the significance of covariance components and fixation indices A value of zero will not lead to any testing procedure Values of several thousands are in order for a proper testing scheme and 16 000 permutations guarantee to have less than 1 difference with the exact probability in 99 of the cases Guo and Thomson 1992 The number of permutations used by the program might be slightly larger This is the consequence of subdivision of the total number of permutation in batches for estimating the standard error of the P value Note that if several covariance components need to be tested the probability of each covariance component will be estimated with this number of permutation The distribution of the covariance components is output into a tabulated text file called amo_hist x which can be directly read into MS EXCEL Compute Minimum Spanning Network MSN among haplotypes A Minimum Spanning Tree and a Minimum Spanning Network are computed from th
71. e distance matrix used to perform the AMOVA calculations Choice of Euclidian square distances m o Use project distance matrix m Use the distance matrix defined in the project file if available o Compute distance matrix m Compute a given distance matrix based on a method defined below With this setting selected the distance matrix potentially defined in the project file will be ignored This matrix can be Manual Arlequin ver 3 1 Methodological outlines 82 generated either for haplotypic data or genotypic data Michalakis and Excoffier 1996 o Use conventional F statistics m With this setting activated we will use a lower diagonal distance matrix with zeroes on the diagonal and ones as off diagonal elements It means that all distances between non identical haplotypes will be considered as identical implying that one will bas the analysis of genetic structure only on allele frequencies Distance between haplotypes m Select a distance method to compute the distances between haplotypes Different square Euclidean distances can be used depending on the type of data analyzed o Gamma a value f Set the value for the shape parameter a of the gamma function when selecting a distance allowing for unequal mutation rates among sites See the Molecular diversity section 7 1 2 5 6 3 8 7 1 2 AMOVA with genotypic data Arlequin 3 1 C users Laurent Batranke2 Arlequin Code New test files Hapl J00 File View Option
72. e having same name as the project file and the ars extension These setting files are convenient when you want to repeat some analyses done previously or when you want to make different types of computations on several projects as it is possible using batch files See Batch files in section 3 6 giving you considerable flexibility on the analyses you can perform and avoiding tedious and repetitive mouse clicks 2 6 Performing the analyses The selected analyses can be performed either by clicking on the Start button File View Options Help aint ao ee lt Open project A View project 3 View results Ey View Log file Close project ic Start fy Pause Stop Settings Arlequin Configuration Project wizard Import s i If an error occurs during the execution Arlequin will write diagnostic information in a log file If the error is not too severe Arlequin will open the web browser where you can consult the log file If there is a memory error Arlequin will shut down itself In the latter case you should consult the Arlequin log file before launching a new analysis in order to get some information on where or at which stage of the execution the problem occurred To do that just reopen your last project and press on the View Log File button on the ToolBar above In any case the file Arlequin_log txt is located in the project results directory 2 7 Interrupting the computations The computations can be stopp
73. e together with the raw data Manual Arlequin ver 3 1 Getting started 19 There are two ways to create Arlequin projects 1 You can start from scratch and use a text editor to define your data using reserved keywords 2 You can let Arlequin s create the outline of a project by selecting the tab panel Project Wizard see section Project Wizard 6 3 4 Arlequin 3 0 File View Options Help Open project BF Nie project fgj View results Gy View Log file 04 Close project Di Start fj Pause Pe peee pawian import data Project wizard New project file name Browse Create project Edit project m Data type STANDARD Genotypic data Known gametic phase Recessive data m Controls No of samples 1 A Locus separator WHITESPACE Missing data i Include haplotype list Include distance matrix Include genetic structure Optional sections The controls on this tab panel allow you to specify the type of project outline that should be build Use the Browse button to choose a name and a hard disk location for the project Once all the settings have been chosen the project outline is created by pressing the Create Project button Note that it is not automatically loaded into Arlequin The name of the data file should have a arp extension for ARlequin Project You can then edit the project by pressing the Edit Project button Note that this w
74. ect file Arlequin can create the outline of a project file for you This tab dialog should allow you to quickly define which type of data you have and some of its properties e Browse button It allows you to specify the name and the directory location of the new project file Pressing on that buttons opens a File dialog box The project file should have the extension arp e Create project button Press on that button once you have specified all other properties of the project e Edit project button This button become active once you have created an outline and allows you to begin editing the outline and fill in some data e Data type Manual Arlequin ver 3 1 Methodological outlines 57 Specify which type of data you want to analyze DNA RFLP Microsat Standard or Frequency Specify if the data is under genotypic or haplotypic form Specify if the gametic phase is known for genotypic data only Specify if there are recessive alleles for genotypic data only Controls Specify the number of population samples defined in the project Choose a locus separator Specify the character coding for missing data Optional sections Specify if you want to include a global list of haplotypes Specify if you want to include a predefined distance matrix Specify if you want to include a group structure 6 3 5 I mport data Aregin3 0 OOO S S S File View Options Help Open project Ed View project fay View results Ey V
75. ed at any time by pressing either the Pause or the Stop buttons on the toolbar After pressing on the Pause button computations can be resumed by pressing on the Resume button Manual Arlequin ver 3 1 Getting started 24 e File View Options Help Open project A View project 3 View results Ey View Log file Close project ic Resume fy Pause Stop Project Settings Arlequin Configuration Project wizard Import data N Note that by pressing the Stop button you have no guarantee that the current computations give correct results For very large project files you may have to wait for a few seconds before the calculations are stopped 2 8 Consulting the results When the calculations are over Arlequin will create a result directory which has the same name as the project file but with the res extension This directory contains all the result files particularly the main result file with the same name as the project file but with the htm extension After the computations the result file project name _main html is automatically loaded in the default html browser You can also view your results at anytime by clicking on the View results button Manual Arlequin ver 3 1 Input files 25 3 INPUT FILES 3 1 Format of Arlequin input files Arlequin input files are also called project files The project files contain the description of the properties of the data as well as the raw data themselves The project file ma
76. ene genealogies within a species Molecular variance parsimony Genetics 136 343 59 Excoffier L and M Slatkin 1995 Maximum likelihood estimation of molecular haplotype frequencies in a diploid population Mol Biol Evol 12 921 927 Excoffier L and M Slatkin 1998 Incorporating genotypes of relatives into a test of linkage disequilibrium Am J Hum Genet 171 180 Excoffier L Laval G Balding D 2003 Gametic phase estimation over large genomic regions using an adaptive window approach Human Genomics 1 7 19 Excoffier L Estoup A Cornuet J M 2005 Bayesian Analysis of an Admixture Model With Mutations and Arbitrarily Linked Markers Genetics 169 1727 1738 Fu Y X 1997 Statistical tests of neutrality of mutations against population growth hitchhiking and backgroud selection Genetics 147 915 925 Gaggiotti O and L Excoffier 2000 A simple method of removing the effect of a bottleneck and unequal population sizes on pairwise genetic distances Proceedings of the Royal Society London B 267 81 87 Garza JC Williamson EG 2001 Detection of reduction in population size using data from microsatellite loci Mol Ecol 10 305 318 Goudet J M Raymond T de Meets and F Rousset 1996 Testing differentiation in diploid populations Genetics 144 1933 1940 Guo S and Thompson E 1992 Performing the exact test of Hardy Weinberg proportion for multiple alleles Biometrics 48 361 372 Harpending R C 1994 Signatu
77. equencies of the alleles Profile Title Frequency data NbSamples 2 GenotypicData 0 DataType FREQUENCY Data Samples SampleName Population 1 SampleSize 16 SampleData 000 1 001 3 002 1 7 4 003 004 SampleName Population 2 SampleSize 23 SampleData 000 3 001 002 003 004 PONOA 5 2 Example of standard data Genotypic data unknown gametic phase recessive alleles In this example the individual genotypes for 5 HLA loci are output on two separate lines We specify that the gametic phase between loci is unknown and that the data has a recessive allele We explicitly define it to be xxx Note that with recessive data all single locus homozygotes are also considered as potential heterozygotes with a null allele We also provide Arlequin with the minimum frequency for the estimated haplotypes to be listed 0 00001 and we define the minimum epsilon value sum of haplotype frequency differences between two steps of the EM algorithm to be reached for the EM algorithm to stop when estimating haplotype frequencies Profile Title Genotypic Data Phase Unknown 5 HLA loci NbSamples 1 GenotypicData 1 DataType STANDARD Manual Arlequin ver 3 1 Methodological outlines LocusSeparator WHITESPACE MissingData GameticPhase 0 RecessiveData 1 RecessiveAllele xxx Data Samples SampleName Population 1 SampleSize 63 SampleData MANO102 12
78. equin Some information about Arlequin its authors contact address and the Swiss NSF grants that supported its development 6 2 Toolbar Arlequin s toolbar contains icons that are shortcuts to some commonly used menu items as shown below Clicking on one of these icons is equivalent to activating the corresponding menu item File View Options Help 3 Open project A View project Q View results View Log file D Close project ic Start nj Pause Stop 6 3 Tab dialogs Most of the methods implemented in Arlequin can be computed irrespective of the data type Nevertheless the testing procedure used for a given task e g linkage disequilibrium test may depend on the data type The aim of this section is to give an overview of the numerous options which can be set up for the different ananlyses The items that appear grayed in Arlequin s dialog boxes indicate that a given task is impossible in the current situation For example if you open a project containing haplotypic data it is not possible to test for Hardy Weinberg equilibrium or for STANDARD data it is not possible to set up the transversion or transition weights which can only be set up for DNA data Arlequin s interface usually prevents the user from selecting tasks impossible to perform or from setting up parameters that are not taken into account in the analyses When describing the different dialog boxes accessible in Arlequin we have sometimes used t
79. erence gt Haplotype inference 3 ae E Linkage disequilibrium Linkage disequilibrium oe Hardy Weinberg 2 r t Pairwise linkage Pairwise Linkage Disequilibrium test gt Mantel test Mantel test Mismatch distribution Mismatch distribution Molecular diversity indices Molecular Di 3 Neutrality tests Molecular Diversity General settings Neutrality tests Manual Arlequin ver 3 1 Getting started 23 You can navigate in the tree on the left side to select different types of computations you whish the set up Depending on your selection the right part of the tab dialog is will show you different parameters to set up 2 5 Creating and using Setting Files By settings we mean any alternative choice of analyses and their parameters that can be set up in Arlequin As you can choose different types of analyses as well as different options for each of these analyses all these choices can be saved into setting files These files generally take the same name as the project files but with the extension ars Setting files can be created at any time of your work by clicking on the Save button on top of the settings tree Alternatively if you activate the Use associated settings in the Arlequin configuration pane see Arlequin configuration section 2 1 the last used settings used on this project will be automatically saved when you close the project and reloaded when you open it later again The setting are stored in a fil
80. es Incorporation of a least square approach to estimate the parameters of an instantaneous spatial expansion from DNA sequence diversity within samples and computations of bootstrap confidence intervals using coalescent simulations Estimation of confidence intervals for F statistics using a bootstrap approach when genetic data on more than 8 loci are available Update of the java script routines in the output html files making them fully compatible with Firefox 1 X A completely rewritten and more robust input file parsing procedure giving more precise information on the location of potential syntax and format mistakes Use of the ELB algorithm described above to generate samples of phased multi locus genotypes which allows one to analyse unphased multi locus genotype data as if the phase was known The phased data sets are output in Arlequin projects that can be analysed in a batch mode to obtain the distribution of statistics taking phase uncertainty into account No need to define a web browser for consulting the results Arlequin will automatically present the results in your default web browser we recommend the use of Firefox freely available on http www mozilla org products firefox central html 1 13 2 Version 3 01 compared to version 3 0 Arlequin 3 01 include some bug corrections and some additional features Additions New editor of genetic structure allowing one to modify the current Genetic Structure direct
81. es not correspond to the number of defined groups then calculations will not be possible and an error message will be displayed e The group definitions Notation Group Possible values A list containing the names of the samples belonging to the group entered within braces Repeat this for as many groups you have in your structure It is of course not allowed to put the same population in different groups Also note that a comment sign is not allowed after the opening brace and would lead to an error message Comments about the group should therefore be done before the definition of the group Example NbGroups 2 Group populationl population2 population3 Group population4 population5 A new genetic Structure Editor is now available to help you with the process of defining the genetic Structure to be tested see section Defining the Genetic Structure to be tested 2 2 1 1 2 2 4 Mantel test settings This subsection allows to specify some distance matrices Ymatrix X1 and X2 The goal is to compute a correlation between the Ymatrix and X1 or a partial correlation between the Ymatrix X1 and X2 The Ymatrix can be either a pairwise population Fst matrix or a Manual Arlequin ver 3 1 Input files 33 custom matrix entered into the project by the user X1 and X2 have to be defined in the project This subsection starts with the keyword Mantel The matrices which are used to test correla
82. f Haplotype definition HaplListName list1 give any name you whish to this list HaplList EXTERN hapl_file hap 3 2 2 2 Distance matrix optional Here a matrix of genetic distances between haplotypes can be specified This section is here to provide some compatibility with earlier WINAMOVA files The distance matrix must be a lower diagonal with zeroes on the diagonal This distance matrix will be used to compute the genetic structure specified in the genetic structure section As specified in AMOVA the elements of the matrix should be squared Euclidean distances In practice they are an evaluation of the number of mutational steps between pairs of haplotypes One also has to provide the labels of the haplotypes for which the distances are computed The order of these labels must correspond to the order of rows and columns of the distance matrix If a haplotype list is also provided in the project the labels and their order should be the same as those given for the haplotype list Usually it will be much more convenient to let Arlequin compute the distance matrix by itself Manual Arlequin ver 3 1 Input files 29 It is also possible to have the definition of the distance matrix given in an external file Use the keyword EXTERN followed by the name of the file containing the definition of the matrix Read Example 2 to see how to proceed Example 1 DistanceMatrix start the distance matrix
83. ferences among haplotypes within a population o the covariance component due to differences among haplotypes in different populations within a group o and the covariance component due to differences among the G populations CA The same framework could be extended to additional hierarchical levels such as to accommodate for instance the covariance component due to differences between haplotypes within diploid individuals Note that in the case of a simple hierarchical genetic structure consisting of haploid individuals in populations the implemented form of the algorithm leads to a fixation index Fsr which is absolutely identical to the weighted average F statistic over loci Os defined by Weir and Cockerham 1984 see Michalakis and Excoffier 1996 for a formal proof In terms of inbreeding coefficients and coalescence times this Fsr can be expressed as Pape S P Slatkin 1991 where o is the probability of identity by descent of two different genes drawn from the same population fi is the probability of identity by descent of two genes drawn from two different populations f is the mean coalescence times of two genes drawn from two different populations and ty is the mean coalescence time of two genes drawn from the same population The significance of the fixation indices is tested using a non parametric permutation approach described in Excoffier et al 1992 consisting in permuting haplotypes in
84. gametic phases estimated with the ELB algorithm as well as batch file ELB_PhaseDistribution arb listing all aforementioned project files The file ELB_Best_Phases arp can then be analyzed as if gametic phases were known for the different samples Keep however in mind that the gametic phases are not necessarily correct and that analyses assuming that the gametic phase is unknown will not take into account possible gametic phase estimation errors Manual Arlequin ver 3 1 Methodological outlines 55 6 3 3 Arlequin Configuration File View Options Help lt 2 Open project BF iew project E View results Ey View Log file 04 Close project Di Start m Pau ion Project wizard Import data Use associated settings Append results Keep AMOVA null distributions Prompt for handling unphased multi locus data Helper programs Text editor Browse C Program Files TextPad 4 TextPad exe Different options can be specified in this tab dialog Use associated settings By checking the Use associated settings checkbox the settings and options last specified for your project will be used when opening a project file When closing a project file Arlequin automatically saves the current calculation settings for that particular project Check this box if you want Arlequin to automatically load the settings associated to each project If this box is unchecked the same settings will be used for diffe
85. gorithm Output phase distribution files EM algorithm E Linkage disequilibrium gt Hardy Weinberg Pairwise linkage 3 Mantel test Mismatch distribution Molecular diversity indices Neutrality tests General settings Use ELB algorithm to estimate gametic phase b Check this box if you want to estimate the gametic phase of multi locu genotypes with the ELB algorithm See methodological section on ELB algorithm 7 1 3 2 3 for a description of the algorithm Dirichlet prior alpha value f Value of the alpha parameter of the prior dirichlet distribution of haplotype frequencies Recommended value a small value like 0 01 for all data types has been found to work well Excoffier et al 2003 see section 7 1 3 2 3 details Epsilon value f Value of the parameter controlling how much haplotypes differing by a single mutation from potentially present haplotypes are weighted Recommended values 0 1 for microsatellite data and 0 01 for other data types see section 7 1 3 2 3 details Heterozygote site influence zone i Defines the number of sites adjacent to heterozygote sites that need to be taken into account when computing haplotype frequencies in the Gibbs chain A value of zero implies that gametic phase will be Manual Arlequin ver 3 1 Methodological outlines 70 estimated only on the basis of heterozygote sites A negative value will indicate that all sites homozygotes and heterozygotes will
86. haplotypes Chakraborty s neutrality tests DNA sequence Sample size Maximum length 2 000 1 000 100 000 Manual Arlequin ver 3 1 Introduction 15 Other limitations e Line length in input file is limited to 100 000 characters e Interleaved format is not supported in Arlequin This concerns haplotype definition multilocus genotypes and distance matrices 1 10 How to cite Arlequin Excoffier L G Laval and S Schneider 2005 Arlequin ver 3 0 An integrated software package for population genetics data analysis Evolutionary Bioinformatics Online 1 47 50 1 11 Acknowledgements This program has been made possible by Swiss NSF grants No 32 37821 93 32 047053 96 and 31 56755 99 Many thanks to David Roessli Samuel Neuenschwander Carlo Largiad r Pierre Berthier Mathias Currat Guillaume Laval Nicolas Ray Gerald Heckel Sabine Fink Pierre Berthier Daniel Wegmann Jean Marc Kuffer Yannis Michalakis Thierry Pun Montgomery Slatkin David Balding Peter Smouse Oscar Gaggiotti Alicia Sanchez Mazas Isabelle Dupanloup Estella Poloni Giorgio Bertorelle Guido Barbujani Michele Belledi Evelyne Heyer Erika Bucheli Alex Widmer Philippe Jarne Fr d rique Viard Peter de Knijff Peter Beerli Matthew Hurles Mark Stoneking Rosalind Harding Frank Struyf A J Gharrett Jennifer Ovenden Steve Carr Marc Allard Omar Chassin Alonso Santos John Novembre Nelson Fagundes Eric Minch Pierre Darl
87. haplotypes and P is the sample frequency of the th haplotype Note that Arlequin outputs the standard deviation of the Heterozygosity computed as s d H JV H Reference Nei 1987 p 180 7 1 1 2 Expected heterozygosity per locus For each locus Arlequin provides an estimation of the expected heterozygosity simply as k yy A 2 Ha 2 p i 7 1 1 3 Number of usable loci Number of loci that show less than a specified amount of missing data The maximum amount of missing data must be specified in the General Settings tab dialog 7 1 1 4 Number of polymorphic sites S Number of usable loci that show more than one allele per locus 7 1 1 5 Allelic range R For MI CROSAT data it is the difference between the maximum and the minimum number of repeats Manual Arlequin ver 3 1 Methodological outlines 91 7 1 1 6 Garza Williamson index G W k Following Garza and Wlliamson 2001 the G W statistic is given as G W E where k is the number of alleles at a given loci in a population sample and R is the allelic range Originally the denominator was defined as just R in Garza and Wlliamson 2001 but this could lead to a division by zero if a sample is monomorphic This adjustment was introduced in Excoffier et al 2005 This statistic was shown to be sensitive to population bottleneck because the number of alleles is usually more reduced than the range by a recent reduction in population size such that the distr
88. hat a stationary haploid population at equilibrium has suddenly passed t generations ago from a population size of No to Ni then the probability of observing S differences between two randomly chosen non recombining haplotypes is given by S J 6 1 r F s 7 0 0 F 5 9 exp T 6 2 Sq lFs i Fs jG Li 1977 JE S where F 0 is the probability of observing two random haplotypes with S S 0 1 5 differences in a stationary population Watterson 1975 0 2uN 9 0 2uN T 2ut and U is the mutation rate for the whole haplotype Rogers 1995 has simplified the above equation by assuming that 0 2 implying there are no coalescent events after the expansion which is only reasonable if the expansion size is large With this simplifying assumption it is possible to derive the moment estimators of the time to the expansion 7 and the mutation parameter 0 o as Rogers 1995 where M and V are the mean and the variance of the observed mismatch distribution respectively These estimators can then be used to plot Fs T 0 p values Note however that this estimation cannot be done if the variance of the mismatch is smaller than the mean However Schneider and Excoffier 1999 find that this moment estimator often leads to an underestimation of the age of the expansion 7 They rather propose to estimate the parameters of the demographic expansion by a generalized non linear least square approach
89. he following symbols to specify which types of user input were expected f parameter to be set in the dialog box as a floating number Manual Arlequin ver 3 1 Methodological outlines 53 i parameter to be set in the dialog box as an integer b check box two states checked or unchecked m multiple selection radio buttons I List box allowing the selection of an item in a downward scrolling list r read only setting cannot be changed by the user 6 3 1 Open project Open Arlequin project or batch file Ze In this dialog box you can locate an existing Arlequin project on your hard disk Alternatively you can use the File Recent Projects menu to reload one the last 10 projects on which you worked on Arlequin 3 000 File View Options Help So roject gj View results By View Log file Close project Di Open project Recent projects D Laurent Arlequin Code New test files DNA mtDNAHV 1 arp Load settings D Laurent Arlequin Code New test files Disequil ld_gen0 arp Save settings D Laurent Arlequin Code New test files Batch batch_ex arb Save settings as D Laurent Arlequin Code New test files DNA nud_div arp Exit D Laurent Arlequin Code New test files Amova amovahap arp D Laurent Arlequin Code WinArl3 final testo arp Manual Arlequin ver 3 1 Methodological outlines 54 6 3 2 Handling of unphased genotypic data Handling of unphased data Joe Handling unph
90. htm The following figure illustrates how results are presented in your HTML browser Arlequin Result Browser MicDipl arp Mozilla Firefox oeBB File Edit View Go Bookmarks Tools Help 9 8 BB 8 BAK Crm E d E5Firebird Biblio E amp music BQuick Searches E5Dictionnaire F5Favorites Googe ENews EWeb perso Arlequin Q E Entrez PubMed X 7 Arlequin Result Browser MicDipl arp X Arlequin Result Browser MicDipl arp x x E ARLEQUIN RESULTS MicDipl arp 9 Arlequin log file z Mari Nord dE Run of 19 09 06 at 09 20 41 9 Settings pe Re ena ee As Genetic structure Standard diversity indices Mari Nord AMOVA Reference Nei M 1987 Reference Garza J C and Williamson E G 2001 No of gene copies 58 No of loci 4 No of usable loci 4 loci with less than 5 00 missing data No of polymorphic loci 4 FIS per pop Locus by locus AMOVA FIS per pop per locus Samples Bala Results are only shown for polymorphic loci Mari Nord 9 Namaga PM gene Namaga B copies Namaga W Tera Boyze I Boyze II Bouktra Foua Kobouri Kokourou Manual Arlequin ver 3 1 Methodological outlines 43 5 EXAMPLES OF INPUT FILES 5 1 Example of allele frequency data The following example is a file containing FREQUENCY data The allelic composition of the individuals is not specified The only information we have are the fr
91. i Weir and Cockerham 1984 Michalakis and Excoffier 1996 7 1 2 9 Minimum Spanning Network among haplotypes We have implemented the computation of a Minimum Spanning Tree MST Kruskal 1956 Prim 1957 between OTU s Operational Taxonomic Units The MST is computed from the matrix of pairwise distances calculated between all pairs of haplotypes using a modification of the algorithm described in Rohlf 1973 The Minimum Spanning Network embedding all MSTs see Excoffier and Smouse 1994 is also provided This implementation is the translation of a standalone program written in Pascal called MINSPNET EXE running under DOS formerly available on http anthropologie unige ch LGB software win min span net 7 1 3 Haplotype inference 7 1 3 1 Haplotypic data or Genotypic data with known Gametic phase If haplotype i is observed x times in a sample containing n gene copies then its estimated frequency P is given by whereas an unbiased estimate of its sampling variance is given by p d P WO ey 7 1 3 2 Genotypic data with unknown Gametic phase 7 1 3 2 1 EM algorithm Maximum likelihood haplotype frequencies can be estimated using an Expectation Maximization EM algorithm see e g Dempster et al 1977 Excoffier and Slatkin 1995 Lange 1997 Weir 1996 This procedure is an iterative process aiming at obtaining maximum likelihood estimates of haplotype frequencies from multi locus genotype data when the gametic phase
92. ibution of allele length will show vacant positions Therefore the G W statistic is supposed to be very small in population having been through a bottleneck and close to one in stationary populations Here we just report the statistics but do not provide any test 7 1 2 Molecular indices 7 1 2 1 Mean number of pairwise differences z Mean number of differences between all pairs of haplotypes in the sample It is given by k k _ gt DY pip diy n JA where di is an estimate of the number of mutations having occurred since the divergence of haplotypes and j k is the number of haplotypes P is the frequency of haplotype I and n is the sample size The total variance over the stochastic and the sampling process assuming no recombination between sites and selective neutrality is obtained as _ 3n nt la 2 n n 3 47 V 1l n 7n 6 Tajima 1993 Note that similar formulas are also used for Microsat and Standard data even though the underlying assumptions of the model may be violated Note also that Arlequin outputs the standard deviation computed as s d 7z V 7 References Tajima 1983 Tajima 1993 Manual Arlequin ver 3 1 Methodological outlines 92 7 1 2 2 Nucleotide diversity or average gene diversity over L loci It is computed here as the probability that two randomly chosen homologous nucleotide or RFLP sites are different It is equivalent to the gene diversity at the nucle
93. ical implying that one will bas the analysis of genetic structure only on allele frequencies 6 3 8 7 3 Population differentiation Arlequin 3 0a D Laurent Arlequin Code New test files DNA nucl_div arp File View Options Help 3 Open project 2 View project Q View results Close project 3 Exit Arlequin P Start m Pause Project Settings Configuration Arlequin Project wizard Import data l Project Editor Population differentiation Reset Load Save Exact test of population differentiation ARLEQUIN SETTINGS No of steps in Markov chain i 0000 General settings 3 No of dememorization steps 1000 Calculation settings gt E Genetic structure V Generate histogram and table AMOVA a Sr 5 Significance level 10 05 3 Populati 7 Genotype assignment Haplotype inference Linkage disequilibrium Hardy Weinberg Pairwise linkage 3 Mantel test Mismatch distribution 3 Molecular diversity indices 3 Neutrality tests e Exact test of population differentiation b We test the hypothesis of random distribution of the individuals between pairs of populations as described in Raymond and Rousset 1995 and Goudet et al 1996 This test is analogous to Fisher s exact test on a two by two contingency table but extended to a contingency table of size two by no of haplotypes We do also an exact differentiation test for all populations defined in the project by const
94. icular table corresponds to its actual probability under the null hypothesis of linkage equilibrium A particular table is modified according to the following rules see also Guo and Thompson 1992 or Raymond and Rousset 1995 1 We select in the table two distinct lines 2 and two distinct columns ji J2 at random 2 The new table is obtained by decreasing the counts of the cells 1 J1 2 J2 and increasing the counts of the cells 1 J2 2 J1 by one unit This leaves the marginal allele counts Ni unchanged 3 The switch to the new table is accepted with a probability equal to n 1 n 1 R L 2 mh l2 J1 L n n 0 hsddi tada where R is just the ratio of the probabilities of the two tables The steps 1 3 are done a large number of times to explore a large amount of the space of all possible contingency tables having identical marginal counts In order to start from a random initial position in the Markov chain the chain is explored for a pre defined number of steps the dememorization phase before the probabilities of the switched tables are compared to that of the initial table The number of dememorization steps should be enough some thousands such as to allow the Markov chain to forget its initial state and make it independent from its starting point The P value of the test is then taken as the proportion of the visited tables having a probability smaller or equal to the observed contingency table A
95. iew Log file QD Close project D Start m Paust About Arlequin Configuraton Project wizara impor da Import Export data file Source File Browse Format Arlequin none selected Target Format Arlequin gt Load in Arlequin after translation File none selected Manual Arlequin ver 3 1 Methodological outlines 58 With this dialog box you can quickly translate data into several other file formats often us in population genetics analyses The currently supported formats are Arlequin GenePop ver 1 0 Phylip ver 3 5 Mega ver 1 0 Biosys ver 1 0 Win Amova ver 1 55 The translation procedure is as follows 1 Select the source file with the upper left Browse button 2 Select the format of the source data file as well as that of the target file 3 A default extension depending on the data format is automatically given to the target file 4 The file conversion is launched by pressing on Translate button 5 In some cases you might be asked for some additional information for instance if input data is split into several input files like in WinAmova 6 If you have selected the translation of a data file into the Arlequin file format you ll have the option to load the newly created project file into the Arlequin J ava Interface Manual Arlequin ver 3 1 Methodological outlines 59 6 3 6 Loaded Project Arlequin 3 0 D Laurent Arlequin Code New test files
96. iles This will be the Ymatrix DistMatMantel 0 00 1 20 0 00 1 17 0 84 1 00 1 23 2 12 0 44 0 00 0 23 0 21 This will be X1 DistMatMantel 0 00 3 20 0 00 2423 1773 299 2023 2623 162 0 00 0 35 1 54 0 00 0 12 0 00 2 32 UsedYMatrixLabels Mao WOM moi TAN LAM 3 3 Example of an input file 00 00 36 The following small example is a project file containing four populations The data type is STANDARD genotypic data with unknown gametic phase Profile Title Fake HLA data NbSamples 4 GenotypicData 1 GameticPhase 0 DataType STANDARD LocusSeparator WHIT MissingData ESPACE Data Samples Samp SampleSize 6 SampleData 1 1 1104 0200 0700 0301 3 3 0302 0200 1310 0402 4 2 0402 0602 1502 0602 Sampl SampleSize 11 SampleData i 1103 0301 2 4 1101 0700 0301 0200 0301 0200 leName A sample of 6 Algerians eName A sample of 11 Bulgarians Manual Arlequin ver 3 1 1500 0301 1103 1202 0301 1500 1600 1301 SampleName A sample o SampleSize 12 SampleData 1 SampleName A sampl 2 SampleSize 8 SampleData 219 239 249 250 254 Structure StructureNa NbGroups Group TA TA Group WA WA 2 sa sa sa sa mpl mpl mpl mpl 1104 1600 1303 1101 1502 1500 101 1101 1302 1101 1500 0402 0301 0101 0301 0301
97. imation of the number of nucleotide substitutions when there are strong transition transversion and G C content biases Mol Biol Evol 9 678 687 Tamura K and M Nei 1993 Estimation of the number of nucleotide substitutions in the control region of mitochondrial DNA in humans and chimpanzees Mol Biol Evol 10 512 526 Uzell T and K W Corbin 1971 Fitting discrete probability distribution to evolutionary events Science 172 1089 1096 Waser PM and Strobeck C 1998 Genetic signatures of interpopulation dispersal TREE 43 44 Watterson G 1975 On the number of segregating sites in genetical models without recombination Theor Popul Biol 7 256 276 Watterson G 1978 The homozygosity test of neutrality Genetics 88 405 417 Watterson G A 1986 The homozygosity test after a change in population size genetics 112 899 907 Weir B S 1996 Genetic Data Analysis Il Methods for Discrete Population Genetic Data Sinauer Assoc Inc Sunderland MA USA Weir B S and Cockerham C C 1984 Estimating F statistics for the analysis of population structure Evolution 38 1358 1370 Weir B S and Hill W G 2002 Estimating F statistics Annu Rev Genet 36 721 750 Wright S 1951 The genetical structure of populations Ann Eugen 15 323 354 Wright S 1965 The interpretation of population structure by F statistics with special regard to systems of mating Evol 19 395 420 Zouros E 1979 Mutation rates
98. information for the phase updates ELB starts with an arbitrary phase assignment for all individuals in the sample Associated with each heterozygous locus is a window containing the locus itself and neighboring loci At each iteration of the algorithm an individual is chosen at random and its heterozygous loci are successively visited in random order At each locus visit two attempts are then made to update that window by proposing and then accepting or rejecting i the addition of a locus at one end of the window and ii the removal of a locus at the other end The locus being visited is never removed from the window and each window always includes at least one other heterozygous locus The two update proposals are made sequentially so that the window can either grow by one locus shrink by one locus or if both changes are accepted the window slides by one locus either to the right or the left If both proposals are rejected the window remains unchanged Next the phase at the locus being visited is updated based on the current haplotype pairs within the chosen window of the other individuals in the sample 7 1 3 2 3 1 Phase updates Let h and h22 denote the two haplotypes within the window given the current phase assignment and let h 2 and h2 denote the haplotypes which would result from the alternative phase assignment at the locus being visited Ideally we would wish to choose between the two haplotype assignments h 1
99. ion Possible values Data Structure facultative section StructureName The name of a given A string of characters within quotation genetic structure to test marks NbGroups The number of groups of An integer larger than zero populations Group The definition of a group A series of strings within quotation marks of samples identified by all enclosed within braces and if desired their SampleName listed on separate lines within braces Keywords Description Possible values Data facultative section Mantel Allows computing the partial correlation between YMatrix and X1 X2 MatrixSize The size of the matrix An integer larger than zero entered into the project Y Matrix Specifies which matrix is fst log_fst slatkinlinearfst used as YMatrix log_slatkinlinearfst nm custom MatrixNumber Number of matrices to 1 we compute the correlation between be compared with the YMatrix and X1 YMatrix 2 we compute the partial correlation between YMatrix Xland X2 YMatrixLabels Labels to identify the A series of strings within quotation marks all entries of the YMatrix In enclosed within braces and if desired on case of YMatrix fst separate lines these labels should correspond to population names in the sample DistMatMantel A keyword used to The matrix data will be entered as a format define a matrix which can be either the Ymatrix or another matrix that will be compared with the free lower diagonal m
100. ionship at equilibrium between migration and drift Manual Arlequin ver 3 1 Methodological outlines 130 1 F_ ST 2M 1 Therefore M which is the absolute number of migrants exchanged between the two populations can be estimated by 1 For 2F or If one was to consider that the two populations only exchange with each other and with no other populations then one should divide the quantity M by a factor 2 to obtain an estimator M Nm for haploid populations or M 2Nm for diploid populations This is because the expectation of Fs is indeed given by 1 ST 4Nmd dD 1 e g Slatkin 1991 where d is the number of demes exchanging genes When d is large this tends towards the classical value 1 4Nm 1 but when d 2 then the expectation of Fsr is 1 8Nm 1 7 2 5 4 Nei s average number of differences between populations As additional genetic distance between populations we also provide Nei s raw D and net D number of nucleotide differences between population Nei and Li 1979 D and net D are respectively computed between populations 1 and 2 as k k D p 2 xX2 j and i l j l Da n A where k and k are the number of distinct haplotypes in populations 1 and 2 respectively Xz is the frequency of the j th haplotype in population 1 and 6 is the number of differences between haplotype and haplotype j Under the same notation concerning coalescence times as described above the expect
101. iple follow a Chi square distribution with K1 1 K2 1 degrees of freedom but it is not always the case in small samples with large number of alleles per locus In order to better approximate the underlying distribution of the likelihood ratio statistic under the null hypothesis of linkage equilibrium we use the following permutation procedure 1 Permute the alleles between individuals at one locus only 2 Re estimate the likelihood of the data La by the EM algorithm Note that Lysis unaffected by the permutation procedure 3 Repeat steps 1 2 a large number of times to get the null distribution of Ly and therefore the null distribution of S Note that this test of linkage disequilibrium assumes Hardy Weinberg proportions of genotypes and the rejection of the test could be also due to departure from Hardy Weinberg equilibrium see Excoffier and Slatkin 1998 Reference Excoffier and Slatkin 1998 Manual Arlequin ver 3 1 Methodological outlines 114 7 1 4 3 Measures of gametic disequilibrium haplotypic data D D and r coefficients Note that these coefficients are computed between all pairs of alleles at different loci and that their computation assumes that the gametic phase between alleles at different loci is known 1 D The classical linkage disequilibrium coefficient measuring deviation from random association between alleles at different loci Lewontin and Kojima 1960 is expressed as D P Pi
102. is initially restricted to a very small area and then the range of the population increases over time and over space The resulting population becomes generally subdivided in the sense that individuals will tend to mate with geographically close individuals rather than remote individuals Based on simulations Ray et al 2003 have shown that a large spatial expansion can lead to the same signal in the mismatch distribution than a pure demographic expansion in a panmictic population but only if neighboring sub populations demes exchange many migrants 50 or more The simulations performed in Ray et al 2003 were performed in a two dimensional stepping stone model T generations ago a haploid population restricted to a single deme of size N began to send migrants to neighboring demes at rate m progressively colonizing the whole world During the expansion the size of each deme followed a logistic regulation with carrying capacity K and intrinsic rate of growth r During the whole process neighboring demes continue to exchange a fraction m of migrants While this model is difficult to describe analytically Excoffier 2004 derived the expected mismatch distribution under a simpler model of spatial expansion He assumed that one has sampled genes from a single deme belonging to a population subdivided into a infinite number of demes each of size N which would exchange a fraction m of migrants wirh other demes This infinite island model is ac
103. is no obvious way to be sure that the resulting frequencies are those that globally maximize the likelihood of the data This would need a complete evaluation of the likelihood for all possible genotype configurations of the sample In order to check that the final frequencies are putative maximum likelihood estimates one has generally to repeat the EM algorithm from many different starting points many different initial haplotype frequencies Several runs may give different final frequencies suggesting the presence of several peaks in the likelihood surface but one has to choose the solution that has the largest likelihood It may also arise that several distinct peaks have the same likelihood meaning that different haplotypic compositions explain equally well the observed data At this point there is no way to choose among the alternative solutions from a likelihood point of view Some external information should be provided to make a decision Standard deviations of the haplotype frequencies are estimated by a parametric bootstrap procedure see e g Rice 1995 generating random samples from a population assumed to have haplotype frequencies equal to their maximum likelihood values For each bootstrap replicate we apply the EM algorithm to get new maximum likelihood haplotype frequencies The standard deviation of each haplotype frequency is Manual Arlequin ver 3 1 Methodological outlines 107 then estimated from the resulting distrib
104. is thus approximately proportional to t N for short divergence times 7 2 5 2 Slatkin s linearized Fst s Slatkin 1995 Slatkin considers a simple demographic model where two haploid populations of size N have diverged t generations ago from a population of identical size These two populations have remained isolated ever since without exchanging any migrants Under such conditions Fsr can be expressed in terms of the coalescence times f which is the mean coalescence time of two genes drawn from two different populations and ty which is the mean coalescence time of two genes drawn from the same population Using the analysis of variance approach the Fsr s are expressed as t t Fons 1 0 Slatkin 1991 1995 fi Because tois equal to N generations see e g Hudson 1990 and tis equal to rT N generations the above expression reduces to ot ST t4N Therefore the ratio D For IA For is equal to T N and is therefore proportional to the divergence time between the two populations 7 2 5 3 M values M Nm for haploid populations M 2Nm for diploid populations This matrix is computed under very different assumptions than the two previous matrices Assume that two populations of size N drawn from a large pool of populations exchange a fraction m of migrants each generation and that the mutation rate u is negligible as compared to the migration rate m In this case we have the following simple relat
105. is unknown phenotypic data In this case a simple gene counting is not possible because several genotypes are possible for individuals heterozygote at more than one locus Therefore a slightly more elaborate procedure is needed Manual Arlequin ver 3 1 Methodological outlines 106 The likelihood of the sample the probability of the observed data D given the haplotype frequencies Pp is given by 8 IIe n L DIp i l j where the sum is over all N individuals of the sample and the product is over all possible genotypes of those individuals and Gi 2D P j gt if i jor Gij p gt if i j The principle of the EM algorithm is the following 1 Start with arbitrary random estimates of haplotype frequencies 2 Use these estimates to compute expected genotype frequencies for each phenotype assuming Hardy Weinberg equilibrium The E step 3 The relative genotype frequencies are used as weights for their two constituting haplotypes in a gene counting procedure leading to new estimates of haplotype frequencies The M step 4 Repeat steps 2 3 until the haplotype frequencies reach equilibrium do not change more than a predefined epsilon value Dempster et al 1977 have shown that the likelihood of the sample could only grow after each step of the EM algorithm However there is no guarantee that the resulting haplotype frequencies are maximum likelihood estimates They can be just local optimal values In fact there
106. ish a link between the haplotype and its allelic content maintained in a local database Once a haplotype has been defined it needs not be defined again However the allelic content of the same haplotype can also be defined several times The different definitions of haplotypes with same identifier are checked for equality If they are found Manual Arlequin ver 3 1 Input files 31 identical a warning is issued is the log file If they are found to be different at some loci an error is issued and the program stops asking you to correct the error For complex haplotypes like very long DNA sequences one can perfectly assign different identifiers to all sequences each having thus an absolute frequency of 1 even if some sequences turn out to be similar to each other If the option Infer Haplotypes from Distance Matrix is checked in the General Settings dialog box Arlequin will check whether haplotypes are effectively different or not This is a good precaution when one tests the selective neutrality of the sample using Ewens Watterson or Chakraborty s tests because these tests are based on the observed number of effectively different haplotypes Genotypic data For each genotype one must specify its identifier its sample frequency and its allelic content Genotypic data can be entered either as a list of individuals all having an absolute frequency of 1 or as a list of genotypes with different sample frequencies During the comput
107. izard only creates an outline and that you manually need to fill in the data and specify your genetic structure Manual Arlequin ver 3 1 Getting started 2 2 1 Defining the Genetic Structure to be tested J Open project View project Q View results Gy View Log file Close project gt Start m Pause E stop j Settings Arlequin Configuration Project wizard Import data Genetic Structure Editor Assign a group number to populations a value of 0 implies no group assigment Resulting structure E 2 2 Genetic Structure 20 israeli Arab Israeli Jew E Group 1 Tharu Oriental E Group 2 Wolof Peul E Group 3 Pima Maya E Group 4 Finnish Sicilian E Group 5 israeli Arab Israeli Jew A new Genetic Structure Editor has been implemented in version 3 01 In the left pane all population samples found in the opened project are listed in the right column with a corresponding group identifier in the left column If no Genetic Structure is defined the 0 identifier will be listed In the right pane the resulting structure is shown Population samples can be assigned to different groups by giving them a new group identifier like Assign a group number to populations a value of 0 implies no group assigment israeli Arab 1 4 4 Finnish S S israel Jew Project Structure Editor Settings Arlequin Configuration Project wizard import data
108. l Arlequin ver 3 1 Methodological outlines 76 powerful at detecting departure from equilibrium for higher values of y see Slatkin 1994a The results are output in a file called d_dis x Significance level f The level at which the test of linkage disequilibrium is considered significant for the output table 6 3 8 5 2 Hardy Weinberg equilibrium File View Options Help J Open project 4 View project Q View results By View Log file Close project b Start m Paus ARLEQUIN SETTINGS Calculation settings Genetic structure i AMOVA Population comparisons Population differentiation gt Genotype assignment Haplotype inference ELB algorithm EM algorithm Linkage disequilibrium Hardy Weinberg gt Pairwise linkage 3 Mantel test Mismatch distribution Molecular diversity indices Neutrality tests a Project Settings Arlequin Configuration Project wizard import data No of steps in Markov chain 100000 No of dememorization steps 1000 HWE test type Locus by locus Whole haplotype Locus by locus and whole haplotype Dialog boxes to set up the parameters of diffe Perform exact test of Hardy Weinberg equilibrium b Test of the hypothesis that the observed diploid genotypes are the product of a random union of gametes This test is only possible for genotypic data Separate tests are carried out at each locus Thi
109. l outlines 62 6 3 8 Calculation Settings Arlequin 3 0 D Laurent Arlequin Code New test files Batch batch_ex arb Jog File View Options Help 3 Open project View project amp View results By View Log fie Close project Start m Pause Project Batch File Settings Arlequin Configuration Project wizard Import data Settings Arlequin calculation settings Load Save Choose one of the following computations to set up General settings a r Calculation settings Genetic structure E Genetic structure AMOVA AMOVA Population comparisons Population comparisons Population differentiation 3 Population differentiation Genotype assigment 3 Genotype assignment Haplot SELES Haplotype inference R z TER Linkage disequilibrium Linkage disequilibrium 3 Hardy Weinberg Hardy Weinberg Equilibrium test Pairwise linkage Pairwise Linkage Disequilibrium test Mantel test Mantel test Mismatch distribution bie ee Molecular diversity indices Mismatch distribution Neutrality tests Molecular Divers General settings Neutrality tests The Settings tab is divided into two zones On the left a tree structure allows the user to quickly select which task to perform The options for those tasks settings will appear on the right pane of the tab dialog If you select the first Arlequin settings node on the tree a list of the different tasks that can be set up appears on the right pane C
110. licking on these underlined blue links will lead you to the appropriate settings panes If a particular task has been selected it will be reflected by a red dot on the left side of the task in the tree structure Settings management Three buttons are also shown on the upper left of the tab dialog Reset Reset all settings to default values and uncheck all tasks Load Load a particular set of settings previously saved into a settings file extension ars Save Saves the current settings into a given setting file extension ars Manual Arlequin ver 3 1 6 3 8 1 General Settings Arlequin 3 0 D Laurent Arlequin Code New test files DNA mtDNAHV1 arp BAF File View Options Help Methodological outlines 63 J Open project View project Q View results By View Log file Close project gt Start m Pause Ei ena tne soe ARLEQUIN SETTINGS Calculation settings E Genetic structure AMOVA Population comparisons 3 Population differentiation amp Genotype assignment Haplotype inference Linkage disequilibrium 3 Hardy Weinberg Pairwise linkage gt Mantel test 3 Mismatch distribution Molecular diversity indices 3 Neutrality tests Settings Arlequin Configuration Project wizard import data Settings General settings m Project files Project input file D Laurent Ariequin Code New test files DNA mtDNAHV1 arp Result file D Laurent Arlequin Code New te
111. ll need to be done by hand and we apologize for that 3 6 Arlequin batch files A batch file with the arb extension is simply a text file having on each line the name of the project files that should be analyzed by Arlequin The number of data files to be analyzed can be arbitrary large If the project type you open is of Batch file type the Batch file tab panel opens up automatically and allows you to tune the settings of your batch run Manual Arlequin ver 3 1 Input files 40 Arlequin 3 0 D Laurent Arlequin Code New test files Batch batch_ex arb Qos File View Options Help 9 Open project Ead View project Z View results Ey View Log file Close project ic Start Ej Pause El Stop Project Batch File Settings Arlequin Configuration Project wizard import data E Z Project list Batch file Da 9 1 RelFreq arp 2 r Settings choice 2 GenotSta arp r s Use interface settings Use associated settings 3 PhenoHLA arp P 4 Missdata arp rm Results to summarize P 5 Microsat arp Gene diversity Hardy Weinberg P 6 IndLevelarp Nucleotide composition Tajima s D 7 Amova2 arp Molecular diversity Fu s Fs Mismatch distribution Chakraborty s test Linkage disequilibrium Population comparisons H r H P 8 Amova1 arp Theta values J7 Ewens Watterson s test ja ja E AMOVA Allele frequencies File currently processed No of files 8 No of processed files 0
112. ly in the graphical interface see section Defining the Genetic Structure to be tested 2 2 1 Computation of population specific Fst indices when a single group is defined in the Genetic Structure This may be useful to recognize population contributing particularly to the global Fsr measure This is also available in the locus by locus AMOVA section see section Population specific Fsr indices 7 2 4 1 13 3 Version 3 1 compared to version 3 01 Arlequin 3 1 includes some bug corrections some improvements and additional features Improvements Locus by locus AMOVA can now be performed independently from conventional AMOVA This can lead to faster computations for large sample sizes and large number of population samples Faster routines to handle long DNA sequences or large number of microsatellites Faster reading of input file Faster computation of demographic parameters from mismatch distribution Improved convergence of least square fitting algorithm Manual Arlequin ver 3 1 Introduction 17 Additions Computations of population specific inbreeding coefficients and computations of their significance level Computation of the number of alleles as well as observed and expected heterozygosity per locus Computation of the Garza Williamson statistic for MICROSAT data In batch mode the summary file sum now report the name of the analyzed file as well as the name of the analyzed population sample When saving current settings u
113. ma 1960 The evolutionary dynamics of complex polymorphisms Evolution 14 450 472 Li W H 1977 Distribution of nucleotide differences between two randomly chosen cistrons in a finite population Genetics 85 331 337 Long J C 1986 The allelic correlation structure of Gainj and Kalam speaking people The estimation and interpretation of Wright s F statistics Genetics 112 629 647 Mantel N 1967 The detection of disease clustering and a generalized regression approach Cancer Res 27 209 220 Michalakis Y and Excoffier L 1996 A generic estimation of population subdivision using distances between alleles with special reference to microsatellite loci Genetics 142 1061 1064 Nei M 1987 Molecular Evolutionary Genetics Columbia University Press New York NY USA Nei M and W H Li 1979 Mathematical model for studying genetic variation in terms of restriction endonucleases Proc Natl Acad Sci USA 76 5269 5273 Paetkau D Calvert W Stirling and Strobeck C 1995 Microsatellite analysis of population structure in Canadian polar bears Mol Ecol 4 347 54 Ohta T Kimura M 1973 A model of mutation appropriate to estimate the number of electrophoretically detectable alleles in a finite population Genet Res 22 201 204 Paetkau D Waits LP Clarkson PL Craighead L and Strobeck C 1997 An empirical evaluation of genetic distance statistics using microsatellite data from bear Ursidae populations Genetics 1
114. metic phase it is also possible to test for the non random association of haplotypes into individuals Note that this test assumes that the allele frequencies are given Therefore this test is not possible for data with recessive alleles as in this case the allele frequencies need to be estimated A contingency table is first built The kxk entries of the table are the observed allele frequencies and K is the number of alleles Using the same notations as in section 8 2 2 the probability to observe the table under the null hypothesis of no association is given by Levene 1949 k ni 7n l l i 0 k i 2H Il n y 2n J i l j l where H is the number of heterozygote individuals Much like it was done for the test of linkage disequilibrium we explore alternative contingency tables having same marginal counts In order to create a new contingency table from an existing one we select two distinct lines 1 2 and two distinct columns ji j2 at random The new table is obtained by decreasing the counts of the cells J 2 j2 and increasing the counts of the cells 1 J2 2 J1 by one unit This leaves the alleles counts N unchanged The switch to the new table is accepted with a probability R equal to L n n 6 1 6 jR 4Ji h22 Ji ELA STRAY L n 1 n 1 6 X 1 8 1 1 2 2 J2 t21 1 J2 oJ 2 1 Manual Arlequin ver 3 1 Methodological outlines 116
115. mple in the Gibbs chain The arlequin files as many as the variable Number of samples defined above are written in a subdirectory of the result directory called PhaseDistribution They have the name ELB_EstimatedPhase lt Sample number gt arp Arlequin also outputs a file called ELB_Best_Phases arp containing for each individual the gametic phases estimated with the ELB algorithm as well as batch file FLB_PhaseDistribution arb listing all aforementioned project files Manual Arlequin ver 3 1 Methodological outlines 71 6 3 8 4 2 2 Settings for the EM algorithm File View Options Help J Open project View project Q View results By View Log fie Close project gt Start m Pause E Stop Project Settings Arlequin Configuration Project wizard import data Settings Haplotype inference via EM algorithm JV Use EM algorithm to estimate ML haplotype frequencies esnt tut sor ARLEQUIN SETTINGS Calculation settings E Genetic structure AMOVA Population comparisons Population differentiation Genotype assignment E Haplotype inference ELB algorithm gt Hardy Weinberg Pairwise linkage Mantel test Mismatch distribution Molecular diversity indices Neutrality tests General settings EM algorithm settings Perform EM algorithm at the Ic Haplotype level Locus level Haplotype and locus levels Epsilon value 1e 7 Significant digits for output s No of sta
116. n AMOVA Under the hypothesis that the two populations are undifferentiated we permute individuals between samples and re estimate the three parameters in order to obtain their empirical null distribution The percentile value of the three statistics is obtained by the proportion of permuted cases that produce statistics larger or equal to those observed It thus provides a percentile value of the three statistics under the null hypothesis of no differentiation The values of the estimated parameters should be interpreted with caution The procedure we have implemented is based on the comparison of intra and inter population diversities 7 s which have a large variance which means that for short divergence times the average diversity found within population could be larger than that observed between populations This situation could lead to negative divergence times and to daughter population relative size larger than one or smaller than zero negative values Also large departures from the assumed pure fission model could also lead to observed diversities that would lead to aberrant estimators of divergence time and relative population sizes One should thus make those computations if the assumptions of a pure fission model are met and if the divergence time is relatively old Simulation results have shown that this procedure leads to better results than other methods that do not take unequal population sizes into account when the relative sizes
117. n demographic expansion which generally lead to large negative Fs values The significance of the Fs statistic is tested by generating random samples under the hypothesis of selective neutrality and population equilibrium using a coalescent Manual Arlequin ver 3 1 Methodological outlines 119 simulation algorithm adapted from Hudson 1990 The P value of the Fs statistic is then obtained as the proportion of random Fs statistics less or equal to the observation Using simulations Fu noticed that the 2 percentile of the distribution corresponded to the 5 cutoff value i e the critical value of the test at the 5 significance level We indeed confirmed this behavior by our own simulations Even though this property is not fully understood it means that a Fs statistic should be considered as significant at the 5 level if its P value is below 0 02 and not below 0 05 Reference Fu 1997 7 2 Inter population level methods 7 2 1 Population genetic structure inferred by analysis of variance AMOVA The genetic structure of population is investigated here by an analysis of variance framework as initially defined by Cockerham 1969 1973 and extended by others see e g Weir and Cockerham 1984 Long 1986 The Analysis of Molecular Variance approach used in Arlequin AMOVA Excoffier et al 1992 is essentially similar to other approaches based on analyses of variance of gene frequencies but it takes into account the number of m
118. necessary to reach a Manual Arlequin ver 3 1 Methodological outlines 74 random starting point corresponding to a table independent from the observed table LD coefficients between pairs of alleles at different loci Compute D D and r coefficients b between all pairs of alleles at different loci See section 7 1 4 3 1 D The classical linkage disequilibrium coefficient measuring deviation from random association between alleles at different loci Lewontin and Kojima 1960 expressed as D et oe 2 D The linkage disequilibrium coefficient D standardized by the maximum value it can take p _ given the allele frequencies Lewontin 1964 max 3 r It is another way to standardise the simple measure of linkage disequilibrium 2 D Das r P i pi p Q p o Generate histogram and table b Generates a histogram of the number of loci with which each locus is in disequilibrium and an S by S table S being the number of polymorphic loci summarizing the significant associations between pairs of loci This table is generated for different levels of polymorphism controlled by the value y a locus is declared polymorphic if there are at least 2 alleles with y copies in the sample Slatkin 1994a This is done because the exact test is more powerful at detecting departure from equilibrium for higher values of y Slatkin 1994a The results are output in a file called d_dis xl Significance level f The level
119. nequilibrium population genetics TREE 14 17 21 Dempster A N Laird and D Rubin 1977 Maximum likelihood estimation from incomplete data via the EM algorithm J Roy Statist Soc 39 1 38 Efron B 1982 The Jacknife the Bootstrap and other Resampling Plans Regional Conference Series in Applied Mathematics Philadelphia Efron B and R J Tibshirani 1993 An Introduction to the Bootstrap Chapman and Hall London Ewens W J 1972 The sampling theory of selectively neutral alleles Theor Popul Biol 3 87 112 Ewens W J 1977 Population genetics theory in relation to the neutralist selectionist controversy In Advances in human genetics edited by Harris H and Hirschhorn K New York Plenum Press p 67 134 Excoffier L 2003 Analysis of Population Subdivision In Balding D Bishop M Cannings C editors Handbook of Statistical Genetics 2nd Edition New York John Wiley amp Sons Ltd pp 713 750 Excoffier L 2004 Patterns of DNA sequence diversity and genetic structure after a range expansion lessons from the infinite island model Mol Ecol 13 4 853 864 Manual Arlequin ver 3 1 References 136 Excoffier L Smouse P and Quattro J 1992 Analysis of molecular variance inferred from metric distances among DNA haplotypes Application to human mitochondrial DNA restriction data Genetics 131 479 491 Excoffier L and P Smouse 1994 Using allele frequencies and geographic subdivision to reconstruct g
120. ng P 1 SSD AP no 20 0 Populations Among N P SSD AI WP w o Individuals Within Populations Within N SSD WI o2 Individuals Total 2N 1 SSD T o2 Manual Arlequin ver 3 1 Methodological outlines 126 Where n and the F statistics are defined by 2N 2N 2N peP P 1 pO is SB HOP ok 2 Op an _ _ ST 2 IT 2 IS 2 20 Or Or o tOo 2 e We test o and Frr by permuting haplotypes among individuals among populations e We test Ti and Fs by permuting haplotypes among individuals within populations e We test G and Fsrby permuting individual genotypes among populations 7 2 1 7 Genotypic data several groups of populations within individual level Source of Degrees of Sum of squares Expected mean squares Variation freedom SSD Among Groups G 1 SSD AG iez ino 202 o a b c d Among P G SSD AP WG no 202 02 b d Populations Within Groups Among N P SSD AI WP 202 4 62 d Individuals j Within Populations Within N SSD WI ae d Individuals Total 2N 1 SSD T o2 T Where the n s and the F statistics are defined by Manual Arlequin ver 3 1 Methodological outlines 127 2N N N DON 2N 4 2 pee seGpes 8 n 2EC amp Peg n N P G N G 1 G 1 pp Troto oo ar o en e aa E a _ an S CT pA IT 2 IS 2 2 SC 2 2 2 Or Or a o oO o o e We test a and Frr by permuting haplotypes among popula
121. ng allele is treated as a specific allele Import Export routines are still not very flexible Manual Arlequin ver 3 1 Getting started 18 2 GETTING STARTED The first thing to do before running Arlequin for the first time is certainly to read the present manual It will provide you with most of the information you are looking for So take some time to read it before you seriously start analyzing your data 2 1 Arlequin configuration Arlequin 3 0 File View Options Help SJ Open project View project fg View results Gy View Log file Q Close project gt start n Pause Arlequi ion Project wizard import data Use associated settings V Append results Keep AMOVA null distributions Prompt for handling unphased multilocus data m Helper programs Text editor Browse C Program Files TextPad 4 TextPad exe Before a first use of Arlequin you need to specify which text editor will be used by Arlequin to edit project files or view the log file We recommend the use of a powerful text editor like TextPad freely available on http www textpad com 2 2 Preparing input files The first step for the analysis of your data is to prepare an input data file for Arlequin This input file is called here a project file As Arlequin is quite a versatile program able to analyze several data types you have to include some information about the properties of your data in the project fil
122. notypic data or haplotypic data For genotypic data the diploid information of each genotype is entered on separate lines in the input file Manual Arlequin ver 3 1 Methodological outlines 60 Gametic phase r Specifies whether the gametic phase is known or unknown when the input file is made up of genotypic data If the gametic phase is known then the treatment of the data will be essentially similar to that of haplotypic data Data type r Data type specified in the input file Dominance r Specifies if the data consists of only co dominant data or if some recessive alleles can occur Recessive allele r Specifies the identifier of the recessive allele Locus separator r The character used to separate allelic information at adjacent loci Missing data r The character used to represent missing data at any locus By default a question mark is used for unknown alleles 6 3 7 Batch files Arlequin 3 0a D Laurent Arlequin Code New test files Batch batch_ex arb File View Options Help lt 3 Open project A View project ia View results D Close project 3 Exit Arlequin E Start m Pause Stop E No of files 8 No of processed files 0 BG Project list Batch file P 1 RelFreg arp Settings choice 2 GenotSta a kd Use interface settings Use associated settings 3 PhenoHLA arp P 4 Missdata arp Results to summarize P 5 Microsat arp V Gene diver
123. ns Watterson neutrality test Calculation settings No of simulated samples 1000 E Genetic structure ge lation 2 AMOVA rty s popu amalgamation Population comparisons Population differentiation Genotype assignment Tajima s D 2 j Haplotype inference zs Linkage disequilibrium i Hardy Weinberg No of simulated samples 1000 Pairwise linkage Mantel test Haplotype definition Mismatch distribution Use original definition Infer from distance matrix Molecular diversity indices m Infinite site model General settings Tests of selective neutrality based either on the infinite allele model or on the infinite site model see section 7 1 6 Infinite allele model e Ewens Watterson neutrality tests b Performs tests of selective neutrality based on Ewens sampling theory in a population at equilibrium Ewens 1972 Manual Arlequin ver 3 1 Methodological outlines 78 These tests are currently limited to sample sizes of 2000 genes or less and 1000 different alleles haplotypes or less Ewens Watterson homozygosity test This test devised by Watterson 1978 1986 is based on Ewens sampling theory but uses as a statistic the quantity F equal to the sum of squared allele frequencies equivalent to the sample homozygosity in diploids see section 7 1 6 1 Exact test based on Ewens sampling theory In this test devised by Slatkin 1994b 1996 the proba
124. nt simulation algorithm adapted from Hudson 1990 The P value of the D statistic is then obtained as the proportion of random Fs statistics less or equal to the observation We also provide a parametric approximation of the P value assuming a beta distribution limited by minimum and maximum possible D values see Tajima 1989a p 589 Note that significant D values can be due to factors other than selective effects like population expansion bottleneck or heterogeneity of mutation rates see Tajima 1993 Aris Brosou and Excoffier 1996 or Tajima 1996 for further details References Tajima 1993 Aris Brosou and Excoffier 1996 Tajima 1996 7 1 6 5 Fu s Fs test of selective neutrality Like Tajima s 1989a test Fu s test Fu 1997 is based on the infinite site model without recombination and thus appropriate for short DNA sequences or RFLP haplotypes The principle of the test is very similar to that of Chakraborty described above Here we evaluate the probability of observing a random neutral sample with a number of alleles similar or smaller than the observed value see section 7 1 2 3 3 to see how this probability can be computed given the observed number of pairwise differences taken as an estimator of In more details Fu first calls this probability S PrK 2k 10 6 and defines the Fs statistic as the logit of S Fs ee Fu 1997 1 S Fu 1997 has noticed that the Fs statistic was very sensitive to populatio
125. o different ways to test for the presence of pairwise linkage disequilibrium between loci We describe in detail below how the two tests are done 7 1 4 1 Exact test of linkage disequilibrium haplotypic data This test is an extension of Fisher exact probability test on contingency tables Slatkin 1994a A contingency table is first built The K xk2 entries of the table are the observed haplotype frequencies absolute values with k and k being the number of alleles at locus 1 and 2 respectively The test consists in obtaining the probability of finding a table with the same marginal totals and which has a probability equal or less than the observed table Under the null hypothesis of no association between the two tested loci the probability of the observed table is i n Nise Maj a wre Ole J 0 TN j LJ where the n s denote the count of the haplotypes that have the th allele at the first locus and the j th allele at the second locus Nj is the overall frequency of the th Manual Arlequin ver 3 1 Methodological outlines 112 allele at the first locus 1 K and N is the count of the i th allele at the second locus j 1 k2 Instead of enumerating all possible contingency tables a Markov chain is used to efficiently explore the space of all possible tables This Markov chain consists in a random walk in the space of all contingency tables It is done is such a way that the probability to visit a part
126. ological outlines 45 026 1 GCCTGTCCGCGTAGCATACGGTGACGGTA 027 1 GCCTGTCTGCGTGGCATACGATGACGATG 028 1 GCCTGTCTGCGTAGCATACGATGACGATA Structure StructureName A group of 3 populations analyzed for DNA NbGroups 1 Group Population 1 Population 2 Population 3 5 4 Example of microsatellite data Genotypic In this example we show how to prepare a project file consisting in microsatellite data Four population samples are defined Three microsatellite loci only have been analyzed in diploid individuals The different genotypes are output on two separate lines The frequencies of the different genotypes are listed in the second column of the first line of each genotype Alternatively one could just output the genotype of each individual and simply set its frequency to 1 One should however be careful to use different identifiers for each individual It does not matter if different genotype labels refer to the same genotype content Here only a few different genotypes have been found in each of the populations which should not correspond to most real situations but we wanted to save space The genotypes consist in the number of repeats found at each locus The genetic structure to be analyzed consists in 2 groups each made up of 2 populations To make things clear the genotype Genot1 in the first population has been observed 27 times For the first locus 12 and 13 repeats were observe
127. on For each pair of populations the shared haplotypes will be printed out Then will follow a table that contains for every group of identified haplotypes its absolute and relative frequency in each population This task is only possible for haplotypic data or genotypic data with known gametic phase Haplotype definition Use original definition m Haplotypes are identified according to their original identifier without considering the fact that their molecular definition could be identical Manual Arlequin ver 3 1 Methodological outlines 68 Infer from distance matrix m Similar haplotypes will be identified by computing a molecular distance matrix between haplotypes Haplotype frequency estimation Estimate haplotype frequencies by mere counting b Estimate the maximum likelihood haplotype frequencies from the observed data using a mere gene counting procedure Estimate allele frequencies at all loci Estimate allele frequencies at all loci separately 6 3 8 4 2 Genotypic data with unknown gametic phase When gametic phase is unknown two methods can be used to infer haplotypes The maximum likelihood EM algorithm or or the Bayesian ELB algorithm File View Options Help J Open project View project Q View results EY View Log file Close project P Start E Paus Da Arlequin Configuration Project wizard Import data Haplotype inference phase unknow ARLEQUIN SETTINGS
128. ond locus Note that the same allele identifier can be used in different loci This is obviously true for Dna sequences but it also holds for all other data types Profile Title An example of genotypic data with known gametic phase NbSamples 3 GenotypicData 1 GameticPhase 1 There is no recessive allel RecessiveData 0 Dat aType STANDARD LocusSeparator WHITESPACE Data Samples SampleName standard_pop1 SampleSize 20 SampleData G1 4 A D B oG G2 5 A B A A G3 3 B B B A G4 8 D C D C SampleName standard_pop2 SampleSize 10 SampleData G5 5 G6 5 OO WiC Qe Qs SampleName standard_pop3 SampleSize 15 SampleData G7 3 A Pros Manual Arlequin ver 3 1 Methodological outlines G8 12 A C BB Structure StructureName Two groups NbGroups 2 Group standard_popl Group standard_pop2 standard_pop3 49 Manual Arlequin ver 3 1 Methodological outlines 50 6 ARLEQUIN INTERFACE The interface of Arlequin ver 3 0 has been completely rewritten in C and looks like Ariequin 3 1 oog File View Options Help e Open project B iow project l View results Ey View Log file A Close project Ej Start ej Pause Stop About Ariequin Configuration Project wizard Import data About Arlequin Arlequin ver 3 1 c Laurent Excoffier 1998 2006 Computational and Molecular Population Genetics Lab CMPG Zoologic
129. ons i Within Groups Within 2N P SSD WP g2 Populations i Total 2N 1 SSD T o2 T Where the n s and the F statistics are defined by 2N2 2N S Sn ae ss ise AA N P G geGpeg 8 2 2N Ne Sg x P 2N a 1 peP u 8 n z A G l G F o F o 0 d F o an CT 3 ST 2 SC j Or Or O tO If the gametic phase is known 2 c 2 e We test o and Fsr by permuting haplotypes among populations and among groups e We test o and Fsc by permuting haplotypes among populations but within groups If the gametic phase is not known 2 c e We test o among groups and Fsr by permuting individual genotypes among populations and Manual Arlequin ver 3 1 Methodological outlines 125 e We test and Fsc by permuting individual genotypes among populations but within groups In all cases e We test a and Fcr by permuting whole populations among groups 7 2 1 5 Genotypic data one population within individual level Source of Degrees of Sum of squares Expected mean variation freedom SSD squares Among N 1 SSD AI 20 0 Individuals Within N SSD WI a Individuals Total 2N 1 SSD T on Where Frs is defined as o Fis om T e We test o and Fi by permuting haplotypes among individuals 7 2 1 6 Genotypic data one group of populations within individual level Source of Degrees of Sum of squares Expected mean Variation freedom SSD squares Amo
130. opens a tab for the conversion of data files from one format to an other This might be useful for users already having data files set up for other data software packages It is also possible to convert Arlequin data files into other formats The currently recognized data formats are e Arlequin e GenePop ver 3 0 e Biosys ver 1 0 e Phylip ver 3 5 Manual Arlequin ver 3 1 Input files 39 e Mega ver 1 0 e Win Amova ver 1 55 Arlequin 3 0 D Laurent Arlequin Code New test files DNA mtDNAHV1 arp caog File View Options Help J Open project View project Q View results By View Log file Close project P Start m Pause E Stop Project Settings Ariequin Configuration Project wizard Import Export data file Source File Browse Format Ariequin X none selected Target Format Arlequin v J Load in Arlequin after translation File none selected Translate The translation procedure is more fully described in the Project Wizard section 6 3 5 These conversion routines were done on the basis of the description of the input file format found in the user manuals of each of aforementioned programs The tests done with the example files given with these programs worked fine However the original reading procedures of the other software packages may be more tolerant than our own and some data may be impossible to convert Thus some small corrections wi
131. or smaller than the original sample is recorded This tests is currently limited to sample sizes of 2000 genes or less and 1000 different alleles haplotypes or less It can be used to test the hypothesis of selective neutrality and population equilibrium against either balancing selection or the presence of advantageous alleles References Ewens 1972 Watterson 1978 Manual Arlequin ver 3 1 Methodological outlines 117 7 1 6 2 Ewens Watterson Slatkin exact test This test is essentially similar to that of Watterson 1978 test but instead of using F as a summary Statistic it compares the probabilities of the random samples to that of the observed sample Slatkin 1994b 1996 The probability of obtaining a random sample having a probability smaller or equal to the observed sample is recorded The results are in general very close to those of Watterson s homozygosity test Note that the random samples are generated as explained for the Ewens Watterson homozygosity test References Ewens 1972 Slatkin 1994b 1996 7 1 6 3 Chakraborty s test of population amalgamation This test is also based on the infinite allele model and on Ewens 1972 sampling theory of neutral alleles By simulation Chakraborty 1990 has noticed that the number of alleles in a heterogeneous sample drawn from a population resulting from the amalgamation of previously isolated populations was larger than the number of alleles expected in a homogeneous neutral s
132. orm i Set the number of parametric bootstrap replicates of the EM estimation process on random samples generated from a fictive population having haplotype frequencies equal to previously estimated ML frequencies This procedure is used to generate the standard deviation of haplotype frequencies When set to zero the standard deviations are not estimated No of starting points for s d estimation i Set the number of initial conditions for the bootstrap procedure It may be smaller than the number of initial conditions set when estimating the haplotype frequencies because the bootstrap replicates are quite time consuming Setting this number to small values is conservative in the sense that it usually inflates the standard deviations Manual Arlequin ver 3 1 Methodological outlines 73 6 3 8 5 Linkage disequilibrium 6 3 8 5 1 Linkage disequilibrium between pairs of loci 6 3 8 5 1 1 Gametic phase known Arlequin 3 0 D Laurent Arlequin Code New test files DNA mtDNAHV1 arp cog File View Options Help Open project 4 View project Q View results By View Log file Close project gt Start H Pause Project Settings Arlequin Configuration Project wizard Import data Settings Pairwise LD phase known Reset Load Save V Linkage disequilibrium between all pairs of loci ARLEQUIN SETTINGS No of steps in Markov Chain 10000 E Calculation settings aii i C 6 Chuc aidsa No of dememorization s
133. otide level for DNA data k 2 P P ja A i l j lt i f J n L 2 V n 1 A 2 n n 3 22 n 3n L 9n n 1 Note that similar formulas are used for computing the average gene diversity over L loci for Microsat and Standard data assuming no recombination and selective neutrality As above one should be aware that these assumptions may not hold for these data types Note also that Arlequin outputs the standard deviation computed as s d V amp Note that for RFLP data this measure should be considered as the average heterozygosity per RFLP site which is different from the true diversity at the nucleotide level for which one would need to know the base composition of the restriction sites References Tajima 1983 Nei 1987 p 257 7 1 2 3 Theta estimators Several methods are used to estimate the population parameter 0 2Mu where M is equal to 2 N for diploid populations of size N or equal to N for haploid populations and u is the overall mutation rate at the haplotype level 7 1 2 3 1 Theta Hom The expected homozygosity in a population at equilibrium between drift and mutation is usually given by 1 EEL However Zouros 1979 has shown that this estimator was an overestimate when estimated from a single or a few loci Although he gave no closed form solution Chakraborty and Weiss 1991 proposed to iteratively solve the following relationship between the expectation of and the
134. pecific FIS s b Compute inbreeding coefficients Frs separately for each population and test it by permutation of gene copies between individuals within population The checkbox Include individual level must be checked to enable this option 6 3 8 7 2 Population comparison Arlequin 3 0 D Laurent Arlequin Code New test files Disequil Id_gen0 arp File View Options Help J Open project View project Q View results EY View Log file Close project Start m Pause His Project Settings Arlequin Configuration Project wizard Import data Settings Population comparisons ARLEQUIN SETTINGS Slatkin s distance J Compute pairwise differences pi Calculation settings E amp E Genetic structure AMOVA No of permutations 100 Significance level Jo os Population comparisons Population differentiation Compute distance matrix gt 3 Genotype assignment 2 Haplotype inference 3 ELB algorithm EM algorithm Linkage disequilibrium Hardy Weinberg gt Pairwise linkage 3 Mantel test 3 Mismatch distribution 3 Molecular diversity indices Neutrality tests General settings Reynold s distance J Estimate relative population sizes e Population comparisons b Computes different indexes of dissimilarities genetic distances between pairs of populations like Fs7 statistics and transformed pairwise Fsy s that can be used as short term genetic distances bet
135. ple of microsatellite data Genotypic 45 5 5 Example of RFLP data Haplotypic 46 5 6 Example of standard data Genotypic data known gametic phase 48 6 Arlequin interface 50 6 1 Menus 50 6 1 1 File Menu 50 6 1 2 View Menu 51 6 1 3 Options Menu 51 6 1 4 Help Menu 52 6 2 Toolbar 52 6 3 Tab dialogs 52 6 3 1 Open project 53 6 3 2 Handling of unphased genotypic data 54 6 3 3 Arlequin Configuration 55 6 3 4 Project Wizard 56 6 3 5 Import data 57 6 3 6 Loaded Project 59 6 3 7 Batch files 60 6 3 8 Calculation Settings 62 6 3 8 1 General Settings 63 6 3 8 2 Diversity indices 64 6 3 8 3 Mismatch distribution 65 6 3 8 4 Haplotype inference 67 6 3 8 4 1 Haplotypic data or genotypic diploid data with known gametic phase67 6 3 8 4 2 Genotypic data with unknown gametic phase 68 Manual Arlequin ver 3 1 Table of contents 5 6 3 8 5 Linkage disequilibrium 73 6 3 8 5 1 Linkage disequilibrium between pairs of loci 73 6 3 8 5 2 Hardy Weinberg equilibrium 76 6 3 8 6 Neutrality tests 77 6 3 8 7 Genetic structure 80 6 3 8 7 1 AMOVA 80 6 3 8 7 2 Population comparison 83 6 3 8 7 3 Population differentiation 85 6 3 8 8 Genotype assignment 87 6 3 8 9 Mantel test 88 7 Methodological outlines 89 7 1 Intra population level methods 90 7 1 1 Standard diversity indices 90 7 1 1 1 Gene diversity 90 7 1 1 2 Expected heterozygosity per locus 90 7 1 1 3 Number of usable loci 90 7 1 1 4 Number of polymorphic sites S 90 7 1 1 5 Allelic range R 90
136. plotypes 7 1 2 7 1 No of different alleles We simply count the number of different alleles between two haplotypes L dy g Doa i i l where Oy i is the Kronecker function equal to 1 if the alleles of the th locus are identical for both haplotypes and equal to O otherwise When estimating genetic structure indices this choice amounts at estimating weighted Fsr statistics over all loci Weir and Cockerham 1984 Michalakis and Excoffier 1996 7 1 2 7 2 Sum of squared size difference Counts the sum of the squared number of repeat difference between two haplotypes Slatkin 1995 L Be oes 2 d y Gs a i l where a is the number of repeats of the microsatellite for the th locus When estimating genetic structure indices this choice amounts at estimating an analog of Slatkin s Rsr 1995 see Michalakis and Excoffier 1996 as well as Rousset 1996 for details on the relationship between Fsr and Rs7 7 1 2 8 Estimation of distances between Standard haplotypes 7 1 2 8 1 Number of pairwise differences Simply counts the number of different alleles between two haplotypes L d iy xy i i l where On i is the Kronecker function equal to 1 if the alleles of the th locus are identical for both haplotypes and equal to O otherwise Manual Arlequin ver 3 1 Methodological outlines 105 When estimating genetic structure indices this choice amounts at estimating weighted Fsr statistics over all loc
137. population sizes and amounts of electrophoretic variation of enzyme loci in natural populations Genetics 92 623 646 Manual Arlequin ver 3 1 Appendix 141 9 APPENDIX 9 1 Overview of input file keywords Keywords Description Possible values Profile Title A title describing the A string of alphanumeric characters within present analysis double quotes NbSamples The number of different A positive integer larger than zero samples listed in the data file DataType The type of datato be STANDARD analyzed DNA only one type of data RFLP per project file is MICROSAT allowed FREQUENCY GenotypicData Specifies if genotypic or O haplotypic data gametic data is 1 genotypic data available LocusSeparator The character used to WHITESPACE separate adjacent loci TAB NONE or any character other than or the character specifying missing data Default WHITESPACE GameticPhase Specifies if the gametic 0 gametic phase not known phase is known for 1 Known gametic phase genotypic data only Default 1 RecessiveData Specifies whether 0 co dominant data recessive alleles are 1 recessive data present at all loci for Default 0 genotypic data RecessiveAllele Specifies the code for Any string within quotation marks the recessive allele This string can be explicitly used in the input file to indicate the occurrence of a recessive homozygote at one or several loci Default null MissingData A character used to 2
138. quared deviations Among Individuals SSD WP Sum of squared deviations Within Populations SSD WI Sum of squared deviations Within Individuals SSD AP WG Sum of squared deviations Among Populations Within Groups SSD AI WP Sum of squared deviations Among Individuals Within Populations G Number of groups in the structure P Total number of populations N Total number of individuals for genotypic data or total number of gene copies for haplotypic data N Number of individuals in population p for genotypic data or total number of gene copies in population p for haplotypic data N Number of individuals in group g for genotypic data or total number of gene copies in group g for haplotypic data Manual Arlequin ver 3 1 Methodological outlines 122 7 2 1 1 Haplotypic data one group of populations Source of variation Degrees of Sum of squares Expected mean freedom SSD squares Among P 1 SSD AP no o Populations Within Populations N P SSD WP o2 b Total N 1 SSD T o2 T Where n and Fer are defined by N2 Nay 2 Ly pa es P pes ST oe e We test o and Fsrby permuting haplotypes among populations 7 2 1 2 Haplotypic data several groups of populations Source of variation Degrees of Sum of squares Expected mean freedom SSD squares Among Groups G 1 SSD AG n oz pHo o a b c Among Populations P G SSD AP WG n6 io b c Within Groups N Within Population
139. re missing data the global variance components should be different because the degrees of freedom will vary from locus to locus and therefore the estimators of F statistics will also vary Manual Arlequin ver 3 1 Methodological outlines 128 7 2 4 Population specific Fst indices It has been proposed Weir and Hill 2002 p 734 that population specific Fsr indices could be computed such that the global Fsr index would be a weighted average of population specific F gt 7 values as P P For gt n Fori Yn i 1 i l where n is the number of gene copies sampled in the j th population Following on that we propose to use as population specific value for the th population the quantity LI 1 sspcapy sspwwe P 1 n N P 2 OT Fri which satisfies the above equation We assume here that there is a single hierarchical level with genes within populations We therefore follow the notations found in section 7 2 1 1 The option to compute these population specific Fsr indices is offered when a single group of population samples is defined for haplotypic or genotypic data Intuitively these population specific coefficients would represent the degree of evolution of particular populations from a common ancestral population which would have split into all the demes considered in the Genetic Structure These coefficients are provided here mainly to see if some populations do contribute differently than others to the average
140. re of ancient population growth in a low resolution mitochondrial DNA mismatch distribution Hum Biol 66 591 600 Hudson R R 1990 Gene genealogies and the coalescent proces pp 1 44 in Oxford Surveys in Evolutionary Biology edited by Futuyama and J D Antonovics Oxford University Press New York Jin L and Nei M 1990 Limitations of the evolutionary parsimony method of phylogenetic analysis Mol Biol Evol 7 82 102 Jukes T and Cantor C 1969 Evolution of protein molecules In Mammalian Protein Metabolism edited by Munro HN New York Academic press p 21 132 Manual Arlequin ver 3 1 References 137 Kimura M 1980 A simple method for estimating evolutionary rate of base substitution through comparative studies of nucleotide sequences J Mol Evol 16 111 120 Kruskal J B 1956 On the shortest spanning subtree of a graph and the travelling salesman problem Proc Amer Math Soc 7 48 50 Kumar S Tamura K and M Nei 1993 MEGA Molecular Evolutionary Genetic Analysis ver 1 0 The Pennsylvania State University University Park PA 16802 Lange K 1997 Mathematical and Statistical Methods for Genetic Analysis Springer New York Levene H 1949 On a matching problem arising in genetics Annals of Mathematical Statistics 20 91 94 Lewontin R C 1964 The interaction of selection and linkage General considerations heterotic models Genetics 49 49 67 Lewontin R C and K Koji
141. rent projects Append results If the option Append Results is checked the results of the current computations are appended to those of previous analyses Otherwise only the results of the last analysis are written in the result file and previous results are erased Keep AMOVA null distributions If this option is checked the null distributions of a a o and a generated by an AMOVA analysis are written in files having c I the same name as the project file but with the extensions va vb vc and vd respectively Helper programs Manual Arlequin ver 3 1 Methodological outlines 56 Text editor press on the Browse button to locate the text editor you want to use to edit or view your project file and to view the Arlequin Log File 6 3 4 Project Wizard E Arlequin 3 0a gog File View Options Help Open project Ed View project gj View results Close project 59 Exit Arlequin Dj stat Dy About Configuration Arlequin Project wizard import data Project Editor Project wizard New project file name Browse Create project Edit project m Data type STANDARD r Genotypic data Known gametic phase Recessive data Controls No of samples 1 al Locus separator WHITESPACE Missing data m Optional sections Include haplotype list Include distance matrix Include genetic structure In order to help you setting up quickly a proj
142. rting points for EM algorithm Maximum number of iterations J Use zipper version of EM Recessive data Estimate s d through bootstrap No of bootstraps to perform ioo No of loci orders m No of starting points for s d estimations 10 os Use EM algorithm to estimate ML haplotype frequencies b We estimate the maximum likelinood ML haplotype frequencies from the observed data using an Expectation Maximization EM algorithm for multi locus genotypic data when the gametic phase is not known or when recessive alleles are present see section 7 1 3 2 Perform EM algorithm at the Haplotype level m Estimate haplotype frequencies for haplotypes defined by alleles at all loci Locus level m Estimate allele frequencies for each locus Haplotype and locus levels m The two previous options are performed one after the other Epsilon value Threshold for stopping the EM algorithm After each iteration Arlequin checks if the current haplotype frequencies are different from those at the previous iteration If the sum of difference is smaller than epsilon the algorithm stops Manual Arlequin ver 3 1 Methodological outlines 72 Significant digits for output Precision required for output of haplotype frequencies Haplotypes having a zero frequency given the required precisin are not output in the result file Number of starting points for
143. ructing a table of size no of populations by no of haplotypes Raymond and Rousset 1995 Manual Arlequin ver 3 1 Methodological outlines 86 No of steps in Markov chain i The maximum number of alternative tables to explore Figures of 100 000 or more are in order Larger values of the step number increases the precision of the P value as well as its estimated standard deviation No of dememorisation steps i The number of steps to perform before beginning to compare the alternative table probabilities to that of the observed table Corresponds to a burnin A few thousands steps are necessary to reach a random starting point corresponding to a table independent from the observed table Generate histogram and table b Generates a histogram of the number of populations which are significantly different from a given population and a PxP table P being the number of populations summarizing the significant associations between pairs of populations An association between two populations is considered as significant or not depending on the significance level specified below Significance level f The level at which the test of differentiation is considered significant for the output table If the P value is smaller than the Significance level then the two populations are considered as significantly different Manual Arlequin ver 3 1 Methodological outlines 87 6 3 8 8 Genotype assignment Arlequin 3 0a D Laurent Arleq
144. s We strongly recommend you to consult the original references provided with the description of a given method if you are in doubt with any aspect of the analysis 1 4 Data types handled by Arlequin Arlequin can handle several types of data either in haplotypic or genotypic form The basic data types are e DNA sequences e RFLP data e Microsatellite data e Standard data e Allele frequency data By haplotypic form we mean that genetic data can be presented under the form of haplotypes i e a combination of alleles at one or more loci This haplotypic form can result from the analyses of haploid genomes mtDNA Y chromosome prokaryotes or from diploid genomes where the gametic phase could be inferred by one way or another Note that allelic data are treated here as a single locus haplotype Ex 1 Haplotypic RFLP data 100110100101001010 Ex 2 Haplotypic standard HLA data DRB1 0101 DQB1 0102 DPB1 0201 By genotypic form we mean that genetic data is presented under the form of diploid genotypes i e a combination of pairs of alleles at one or more loci Each genotype is entered on two separate lines with the two alleles of each locus being on a different line Ex1 Genotypic DNA sequence data ACGGCA AAGCATGACATACGGATTGACA ACGGGA TAGCATGACATTCGGATAGACA Ex 2 Genotypic Microsatellite data 63 24 32 62 24 30 The gametic phase of a multi locus genotype may be either known or unknown
145. s Help 3 Open project Fd View project View results Ey View Log file Close project iC Start m Pause Stop Project Structure Editor Settings Arlequin Configuration Project wizard Import deta Settings alysis of MOlecular VAriance eset Load Save Standard AMOVA computations haplotypic format ARLEQUIN SETTINGS JV Locus by locus AMOVA Calculation settings m Amova settings Genetic structure JV Include individuallevel Vv iC AMOVA ae Population comparisons Population differentiation No of permutations foo 000 3 Genotype assignment E Haplotype inference ELB algorithm Compute distance matrix EM algorithm E Linkage disequilibrium Number of different alleles 3 Hardy Weinberg 3 Pairwise linkage Mantel test Print distance matrix Mismatch distribution Molecular diversity indices Neutrality tests General settings Compared to haplotypic data it becomes possible to compute the average inbreeding coefficient Frs with diploid genotypic data Include individual level for genotype data b Include the intra individual covariance component of genetic diversity and its associated inbreeding Manual Arlequin ver 3 1 Methodological outlines 83 coefficients Frs and F r It thus takes into account the differences between genes found within individuals This is another way to test for global departure from Hardy Weinberg equilibrium Compute population s
146. s N P SSD WP o a Total N 1 SSD T a NN Where the n s and the F statistics are defined by Manual Arlequin ver 3 1 Methodological outlines N N S San n C FE Ly P G geGpeg 8 N2 N2 Sa P N E i peP x ae geG y G 1 Gal o 2 02 Fae and Fon CT o2 SC o2 07 ST o2 e We test Oo 123 2 and Fsr by permuting haplotypes among populations among groups e We test 6 and Fsc by permuting haplotypes among populations within groups e We test o and Fcr by permuting populations among groups 7 2 1 3 Genotypic data one group of populations no within individual level Source of Degrees of Sum of squares Expected mean variation freedom SSD squares Among P 1 SSD AP no a Populations Within 2N P SSD WP 2 Populations Total 2N 1 SSD T 2 OF Where n and Fs are defined by 2N IN N LN n __ P 1 o2 F 4 Cy 2 Or If the gametic phase is know e We test o and Fsrby permuting haplotypes among populations Manual Arlequin ver 3 1 If the gametic phase is unknown Methodological outlines 124 e We test o and Fsrby permuting individual genotypes among populations 7 2 1 4 Genotypic data several groups of populations no within individual level Source of Degrees of Sum of squares Expected mean Variation freedom SSD squares Among Groups G 1 SSD AG n o n o o a b c Among P G SSD AP WG noz o b Populati
147. s test is analogous to Fisher s exact test on a two by two contingency table but extended to a contingency table of arbitrary size see section 7 1 5 If the gametic phase is unknown the test is only possible locus by locus For data with known gametic phase it is also possible to test the association at the haplotypic level within individuals No of steps in Markov chain i The maximum number of alternative tables to explore Figures of 100 000 or more are in order Manual Arlequin ver 3 1 Methodological outlines 77 No of dememorisation steps i The number of steps to perform before beginning to compare the alternative table probabilities to that of the observed table A few thousands steps are necessary to reach a random starting point corresponding to a table independent from the observed table HWE test type o Locus by locus m Perform separate HWE test for each locus o Whole haplotype m Perform a HWE test at the haplotype level if gametic phase is available o Locus by locus and whole haplotype m Perform both kinds of tests if gametic phase is available 6 3 8 6 Neutrality tests Arlequin 3 0 D Laurent Arlequin Code New test files DNA mtDNAHV1 arp Zoe File View Options Help J Open project View project Q View results EY View Log file Close project P Start E Paus Settings Ariequin Configuration Project wizard Import data SE cee ARLEQUIN SETTINGS 7 Ewe
148. sample size n and n 1 4 E k 0 X i o9 i Instead of the variance of we give the limits and of a 95 confidence interval around Bes obtained from Ewens 1972 Pr less than k alleles 0 Oy 0 025 Pr more than k alleles 8 0 0 025 These probabilities are obtained by summing up the probabilities of observing k alleles k 0 K obtained as Ewens 1972 sk ok n S O Pr K k10 where si is a Stirling number of the first kind see Abramovitz and Stegun 1970 and S_ is defined as 0 0 1 6 2 0 n 1 7 1 2 3 4 Theta z is estimated from the infinite site equilibrium relationship between the mean number of pairwise differences z and theta 8 E z 0 Tajima 1983 and its variance V z is given in section 7 1 1 1 7 1 2 4 Mismatch distribution It is the distribution of the observed number of differences between pairs of haplotypes This distribution is usually multimodal in samples drawn from populations at demographic equilibrium as it reflects the highly stochastic shape of gene trees but it Manual Arlequin ver 3 1 Methodological outlines 95 is usually unimodal in populations having passed through a recent demographic expansion Rogers and Harpending 1992 Hudson and Slatkin 1991 or though a range expansion with high levels of migration between neighboring demes Ray et al 2003 Excoffier 2004 7 1 2 4 1 Pure demographic expansion If one assumes t
149. ser are now asked to choose a file name Default is project file name ars New sections are provided at the end of the result file in order to report summary statistics computed over all populations o Basic properties of the samples size no of loci etc Heterozygosity per locus Number of alleles total no of alleles over all pops Allelic range total allelic range over all pops for microsatellite data Garza Williamson index for microsatellite data Number of segregating sites total over all pops Molecular diversity indices theta values Neutrality tests summary statistics and p values Demographic parameters estimated from the mismatch distribution and p values New shortcuts are provided in the left pane of the html result file for F statistics bootstrap confidence intervals population specific FIS and summary of intra population statistics 0000000 0 1 14 Forthcoming developments Linux version Incorporation of additional population genetics methods Suggestions are welcome but we only have one life 1 15 Reporting bugs and comments Problems can be reported on the Arlequin Forum located on the Genetic Software Forum GSF on http www rannala org gsf and hosted by Bruce Rannala 1 16 Remaining problems Missing data are not handled properly in the estimation of haplotype frequencies via the EM algorithm and in tests of linkage disequilibrium since the character string coding for missi
150. sion this file can be read with MS Excel without modification The format of the file is tab separated 4 4 View your results in HTML browser For very large result files or result files containing the product of several analyses it may be of practical interest to view the results in an HTML browser This can be simply done by activating the button Browse results of the project tab panel which will then load the result files into your default web browser In the web browser the file project name _main htm shows two panes Manual Arlequin ver 3 1 Output files 42 1 The left pane contains a tree where each first level branch corresponds to a run For each run we have several entries corresponding to the settings used for the calculation the inter population analyses Genetic structure shared haplotypes etc and finally all intra population analyses with one entry per population sample The description of this tree is stored in project name _tree html At this point it is important to notice that this tree uses the java script files ftiens4 js and ua js located in Arlequin s installation directory If you move Arlequin to another location or uninstall Arlequin the left pane will not work anymore N The right pane shows the results concerning the selected item in the left pane The HTML code of this pane is in the main result file This file is located in result sub directory of your project and is named project name
151. sity Hardy Weinberg F 6 IndLevelarp J Nucleotide composition Tajima s D D 7 Amova2 arp V Molecular diversity J FusFs P 8 Amova1 arp JV Theta values Ewens Watterson s test J Mismatch distribution Chakraborty s test J Linkage disequilibrium Population comparisons V AMOVA Allele frequencies File currently processed The project files found in the selected batch file appear listed in the left pane window Use associated settings b Use this button if you have prepared settings files associated to each project Manual Arlequin ver 3 1 Methodological outlines 61 Use interface settings b Use this button if you want to use the same predefined calculation settings for all project files Results to summarize This option allows you to collect a summary of the results for each file found in the batch list These results are written in different files having the extension sum These summary files will be placed into the same directory as the batch file List of summary files created by activating different checkboxes Checkbox Summary file Description Gene diversity Nucleotide composition Molecular diversity Mismatch distribution Theta values Linkage disequilibrium Hardy Weinberg Tajima s test Fu s Fs test Ewens Watterson Chakraborty s test Population comparisons Allele frequencies gen_div sum nucl_comp s
152. st files DNA mtDNAHV1 res mtDNAHV1_main htm m Polymorphism Control Allowed missing level per site 0 05 Transition weight Transversion weight jr Deletion weight 1 Haplotype definition E Use original definition Infer from distance matrix Project file r The name of the project file containing the data to be analyzed it usually has the arp extension Result files The html file containing the results of the analyses generated by Arlequin it has the same name as the project file but the htm extension Polymorphism control Allowed missing level per site f Specify the fraction of missing data allowed for any locus to be taken into account in the analyses For instance a level of 0 05 means that a locus with more than 5 of missing data will not be considered in any analysis This option is especially useful when dealing with DNA data where different individuals have been sequenced for slightly different fragments Setting a level of zero will force the analysis to consider only those sites that have been sequenced in all individuals Alternatively choosing a level of one means that all sites will be considered in the analyses even if they have not been sequenced in any individual not a very smart choice however Manual Arlequin ver 3 1 Methodological outlines 64 Transversion weight f The weight given to transversions when comparing
153. standard error on P is estimated by subdividing the total amount of required steps into B batches see Guo and Thompson 1992 p 367 A P value is calculated separately for each batch Let us denote it by P 1 B The estimated standard error is then calculated as s d P The process is stopped as soon as the estimated standard deviation is smaller than a pre defined value specified by the user Manual Arlequin ver 3 1 Methodological outlines 113 Reference Raymond and Rousset 1995 7 1 4 2 Likelihood ratio test of linkage disequilibrium genotypic data gametic phase unknown For genotypic data where the haplotypic phase is unknown the test based on the Markov chain described above is not possible because the haplotypic composition of the sample is unknown and is just estimated Therefore linkage disequilibrium between a pair of loci is tested for genotypic data using a likelihood ratio test whose empirical distribution is obtained by a permutation procedure Slatkin and Excoffier 1996 The likelihood of the data assuming linkage equilibrium L is computed by using the fact H that under this hypothesis the haplotype frequencies are obtained as the product of the allele frequencies The likelihood of the data not assuming linkage equilibrium Ly is obtained by applying the EM algorithm to estimate haplotype frequencies The likelihood ratio statistic given by Ly S 2 log Ly should in princ
154. teps 1000 i AMOVA LD coefficients between pairs of alleles at different loci 3 Population comparisons IV Compute D D and r2 coefficients 3 Population differentiation Genotype assignment V Generate histograms and table in file LD_DIS XL Haplotype inference Significance level Jo os Linkage disequilibrium 3 Hardy Weinberg 3 Mantel test 3 Mismatch distribution Molecular diversity indices Neutrality tests General settings e Linkage disequilibrium between all pairs of loci b Test for the presence of significant association between pairs of loci based on an exact test of linkage disequilibrium This test can be done with all data types except FREQUENCY data type The number of loci can be arbitrary but if there are less than two polymorphic loci there is no point performing this test The test procedure is analogous to Fisher s exact test on a two by two contingency table but extended to a contingency table of arbitrary size see section 7 1 4 1 No of steps in Markov chain i The maximum number of alternative tables to explore Figures of 100 000 or more are in order Larger values will lead to a better precision of the P value as well as its estimated standard deviation No of dememorization steps i The number of steps to perform before beginning to compare the alternative table probabilities to that of the observed table It corresponds to a burnin A few thousands steps are
155. ter used to code for missing data Manual Arlequin ver 3 1 Input files 27 Notation MissingData Possible values A character used to specify the code for missing data entered between single or double quotes Example MissingData Default value If haplotype or phenotype frequencies are entered as absolute or relative values Notation Frequency Possible values ABS absolute values REL relative values absolute values will be found by multiplying the relative frequencies by the sample sizes Example Frequency ABS Default value ABS The number of significant digits for haplotype frequency outputs Notation FrequencyThreshold Possible values A real number between le 2 and 1le 7 Example FrequencyThreshold 0 00001 Default value le 5 The convergence criterion for the EM algorithm used to estimate haplotype frequencies and linkage disequilibrium from genotypic data Notation EpsilonValue Possible values A real number between le 7 and le 12 Example EpsilonValue le 10 Default value le 7 3 2 2 Data section This section contains the raw data to be analyzed The beginning of the profile section is indicated by the keyword Data within brackets It contains several sub sections 3 2 2 1 Haplotype list optional In this sub section one can define a list of the haplotypes that are used for all samples This section is most useful in order to avoid repeating the allelic content of the
156. tion extensions of the Mantel Test of matrix correspondence Systematic Zoology 35 627 632 Sokal R R and F J Rohlf 1981 Biometry 2 edition W H Freeman and Co San Francisco CA Stewart F M 1977 Computer algorithm for obtaining a random set of allele frequencies for a locus in an equilibrium population Genetics 86 482 483 Strobeck K 1987 Average number of nucleotide differences in a sample from a single subpopulation A test for population subdivision Genetics 117 149 153 Tajima F 1983 Evolutionary relationship of DNA sequences in finite populations Genetics 105 437 460 Tajima F 1989a Statistical method for testing the neutral mutation hypothesis by DNA polymorphism Genetics 123 585 595 Tajima F 1989b The effect of change in population size on DNA polymorphism Genetics 123 597 601 Tajima F 1993 Measurement of DNA polymorphism In Mechanisms of Molecular Evolution Introduction to Molecular Paleopopulation Biology edited by Takahata N and Clark A G Tokyo Sunderland MA J apan Scientific Societies Press Sinauer Associates Inc p 37 59 Tajima F and Nei M 1984 Estimation of evolutionary distance between nucleotide sequences Mol Biol Evol 1 269 285 Manual Arlequin ver 3 1 References 140 Tajima F 1996 The amount of DNA polymorphism maintained in a finite population when the neutral mutation rate varies among sites Genetics 143 1457 1465 Tamura K 1992 Est
157. tion between genetic distances and one or two other distance matrices are defined in this section One must specify The size of the matrices used for the Mantel test Notation MatrixSize Possible values Any positive integer value Example MatrixSize 5 The number of matrices among which we compute the correlations If this number is 2 the correlation coefficient between the YMatrix see next keyword and the matrix defined after the DistMatMantel keyword If this number is 3 the partial correlation between the YMatrix see next keyword and the two other matrices are computed In this case the Mantel section should contain two DistMatMantel keywords followed by the definition of a distance matrix Notation MatrixNumber Example MatrixNumber 2 The matrix that is used as genetic distance If the value is fst then the correlation between the population pairwise Fs matrix other another matrix is computed If the value is custom then the correlation between a project defined matrix and other matrix is computed Notation YMatrix Possible values Corresponding YMatrix fst Y Fst log_fst Y log Fst slatkinlinearfst Y Fst 1 Fst log_slatkinlinearfst Y log Fst 1 Fst nm Y 1 Fst 2 Fst custom Y user specified in the project Example yMatrix fst Labels that identify the columns of the YMatrix In case of YMatrix fst the labels should be the names of population from witch we use
158. tion is needed for such data Population samples are then only compared for their allelic frequencies Manual Arlequin ver 3 1 Introduction 11 1 5 Methods implemented in Arlequin The analyses Arlequin can perform on the data fall into two main categories intra population and inter population methods In the first category statistical information is extracted independently from each population whereas in the second category samples are compared to each other Intra population methods Short description Standard indices Molecular diversity Mismatch distribution Haplotype frequency estimation Gametic phase estimation NEW Linkage disequilibrium Hardy Weinberg equilibrium Tajima s neutrality test infinite site model Fu s Fs neutrality test infinite site model Ewens Watterson neutrality test infinite allele model Chakraborty s amalgamation test infinite allele model Minimum Spanning Network MSN Some diversity measures like the number of polymorphic sites gene diversity Calculates several diversity indices like nucleotide diversity different estimators of the population parameter The distribution of the number of pairwise differences between haplotypes from which parameters of a demographic NEW or spatial population expansion can be estimated Estimates the frequency of haplotypes present in the population by maximum likelihood methods Estimates the most
159. tions and among groups 2 e Wetest of and Frs by permuting haplotypes among individuals within populations 5 e We test o2 b and Fsc by permuting individual genotypes among populations but within groups e We test G and Fcr by permuting populations among groups 7 2 2 Minimum Spanning Network MSN among haplotypes It is possible to compute the Minimum Spanning Tree MST and Minimum Spanning Network MSN from the squared distance matrix among haplotypes used for the calculation of F statistics in the AMOVA procedure See section 7 1 2 9 for a brief description of the method and references 7 2 3 Locus by locus AMOVA AMOVA analyses can now be performed for each locus separately in the same way it was performed at the haplotype level Variance components and F statistics are estimated for each locus separately and listed into a global table The different variance components from different levels are combined to produce synthetic estimators of F statistics by summing variance components estimated at a given level in the hierarchy in the numerator and denominator to produce F statistics as variance component ratios Therefore the global F statistics are not obtained as an arithmetic average of each locus F statistics see e g Weir and Cockerham 1984 or Weir 1996 If there is no missing data the locus by locus and the haplotype analyses should lead to identical sums of squares variance components and F statistics If there a
160. tives with different selected options The statistical tests implemented in Arlequin have been chosen such as to minimize hidden assumptions and to be as powerful as possible Thus they often take the form of either permutation tests or exact tests with some exceptions Finally we wanted Arlequin to be able to handle genetic data under many different forms and to try to carry out the same types of analyses irrespective of the format of the data Because Arlequin has a rich set of features and many options it means that the user has to spend some time in learning them However we hope that the learning curve will not be that steep Arlequin is made available free of charge as long as we have enough local resources to support the development of the program 1 3 About this manual The main purpose of this manual is to allow you to use Arlequin on your own in order to limit as far as possible e mail exchange with us In this manual we have tried to provide a description of 1 The data types handled by Arlequin 2 The way these data should be formatted before the analyses 3 The graphical interface 4 The impact of different options on the computations Manual Arlequin ver 3 1 Introduction 8 5 Methodological outlines describing which computations are actually performed by Arlequin Even though this manual contains the description of some theoretical aspects it should not be considered as a textbook in basic population genetic
161. tribution v v viv v Haplotype or allele frequency v v vjtvi v v v iv v estimation Linkage disequilibrium le v Liy v Tv iv Hardy Weinberg equilibrium X x v viv vj v Tajima s neutrality test x Fu s neutrality test v Ewens Watterson neutrality x v v v tests Chakraborty s amalgamation 4 v v v test Search for shared haplotypes v v v between samples AMOVA x v v v viv v v v v Minimum Spanning Network y v vi vx v Pairwise genetic distances X x vJ yj viv iv v Exact test of population Xx viv viyy viv Iv v differentiation Individual assignment tet X v viv viv Mantel test vy iv Viv jvilviviv v G Genotypic data gametic phase known G Genotypic data gametic phase unknown H Haplotypic data 1 Computation of minimum spanning network between haplotypes is only possible if a distance matrix is provided or if it can be computed from the data Manual Arlequin ver 3 1 Methodological outlines 90 7 1 Intra population level methods 7 1 1 Standard diversity indices 7 1 1 1 Gene diversity Equivalent to the expected heterozygosity for diploid data It is defined as the probability that two randomly chosen haplotypes are different in the sample Gene diversity and its sampling variance are estimated as k A n 2 H 1 rary dei k k k k VA n 2 n 2 F p S p Y p2 p2 i l i l i l n n 1 ial where n is the number of gene copies in the sample k is the number of
162. trix will be the pairwise Fst matrix between the population listed after YMatrixLabels The partial correlations will be based on the 3 by 3 matrix whose labels are listed after UsedYMatrixLabels Manual Arlequin ver 3 1 Mantel J Input files 35 size of the distance matrix MatrixSize 5 number of declared matrixes MatrixNumber 3 what to be taken as the YMatrix YMatrix Fst Labels to identify matrix entry and Population YMatrixLabels pop pop pop pop pop 1 2 3 4 5 distance matrix DistMatMantel 0 1 0 0 0 00 20 17 00 lt 12 00 84 123 44 OrROO 0 00 0 23 0 0 21 0 distance matrix DistMatMantel 0 00 O OOW 20 47 00 22 00 76 LAB 37 OrROO 0 00 0 37 0 Ose 2s lt 0 X1 00 12 0 00 X2 00 38 0 00 UsedYMatrixLabels pop pop pop 1 3 4 Example 2 we compute the correlation between the YMatrix and another matrix X1 The YMatrix will be defined after the keyword YMatrix The correlation will be based on the 3 by 3 matrix whose labels are listed after UsedYMatrixLabels Mantel MatrixSize 5 MatrixNumber 2 YMatrix Custom YMatrixLabels me wan DN TAN Wn size of the distance matrix number of declared matrixes 1 or 2 what to be taken as YMatrix Labels to identify matrix entry and Population Manual Arlequin ver 3 1 Input f
163. tually equivalent to a continent island model where the sampled deme would exchange migrants at rate m with a unique population of infinite size Some T generations in the past the continent island system would be reduced to a single deme of size No like Continent island model Manual Arlequin ver 3 1 Methodological outlines 98 m b T generations ago N 0 After the expansion Before the expansion Under this simple model the probability that two genes currently sampled in the small deme of size N differ at S sites is given by Me C 0 ae 0 ue 07 Ca Excoffier 2004 Fy S M Q Q T l i AS gt M D D S jt jas where amp 2Nou 0 2N u t 2Tu and A 01 M 1 and Cag TAF In Arlequin we estimate the three parameters of a spatial expansion Tt 6 4 G here we assume that N No and M 2Nm using the same least square method as described in the case of the estimation of the parameters of a demographic expansion see section 7 1 2 4 1 Like for the demographic expansion we also provide the expected mismatch distribution and test the fit to the model by coalescent simulations of an instantaneous expansion under the continent island model defined above References Ray et al 2003 Excoffier 2004 7 1 2 5 Estimation of genetic distances between DNA sequences Definitions L Number of loci Gamma This correction is proposed when the mutation rates cannot be correction ass
164. u J r me Goudet Fran ois Balloux Eric Petit Ettore Randi Natacha Mesquita David Foltz Guoqing Lu Tomas Hrbek Corinne Zeroual Rod Norman Chew Kiat Heng Russell Pfau April Harlin S Kark Jenny Ovenden Jill Shanahan and all the other users or beta testers of Arlequin that have send us their comments 1 12 How to get the last version of the Arlequin software Arlequin will be updated regularly and can be freely retrieved on http cmpg unibe ch software arlequin3 1 13 What s new in version 3 1 1 13 1 Version 3 0 compared to version 2 Arlequin version 3 now integrates the core computational routines and the interface in a single program written in C Therefore Arlequin does not rely on Java anymore This has two consequences the new graphical interface is nicer and faster but it is less portable than before At the moment we release a Windows version 2000 XP and Manual Arlequin ver 3 1 Introduction 16 above and we shall probably release later a Linux Support for the Mac has been discontinued Other main changes include 1 2 Correction of many small bugs Incorporation of two new methods to estimate gametic phase and haplotype frequencies a EM zipper algorithm An extension of the EM algorithm allowing one to handle a larger number of polymorphic sites than the plain EM algorithm b ELB algorithm a pseudo Bayesian approach to specifically estimate gametic phase in recombining sequenc
165. uilibrium gt Hardy Weinberg Pairwise linkage gt Mantel test 3 Mismatch distribution 3 Molecular diversity indices gt Neutrality tests General settings Significance level 0 05 e Linkage disequilibrium between all pairs of loci b perform the likelihood ratio test see section 7 1 4 2 No of permutations i Number of random permuted samples to generate Figures of several thousands are in order and 16 000 permutations guarantee to have less than 1 difference with the exact probability in 99 of the cases Guo and Thomson 1992 A standard error for the estimated P value is estimated using a system of batches Guo and Thomson 1992 No of initial conditions for EM i Sets the number of random initial conditions from which the EM is started to repeatedly estimate the sample likelihood The haplotype frequencies globally maximizing the sample likelihood will be eventually kept Figures of 3 or more are in order Generate histogram and table b Generates an histogram of the number of loci with which each locus is in disequilibrium and an S by S table S being the number of polymorphic loci summarizing the significant associations between pairs of loci This table is generated for different levels of polymorphism controlled by the value y a locus is declared polymorphic if there are at least 2 alleles with y copies in the sample Slatkin 1994a This is done because the exact test is more Manua
166. uin Code New test files HapIFreq HLA_7pop arp OJEJ File View Options Help lt 3 Open project View project Q View results Close project 39 Exit Arlequin E start Pause Project Settings Configuration Arlequin Project wizard import data Project Editor Settings Genotype assignment Reset Load save V Perform genotype assignment for all pairs of populations ARLEQUIN SETTINGS General settings E Calculation settings Genetic structure AMOVA Population comparisons Population differentiation Genotype assignment Haplotype inference E Linkage disequilibrium Hardy Weinberg Pairwise linkage 3 Mantel test 3 Mismatch distribution 3 Molecular diversity indices gt Neutrality tests Perform genotype assignment for all pairs of populations Computes the log likelihood of the genotype of each individual in every sample as if it was drawn from a population sample having allele frequencies equal to those estimated for each sample Paetkau et al 1997 Waser and Strobeck 1998 Multi locus genotype likelihoods are computed as the product of each locus likelihood thus assuming that the loci are independent The output result file lists for each population a table of the log likelihood of each individual genotype in all populations see section 7 2 7 Manual Arlequin ver 3 1 Methodological outlines 88 6 3 8 9 Mantel test E Open project View project
167. um mold_div sum mismatch sum theta sum d_pro sum link_dis sum hw sum tajima sum fu_fs sum ewens sum chakra sum coanst_c sum NM_value sum slatkin sum tau_uneq sum pairdiff sum pairdist sum allele_freqs sum Gene diversity of each sample Nucleotide composition of each sample Molecular diversity indexes of each sample Mismatch distribution for each sample Different theta values for each sample Significance level of linkage disequilibrium for each pair of loci Number of significantly linked loci per locus Test of departure from Hardy Weinberg equilibrium Tajima s test of selective neutrality Fu s Fs test of selective neutrality Ewens Watterson tests of selective neutrality Chakraborty s test of population amalgamation Matrix of Reynolds genetic distances in linear form Matrix of Nm values between pairs of populations in linear form Matrix of Slatkin s genetic distance in linear form Matrix of divergence times between populations taking into account unequal population sizes in linear form Matrix of mean number of pairwise differences between pairs of samples in linear form Different genetic distances for each pair of population only clearly readable if 2 samples in the project List allele frequencies for all populations in turn It becomes difficult to read when more than a single population is present in te project file Manual Arlequin ver 3 1 Methodologica
168. umed as uniform for all sites It had been originally proposed for mutation rates among amino acids Uzell and Corbin 1971 but it seems also to be the case of the control region of human mtDNA Wakeley 1993 In such a case a Gamma distribution of mutation rates is often assumed The Manual Arlequin ver 3 1 Methodological outlines 99 Shape of this distribution the unevenness of the mutation rates is mainly controlled by a parameter a which is the inverse of the coefficient of variation of the mutation rate The smaller the a coefficient the more uneven the mutation rates A uniform mutation rate corresponds to the case where a is equal to infinity nj Number of observed substitutions between two DNA sequences n Number of observed transitions between two DNA sequences Ss n Number of observed transversions between two DNA V sequences Q G C ratio computed on all the DNA sequences of a given sample 7 1 2 5 1 Pairwise difference Outputs the number of loci for which two haplotypes are different d n V d d L d L 7 1 2 5 2 Percentage difference Outputs the percentage of loci for which two haplotypes are different A d n L V d d 1 d L 7 1 2 5 3 Jukes and Cantor Outputs a corrected percentage of nucleotides for which two haplotypes are different The correction allows for multiple substitutions per site since the most recent common ancestor of the two DNA sequences The correction also assumes th
169. under the hypothesis of panmixia Assignment test of genotypes Assignment of individual genotypes to particular populations according to estimated allele frequencies Mantel test Short description Correlations or partial correlations Can be used to test for the presence of between a set of 2 or 3 matrices isolation by distance 1 6 System requirements e Windows 95 98 NT 2000 XP e A minimum of 64 MB RAM and more to avoid swapping e At least 1OMb free hard disk space 1 7 Installing and uninstalling Arlequin 1 7 1 Installation 1 7 1 1 Arlequin 3 installation 1 Download Arlequin3 zip to any temporary directory 2 Extract all files contained in Arlequin3 zip in the directory of your choice 3 Start Arlequin by double clicking on the file WinArl3 exe which is the main executable file 1 7 1 2 Arlequin 3 uninstallation Simply delete the directory where you installed Arlequin The registries were not modified by the installation of Arlequin Manual Arlequin ver 3 1 Introduction 13 1 8 List of files included in the Arlequin package Required by Arlequin to Files Description run properly Arlequin files WinArl3 exe Arlequin main application file including v graphical interface and computational routines Arlequin ini A file containing the description of the last v custom settings defined by the user NOT TO BE MODIFIED BY HAND Arl_run ars A file containing all the computation settings
170. unknown parameter 0 Manual Arlequin ver 3 1 Methodological outlines 93 Zouros 1979 E 6 of 2a 2 0X3 0 starting with a first estimate of of 1 H H and equating it to its expectation Chakraborty and Weiss 1991 give an approximate formula for the standard error of Oy as 2 0 3 0 s d H HH 1 A 2 0 B 0 4 6 10 2 0 4 s d where s d H is the standard error of H given in section 7 1 1 1 For MICROSAT data Ohta and Kimura 1973 have shown that the expected homozygosity in stationary populations under a pure stepwise mutation model was equal to E Hom 1 V 1 26 where 4N u for diploids and 2N u for haploid systems It follows that an estimator of 0 can be obtained for microsatellite data as A 1 l1 1 H where H is the expected heterozygosity estimated as in section 7 1 1 2 7 1 2 3 2 Theta S is estimated from the infinite site equilibrium relationship Watterson 1975 between the number of segregating sites S the sample size n and for a sample of non recombining DNA S 0 ai where 1 a F The variance of 6 is obtained as afs a S VO 55 Tajima 1989 Zia aj af a Manual Arlequin ver 3 1 Methodological outlines 94 where 1 La ay 2 i2 lI 7 1 2 3 3 Theta k is estimated from the infinite allele equilibrium relationship Ewens 1972 between the expected number of alleles k the
171. usset F 1996 Equilibrium values of measures of population subdivision for stepwise mutation processes Genetics 142 1357 1362 Rousset F 2000 Inferences from spatial population genetics in Handbook of Statistical Genetics D Balding M Bishop and C Cannings eds Wiley amp Sons Ltd Schneider S and L Excoffier 1999 Estimation of demographic parameters from the distribution of pairwise differences when the mutation rates vary among sites Application to human mitochondrial DNA Genetics 152 1079 1089 Slatkin M 1991 Inbreeding coefficients and coalescence times Genet Res Camb 58 167 175 Manual Arlequin ver 3 1 References 139 Slatkin M 1994a Linkage disequilibrium in growing and stable populations Genetics 137 331 336 Slatkin M 1994b An exact test for neutrality based on the Ewens sampling distribution Genet Res 64 1 71 74 Slatkin M 1995 A measure of population subdivision based on microsatellite allele frequencies Genetics 139 457 462 Slatkin M 1996 A correction to the exact test based on the Ewens sampling distribution Genet Res 68 259 260 Slatkin M and Excoffier L 1996 Testing for linkage disequilibrium in genotypic data using the EM algorithm Heredity 76 377 383 Smouse P E and J C Long 1992 Matrix correlation analysis in Anthropology and Genetics Y Phys Anthop 35 187 213 Smouse P E J C Long and R R Sokal 1986 Multiple regression and correla
172. utations between molecular haplotypes which first need to be evaluated By defining groups of populations the user defines a particular genetic structure that will be tested see the input file notations for more details A hierarchical analysis of variance partitions the total variance into covariance components due to intra individual differences inter individual differences and or inter population differences See also Weir 1996 for detailed treatments of hierarchical analyses and Excoffier 2000 as well as Rousset 2000 for an explanation why these are covariance components rather than variance components The covariance components o s are used to compute fixation indices as originally defined by Wright 1951 1965 in terms of inbreeding coefficients or later in terms of coalescent times by Slatkin 1991 Formally in the haploid case we assume that the th haplotype frequency vector from the j th population in the k th group is a linear equation of the form X ig xX a b C Manual Arlequin ver 3 1 Methodological outlines 120 The vector x is the unknown expectation of x averaged over the whole study The ijk effects are a for group b for population and c for haplotypes within a population within a group assumed to be additive random independent and to have the associated covariance components cme a and a respectively The total molecular variance o is the sum of the covariance component due to dif
173. ution of haplotype frequencies Note however that this procedure is quite computer intensive Reference Excoffier and Slatkin 1995 7 1 3 2 2 EM zipper algorithm The EM zipper is a simple extension of the EM algorithm aiming at speeding up the estimation process and allowing the handling of a much larger number of heterozygous sites per individual The EM algorithm becomes indeed extremely slow when there are more than 20 heterozygous sites per individual and it is therefore not suited for the analysis of long stretches of DNA with hundreds of polymorphic sites The EM zipper therefore begins by estimating frequencies of two locus haplotypes and then adds another locus to estimate 3 locus haplotype frequencies and then adds another locus to get 4 locus haplotype frequencies and so on until all loci have been added At each stage any n locus genotype which incorporates a n locus haplotype with estimated frequency equal to zero is prevented from being extended to n 1 loci because it is likely that the frequency of an extended n 1 locus haplotype would have also been equal to zero With this method Arlequin does not need to build all possible genotypes for each individual but it only considers the genotypes whose sub haplotypes have non null frequencies and one can thus handle a much larger number of polymorphic sites than the conventional EM algorithm In Arlequin s tab dialog see section 6 3 8 4 2 2 one can specify if the loci shoul
174. ween populations Reynolds et al 1983 Slatkin 1995 but also Nei s mean number of pairwise differences within and between pairs of populations The significance of the genetic distances is tested by permuting the haplotypes or individuals between the populations See section 7 2 3 for more details on the Manual Arlequin ver 3 1 Methodological outlines 84 output results genetic distances and migration rates estimates between populations Compute pairwise Fsr b Computes pairwise Fsr s for all pairs of populations Slatkin s distances b Computes Slatkin s 1995 genetic distance derived from pairwise Fsr see section 7 2 5 2 Reynolds s distance b Computes Reynolds et al 1983 linearized Fst for short divergence time see section 7 2 5 1 Compute pairwise differences b Computes Nei s average number of pairwise differences within and between populations Nei and Li 1979 see section 7 2 5 4 o Estimate relative population sizes b Computes relative population sizes for al pairs of populations as well as divergence times between populations taking into account these potential differences between population sizes Gaggiotti and Excoffier 2000 see section 7 2 5 5 No of permutations i Enter the required number of permutations to test the significance of the derived genetic distances If this number is set to zero no testing procedure will be performed Note that this procedure is quite time
175. y also refer to one or more external data files Note that comments beginning by a character can be put anywhere in the Arlequin project files Everything that follows the character on a line will be ignored by Arlequin Also note that Arlequin does not support interleaved data implying that haplotypes multi locus genotypes as well as entire rows of distance matrices must be entered on a single line A maximum of 100 000 characters can be entered on each line 3 2 Project file structure Input files are structured into two main sections with additional subsections that must appear in the following order 1 Profile section mandatory 2 Data section mandatory 2a Haplotype list optional 2b Distance matrices optional 2c Samples mandatory 2d Genetic structure optional 2e Mantel tests optional We now describe the content of each sub section in more detail 3 2 1 Profile section The properties of the data must be described in this section The beginning of the profile section is indicated by the keyword Profile within brackets One must also specify e The title of the current project used to describe the current analysis Notation Title Possible value Any string of characters within double quotes Example Title An analysis of haplotype frequencies in 2 populations e The number of samples or populations present in the current project Notation NbSamples Manual Arlequin ver 3 1 Input files

Arlequin User Manual

Contents

Download Pdf Manuals

Related Search

Related Contents