Home

Tassel 5 User Guide

1. uli n TEN DPOLLHOMESTEAD ID1 M DPOLL CLAYTON IDI5 DPOLLHOMESTEAD ID1 M DPOLL CLAYTON IDI5 HOMESTEAD ID1 B DPOLL CLAYTON ID15 Scatter plots Use the graph type combo box to select the desired graph type XY Plot from the list of options Select data to be plotted in X and Y axes using the appropriate drop down boxes If two data series are plotted simultaneously on the Y axis the 2 Y Axes checkbox will provide an axis for each 50 2 Nes x DPOLL AOMESTEAD IDI ha Line Regressior 2 Y Axes DPOLL HOMESTEAD ID1 vs DPOLL CLAYTON ID15 Lo z E I CL C 450 475 500 525 550 575 800 625 85D 875 700 725 750 DPOLL HOMESTEAD ID1 B DPOLL CLAYTON ID15 B DPOLL CLAYTON ID15 Fitted Reg Line 7 6 QQ Plot 7 7 Manhattan Plot 8 GBS Menu http www maizegenetics net tassel docs TasselPipelineGBS pdf 9 Help Menu Help provides information Tassel and diagnostics 9 1 Help Manual 9 2 About 9 3 Show Memory 9 4 Logging 51 10 Tutorial This tutorial reviews several common scenarios for using TASSEL in order to help the user better understand its capabilities for data manipulation and association analyses The TASSEL software package includes a tutorial data set that can be downloaded from the TASSEL website please unzip all files to a directory of your choice This tutorial data set contains data for phenotype genotype population structure and kinship 10 1 Missing Phe
2. mdp traits Law mdp genotype chri 157104 148907116 Matrix t 4 mdp kinship Tree Fusions 4 Synonymizer Number of rows 281 Number of elements 719922 2 s8 Taxa 1 1 1 1 si po p dm 0 w2 0 090 p XQ qd 0 e 9 9Q9Q0 4226 0 040 Q09pg o w 0p 90 0p Jqdl S 0 J f 0 null 1 null 2 null 3 null 4 null 5 null Eit 57 null 8 null 9 null ERUIT 11 null 12 null 13 null 14 null 15 null 17 null 18 null 19 null F3 TASSEL Trait Analysis by aSSociation Evolution and Linkage 3 0 39 BHR Fie Tools Help GDPC E SHA E S6 S S8 Sequence mdp genotype mdp genotype mdp genotype mdp genotype chri 157104 148907116 mdp population structure mdp traits mdp genotype chri 157104 148907116 Colla Taxa with insufficient data 7 K 30 8 cutoff E w Var Prop i ie Te Tt Tt Tt EEEE ee 2d Jt Tt Dt ee Tt Tet Tet Tt i Tt Tt Tet Dt Three items will be added to the data tree after running PCA The first are the PCs The second are the eigenvalues And the last are the eigenvectors Here we use the Chart Function in the Result mode to graph the first three PCs the individual eigenvalue contributions sometimes called a skree plot and
3. 3 Custom Level No Compression Variance Component Estimation P3D estimate once j Re estimate after each marker EN An MLM option dialog will pop up as shown above Choose the default options which use P3D and compression at the optimum compression level After the Run button is clicked the progress bar will start moving The time required will depend on sample size number of traits number of markers and the options chosen in the MLM option dialog After the progress bar is reset to zero indicating completion of MLM three reports will be added to the data tree The first two are similar to the reports created by GLM The most significant SNP is still the 61 same however the strength of association is weaker with a P value of 7 199x10 vs 1 1021x10 from GLM which does not pass the Bonferroni multiple test threshold 5x10 The third report contains the MLM specific statistics including 2 Log Likelihood genetic variance and residual variance components under different level of compression These statistics are illustrated by the Chart function on the Result mode as follows l b raph Type XYScatter Saws Properties raph Type XYScatter Save l Froperties Yi 2LnLk Y2 None m Yl Var genetic m Y2 Var error X 3 groups Line Regression 2Y Axes x groups ba 7 Line Regression 2 Y Axes groups vs 2LnLk groups vs Var genetic and Var error 460 1 459 458 4
4. 870 136356797 871 136357534 872 139668467 873 140524105 874 142431173 875 142821031 876 143853993 877 144466196 878 144466243 879 144466246 880 144466414 881 145421006 882 148153258 883 148153805 884 148154058 885 150829954 886 150830416 887 150830673 888 150830782 889 150837246 890 150837488 891 155566732 892 155576390 893 155818939 894 156252241 895 156252478 896 157104591 897 157263770 898 157640380 899 157640764 900 157640944 901 157646430 Fusions Synonvmizer Result alata EEEL ajaj z e e o n te o E n njejelote e en a e aeo DEEZ Number of sequences 261 Number of sites 2561 Data type IUPACNucleotide BBB BRR alalalalalala Ste Fn E Gi i EEZEEZEREEER EE Ca E Eu a Eo E eje eee etn o Es a Ca Ca Eu Ea c ED ED EZE RE Epspelelaiab ain e e Alalalaalalalalalala Hlaz gt alqlalalalalz aata ARR GERBER G le le EEREEEREEE Esa eje eje e el ajaaa CROORRCEHHKE Ea Ea E EE zlaajaa EEBDEEBEBEREUE oe eje e eje o o o i eje e nle e n nje e 3 9 9 9 9 9 fis ad Es Le ed Fd Pcl d eleele BEEEERBBBERE 55 EA TASSEL Trait Analysis by aSSociation Evolution and Linkage 3 0 39 BHR sm SHG J Data 3 73 Sequence mdp genotype mdp genotype i mdp genotype Lo mdp genotype chri 157104 148907116 4 Polymorphisms 3 73 Numerical fe mdp population structure
5. 4 mdp genotype 4 mdp population structure i 4 amp mdp traits Ei Matrix mdp kinship Tree Fusions Synonymizer foe mds sequence Synonyms J Result Table Title Taxa Synonym Table Number of columns 4 Number of rows 301 Number of elements 1204 Taxa synonyms Synonym Table 301 unique matches 73 unmatched TaxaSyno TaxaRealN ReflDNum MatchScore 11 1 0 0 T10 T10 82 1 0 73 12 0 5 1 0 0 73 12 0 5 37W 52 1 0 73 12 0 5 3 47 1 0 73 12 lo 5 112 112 37 1 0 31A 3 47 0 33333333 73 12 1 0 A554 3 0 33333333 i144 49 0 66666666 92 CML5 29 0 57142857 CMLS 29 0 57142857 CI187 2 16 0 25 5 CMLS 29 1 0 SC55 78 1 0 78 78 72 1 0 IL677A 43 1 0 3 47 0 4 37 37W 52 0 66666666 63 11 1 0 1 0 0 68 11 0 5 i 0 0 811 11 45 0 28571428 39 38 l0 5 55 44 0 33333333 372 1 0 0 370 1 0 0 267Y 35 1 0 E 0 0 WF9 90 1 0 Click OK to save the changes 19 l Threshold for synonymizer Synonymizer TaxaSy TaxaRe ReflDMum MatchSc ri g qX ps 1 u L u ri Ll UE OO Co dia aS 3 8 3 z 2 P S al ed el PPP mir HE hae Lu i D ce Ca H 5 JA h X TZ10 NENNEN ii ii C Kei aj C nar T na T i un A zi A T 3 EG tn IL 3 e er
6. This function generates a tree or cladogram data set TASSEL produces neighbor joining trees using only simple parsimony substitution models To retrieve cladogram data first select genotypic data from the Data Tree and then click Analysis gt Cladogram The resulting tree data and the corresponding matrix will appear as separate data sets on the Data Tree Results can be plotted using Results gt Archaeopteryx Tree 6 4 Kinship This function generates a kinship matrix from a genotype To do so first highlight SNP data then click on the Analysis Kinship submenu The resulting dialog box will then provide the option to select scaled IBS or pairwise IBS Clicking OK generates a kinship matrix When a genotype file is selected and pairwise IBS each element 1 j of the kinship matrix that is generated is equal to the proportion of the SNPs which are different between taxon 1 and taxon j Distance is calculated for 35 each pair of taxa ignoring any sites that have a missing value for one of the taxa The distance matrix is converted to a similarity matrix by subtracting all values from 2 then scaling so that the minimum value in the matrix is 0 and the maximum value 1s 2 Kinship can be derived from a set of random SNP data a minimum of several hundred SNPs spread over the whole genome is recommended This ad hoc rescaling method was implemented in an earlier version of TASSEL in order to provide a reasonable estimate
7. oa 29 reede 45 Data can be sorted by clicking on the column header of interest A secondary sort can be done by holding down the CTRL key and clicking on a second column Data can be exported to flat files that are either comma separated Comma Separated Values CSV or tab delimited Both these formats can then be imported into a spreadsheet program such as Excel Tables can also be printed 7 2 Archaeopteryx Tree Select a Tree data set to use https sites google com site cmzmasek home software archaeopteryx Lu rree wl Tree mdp_genotype 46 eoo Archaeopteryx 0 955 beta x 2010 01 15 File Tools Viewas Text FontSize Options Type Analysis Help vi Fhylogram vi Dyna Hide vi Rollover vi Show Internal Data vi Taxonomy Colorize Colorize Branches Display Data vi Node Name vi Taxonomy Code Ral Taxonomy Name Prot Gene Symbol Prot Gene Name Prot Gene Acc Confidence Value Event A579 iClick on Node to B73 m BH TRH Display Node Data m RHI oom Y F Order Subtrees Uncallapse All he A235 A659 Search B46 W22R RSTD 7 3 2D Plot Displays 2D plots and determines color thresholds This function is useful for plotting associations in multiple environments First select the desired result set Using the drop down boxes provided populate row
8. OK Cancel 3 Data Menu The Data Menu has options to import and export data sets as well as other data manipulate functions 3 1 Load Load provides options to import files for genotypes phenotypes populations structure and kinship matrices etc The tutorial data can be downloaded from the TASSEL website at this link http www maizegenetics net tassel docs TASSELTutorialData3 zip To use the data the zip file must be uncompressed and saved on your local machine These tutorial files will load correctly with the Make Best Guess option Multiple files can be imported simultaneously by highlighting them first holding Shift or Control key while clicking and then clicking the Open button eoo File Loader Choose File Type to Load Load Hapmap eoo Open Load HDFS E TASSELTutorialData a Neel Load VCF Nal Date Modified Load Plink 88 d amp sequence phy Monday July 19 2010 5 28 PM z diploid SSR txt Sunday May 5 2013 2 32 AM m Load Projection Alignment L7 mdp genotype hmp txt Tuesday July 13 2010 1 49 PM Load Phylip 88 mdp genotype plk map Tuesday July 13 2010 3 26 PM 886 mdp genotype plk ped Tuesday July 13 2010 3 26 PM _ Load FASTA File Bi mdp kinship txt Tuesday July 13 2010 2 25 PM Load Trait data covariates or factors mdp_population_structure txt Tuesday July 13 2010 2 32 PM E map traits txt Sunday January 30 2011 3 04 AM _ Load Square Numerical Ma
9. sets Then click the menu item Data Intersect Join to create a combined data set Association analysis Highlight the joint data set then click the menu item Analysis GLM to perform association analysis Two reports will be added to the data tree 58 MR ee eoo Filter Alignment Filter Alignment Minimum Count out of 281 sequences Minimum Frequency 0 05 Maximum Frequency 1 0 Position Type Position index Start Position 0 End Position 3092 of 3092 sites Remove minor SNP states __ Generate haplotypes via sliding window Haplotype Length Step Length Select Chromosomes Cancel Filter Filter Traits odify Tr t Fropertie Type ovariate ovariate covariate One of the reports added to data tree is labeled GLM Marker Test_ followed by the name of the joint data In addition to the information for traits and markers the data set contains the following statistics marker F F value from the F test on marker marker p P value from the F test on marker markerR2 R for the marker after fitting other model terms population structure markerDF Degree freedom of marker markerMS Mean square of marker errorDF Degree freedom of residual error errorMS Mean square of residual error modelDF Degree freedom of model modelMS Mean square of model 59 e00 TASSEL Trait Analysis by aSSociation Evolution and Linkage 5 0
10. 0 00445 36 29275 21 40765 supe ahaaa Peers dpoll PHM2244 5562502 1 29978 0 27446 0 00693 28 44728 21 88623 dpoll PZA0309 8075572 0 09464 0 90973 4 9411E 4 2 08047 21 98287 dpoll PZA0018 8366368 0 14162 0 86803 7 639E 4 3 12032 22 03351 dpoll PZA0018 8366411 4 48832 0 01214 0 02238 95 08609 21 18523 dpoll PZA0052 8367944 0 98318 0 32245 0 00277 21 694 22 06508 AH nA 0na37 orinn cC EFCA A NANG NNALAD TAN CANDIES 31 5 0nc 7 Stepwise Clicking marker p will sort the table by P value The smallest P value is 3 5963x10 A reasonable significance threshold is 1 9x10 which is 5 after Bonferroni multiple test correction 0 05 2559 The denominator in the Bonferroni correction is the total number of SNPs tested The association was significant The other data added to the data tree is labeled GLM Allele Estimates followed by the name of the joint data For the most significant SNP highlighted in the figure below there were two genotypes AA and GG There are 220 lines with genotype AA and 41 lines with allele GG For the trait dpoll days to pollination the difference between the two homozygotes was 3 86 days e090 TASSEL Trait Analysis by aSSociation Evolution and Linkage 5 0 2 File Data Filter Analysis Results Help Marker Obs Locus_pos_ Allele Estimate dpoll PHM448 23 8 133775120 dpoll PZA00766 1 133 8 133775220 Synonymizer Result 0 4 06105 R I
11. 208 naBaonE ee g eg peg mg peg g p e p b AE L j un a e em Ji e ko b 6 p un ue l i E pt bk bk k LL BBEBA5 t M SS a E E Am PES J inania a A O S kala L i E b Que eM me P EPIRI mr 1 a k BA Sa Aa Ad ia Be Aa a L nana n Bo E n d k Baa 4 Ww I Eee I jb A eS TEN ES ESI eo e it ab b bb Iz S ee ee ee ee ee ee es n L mE mS he ee ee i r 3 o0 in HB class met maizegenetics dna sop ConeGenotypeTable 1 1 Executing TASSEL http www maizegenetics net tassel docs ExecutingTassel pdf 1 2 Open Source Code Open source code for TASSEL is available at https bitbucket org tasseladmin tassel S source The package uses a number of other libraries that are included in the TASSEL distribution These include a modified version of the PAL library http www cebl auckland ac nz pal project the COLT library http dsd lbl gov hoschek colt jFreeChart http www jfree org jfreechart Guava Google Core Libraries https code google com p guava libraries JUnit http junit org Archaeopteryx https sites google com site cmzmasek home software archaeopteryx and BioJava http www
12. Discrete Include covariate covariate ivi Q2 e NENNEN NENNEN L Exclude Selected L Include Selected Exclude All ls Include All Change Selected Type to Data Change Selected Type to Covariate Change Selected Type to Marker OK Cancel 6 Analysis Menu 6 1 Diversity 32 This executes a basic diversity analysis Average pairwise divergence x segregating sites and estimates 4Nu can be calculated as well as sliding windows of diversity To run a diversity analysis click on a raw sequence alignment and then select Analysis gt Diversity Diversity Surveys Start Base 0 End Base 2465 In the resulting Diversity Surveys dialog box the various site classes available for analysis are listed on the left If the sequence has no annotation then only the Overall and Indels options will be active A sliding window of diversity can also be calculated across the region To produce a sliding window check the box next to Sliding Window and then enter the desired step size and size of the sliding window Results can be plotted using Results gt Chart or viewed in a table via Results gt Table 6 2 Linkage Disequilibrium This generates a linkage disequilibrium data set from SNP data NOTE It is important to use only filtered data sets apply Filter gt Sites first when estimating linkage disequilibrium as a raw alignment with nume
13. Eon ni S y i 5n rai sese SL zwi a BERI d Va rmm ims T gm Ha fT VT w i oe A TIME ii if i i Lat a nt HA si Basal j e IH 0 40 a ue tas cl 4 EJ HT 3 T L Tu F A d LR Us TI P uu un i k E i n iij TE Hb Hn E Hu EIE dB 1 TI Em ey us isi L amp p po LE j p zu RILBGIG set 2 pmi Ea a Rx zs us qh 1500 o Upper triange A Squared Lock Axis tox Schematic Save Save al Lower triangle P Value Chom LD plots can be saved in several formats The Save button will save the area of the graph shown in the screen while the Save All button will save the entire graph 7 5 Chart Chart provides a variety of graphs for visualizing numeric data 49 This feature can be used to display histograms XY plots bar charts and or pie charts Any numeric table data can be charted including LD results phenotypic data diversity results and association results Histograms Use the graph type combo box to select the desired graph type Histogram from the list of options Up to two different series of data can be plotted together Users may specify the number of bins to be used in the histogram raph Type Histogram Iv Series 1 DPOLL HOPMESTEAD ID1 Series 2 DPOLL CLAYTON ID1S Bins 3o DPOLL HOMESTEAD ID1 amp DPOLL CLAYTON ID15 Distribution 12 5
14. Permutations tests for multi factorial analysis of variance Journal of Statistical Computation and Simulation 73 85 113 2003
15. WA TT TT TT TT TT PZDOO081 2 C T 1 4835542 AGPv1 Panzea NA NA maizez82 WA CC cc Cc Cc CC zagli 1 A C 1 4912526 AGPy1 Panzea NA NA maize282 WA AA AA AA AA AA PZB00919 1 AJC 1 5353319 AGPy1 Panzea NA NA maizez82 WA cc cc CC CC AA PZB00919 2 G T 1 5353655 AGPv1 Panzea NA NA maizez82 WA GG GG GG GG GG 3 1 2 HDF5 Hierarchical Data Format version 5 http www hdfgroup org HDF5 3 1 3 VCF Variant Call Format http www 1000genomes org wiki anal sis variant call format vcf variant call format version 42 3 1 4 Plink Plink is a whole genome association analysis tool set which comes with its own text based data format The data is stored in a set of two files a map file and a ped file The ped file contains all the SNP values and has six mandatory header columns for Family ID Individual ID 11 Paternal ID Maternal ID Sex and Phenotype TASSEL only requires that the Individual ID field be filled in Each row of the ped file describes a single germplasm line Notice in Plink an unknown character 1s represented with a 0 However in TASSEL an unknown character is represented with a N and 0 is used to represent heterozygous indel TASSEL will automatically convert between the 0 and the N Any exported Plink files will represent the heterozygous indel with a insertion and a deletion The map file describes all the SNPs in the associated ped file where each row provides information on one SNP The
16. algorithms exist to calculate kinship and their estimates will differ from one another Secondly the algorithm in TASSEL treats each genotype as a haplotype It 1s not recommended that TASSEL be used to generate a kinship matrix from heterozygous genotype In the near future the TASSEL kinship algorithm will be modified to handle heterozygous diploids Can I get Marker R square using SAS Proc Mixed or TASSEL MLM SAS Proc Mixed does not produce an R statistic MLM in TASSEL does The user manual describes how it is calculated Does MLM find more associations than GLM Sometimes MLM has higher statistical power than GLM and may detect more true associations When the tested genetic markers are confounded with kinship structure GLM does not correct for that as effectively as MLM and may produce more false positives Do I need multiple test correction for the p value from Tassel Yes Can TASSEL handle diploid genotype data While TASSEL accepts most common sequence alignment formats which handle polyploid genotype data including haploid and diploid some analyses are not appropriate for heterozygous data GLM or MLM fit SNPs one at a time treating each distinct genotype as a separate class This has the effect of fitting an additive plus dominance model Separating the two effects is under consideration Because handling heterozygotes as a third marker class 1s not appropriate for kinship or LD those analyses should not be used for that type o
17. biojava org 1 3 Software Development Tools jProfiler http www ej technologies com products jprofiler overview html installdj http www ej technologies com products install4j overview html NetBeans IDE https netbeans org Eclipse http www eclipse org IntelliJ http www jetbrains com idea Structurel01 http structure101 com TeamViewer http www teamviewer com Bitbucket https bitbucket org sourceforge http sourceforge net JIRA https www atlassian com software jira Tower http www git tower com 1 4 Graphical Interface TASSEL 1s organized into five main panels 1 At the top menus control functions 2 The Data Tree at the top left organizes data sets and results Data set s displayed in the Data Tree must first be selected before a desired function or analysis can be performed To select multiple data sets press the CTRL or Command for Mac key while selecting the data sets 3 The Report Panel is located below the Data Tree It displays information about a selected data set from the Data Tree such as the type of data and how it was created 4 The Progress Monitoring Panel below the Report Panel shows the progress of running tasks and has buttons that can cancel tasks 5 The Main Panel occupies the right side of the viewing area and displays the content of the selected data set from the Data Tree 1 5 Pipeline Command Line Interface http www maizegenetics net tassel docs TasselPipelineC
18. by starting the row with Header name xxx gt followed by a name for each column of data For instance to define environments start the second header row with Header name env gt Comment lines may be inserted at the beginning of the file Comment line begins with the character 7 3 1 8 1 Trait format This format does not require users to provide information on number of rows and columns The file starts with the key word Trait followed by names of columns The column for line should not be labeled 12 Example 1 simple list of trait values xar n EarHT dpoll EarDia 811 59 5 NA NA 33 16 64 75 64 5 NA Boe 9225 6645 29492997 42525 Seo DrD ausi ZI22 GLL Tl 2324421 ALBO 245 02 31 419 Example 2 traits data collected in multiple environments Suas EarHT PlantHT EarHT PlantHt lt Header name env gt Locl Locl Loc2 Loc2 811 59 5 NA NA NA Soe 64470 L215 NA NA SOK 92 25 1992598945997 83 4 4500 GOGO doa 9221933 D ed 4722 Bdlh3 eGo 327421 205 ALe 2ra 14102 3 419 1 95 6 3 1 8 2 Covariate Format Covariate data uses the same format as trait data except that the first line must be Covariate This line tells TASSEL that the variables in this file will be used as covariates not as dependent variables This is the format to use for population structure covariates Covariate Trait Q1 Q2 Q3 39 16 014 09 2 02014 soci 0 005 0 993 0 004 4226 MOCO FL 4 91 7 OUl 4722 07035 0
19. dependent on the trait and population being analyzed this compressed MLM has improved statistical power compared to the regular MLM The optimum grouping with the best model fit for MLM without fitting genetic markers has the best statistical power for an association test of markers TASSEL allows users to specify the compression level average number of individuals per group or to have the program determine the optimum grouping similar to GLM MLM performs an association test for each combination of traits and markers TASSEL provides users several options 1 to estimate genetic and residual variance for each combination 2 to get these estimates once for each trait without fitting genetic markers and then to use those estimates to test markers 3 to use a prior heritability estimate provided by the user The second option named P3D population parameters previously determined has the same statistical power as the first option Using the P3D method or using a prior heritability can be much faster than calculating heritability for each marker 38 Using MLM is very similar to using GLM The difference is that in addition to choosing the joint data set or numerical data set kinship data must also be highlighted before clicking the MLM button to show the MLM option dialog The option of No Compression is the regular MLM which is equivalent to Custom level 1 For data sets with large numbers of taxa the optimal compression optio
20. genotype filtering markers on minor allele frequency generating principal components and a kinship matrix to represent population structure and cryptic relationships optimizing compression level and performing GWAS The command line version of TASSEL called the Pipeline provides users the ability to program tasks using a script instead of the graphic user interface GUI This feature allows researchers to define tasks using a few lines of code and provides the ability to use TASSEL as part of an analysis pipeline or to perform simulation studies 6 We are also building a larger community of scientist developers that are adding functionality to this platform and working together to improve the system So throughout this user manual you will see how to do most things three different ways with the GUI with the pipeline and with the API application programming interface TASSEL is written in Java thereby enabling its use with virtually any operating system It can be installed using Java Web Start technology by simply clicking on a link at www maizegenetics net tassel A stand alone version of TASSEL can also be downloaded to use in pipeline mode or in any situation where the user wishes to start the software from a command line Getting Started A quick way to get started using TASSEL is to load the tutorial data and try performing analyses However because some of the necessary steps may not be intuitive we recommend that new users fo
21. gt 100 000 fold e Mixed models based on DNA relationships have come to dominate GWP Meuwissen et al 2001 and GWAS Yu et al 2006 yet these models can be slow to solve TASSEL has been a test bed and implements some of the most best optimizations such as EMMA Kang at al 2008 plus approaches optimize variance components once P3D Zhang et al 2010 and EMMA X Kang et al 2010 Compression algorithms are also available Zhang et al 2010 When used correctly these optimizations make powerful GWAS computationally possible e The code is being continually optimized for larger numbers of cores and clusters For example we generally run imputation on 64 core machines And while Java provides some excellent is interoperability between systems its code 1s about 2 fold slower than optimized C libraries and 10 fold slower than GPU processing for some problems TASSELS is building out connection layers directly to native code when these efficiencies are need TASSEL was designed for a wide range of users including those not expert in statistical genetics or computer science A GWAS using the mixed linear model method to incorporate information about population structure and cryptic relationships can be performed by in a few steps by clicking on the proper choices using a graphic interface All the processes necessary for the analysis are performed automatically including importing phenotypic and genotype data imputing missing data phenotype or
22. inbred lines mdp population structure txt The last one 1s phenotypes for three traits for 282 maize inbred lines mdp traits txt The statistical model 1s Flowering time Population structure Marker effect residual Remove monomorphic and low coverage sites Highlight the mdp genotype and click Filter Sites on the menu bar Set Minimum Frequency to 0 05 Maximum Frequency to 1 0 and Minimum Count to 150 Click Filter to create a filtered genotype data set Trait selection Highlight the phenotype and click the menu item Filter Traits Uncheck all the traits except flowering time DPOLL Make sure that the Type is set to Data Click OK to create a filtered phenotype Covariate selection The population structure is presented as the proportion of each population There are three populations represented as Q1 Q2 and Q3 They sum to 100 This creates linear dependency if we use all of them as covariates While GLM can handle that properly it will cause MLM to complain and refuse to complete your analysis We can eliminate the dependency by removing one of the Q variables In this demonstration we exclude the last one Highlight mdp population structure and click Filter Traits Uncheck the last population Q3 Make sure that the Type is set to Covariate Then click OK to create a filtered population structure data Joining data Highlight the three filtered data sets by holding the Control key while selecting the individual data
23. map file must contain exactly four columns Chromosome rs Genetic distance and Position TASSEL does not require the Genetic distance field to be filled in Both files should be TAB delimited For a more detailed description on the data format please visit the Plink basic usage and data formats webpage http pngu mgh harvard edu purcell plink data shtml 3 1 5 Projection Alignment 3 1 6 Phylip Details on Phylip format are described at the following website http evolution genetics washington edu phylip doc sequence html 3 1 7 FASTA 3 1 8 Numerical Data This type of format is used for trait and covariate data such as population structure Similar to sequence alignment genotype data numerical data also consists of two parts a header that defines data structure and a body containing the main data Tabs should be used as delimiters However any white space character such as blank will be treated as a delimiter as well As a result embedded blanks in names will cause data to be imported incorrectly We suggest representing missing values using NA or NaN However any text value e g will be interpreted as missing data There are several formats for numerical data to fit the requirement for modeling Trait data dependent variables can be imported by starting the first line with lt Trait gt and following that with the trait names Additional classifiers may also be included in subsequent header rows
24. model to test for association between segregating sites and phenotypes The analysis optionally accounts for population structure using covariates that indicate degree of membership in underlying populations A main effects only model is automatically built using all variables in the input data A separate model is built and solved for each trait and marker combination Any factors covariates reps or locations are included in every model as main effects How the data is used must be defined either in the input data files or using the Trait Filter after the data has been imported but before it has been joined with a genotype General Linear Model GLM can be run using a numeric data set only or using numeric data joined to genotype data If only numeric data is selected best linear unbiased estimates BLUEs or least square means will be generated for the taxa for each trait Note only factors and covariates intended to control field variation should be included at this stage Population structure covariates which are intended to control for marker effects should only be included when markers are also in the analysis If numeric data with genotypes are analyzed each trait by marker combination will be tested and two reports will be produced one containing trait by marker F tests and the other containing allele estimates To run GLM select a data set and then click the GLM button A dialog box will pop up to allow the user to indicate that a permuta
25. of additive genetic variance but tends to overestimate that value Rescaling does not affect its use for correcting for population structure It only affects the estimate of additive genetic variance and consequently heritability To provide a better estimate of addivitive genetic variance an alternative method can be used by selecting scaled IBS This method from Endelman and Jannink 2012 codes genotypes as 2 1 or 0 equal to the count of one of the alleles at that locus It then replaces missing genotype values with the average genotypic score at that locus before estimating a relationship matrix Other methods of imputing genotypes prior to calculating Kinship may provide a better result For instance rather than using this default treatment of missing values using the numerical genotype method followed by imputation described in section 3 3 before running Kinship is a reasonable alternative When using numerical genotypes Kinship always applies the scaled IBS method Users may also load their own kinship data using Data Load Kinship matrices can be calculated using the SPAGeDi software package http www ulb ac be sciences ecoevol spagedi html Comparisons of methods for calculating kinship can be found in the literature e g Stich et al 2008 6 5 GLM General Linear Model This function performs association analysis using a least squares fixed effects linear model TASSEL utilizes a fixed effects linear
26. site and itself The default setting graphs r in the upper right and p values 48 in the lower left This default can be modified by clicking on the buttons in the lower left The left side of the graph contains a text description with the Chromosome and the Site name At the bottom of the graph is a display of the position of each site along the chromosome This display can be hidden by deselecting the Schematic check box Legends that describe the color scheme appear on the right hand side of the graph The number of sites displayed can be selected by entering a number in the white box in the upper right corner or by moving the sliding bar next to it To move through the graph use the sliding bars on the right and bottom The red box in the small white window in the upper left corner will show what portion of the graph 1s displayed To move only around the diagonal select the Lock Y Axis to X check box recommended when visualizing a LD by sliding window analysis due mann O UEEER O1 XD Select vies ale sine 614 A squared T i Tra i iu T 415 or 7 i sm j 1 g 00 EIN a niti Em a ML ee DB dort E djal F RU m B L 3 L E pu NEUTER B pm i ur T Jm i d B PE hg M CT e m J mop m g ET missis Hs duo Fig Me ls end d e HIS O80 Ua nm Ja jp HHR I if rn Gal EM i MEI dy RJ T 1 JI on BN oe a oo a er Pe f NS PS P ui l b 3 in rn i EE J faa EN s al a ill x Le i h
27. which family Also input genotypes must contain data for only a single chromosome If the genotype file contains multiple chromosomes the chromosomes can be separated using the TASSEL separate command Pedigree File Format The only file format specific to FSFHap is the pedigree file The taxa names must exactly match names in the genotype data If the genotype data contains taxa not included in the pedigree file only individuals listed in the pedigree file will be analyzed The input genotypes can be in any of the formats accepted by TASSEL The pedigree file must contain the names of the individual taxa to be analyzed the family to which each belongs the parents the parent contributions and the average inbreeding coefficient The first row in the file must be column headers The values in the columns should be tab delimited and are expected to be in the following order family taxon parentl parent2 parent Contribution parent2Contribution F The F value is not required but all other columns are Example family taxonName parent parent2 contribution contribution2 F fam t0001 parl par2 0 5 0 5 92 faml t0002 parl par2 0 5 0 5 92 fam2 t0201 parl par3 0 5 0 5 92 22 fam2 t0202 parl par3 0 5 0 5 92 fam2 t0203 parl par3 0 5 0 5 92 The values for contribution1 contribution2 and F are family means Those values are read from the first line for a family only and then applied to the entire family Using the command line for FSF H
28. 094 O11 ALSS 02 015 0 992 02005 3 1 8 3 Marker Values as Numerical Co variates In some cases a user may wish to have marker values treated as numerical co variates If the first line of the file is lt Numeric gt then the data will be imported as numeric data but used as marker data in GLM and MLM Numeric Marker ml m2 m3 m4 m5 Sordo 0 3 1 9 0 Some 70 0 due m 4220 0 L1 0 5 0 3 1 9 Square Numerical Matrix 13 Kinship can be calculated externally from pedigrees by using SAS Proc Inbreeding or from markers by using one of several available software packages The following format is provided to import the resulting kinship estimates If n represents the number of taxa the format for kinship files is as follows n TaxalName ril r12 rin Taxa2Name r21 r22 r2n TaxanName rni rn2 rnn Here rij 1 71 2 n is the element in the kinship matrix located at row 1 and column J Missing values are not allowed for kinship matrix Important note The current format is different from the format used in TASSEL version 2 0 or lower 3 1 10 Table Report Data can be imported as tab delimited text files The first row of the file will be interpreted as column labels and the remaining rows as rows in the table 3 1 11TOPM Tags on Physical Map 3 2 Export Options are provided to export sequence data Hapmap Plink Phylip Sequential or Interleaved Phenotypes and covariate data i
29. 14 1 7383E6 1 6 754E6 0 963866 62822 0 03614 9622 0 01107 Number of Taxa Number of Taxa in data set Number of Sites Number of Sites in data set Sites x Taxa Number of sites multiplied by number of taxa Number Not Missing Number allele values not unknown NN Proportion Not Missing Number Not Missing Sites x Taxa Number Missing Number unknown NN values Proportion Missing Number Missing Sites x Taxa Number Gametes Number of Sites x Number Taxa x 2 Gametes Not Missing Number of gametes not unknown Proportion Gametes Not Missing Gametes Not Missing Number Gametes Gametes Missing Number unknown N gametes Proportion Gametes Missing Gametes Missing Number Gametes Number Heterozygous Number of heterozygous values Proportion Heterozygous Number Heterozygous Sites x Taxa TASSEL Trait Analysis by aSSociation Evolution and Linkage 5 0 5 Analysis Filter Results Help Alleles Number Proportion Frequency 42 Y BS Sequence mdp genotype ioe Result 246327 235079 178045 158548 0 28342 0 27048 0 20485 0 19404 31411 3938 2598 994 839 595 557 547 456 408 373 219 184 155 155 121 110 108 78 50 0 03614 0 00453 0 0031 0 00114 9 5533E 4 b 85 4E 4 5 4087E 4 0 17585 0 15713 0 13191 0 12059 0 07081 0 05949 0 05335 0 05011 0 03912 0 03556 0 03492 0 02522 0 01617 46 0 01487 24 0 00776 19 0 00614 B3 Genotype Summary mdp_genotype_Overa
30. 2 File Data Filter Analysis Results Help i Marker Locus Locus pos marker F marker p markerR2 markerDF markerMS errorDF errorMS model Synonymizer dpoll PZBOO85 157104 0 33532 0 71543 0 0018 7 32118 21 83356 EB Result dpoll PZA0127 1947984 5 98887 0 01509 0 01593 130 91719 21 86006 Diversity dpoll PZA0361 2914066 0 44396 0 50582 0 00117 9 7473 21 95558 SNP Assays dpoll PZA0361 2914171 1 94335 0 14533 0 00993 42 34854 21 79146 LD dpoll PZA0361 2915078 0 18011 0 67166 4 9717E 4 3 98879 22 14663 v ME Associa dpoll PZA0361 2915242 1 17459 0 27955 0 00313 24 76818 21 08668 dpoll PZA0025 2973508 1 31685 0 26993 0 00725 28 6036 21 72128 dpoll PZA0296 3205252 2 98033 0 05264 0 01559 59 84505 20 08003 GLM allele estimates for Filtered mdp tra dpoll PZA0296 3205262 0 33803 0 56153 9 1575E 4 6 53992 19 34731 Variances dpoll PZA0059 3206090 0 70844 0 49339 0 00369 15 59899 22 01874 dpoll PZA0212 3706018 0 18465 0 66777 4 8205E 4 4 09165 22 15916 dpoll PZA0039 4175293 0 01174 0 91382 3 1912E 5 0 2533 21 58512 dpoll PZA0286 4429897 2 57509 0 1098 0 00668 56 3929 21 89943 Table Title Marker Test T dpoll PZA0286 4429927 3 39142 0 03529 0 0176 72 99552 21 52361 Nombr of couine T3 dpoll PZA0286 4430055 3 14175 0 04505 0 01722 68 18523 21 70296 Number of rows 2559 dpoll PZA0203 4490461 0 73384 0 39245 0 00191 16 19389 22 06733 Number of elements 33267 dpoll PZB0091 5353319 1 69532 0 1941
31. 2 962 135 PZA03614 2 PZA03614 1 IPZAD0258 3 PzA02962 13 PZA02962 14 M PZ AEB tenement ell E uu 0 T 2 x i Capture Selected Capture Unselected Remove Cancel J 5 3 Taxa Names First select the genotypic phenotypic or population structure data from the data tree The resulting dialog displays the taxa associated with the selected data By using either the CTRL or SHIFT key in conjunction with the mouse the user can select or deselect taxa Once desired taxa have been moved to the Selected window using the Add gt button the Capture Selected or Capture Unselected buttons will create a new data set containing only the desired taxa Using the search box e jsthe wildcard e is always implied at end of search string e Search string is case sensitive For example use Aa bc to match taxa beginning with Abc or abc 30 e A 56 Will match anything starting with A5 or A6 af Taxa Filter Available Selected Capture Selected Capture Unselected Remove Cancel 5 4 Taxa Filter Taxa by Properties Min Proportion of Sites Present Min Heterozygous Proportion 0 0 Max Heterozygous Proportion 1 0 Close 5 5 Traits Clicking the Traits button on the Data toolbar launches the Trait Filter dialog This dialog is used with numerical data sets to 1 change the trait type 2 view but not change whether the trait is discrete or
32. 57 456 455 454 453 452 451 Var genetic 410 45 20 95 30 35 40 45 450 55 B60 65 70 75 go 1 15 20 25 30 35 40 43 50 55 680 G5 70 75 580 groups groups Nar genetic Var error In the example 79 are included in the final analysis When they are clustered into 44 groups the 2 Log Likelihood reaches a minimum which indicates the best model fit The screening of SNPs was performed at this optimum compression level Note When two or more individuals are clustered into one group the variance component for the random effect is not equivalent to the one without compression Consequently the heritability derived should not be interpreted as the individual based heritability To perform a Genome Wide Association Study GWAS on the 3093 SNPs we need to create a new joint data set containing the filtered phenotype population structure and the genome wide genotype Highlight the new joint file and the kinship data and click the MLM button Choose the default options on the MLM option dialog The analysis will take a minute or two The output report labeled MLM compression indicates that 259 lines were used in the analysis With 74 groups the statistics from the best are as graphed below 62 Y1 Var genetic F Line F Regression x groups Ling Regression 2 Y Axes groups vs 2LnLk groups vs Var genetic and Var error Var genetic 125 150 75 100 125 150 175 200 225 250 grou
33. C dpoll PZA00766 1 2 8 133775220 Y 0 C T Diversity dpoll PZA00766 1 115 8 133775220 2 47046 SNP Assays AD dpoll PZB01389 1 122 8 134723842 4 adi dpoll PZB01389 1 137 8 134723842 e E LPL EE dpoll PZA03591 1 8 134813437 G dpoll PZA03591 3 223 8 134813550 C dpoll PZA03591 3 33 8 134813550 T 0 dpoll PZA03591 2 104 8 134813696 G 1 06301 dpoll PZA03591 2 145 8 134813696 A 0 R G A C L3 Association GLM marker test Filtered mdp traits F maGLM allele estimates for Filtered_mdp_tra Variances Stepwise NiIMNaAr AT BiamenrTe 4244444 c lass net maizegenetics util SimpleTableReport pos dpoll PZA00090 1 29 8 137480768 0 01513 dpoll PZA00090 1 217 8 137480768 0 29198 dpoll PZA00090 1 14 8 137480768 0 2 2406 dpoll PZB00665 1 183 8 137572174 10 5 Association analysis using MLM Running MLM in tassel is similar to running GLM The difference is that in addition to the joint data or numerical data MLM requires kinship data to define the relationship between individuals The kinship matrix 60 times a parameter equals the covariance matrix between individuals Here we use kinship file from the tutorial data set to fit the following statistical model Flowering time Population structure Marker effect Individuals residual Individuals and the residual are fit as random effects The other terms are treated as fixed effects With respect to the marker effect we wil
34. LI pdf 1 6 GBS Pipeline http www maizegenetics net tassel docs TasselPipelineGBS pdf 2 File Menu The data tree can be saved in a binary format 2 1 1 Save Data Tree This feature allows you to save the entire contents of the Data Tree panel to a default leeation This ts helpful when the user does not wish to reereate a Free panel that is already well pepulated with information the next mi AL LU a 17 aA gt QO Nato CA Lo OL CL 0C n0 Q C L7 ULC VV a D D a M dl Cl D v bust C NOTE The information outlined above for saving a Data Tree is applicable to files that are in general version specific When a new version of TASSEL is released a data tree saved with a previous version might not load to the version For longer term storage the best practice is to save individual data sets rather than the entire data tree 2 1 5 Set Preferences Currently there is only one preference That is whether to retain rare alleles This is irrelevant for nucleotide data A C G T N because at that number of states there is no data lost Potentially with other types of data it could exceed the 14 max per site number of allele states If you Retain Rare Alleles the lower frequency allele values will be consolidated into a rare Z state Otherwise those lower frequency alleles are changed to unknown N amp Preferences Alignment Preferences vi Retain Rare Alleles
35. Map Export Sort Genotype File Transform Genotype Numericalization Collapse Non Major Alleles Separate Alleles Transform and or Standardize Data Impute Phenotype PCA Synonymizer Synonymize Taxa Names Intersect Join Command Union Jom Command Merge Genotype Tables Command Notes Separate Homozygous Genotype Impute Menu Genotypic Imputation Filter Menu Sites Site Names Taxa Names Traits Analysis Menu Diversity Linkage Disequilibrium Cladogram Kinship GLM General Linear Model MLM Mixed Linear Model mic Selection using Ridge Regression eno Geno Summary Stepwise Results Menu Table Archaeopteryx Tree 2D Plot LD Plot Chart QO Plot Manhattan Plot GBS Menu Help Menu Help Manual About Show Memory Logging Tutorial Missing Phenotype Imputation Principal Component Analysis Estimation of Kinship using genetic markers Association analysis using GLM Association analysis using MLM Appendix Nucleotide Codes Derived from IUPAC TASSEL Tutorial Data sets Frequently Asked Questions REFERENCES Introduction While TASSEL has changed considerably since its initial public release in 2001 its primary function continues to be providing tools to investigate the relationship between phenotypes and genotypes TASSEL has functionality for association study evaluating evolutionary relationships analysis of linkage disequilibrium principal component analysis cluster analysis miss
36. User Manual for TASSEL Trait Analysis by aSSociation Evolution and Linkage Version 5 0 The Buckler Lab at Cornell University August 17 2014 Www maizegenetics net tassel Disclaimer While the Buckler Lab at Cornell University has performed extensive testing and results are in general reliable correct or appropriate Results are not guaranteed for any specific set of data It is strongly recommended that users validate TASSEL results with other software Further help Additional help is available beyond this document Users are welcome to report bugs request new features through the TASSEL website Questions are also welcome to our current team members For more quick and precise answers please address your questions to the most pertinent person Tassel User Group http groups google com group tassel recommended tassel googlegroups com General Information Ed Buckler Project leader esb33 cornell edu Data Import Pipeline Terry Casstevens tmc46 cornell edu Statistical Analysis Peter Bradbury pjb39 cornell edu Contributors Ed Buckler Terry Casstevens Peter Bradbury Zhiwu Zhang Dallas Kroon Jeff Glaubitz Kelly Swarts Jason Wallace Fei Lu Alberto Romero Cinta Romay Eli Rodgers Melnick Alexander Lipka Sara Miller James Harriman Yogesh Ramdoss Michael Oak Karin Holmberg Natalie Stevens and Yang Zhang Citations Overall Package Bradbury PJ Zhang Z Kroon DE Casstevens TM Ramdo
37. a set into it s components For example a genotype table would be separated into individual chromosomes 21 3 10 Homozygous Genotype This changes all heterozygous values to unknown N 4 Impute Menu 4 1 Genotypic Imputation TASSELS contains two methods for imputing missing genotype information one is a generalized approach suitable for all types of populations but optimized for those with higher inbreeding coefficients FILLIN and the other 1s specifically optimized for finding recombination break points in full sib families FSFHap More information on these two methods can be found at Swarts et al FSFHap Full Sib Family Haplotype Imputation and FILLIN Fast Inbred Line Library ImputatioN optimize genotypic imputation for low coverage next generation sequence data in crop plants Plant Genome in review FSFHap Full Sib Family Haplotype Imputation FSFHap imputes missing genotypes and corrects genotyping errors for inbred individuals in full sib families It 1s very useful for calling haplotypes in low coverage GBS data The individuals must be at least partially inbred because the method relies on finding inbred segments to identify haplotypes It does not use the parent genotypes directly but including the parents may be useful for interpreting the results The algorithms used for imputation analyze one chromosome and family at a time As a result a pedigree file must be supplied that indicates which entries belong to
38. ap FSFHap consists of three TASSEL plugins CallParentAllelesPlugin ViterbiAlgorithmPlugin and WritePopulationA lignmentPlugin which are called sequentially A typical command for running FSFHap 1s as follows replace items in lt gt with actual parameter values for a genotype containing a single chromosome run pipeline pl h lt genotypeFilename gt CallParentA llelesPlugin p lt pedigreeFilename gt m 0 9 r 0 5 logfile lt logFilename gt endPlugin ViterbiAlgorithmPlugin g true endPlugin WritePopulationA lignmentPlugin f lt outputFilename gt m false o parents endPlugin For a genotype file containing multiple chromosomes run pipeline pl h lt genotypeF ilename gt separate CallParentA llelesPlugin p lt pedigreeFilename gt m 0 9 r 0 5 logfile lt logFilename gt endPlugin ViterbiA lgorithmPlugin g true endPlugin WritePopulationA lignmentPlugin f lt outputFilename gt m false o parents endPlugin Options for CallParentAllelesPlugin Options taking a parameter value specified by Value p or pedigrees the pedigree file Value filename w or windowSize the number of SNPs to examine for each LD cluster Value integer default 50 r or minR minimum R used to filter SNPs on LD Value number between 0 and 1 default 0 2 use 0 for no ld filter m or maxMuissing maximum proportion of missing data allowed for a SNP Value number between 0 and 1 default 0 9 f or minMaf mini
39. ary mdp genotype Sitesummary Qo c Cn ud d UJ M e C Table Title Taxa Summary Number of columns 9 Number of rows 281 Number of elements 2529 Taxa Summary of mdp genotype Taxa Index of taxa Taxa Name Name of taxa Number of Sites Number of sites for taxon same for all Gametes Missing Number of gametes with unknown N value Every taxa site combination has two gametes Proportion Missing Gametes Missing Number of Sites 2 Number Heterozygous Number of sites that are heterozygous for taxon e Proportion Heterozygous Number Heterozygous Number of Sites not counting sites that are unknown NN e Inbreeding Coefficient e Inbreeding Coefficient Scaled by Missing 6 9 Stepwise 44 7 Results Menu Results consists of the functions to present data as table or graphics 7 1 Table Allows data to be displayed in a spreadsheet view and exported into a flat file To create a table select a data set from the Data Tree panel then click on the menu Results gt Table TASSEL Trait Analysis by aSSociation Evolution and Linkage 5 0 6 File Data Filter Analysis Results GBS Lows BB me n VW Archaeopte 7S 2DPlot LD Plot Chart QQ Plot Manhattan Plot i 4 NER Shown below is an example in which the Taxa Summary is displayed Inbreeding Coe 0 054 inbreedng Coe g ooi 2 ns ooa 4 Q0iSbreednc i oo 36 0 01 inbreeding
40. as included to test whether or not erroneous het calls result in too many hets being imputed It appears to have only a small effect on the outcome The windowld algorithm handles F2 and later populations effectively but can have problems when parents have some residual heterozygosity It is recommended that the logfile option be used The output can be used to identify and diagnose possible problems The bcn true should be used for populations with two or more backcrosses However using the bcl option is not necessary as the default behavior is usually best Options for ViterbiAlgorithmPlugin g or fillgaps if true then missing values flanked by SNPs from the same parent will be imputed to that parent false otherwise Value true or false default true h or phet expected frequency of heterozygous loci Used only if the inbreeding coefficient 1s not specified in the pedigree file Value number between 0 and 1 default 0 07 Options for WritePopulationAlignmentsPlugin Required f or file The base file name for the ouput hmp txt will be appended Value filename Optional m or merge if true then families are merged into a single file if false then each family 1s output to a separate file Value true or false default false 0 or outputT ype if value parents then output parent calls 1f value nucleotides then output nucleotides if value both then output both in separate files default bo
41. at were taken that produced the error If the data you are working with is not too sensitive please include the files which were used in the faulty procedure If you would rather not post your data file on SourceForge you may email it to one of the software developers 2 Where do I turn for more information If you are having difficulty with a certain aspect of TASSEL you can either email one of the software developers listed at www maizegenetics net or you may check the TASSEL forum on SourceForge http sf net projects tassel as another user may have already addressed a similar question There is also a TASSEL discussion group at http groups google com group tassel 3 How doljoin the fun TASSEL on SourceForge TASSEL is an open source project distributed under the GNU general public license This means that the source code is available and the user is free to modify the code to suit their particular needs We welcome input from developers and those who wish to become involved in the improvement of this software The project is hosted on SourceForge http sf net projects tassel thereby allowing anyone to access the most recent changes to the code This setup makes it convenient for anyone to add special functionality to TASSEL if they so desire It also serves as a good platform for anyone who wishes to become involved in a bioinformatics software development project 4 When I click on the most current version of TASSEL web start a p
42. below a user supplied heterozygosity threshold For taxon considered outbred above the threshold 2b the Viterbi option 1s never used because it 1s more likely in an outbred taxon that if two haplotypes explain a segment it is heterozygous for those two haplotypes If the algorithm cannot find haplotypes to satisfy any of these threshold requirements the segment will not be imputed The thresholds for the focus block imputation are set based on the mxInbErr and mxH ybErr values entered or defaults m Below mxHet inbred Above mxHet outbred 3 10 mxInbErr 1 10 mxInbErr 25 FILLINFindHaplotypesPlugin FILLINImputationPlugin Generate block Impute back onto sample haplotypes 42k taxa using haplotypes by block Impute 64 site subsets of blocks One haplotype 2a Impute one nearest neighbor haplotype 1a v Using Viterbi HMM v 2b Impute with two best haplotypes using Viterbi HMM v Use two resolve hets 2C v DO NOT IMPUTE Running FILLIN FILLIN consists of two TASSEL plugins FILLINFindHaplotypesPlugin and FILLINImputationPlugin which are called sequentially If you would like to mask your data and calculate accuracy use the accuracy flag for FILLINImputationPlugin If imputing maize a donor file of haplotypes from 40k taxa can be found on the Panzea website http ww w panzea org lit data sets html FILLIN can be run either within the TASSEL GUI or through the command line The options a
43. continuous and 3 drop one or more traits from the data set In addition the dialog can be used to view the trait properties without changing them If the OK button is clicked a new data set is created that incorporates the changes the original data set remains unchanged and the dialog closes If the Cancel button is clicked no data set is created the original data set remains unchanged and the dialog closes Allowable trait types are data covariate factor and marker Generally data and covariate traits will be continuous not discrete and factor will be discrete Markers in a numerical data set will be continuous Discrete valued 3l markers are better imported as genotypes and filtered using the Sites filter Clicking Exclude All unchecks the Include box for all traits Clicking Include Al checks the Include box for all traits The Exclude Selected and Include Selected buttons do the same thing for traits that have been highlighted by selecting them with the mouse Type can be changed for individual traits by selecting a value in the drop down box in the type column for that trait Type can be changed for multiple traits by selecting those traits then clicking one of the Change Selected Type to buttons Important Once a numerical data set has been joined with genotypes it can no longer be modified using the trait filter function eoo Filter Traits Modify Trait Properties Type
44. culated when no marker 1s included in the model Why should I exclude one column of the population structure For some methods of calculating population structure such as the software STRUCTURE the population proportions sum to one This produces linear dependence between the population co variates While the algorithm used by GLM tolerates that dependency MLM will fail because the design matrix will not be invertible Excluding one column eliminates linear dependence between columns Using PC axes to represent population structure does not result in linear dependency because all PC columns are guaranteed to be independent 10 Can kinship replace population structure 11 12 13 14 15 Sometimes For some traits and populations the K only model may be as good as or better than the Q K model For others Q K may be superior The Q only model is not as effective for controlling population structure as the alternatives Unfortunately no general guidelines exist for predicting which model will perform best As a result an investigator may wish to fit all three models and compare the results If eliminating false positives is very important then it may make sense to accept the most conservative model However if the objective is to identify candidates for further study and the cost of following up on a false lead is low the most liberal model may be preferred Why do TASSEL and SPAGeDi give different kinship estimates First many
45. cy Minor Allele Gametes Number of Taxa 2 Gametes Missing Gametes Missing Number of gametes with unknown N value Proportion Missing Gametes Missing Number of Taxa 2 Number Heterozygous Number of taxa that are heterozygous for site Proportion Heterozygous Number Heterozygous Number of Taxa not counting taxa that are unknown NN Inbreeding Coefficient e Inbreeding Coefficient Scaled by Missing eoo TASSEL Trait Analysis by aSSociation Evolution and Linkage 5 0 5 File Data Filter Analysis Results Help E x E Taxa Name Number of Gametes M Proportion Number He I 33 16 3093 190 0 03071 33 38 11 3093 T8 0 01261 fe 4276 3093 176 0 02845 if 4722 3093 790 0 12771 147 A188 3093 158 0 02554 25 A2 14N 3093 118 0 01908 25 A239 3093 76 0 01229 31 A272 3093 330 0 05335 50 A441 5 3093 BO 0 01293 26 mdp genotype Taxasummary A554 3093 104 0 01681 34 A555 3093 254 0 04106 25 Ab 3093 T8 0 01261 36 A619 3093 124 0 02005 38 Ab32 3093 98 0 01584 32 Ab34 3093 114 0 01843 33 A635 3093 150 0 02425 40 Ab41 3093 14 0 022796 26 Ab54 3093 226 0 03653 31 A659 3093 160 0 02586 31 AGGI 3093 468 0 07565 29 Ab 79 3093 140 0 02763 29 Ab 80 3093 128 0 02069 44 A682 3093 Ll 0 01811 33 AB BA 3093 238 0 03847 25 B10 3093 118 0 01908 36 B103 3093 136 0 02199 29 B104 3093 Ll 0 01811 30 Y BS Sequence mdp genotype E Result Y B3 Genotype Summary 1 mdp genotype Overallsummary mdp genotype Allelesumm
46. cy based on masked genotypes Default false propSitesMask lt Proportion of genotypes to mask if no depth gt Proportion of genotypes to mask for accuracy calculation if depth not available Default 0 01 depthMask lt Depth of genotypes to mask gt Depth of genotypes to mask for accuracy calculation 1f depth information available Default 9 propDepthSitesMask lt Proportion of depth genotypes to mask gt Proportion of genotypes of given depth to mask for accuracy calculation if depth available Default 0 2 5 Filter Menu 5 1 Sites The genotype table can be filtered in several ways For example monomorphic sites can be eliminated and regions of a sequence can be eliminated 28 ao o Filter Alignment m Filter Alignment Minimum Count 10 out of 281 sequences Minimum Frequency 0 1 Maximum Frequency 1 0 Position Type Position index Start Position 0 End Position 3092 of 3092 sites _ Remove minor SNP states Generate haplotypes via sliding window Haplotype Length Filter Select Chromosomes Cancel Minimum Count the minimum number of taxa in which the site must have been scored to be included in the filtered data set GAP or missing data do not count Minimum Frequency the minimum frequency of the minority polymorphisms for the site to be included in the filtered data set Start Position End Position establishes the range of sites for filtering Extract Indels if selected indels are extracted fr
47. dies Nat Genet 42 348 54 2010 Thornsberry J M et al Dwarf8 polymorphisms associate with variation in flowering time Nature Genetics 28 286 289 2001 Pritchard J K Stephens M Rosenberg N A amp Donnelly P Association mapping in structured populations American Journal of Human Genetics 67 170 181 2000 Zhao K et al An Arabidopsis example of association mapping in structured samples PLoS Genet 3 e4 2007 Yu J M et al A unified mixed model method for association mapping that accounts for multiple levels of relatedness Nature Genetics 38 203 208 2006 Ware D et al Gramene a resource for comparative grass genomics Nucleic Acids Research 30 103 105 2002 Ware D H et al Gramene a tool for grass Genomics Plant Physiology 130 1606 1613 2002 Jaiswal P et al Gramene development and integration of trait and gene ontologies for rice Comparative and Functional Genomics 3 132 136 2002 Yamazaki Y amp Jaiswal P Biological ontologies in rice databases An introduction to the activities in gramene and oryzabase Plant and Cell 15 16 17 18 19 20 21 22 23 24 29 26 2f 20 70 Physiology 46 63 68 2005 Zhao W et al Panzea a database and resource for molecular and functional diversity in the maize genome Nucleic Acids Research 34 D 52 D75 7 2006 Canaran P Stein L amp Ware D Look Align an interactive web based multipl
48. e sequence alignment viewer with polymorphism analysis support Bioinformatics 22 885 886 2006 Du C G Buckler E amp Muse S Development of a maize molecular evolutionary genomic database Comparative and Functional Genomics 4 246 249 2003 SAS l l SAS Statistical Analysis Software for Windows 9 0 ed Cary NC USA 2002 Hardy O J amp Vekemans X SPAGEDI a versatile computer program to analyse spatial genetic structure at the individual or population levels Molecular Ecology Notes 2 618 620 2002 Cover T amp Hart P Nearest neighbor pattern classification Proc IEEE Trans Inform Theory 13 1967 Weir Genetic Data Analysis Il Sunderland MA 1996 Farnir F et al Extensive genome wide linkage disequilibrium in cattle Genome Res 10 220 7 2000 Henderson C R Best Linear Unbiased Estimation and Prediction under a Selection Model Biometrics 31 423 447 1975 Kang H M et al Efficient control of population structure in model organism association mapping Genetics 178 1709 23 2008 Laird N M amp Ware J H Random Effects Models for Longitudinal Data Biometrics 38 963 974 1982 Thornsberry J M et al Dwarf8 polymorphisms associate with variation in flowering time Nat Genet 28 286 9 2001 Flint Garcia S A et al Maize association population a high resolution platform for quantitative trait locus dissection Plant J 44 1054 64 2005 Anderson M J amp Ter Braak C J F
49. e table also contains markerR2 mean squares MS and degrees of freedom DF for the marker effect for the model corrected for the mean and for error If taxa are replicated across reps or environments then the markers are tested using the taxa within marker mean square If taxa are unreplicated then the residual mean square is used MarkerR2 is the marginal R squared for the marker calculated as SS Marker after fitting all other model terms SS Total where SS stands for sum of squares The following table shows an example of the Allele Estimates output as viewed with Results Table T Allele Estimates Marker OCU Locus pos Allele Estimate PZBO00859 1 B8 157104 PZB00859 1 wma J 0c PZA01271 1 19479846 EF L i 4 PZA01271 1 1947984 7513 177 7 al 7071 anie i For each marker and trait combination each marker allele is listed along with the number of observations for taxa carrying that allele Obs the locus usually chromosome and locus position of that marker the allele and the estimate of the effect of that allele Because of the way that GLM codes alleles the last allele estimate for a marker is always zero and the other allele estimates are relative to that 6 6 MLM Mixed Linear Model This conducts association analysis via a mixed linear model MLM A mixed model 1s one which includes both fixed and random effects Including random effects gives MLM the ability to incorporate information about relations
50. ed required hapSize Preferred haplotype size Preferred haplotype block size in sites use same as in FILLINFindHaplotypesPlugin Default 8000 hetT hresh Heterozygosity threshold Threshold per taxon heterozygosity for treating taxon as heterozygous no Viterbi het thresholds Default 0 01 mxInbErr Max error to impute one donor Maximum error rate for applying one haplotype to entire site window Default 0 01 mxHybErr Max combined error to impute two donors Maximum error rate for applying Viterbi with to haplotypes to entire site window Default 0 003 mnTestSite Mnmn sites to test match Minimum number of sites to test for IBS between haplotype and target in focus block Default 20 minMnCnt Min num of minor alleles to compare 21 Minimum number of informative minor alleles in the search window or 10X major Default 20 mxDonH lt Max donor hypotheses gt Maximum number of donor hypotheses to be explored Default 20 hybNN lt true false gt If true uses combination mode in focus block else does not impute Default true ProjA lt true false Create a projection alignment for high density markers Default false impDonor lt true false gt Impute the donor file itself Default false nV lt true false gt Supress system out Default false Options for calculating accuracy accuracy lt true false Masks input file before imputation and calculates accura
51. er Matrix category Alternatively impute missing genotype data first then create the kinship matrix using the imputed data To impute missing data highlight the filtered genotype choose Data Transform leave Collapse Non Major Alleles selected and click Create Dataset A new data set with Collapse appended will appear in the Numerical folder Highlight the collapsed data set choose Data Transform select the Impute tab then click Create Dataset Highlight the resulting imputed data then choose Analysis Kinship 57 TASSEL Trait Analysis by aSSociation Evolution and Linkage 5 0 4 a 31 16 i8 AES 33 16 1 81455 0 04063 0 0632 7 5834E 0 035 5 1403E 6 4094E E 38 11 0 04063 1 92021 0 01983 6 4647E 3 0453E 3 7423E 0 1310 4 4 4226 0 0632 0 01983 1 33465 1 6516E 2 S419E 3 8 7 472 7 5834E 6 4647E 1 6516E 1 44544 A188 0 03594 3 0453 02562 2 0005 8 mdp genotype chrl 2 3 4 5 6 7 8 9 10 157104 148907116 A214N 6 1403E 3 7423E a Result A239 6 4094E 0 A272 3 4562E 7 8 A441 5 0 086 1 5 ASS 0 004 0 113 0 2 0 00918 4 4 A556 0 02044 0 06832 4 9268E 0 05737 1 2258E 1 3 Ab i 2 t 1 70 013E 6 452 8E 7 0903 8 5848EF 1 836E 1 l S A619 0 02416 2 705 0 47E 7 1986E 0 11674 0 08643 1 2462E 1 5 A632 4133 07 0 644 4 8881 7 2249E 1 0662E 0 31715 0 A634 1 239E 1 1 4 0 68956 1 2823E 1 0067E 7 9731E 0 32677 4 Te T
52. er button A synonym data set will be placed on the Data Tree panel under Synonyms Each name in the data set selected second is now listed in the TaxaSynonym column Next to this column is a TaxaRealName column listing the highest scoring match derived from the real name data set The MatchScore column gives an indication of the amount of similarity between the two names where 0 is no similarity and 1 0 is identity 18 Caution Before the synonyms are applied we strongly encourage the user to check the match score especially for those taxa with low match scores To do that the user selects the synonym file and clicks the Synonymizer button The incorrect matches usually the ones with the lowest match scores can be rejected at this point Sorting on the match score column first makes this a fairly easy process In the event that some of the taxa are not interpreted correctly matches can be modified manually Select the taxa you wish to modify on the left side and then choose a replacement taxa from the right side Click the arrow button EN to substitute the taxa Taxa with no synonym can be identified by selecting then clicking No Synonym TASSEL Trait Analysis by aSSociation Evolution and Linkage 3 0 36 i File Tools Help GDPC orro el GDPC E Load amp Export T sites D gt Taxa Y Traits Impute SNPs 5 Transform D Q Synonymizer D u Join Data S i Sequence 44 d8 sequence
53. f data at the present time Work to improve handling heterozygotes is ongoing 16 How to cite TASSEL 67 68 The paper that describes TASSEL as a software package and the papers that introduce specific methods implemented in TASSEL should be cited as appropriate such as the unified Q K approach EMMA compression of mixed linear model and P3D For example A Linkage disequilibrium D R and P value were calculated by TASSEL B Association analyses were performed with the mixed linear model approach implemented by TASSEL C GWAS was performed with the compressed mixed linear model approach carried by TASSEL which also implemented the EMMA and P3D algorithms to reduce computing time 11 12 13 14 69 REFERENCES Bradbury P J et al TASSEL software for association mapping of complex traits in diverse samples Bioinformatics 23 2633 2635 2007 Zhang Z Buckler E S Casstevens T M amp Bradbury P J Software engineering the mixed model for genome wide association studies on large samples Brief Bioinform 10 664 75 2009 Kang H M et al Efficient Control of Population Structure in Model Organism Association Mapping Genetics 178 1709 1723 2008 Zhang Z et al Mixed linear model approach adapted for genome wide association studies Nat Genet 42 355 60 2010 Kang H M et al Variance component model to account for sample structure in genome wide association stu
54. filtered genotype and click Transform Use the default option of Collapse non major alleles Click Create data set Imputation of missing values Highlight the numerical genotype and click Transform and then click Impute Tab Use the default options Click Create data set PCA Highlight the imputed numerical genotype click Transform and then click PCA Tab Change the default option to Components 3 by choosing Components and type 3 in the text box Click Create data set 53 B Filter Alignment Filter Alignment Minimum Count 210 out of 281 sequences Minimum Frequency 0 05 Position Type Position index Physical Position AGP Start Position 0 157104 End Position 2560 148907116 of 2561 sites _ Extract Indels C Generate haplotypes via sliding window Haplotype Length Step Lenath TASSEL Trait Analysis by aSSociation Evolution and Linkage 3 0 39 SEE File Tools Help GDPC EN Physical Positions Site Numbers Locus Alleles Enter physical position 148907112 O Data 5 Sequence o mdp genotype mdp genotype e mdp genotype ge Mndp_genotype_chri_157104 14890711 H Polymorphisms 9 Numerical fe mdp population structure 4 mdp traits ao Matrix mdp_kinship E n3 tb _ em c ai No A B 99323776 l 157104 24948772 Bo Eu d E e L e J 4 e co n3 _ e ceo
55. henotyped lines based on the performance of a training set To do that a dataset containing both the genotypes to be predicted and the genotypes of the training set can be joined with a dataset containing the phenotypes of the training set using a union join All taxa in the phenotype set should have genotypes If an individual without genotype data is included all the marker data for that individual will be imputed which 1s not a generally useful thing to do 40 6 8 Geno Summary mm me ee m me MR amp O Genotype Su Genotype Summary 9 Genotype Overview Site Summary a Taxa Summary Ok Close TASSEL Trait Analysis by aSSociation Evolution and Linkage 5 0 5 File Data Filter Analysis Results Help 4 Data Filter Analysis Results Help Value 4 Y B3 Sequence mdp genotype 3 Result E Genotype Summary mdp genotype Qverallummary mdp genotype Allelesummary mdp genotype Sitesummary mdp genotype TaxaSummary Table Title Overall Summary Number of columns 2 Number of rows 14 Number of elements 28 Overall Summary of mdp genotype Number of Taxa Number of Sites Sites x Taxa Number Not Missing Proportion Not Missing Number Missing Proportion Missing Number Gametes Gametes Mot Missing Proportion Gametes Not Missing Gametes Missing Proportion Gametes Missing Number Heterozygous Proportion Heterozygous 281 3093 859133 837 7272 0 96386 31411 0 035
56. hips among individuals When a genetic marker based kinship matrix K is used jointly with population structure Q the Q K approach improves statistical power compared to Q only MLM can be described in Henderson s matrix notation as follows y Xp Zu e where y is the vector of observations p is an unknown vector containing fixed effects including genetic marker and population structure Q u is an unknown vector of random additive genetic effects from multiple background QTL for individuals lines X and Z are the known design matrices and e is the unobserved vector of random 37 residual The u and e vectors are assumed to be normally distributed with null mean and variance of ui G 0 e 0 R Var id y 2 where G 9 K with a as the additive genetic variance and K as the kinship matrix Homogeneous variance is jd 2 assumed for the residual effect which means R I e where is the residual variance The proportion of genetic variance over the total variance is defined as heritability h When K is derived from pedigrees the elements of K equal 2 Probability IBD where IBD means that two alleles drawn at random are identical by descent Generally K calculated from markers is an IBS matrix The resulting multiplier is then not c but some unknown constant times o Some methods for calculating K such as those implemented in SPaGEDI actually use markers to develop an es
57. indow default 8k this is performed by the first plugin FILLINFindHaplotypesPlugin Because short IBD segments may be replicated widely within a species even between diverse individuals we recommend supplying all the information available within a species for this step 24 The second plugin FILLINImputationPlugin uses these haplotypes to impute missing genotypes in target individuals It does so in multiple steps first looking for haplotypes that match the minor alleles to a threshold within the whole site window 1a in schematic below and if this fails looks for two haplotypes to explain the site window and assuming this represents a recombination break point between two inbred haplotypes uses a Viterbi HMM algorithm to model the recombination breakpoints 2a If two haplotypes cannot be found to explain the whole site window the algorithm next searches for haplotypes to explain a smaller focus window within the site window centered on 64 sites at a time and searching to the right and left until enough informative minor alleles are found It does this by first looking for one haplotype to a threshold 2a then two modeling a recombination break between inbred segments 2b then finally to a higher threshold looks for two haplotypes and models the 64 focus site window as heterozygous combining the two haplotypes together The thresholds for 2a c are also set differently based on whether the whole sequence of the target taxon is above or
58. ing data imputation and data visualization TASSEL development has been led by a group focused on maize genetics and genomics and for these reasons that software has design and computational optimizations that account for the biology found in many plants and breeding situations Compared to human genetics many crops are highly diverse both at the nucleotide level and structural variations 10 50X greater than humans inbreeding is common large families are common and whole genome prediction is being applied daily to real world problems These biological differences lead to some different optimizations that are of use to many biological systems outside of crops One of the design elements driving TASSEL development has been the need to analyze ever larger sets of data TASSELS has at its heart lots of design optimizations for big data including e Bit level encoding of nucleotides so genetic distance and linkage disequilibrium estimates can be made very quickly 20 50X speed increases e Extensive use the HDF5 file format which has been developed as a robust element of many climate modelers for matrix style data e Tools for extracting and calling SNPs from extensive Genotyping by Sequencing data tested for 60 000 samples by over 2 5 million SNPs and 96 million sequence alleles e Projection and imputation procedures that are optimized for the large families in crops Some of these optimizations permit memory and computational improvements of
59. issesliepi FE Apply threshold Once it has been determined that the taxa names were matched correctly the synonyms can be applied With the synonyms selected hold down the CTRL key while clicking on the second synonym data set the data set whose names you would like to change Then once again click on the Synonymizer button to apply the new names to the data set 3 6 Intersect Join Command Jum Pipeline pl erorkl en ogroupl hNp txt fork h group2 Hhlpatxt combine3 inputl input2 intersect export grou upl group InbtersecL DND LxXb emunLorco epu mnpoOrk2 Luntorks gt This joins multiple data sets by the intersection of their taxa Taxa must be present in both data sets to be included Select multiple data sets using the CTRL key in conjunction with mouse clicks and then click on the intersection button to join the data sets Because this function uses taxa names to join data sets any variation in taxa names can 20 prevent proper joining Taxa names can be made uniform by using the Synonymizer 3 Union Join Command L0 pipeline pL rock i uoougplhlmpcrxL eroOSEZ E group ANPE cUOnbIDeo INPUTCL BDULZ LBEer sect cSXDOPL Croup GLouUpZ Uno hip ERE Kumeorkl Xu nbtorkz nainrork3 This joins multiple data sets by a union of their taxa Missing data will be inserted if taxa are missing from one data set Select multiple data sets using the CTRL key in conjunction with mouse clicks and then clic
60. itle Alignment Distance M I A635 8 02 17E 4 0063 1 13 93108 0 02127 4 9243t 0973E 9 2 q nber of columns 282 MAI 0 033 i6 994 af 23 du 3 a ad 7 nber of rows 281 Ab54 0277 2 1423E 0 19805 5 0946 0 0 1 2 ar ante n A659 0 0493 0 04002 0 0424 1 9086 INI Q0 m mE d f 3 907116 A682 0 01092 0 03385 0 00147 0 06965 11 Sp matrix map genotype chr1 0 3 5 5 8 8 9 A679 1 4359E 5 7639 7 9908E 1 4286E l A680 1 8842E 4 4384E 8 7841E 1 605 1E 2 A682 0 00863 5 8781E 0 05776 3 2347E 6 1 4 AB28A 7 5477E 0 53905 0 00724 3 0479E 2 7707E i 6 B10 1 6858E 4 3444E 0 07198 1 3603E 9 163E 2 0 357 o 8103 0 02804 1 6718E 1 0304E 0 02547 4 B104 4 7435E 2 4154E 0 0083 8 7575E a 8105 6351E 0 05643 0 11994 1 0122E 0 0102 u 8109 1 3574E 0 04056 2 8808E 1 7789E 5 828 1 8115 0 0486 05404 0 05205 0 0802 0514 8 626 1 B14A 1 4491E 5 7115E 1 391E 1 1 5328E 1 7639E 6164 0 03086 0 08966 0 00698 8 6151E 9E 0 0 7 6 B2 0 12692 0 04725 0 19644 2 7222E 7 9646E B37 0 02824 0 05153 0 02269 3 4049E 8 7392E 00777 2 2 0 0292 0 0 0 02 class ne ta gegenet taradutance Dstanceviatms 10 4 Association analysis using GLM We use three files from the tutorial data set to perform association analysis using the GLM The first file mdp genotype hmp txt a set of SNPs scored at 3093 sites on 281 maize inbred lines The second one is the population structure of 282 maize
61. k on the union button to join the data sets Because this function uses taxa names to join data sets any variation in taxa names can prevent proper joining Taxa names can be made uniform by using the Synonymizer 3 8 Merge Genotype Tables Command d Iur Sipe line pork cH group le npe FOrkZ ch OSODDZohmpgcssb combine3 inputl input2 mergeAlignments export Group group merge ump xl LuntOrk eEUHBPIPOXRA CeEURNLLPOIRS This is the most complex merge function and can be considered as a union join across both sites and taxa The actual union join only works across taxa The resulting genotype table will contain all unique sites and all unique taxa from across the input datasets If a specific site taxon combination isn t present in any input dataset the value is set to missing If a specific site taxon combination is present in more than one input file the output will contain the last value processed That is later values overwrite earlier values even if they conflict There are plans to change this but they have not been implemented yet Notes e This maps to Data gt Merge Genotype Tables Menu on GUI e Error if duplicate site names in same file Same as with other file loadings e Undefined taxa site allele values are set to UNKNOWN e Duplicate taxa site set to last Alignment processed e Sites are identified by Locus chromosome Physical Position and Site Name 3 9 Separate This separates the selected dat
62. l demonstrate the analysis using two sets of markers One is the dwarf8 gene sequence used in the GLM tutorial The other is a set of 3093 SNPs spread across the maize genome For the dwarf8 gene sequence use the joint data set created by following the tutorial for GLM Solve the mixed linear model by highlighting the joint data set and the kinship data then clicking the MLM button in Analysis mode File Tools Help GDPC Qi Q2 ED Taxa dpall Q T parapi 38 11 68 5 3 0E 3 0 993 IC CGCAT sequence az po rse2 jo122 TGTGAT amp C See A441 5 5 7 531 ICTGTGATGC mdp genotype A554 97 CTGTGATGC mdp_genotype DE 2 d amp sequence chr 66 2404 gt Filtered_mdp_traits Filtered_mdp_populatio a ojola en ajzdo a T G3 Cd J Gd ejeje i i iil n e A519 GCGCGACA i Polymorphisms E Jd Numerical mdp population structure t 4 mdp traits 4 4 Filtered mcdp population structure t4 Filtered mdp traits hb Matrix e mmm mu il PIRJE Jl T ueISBS T e af Ch Cn Cni Oy oy Go oy ay S18 5 c1 03 S S A 1S a S Ln tn in in un cn un un BR RR LJ Ga ba I INNMNNNMN MNNNGACG GCGC CGACA GC CGC GACA JEEE in in in in in S a H QI i a a n mem m m m mn js S EE d m Wa BB ul om T A da WSN See fa in c m bl un jou La fs WIE i E ms MLM Options Compression Level
63. llSummary mdp genotype Allele amp immary mdp genotype Sitesummary mdp genotype TaxaSummary LelLelele 4 2EBER nae ZPJ Table Title Allele Summary Number of columns 4 Number of rows 27 Number of elements 108 Allele Summary of mdp_genotype Alleles Allele values present in data set Single letter values are diploid where some letter represent heterozygous Two letter values are major minor combinations with count of sites Number Number of occurrences Proportion Percentage the value occurs in data set Frequency Percentage the value occurs in data set not counting unknown N values 0 29404 0 28052 0 21254 0 20132 0 0375 0 0047 0 00322 0 00119 0 001 7 ll45E 4 6 649E 4 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 43 eoo TASSEL Trait Analysis by aSSociation Evolution and Linkage 5 0 5 File Data Filter Analysis Results Help Site Number Site Name Chromoso Physical Po Number of Major Allele Major PYBOOSS 157104 261 PZAD127 1947984 261 PZA03651 2914066 261 PZAO03651 2914171 261 PFAODS61 2915078 261 PZAQ36 1 2915242 261 mdp_genotype_AlleleSummary PZAQ023 2973508 281 3 PYAD2 96 3205252 261 mdp genotype _SiteSummary P7A0 96 3205262 281 mdp genotype Taxasummary PZA0059 3206090 281 PFAODZ 12 3706018 261 PZAD039 4175293 281 PZAO 2 86 4429897 281 Y B3 Sequence mdp genotype ie Result Lu Ge
64. lleles chrom pos Strand assembly center protlLSID assayLSID panel QCcode 33 16 38 11 4226 4722 A188 PZB00859 1 AJC 1 157104 AGPv1 Panzea NA NA maizez82 WA cc cc CC CC AA PZ7A01271 1 C G 1 1947984 AGPv1 Panzea NA NA maizez82 WA CC GG CC GG CC PFA03613 2 G T 1 2914066 AGPv1 Panzea NA NA maizez82 WA GG GG GG GG GG PFAQ3613 1 A T 1 2914171 AGPv1 Panzea NA NA maizez82 WA TT TT TT TT TT Pz7A03614 2 A G 1 2915078 AGPv1 Panzea NA NA maizez82 WA GG GG GG GG GG PFA03614 1 A T 1 2915242 AGPv1 Panzea NA NA maizez82 WA TT TT TT TT TT PZAQ00258 3 C G 1 2973508 AGPv1 Panzea NA NA maize287 WA GG CC CC CG CC PZzA02962 13 A T 1 3205252 AGPv1 Panzea NA NA maizez82 WA TT TT TT TT TT PzA02962 14 C G 1 3205262 AGPv1 Panzea NA NA maizez82 WA CC CC CC CC CC PZAQ005998 25 C T 1 3206090 AGPv1 Panzea NA NA maizez82 WA cc TT CC TT TT Pz7A02129 1 C T 1 3706018 AGPv1 Panzea NA NA maizez82 WA TT cc CC CC CC Pz7A00393 1 C T 1 4175293 AGPv1 Panzea NA NA maizez82 WA TT TT TT Cc TT PZAQ28698 8 C T 1 4429897 AGPv1 Panzea NA NA maizez82 WA cc TT CC NN CC PZAQ2869 4 C G 1 4429927 AGPv1 Panzea NA NA maizez82 WA cc cc CC NN GG PZAQ2869 2 C T 1 4430055 AGPv1 Panzea NA NA maizez82 WA NN TT TT Cc TT Pz7A02032 1 A T 1 4490461 AGPv1 Panzea NA NA maizez82 WA AA TT AA AA AA zagli 5 A T 1 4835434 AGPy1 Panzea NA NA maizez82 WA AA NN AA AA AA zagli 2 AJC 1 4835558 AGPv1 Panzea NA NA maizez82 WA cc cc CC Cc CC zagli 6 C T 1 4835558 AGPv1 Panzea NA NA maize282
65. llow the tutorial at end of this manual The objective of this section is to provide information necessary to install and start TASSEL software and to provide a brief overview of the interface Ao TASSEL Trait Analysis by aSSeciation Evolution aed Linkage 5 0 1 File Data Fiker Analeis Resuls Help id Data s Sine Numbers Locet re Name Allta MajorMinorAllete a Eeqer physical pasitionl Search EB uere Priyrorghiums LJ BB Numerical j mp poguhatin Ernie map trai BS Marix map kirtis Tres Fus sfr 33 16 Seyrupemer sne r 33 16 L3 Result LEAL bal z LITE bia ei e Lr um La bj 8 e m 560 220885121 c 5bl fff 2550ee m 1504 222233034 J j Pri 525 1947420287 S26 1979644304 31527 197964459 5 0235939 0236675 2807348 28903509 55953158 5593255 0255 n3bb zB 228 0295 Da B5bz b5 850 220857872 e IN M NT LEM BEN EM EN EN EN NT IT ny n mn nr ny nw eed EV 15293 201131377 S30 201131730 531 202245021 S32 202883357 534 208059060 535 208059180 S63 222233078 554 222233207 565 22255194 566 22255199 557 222332058 1 I E L 4 x n EE one T E E n Meier od sequences 281 Murder od sites 5095 Locic 1 4 4 5 6 F amp 18 BA o o Imm mi si d Fh I Ii a t i t b Bann T E n IT I fit F aa A i ani s E HeH E E E 1222222
66. lumns Genetic Var Residual Var and 2LnLikelihood list 02a 62e and minus two times the model likelihood respectively When the P3D option is used all of the values are the same for a given trait because they are only calculated once A second table lists the estimated effects of each allele for each marker similar to the output for GLM The compression results table shown below shows the likelihood genetic variance and error variance for each compression level tested during the optimization process The meaning of groups and compression is discussed above in the description of the compression method The compression level with the lowest value of 2LnLk is used for testing markers 39 Compression 2LnLk Var genetic Var error 1 480 402 s 28 i e Heise L5 Si dpoll st 1 483301 LBS 6 866384 1 2 im 1 486 718 6 172 10 576 i 5 im 1 486 045 6 Genomic Selection using Ridge Regression This function performs ridge regression to predict phenotypes from genotypes It is one of the methods used for genomic selection GS The input dataset must contain one or more phenotypes and numeric marker data Optionally it may also contain factors and covariates The analysis is run by selecting the input dataset then clicking the GS button Because no additional user input is needed the analysis will run immediately after the button is clicked All traits will be ana
67. lyzed separately using all of the genotypes factors and covariates in the dataset The output will consist of two new datasets for each trait One of the datasets will contain genomic estimated breeding values GEBVs for each taxon and the other will contain BLUPs for each marker in the genotype file The output datasets will appear in the Numerical folder which holds the input data as well The output datasets can in turn be used for subsequent analysis For example it could be joined with the input data so that the predicted values could be graphed against the original values Understanding the input data requirements is important to ensure that the results of the analysis will be correct and useful Genotypes must be numeric with one column for each marker It is expected that the markers are bi allelic with the homozygotes coded as 1 and 1 and the heterozygotes coded as 0 However any reasonable coding scheme will work For instance missing data could be replaced by a probability resulting from imputation If any genotype data is missing it will be imputed as the average of the marker scores across all taxa for that marker If a user prefers to use a different method of imputation then the missing genotypes must be imputed before importing the data into TASSEL GEBVs will be calculated for all taxa in the dataset including any lines that have missing phenotype data A typical use of genomic selection is to predict GEBVs for a set of unp
68. mum minor allele frequency used to filter SNPs If negative filters on expected segregation ratio from parental contribution Value number between 1 and 1 default 1 b or bcl use BC specific filter Value true or false default true n or bcn use multiple backcross specific filter Value true or false default false logfile the name of a file to which all logged messages will be printed Value filename Options not taking a parameter value cluster use the cluster algorithm min Maf defaults to 0 05 subpops filter sites for heterozygosity in subpopulations nohets delete het calls from original data before imputing windowld use the window ld algorithm for finding parent haplotypes 29 66 The cluster subpops nohets and windowld options do not take parameters but only act as flags that include certain features in the analysis Of those cluster and windowld are the most useful When the cluster option is used a different algorithm is used that does a better job of handling residual heterozygosity in the 23 parents However it does not perform well for partially inbred RILs that have only been self pollinated for one or two generations If the RILs being imputed are F2 s or F3 s the cluster option should not be used The subpops option should only be used when imputing families of the NAM population developed by the Maize Diversity Project The nohets option w
69. n may be considerably slower than no compression or user supplied compression This is because the algorithm solves the model once for each of a series of compression levels in order to determine the optimal one All MLM analyses create two output tables model statistics and model effects If compression is used the analysis creates three tables MLM statistics for Filtered mdp traits Filtered mdp population structure mdp genotype chrl 15 104 3 06018 m Trait Marker Locus Site df F p errordf markerR 2 Genetic Var Residual Var X 2LnLikelihood oe None 257 L1 8 068 14 585 LA 183 a del poem a Mmi S9 1m 9 sme Hes Xena mmi i eene 11 499 tum Ws RU tens LATLIS dal m 9 3 doo doo tos 8 068 dpoll PZAD3614 1 8 068 dpoll a ZA A n amp a dpall dinol bzAn 062 14 li F Print Export CSV Export Tab The statistics table shows the results of the tests for each trait The first line 1s for the model with no markers Following that is a single line for each marker tested The columns labeled df F and p are the degrees of freedom F and p value from the F distribution for the test of the marker The cdam errordf is the degrees of freedom used for the denominator of the F test The column labeled markerR2 1s the R2 for the marker calculated based on a formula for R2 for a generalized least squares GLS model as shown here The co
70. n two taxa to compare genetic distance to evaluate similarity for clustering Default 50 mxErr lt Max combined error to impute two donors gt The maximum genetic divergence allowable to cluster taxa Default 0 05 hapSize lt Preferred haplotype size gt Preferred haplotype block size in sites minimum 64 will use the closest multiple of 64 at or below the supplied value Default 8192 minPres lt Min sites to test match gt Minimum number of present sites within input sequence to do the search Default 500 maxHap lt Max haplotypes per segment gt Maximum number of haplotypes per segment Default 3000 minTaxa Min taxa to generate a haplotype gt Minimum number of taxa to generate a haplotype Default 2 maxOutMiss lt Max frequency missing per haplotype gt Maximum frequency of missing data in the output haplotype Default 0 4 nV true false Supress system out Default false extOut true false Details of taxa included in each haplotype to system out Default false Options for FILLINImputationPlugin hmp Target file gt Input HapMap file of target genotypes to impute Accepts all file types supported by TASSELS required d Donor Dir Directory containing donor haplotype files from output of FILLINFindHaplotypesPlugin All files with gc in the filename will be read in only those with matching sites are used required 0 Output filename Output file hmp txt gz and hmp h5 accept
71. ng either of the minimum eigen value associated 17 with each axis the minimum percent of the variance captured by an axis or the number of axes The resulting axes will be sorted by the amount of variance each captures Column Fercent Missing Data Trans Impute PCA EarHT MA 0 00 dpall MA 0 00 Method arDia Me 0 00 Eiis Correlation C Covariance Mukput 5 Eigenvalue 0 OQ var Frop 0 33333 O Components m b Create Dataset 3 5 Synonymizer Synonymize Taxa Names This button makes taxa names uniform to permit the joining of data sets The join functions that generate fused data sets work by matching taxa names Consequently if multiple names exist for a given taxon an added suffix alternative spellings different naming conventions etc then the two data sets will not join correctly To help remedy this the Synonymizer function allows the taxa names of one data set to replace similar taxa names in the second data set It relies on an algorithm that calculates the degree of similarity between names using the name from the first set which is most similar to that in the second data set When using the Synonymizer keep in mind that order of selection matters Always select the data set with the names you wish to use the real name first and then while holding down the CTRL key click on the second data set with the taxa names you wish to change the synonym Then click on the Synonymiz
72. notype Imputation The phenotype file mdp traits will be used to demonstrate the process of imputing missing data Note that the data set below contains missing values NaN B TASSEL Trait Analysis by aSSociation Evolution and Linkage 3 0 37 File Tools Help GDPC 6 2404 NE d sequence cnr 6 Polymorphisms 4 mdp population structure Lg NRI El Matrix Leg mdp kinship Table Title Phenotypes Number of columns 4 Number of rows 301 Number of elements 1204 Taxa To impute missing data first select the mdp_traits data set in the Data Tree Panel and then click the Transform button Data Transform The Transform Column Data window will open Click on the Impute tab in this window Finally click on the Create Data set button to create the new data set with missing values imputed Note that missing values are now filled 52 TASSEL Trait Analysis by aSSociation Evolution and Linkage 3 0 37 File Tools NE GDPC ER aes L TEE ds sequence cd EarDia Polymorphisms 37 897 B Numerical 4 mdp population structure lE a eoo a 21933 amp mdp traits Hz2 J gnii3 pis g241 oe mee pues pe ms ea Tel Az 5 9 52006 Ao 4788 Bi eoa Imputed Phenotypic Values Taxa with insufficient data 35 K 30 8 cutoff 10 2 Principal Component Analysis Principal component analysis PCA is a statistical tool that transforms a set of cor
73. notype Summary mdp genotype Overallsummary EO c Cn un d Lu P ca PZA0 285 PZA0 203 zagll 5 zagll zagll PZDDODS zagll PZB0O9 PZB0UO9 PHM2244 PZA0 309 PZADO18 PZADO18 4430055 281 4490461 261 4835434 281 4835558 281 4835558 281 4835542 281 4912526 281 5353319 281 53536555 281 5562502 281 8075572 281 83553658 281 8366411 261 Table Title Site Summary Number of columns 35 Number of rows 3093 Number of elements 108255 Site Summary of mdp_genotype l l l l l l l l l l l l l PZAU285 4429927 281 l l l l l l l l l l l l c c C C ri nP rx u000 00 040 C0 Site Number Index of site Site Name Name of site Chromosome Chromosome Physical Position Physical Position on Chromosome Number of Taxa Number of taxa for site same of all Major Allele The major allele of site Major Allele Gametes Number of times major allele occurs for site up to twice number of taxa Major Allele Proportion Major Allele Gametes Number of Taxa 2 Number of Taxa 2 1s the Number of Gametes for a Site Major Allele Frequency Major Allele Gametes Number of Taxa 2 Gametes Missing Minor Allele The minor allele of site Minor Allele Gametes Number of times minor allele occurs for site Minor Allele Proportion Minor Allele Gametes Number of Taxa 2 Number of Taxa 2 is the Number of Gametes for a Site Minor Allele Frequen
74. om the alignment If not selected only point substitutions are extracted Remove minor SNP states converts tertiary and rarer states to missing data thereby forcing sites to have only two types of segregating sites at a locus This may help remove sequencing errors Generate haplotypes via sliding window creates haplotypes from an ordered set of SNPs Example Pipeline Command that removes SNPs with MAF Minimum Allele Frequency less than 5 r n prpellne pl rorkl h mdp genotype ump txt filcerAlign CEIIUCSIASUIDMIBRPeg 0 00 expore Filtered genotype runtorkl 29 5 2 Site Names First select the genotypic data from the data tree The resulting dialog displays the site names associated with the selected data By using either the CTRL or SHIFT key in conjunction with the mouse the user can select or deselect site names Once desired site names have been moved to the Selected window using the Add gt button the Capture Selected or Capture Unselected buttons will create a new data set containing only the desired site names Using the search box e js the wildcard is always implied at end of search string e Search string is case sensitive For example use Aa bc to match site names beginning with Abc or abc e PZ AB Will match anything starting with PZA or PZB af Site Name Filter Available Selected PZB00859 1 T PZA03613 2 PZADI12 1 1 PZAO3613 1 IP7A03613 2 PFAO
75. ps groups Var genetic Var error The strongest associated SNP is at 193565357 bp on chromosome 3 The P value is 1 3027x10 The threshold is 3 2331x10 at significant level of 1 after Bonferroni multiple test correction 0 01 3093 The association was not significant As illustrated below the output labeled GLM Allele Estimates shows the marker effects assigned to genotypes for each SNP The GLM is also the same For example the first SNP at 157104 bp on chromosome 1 had three genotypes AA CC and AC coded as A C and M based on the IUPAC code see Appendix Nucleotide Codes B TASSEL Trait Analysis by aSSociation Evolution and Linkage 3 0 39 File Tools Help GDPC eo s PZB00859 1 3 64912 197 Synonymizer dpoll PzB00859 1 1 157104 A ss k Result dpol PZBO0859 1 0 1 383 Diversity 5 5 z l LD Gee Association bed MLM statistics for Filtered n MLM_compression_for_Filtere MLM_statistics_for_Filtered_n MLM compression for Filtere MLM statistics for Filtered n MLM effects for Filtered m MLM compression for Filtere k i SEITE 58 ESSERE EIE JEJEIEISISISIESE co Table Title MLM effects 4 Number of columns 7 an ut tB 2E oo B bJ IERE a 5 3S S SS BERE LJ 11 Appendix 11 1Nucleotide Codes Derived from IUPAC ee oo CHEN e eaedem 11 2TASSEL Tutorial Data sets http www maizegenetics net tassel docs TASSELT
76. pupelinepl 5ortoenotypeTablePlugin inputrile filename outputEiLile filename fileType Hapmap or VCF The fileType flag 1s optional and is only needed if the input file s extension doesn t match a known file extension hmp txt vcf etc 3 4 Transform This suite of functions allows multiple data manipulation on genotype and phenotype numerical data When a genotype data set is selected the data are transformed to numbers When a numerical data set is selected mathematical transformation data imputation and principal component analysis PCA can be performed The Transform columns tags will be displayed in a Data dialog box with three tabs Trans Impute and PCA 3 4 1 Genotype Numericalization 15 Two options are provided to transform genotype from character to numerical as shown in the following dialog box gt Separate Alleles Dataset Close 3 4 1 1 Collapse Non Major Alleles This function assigns 1 to the major allele and 0 to any other alleles The converted genotypes are saved in a new numerical data set 3 4 1 2 Separate Alleles This function assigns an indicator 1 for present and 0 for absent for each allele The converted genotypes are saved in a new numerical data set 3 4 2 Transform and or Standardize Data The Trans dialog box is the default selection as shown below In the Column list select the column s you wish to transform Then select the type of transfo
77. re the same for both A typical command sequence for running FILLIN through the command line is as follows replace items in lt gt with actual parameter values run pipeline pl FILLINFindHaplotypesPlugin hmp lt genotypeFilename gt o outDonorDir run pipeline pl FILLINImputationPlugin hmp lt genotypeFilename gt d donorDir o lt outFile hmp txt gz gt To run FILLIN from the GUI go to Impute gt FILLINFindHaplotypesPlugin or FILLINImputationPlugin Options for FILLINFindHaplotypesPlugin hmp lt Target file gt Input genotypes to generate haplotypes from Usually best to use all available samples from a species Accepts all file types supported by TASSELS required 0 Donor dir file basename gt Output file directory name or new directory path Directory will be created if doesn t exist Outfiles will be placed in the directory and given the same name and appended with the substring gc s hmp txt to 26 denote chromosome and section required mx Div lt Max divergence from founder gt Maximum genetic divergence from founder haplotype to cluster sequences Default 0 01 mx Het Max heterozygosity of output haplotypes gt Maximum heterozygosity of output haplotype Heterozygosity results from clustering sequences that either have residual heterozygosity or clustering sequences that do not share all minor alleles Default 0 01 minSites lt Min sites to cluster The minimum number of sites present i
78. related variables into a smaller number of uncorrelated variables called principal components PCs The first PC captures as much of the variation as possible and the succeeding PCs account for a decreasing fraction of the remaining variance Another application of PCA is to use PCs derived from genetic markers to represent population structure This method requires much less computing time than maximum likelihood estimation As most marker data are characters numericalization must be performed first A common approach for converting character marker scores is to set one of the homozygotes to 0 the other homozygote to 2 and the heterozygote to 1 For haploids the conversion can be simply performed by coding one allele as 0 and the other as 1 The TRANSFORM function in TASSEL converts the major allele to 0 All the other alleles are collapsed to a single class and coded as 1 PCA requires that all variables should have variation and should not have missing values As a result filtering genotype to eliminate monomorphic markers and imputing missing values may be necessary Imputing missing values can be done before or after numericalization Here we demonstrate how to generate PCs from the genotype file in the tutorial data Remove monomorphic sites Make sure TASSEL is in Data mode Highlight the genotype and click Site Set the minimum frequency to 0 05 and have Remove minor SNP status checked Click Filter Numericalization Highlight the
79. revious version appears What should I do The previous version of TASSEL web start was cached in your machine To replace it with the most current version click the Start button in Windows followed by Run Type javaws and then click OK In the window that opens keep the most current version of TASSEL and delete the rest 5 What should I substitute for missing values in TASSEL For numerical data in version 3 format use NA or NaN For numerical data in version 2 format use 999 for missing values For SNP data use N Kinship does not allow missing values 6 Isit possible to change data names in the Data Tree Yes Click on the desired data name in the Data Tree wait for one second and then click it again or immediately hit the F2 key Rename the data set and then hit Enter to save the change T How can I create a TASSEL icon on desktop Click Start on Microsoft Windows and select Control Panel then double click Java to show java Control Panel In Temporary Internet Files section click View button show Java Cache Viewer Move mouse over TASSEL application and click right button and select Install Shortcuts 8 Why do I get empty squares in MLM association analysis The empty square means null information The major reasons include non convergence in the estimation of 66 9 variance components or that the statistic in question was not calculated For example marker F p and R are not cal
80. rmation you wish to execute Selecting the Standardize checkbox will transform data by subtracting the column mean from the value of the trait and then dividing by the column s standard deviation Clicking on the Create Data set button will result in the placement of a dataset containing only the selected columns in the Data Tree 16 Percent Missing Data EarHT null 0 66 y Raise to Power 2 dpoll null 1 3 i EarDia null 12 Take Log Base 10 F Standardize 3 4 3 Impute Phenotype The k nearest neighbor algorithm is used to impute missing phenotype data If data is missing for a taxon for one of the traits the algorithm finds other taxa neighbors that are most like it for the non missing traits It uses the average of the neighbors to impute the missing data Click on the Impute tab to display the following EarHT null Manhatten Distance dpoll null EarDia null Eudid Distance Unweighted Average Weighted Average Number of Neighbors K 35 Min Freq of Row Data 0 80 3 4 4 PCA Principal component analysis PCA can only be performed on a numerical data set without missing values Two methods are available correlation or covariance This determines whether a correlation or covariance matrix will be used as the basis for the analysis The default correlation is a reasonable choice for genetic data The number of PCA axes in the output data set can be controlled by selecti
81. rous invariant bases will take a very long time and consume a large amount of memory to calculate 33 Linkage Disequilibrium Select LD type Sliding Window LD Window Size 50 Sliding Window LD with 153375 comparisons How to treat heterozygous calls Set to missing Accumulate R2 Results Run Close Linkage disequilibrium between any set of polymorphisms can be estimated by clicking on a filtered set of polymorphisms and then using Analysis Link Diseq At this time D r2 and P values will be estimated The current version calculates LD between haplotypes with known phase only unphased diploid genotypes are not supported see PowerMarker or Arlequin for genotype support D is the standardized disequilibrium coefficient a useful statistic for determining whether recombination or homoplasy has occurred between a pair of alleles i represents the correlation between alleles at two loci which is informative for evaluating the resolution of association approaches D and r2 can be calculated when only two alleles are present If multiple alleles are present a weighted average of D or r2 is calculated between the two loci This weighted average is determined by calculating D or r2 for all possible combinations of alleles and then weighting them according to the allele s frequency Note It is not entirely certain that this procedure fully accounts for allele number effects P values are de
82. s exported as numerical trait data Table Reports are exported as a tab delimited table For numerical data the function of Export is similar to the Table function in Results mode 14 eo Export Choose File Type to Export e Write Hapmap Write HDF5 Write VCF Write Plink Write Phylip Sequential Write Phylip Interleaved Write Tab Delimited OK Cancel 3 3 Sort Genotype File TASSELS has strict requirements for the sites in a genotype file Each site must be unique as defined by its locus chromosome position and name and they must be in order in the file Genotype files produced by other programs and also earlier versions of TASSEL often do not meet this second requirement and throw an error when TASSEL tries to load them It can be difficult to recreate TASSEL s internal sort order by hand so this plugin allows the user to sort an input genotype file according to TASSEL s rules and output it to a new file ready for further analysis This sort 1s not done automatically at load time because the computational cost for sorting large files can be very large We feel it s better for users to know what they re getting into instead of being surprised by it There is currently only support for sorting Hapmap and VCF files To sort a genotype file from the GUI just select Data gt Sort Genotype File and fill in the appropriate parameters in the popup dialog To sort a file from the command line use the following command Hun
83. s with Environment columns with Site and value with PermuteP The cutoff value for coloring can be chosen either by inputting a value in the text box or by using the slider tool to the right of the text box Users can mouse over any box to view the value associated with that box as shown here 47 T 2 D chart t3 B 1 Sg Cel size 14 bel only upper triangle Im P Valuie Ei Cutoff 0 001 RR PermuteP 1490 15 0 1618 CLAYTON ID15 BEEN EN HOMESTEAD ID1 ee Statistics Min 0 0 Max 0 98 Mean D 2658788 SD 0 322007 72 HOMESTEAD ID1 1490 00 0 If P value coloring is desired simply check the P value box as shown below T 2 D chart x H AG 4x Cell size lis L3 only upper triangle a P Value Ei PermuteP 1000 1459 1490 15 0 1616 CLAYTON IDI5 ee HOMESTEAD ID 1 a Statistics Min 0 0 Max 0 98 Mean 0 2658788 5D 032200772 HOMESTEAD ID1 1459 0 04 By checking the P value box Cutoff selection tools will be disabled and fields will instead be colored according to the following grayscale 0 01 0 05 0 05 This key can be shown by clicking on the icon next to the P value check box 7 4 LD Plot Displays the results from a linkage disequilibrium analysis After selecting the desired result from the Data Tree choose Results gt LD Plot The graph that is generated displays LD between pairs of sites calculated with the analysis step The black diagonal represents LD between each
84. ss Y Buckler ES 2007 TASSEL Software for association mapping of complex traits in diverse samples Bioinformatics 23 2633 2635 Genotyping by Sequencing Glaubitz JC Casstevens TM Lu F Harriman J Elshire RJ Sun Q Buckler ES 2014 TASSEL GBS A High Capacity Genotyping by Sequencing Analysis Pipeline PLoS ONE 9 2 e90346 Mixed Model GWAS Zhang Z Ersoz E Lai C Q Todhunter RJ Tiwari HK Gore MA Bradbury PJ Yu J Arnett DK Ordovas JM Buckler ES 2010 Mixed linear model approach adapted for genome wide association studies Nature Genetics 42 355 360 The TASSEL project is supported by the National Science Foundation and the USDA ARS USDA ey Reference Links Main Web Site http www maizegenetics net tassel Open source code https bitbucket org tasseladmin tassel 5 source Wiki https bitbucket org tasseladmin tassel 5 source wiki Table of Contents Introduction Getting Started Executing TASSEL Open Source Code Software Development Tools Graphical Interface Pipeline Command Line Interface GBS Pipeline File Menu Save Data Tree Open Data Tree Save Data Tree AS Open Data Tree Set Preferences Data Menu Load Hapmap HDE5 Hierarchical Data Format version 5 VCE Variant Call Format Plink Projection Alignment Phylip FASTA Numerical Data Trait format Covariate Format Marker Values as Numerical Co variates Square Numerical Matrix Table Report TOPM Tags on Physical
85. termined by two methods If only two alleles are present at both loci then a two sided Fisher s Exact test 1s calculated Note Previous editions of TASSEL used a one sided test but TASSEL version 1 0 8 and later use a two sided test If more than two alleles are present permutations are used to calculate the proportion of permuted gamete distributions that are less probable then the observed gamete distribution under the null hypothesis of independence When calculating linkage disequilibrium users have the option of employing Rapid Permutations If this 34 option is selected the algorithm will compute either a fixed number of permutations or run until 10 permutations are found that are more significant than the observed P value While this slightly reduces P values it also saves a large amount of computational time If an unbiased p value is desired then the user must unselect the Rapid Permutations check box Full Matrix LD calculates LD for every combination of sites in the alignment Sliding Window LD calculates LD for sites within a window of sites surrounding the current site The LD Window Size determines the width of the window on one side of the current site Linkage disequilibrium results can be plotted using Results gt LD Plot or viewed in a table via Results gt Table 6 3 Cladogram eoe Create Tree vf Save distance matrix Clustering Method 4 F Neighbor Joining Run Close
86. th d or diploid if true output is AA CC AC if false output is A C M Value true or false default false cor minCoverage the minimum coverage for a monomorphic snp to be included in the nucleotide output Value number between 0 and 1 default 0 1 x or maxMono the maximum minor allele frequency used to call monomorphic snps default 0 01 For individual families only polymorphic SNPs are imputed When merge false only those SNPs appear in the output When merge true SNPs that are polymorphic in any family will be written to output For any site if SNP coverage is high enough in a family to determine with confidence that it is monomorphic for that family then all individuals in that family will be imputed to the monomorphic value at that site The minCoverage and maxMono options are used to determine thresholds for determining whether a site will be called monomorphic in a family If either of the options is set to a value of NaN then missing values at monomorphic sites will not be imputed FILLIN Fast Inbred Line Library ImputatioN The generalized approach FILLIN imputes missing genotypes in two steps 1 haplotype generation FILLINFindHaplotypesPlugin and 2 imputation of the resulting haplotypes back onto the target samples FILLINImputationPlugin Haplotypes are generated by collapsing low coverage but inbred segments that share identity by state to an optionally user supplied threshold value by site w
87. the cumulative eigenvalue contributions The eigenvalues are of interest because they equal the variance explained by each of the PCs 56 raph Type XYScatter v i Y1 Individual Proportion w v2 MERE ince vi Y2 PC 3 a x PC Mi Line Regression v 2 Y Axes M Due FTResressin VIP Y Axes PC vs Individual Proportion and Cumulative Proportion PC 1 vs PC 2 and PC 3 0 065 p F 1 00 0 060 4 0 95 r 0 90 F 0 85 0 050 r 0 80 r 0 75 r 0 70 0 055 1 o o o o Individual Proportion O o on uoniodojg eAnejnuun r n n i i i i i i i 25 50 75 100 125 150 175 200 225 250 275 50 45 40 35 30 25 20 15 10 5 0 5 10 15 PC PE m Individual Proportion Cumulative Proportion m PC2 PC3 10 3 Estimation of Kinship using genetic markers While PCs can be used to capture major population subdivisions kinship can be used to capture more subtle relationships This section shows how to create a kinship matrix based on the same SNP data used to calculate PC s Remove monomorphic sites Highlight the genotype and choose Filter Sites on the menu bar Set the threshold on MAF to 0 05 check Remove minor SNP status then click Filter Estimate kinship Highlight the filtered genotype and click Analysis Kinship Leave Scaled IBS selected in the Choose Kinship Method dialog and click OK A kinship matrix will be added to the data tree und
88. timate of the IBD relationship matrix For those values of K the resulting variance estimate can be considered an estimate of o as long as the assumptions of the method used to derive K are not violated for the population being analyzed One implication is that two different K matrices may give very different estimates of o and heritability yet produce the same model fit and test of marker association TASSEL implements several methods to improve statistical power and reduce computing time The Restricted ae 2 Maximum Likelihood REML estimates of 9 and c are obtained through the Efficient Mixed Model Association EMMA algorithm which is much faster than the expectation and maximization EM algorithm TASSEL also implements a method called compression which reduces the dimensionality of the kinship matrix to reduce computational time and improve model fitting When MLM is used without compression compression 1 each taxon belongs to its own group At the other extreme GLM can be interpreted as maximum compression compression n with all taxa in a single group In that case it is not possible to estimate the random effect r 2 r 2 independently of error and is absorbed into e Between these two extremes taxa can be grouped using cluster analysis based on kinship When n individuals are compressed into s clusters groups the kinship among individuals is replaced with the kinship among groups At some grouping levels
89. tion test should be run and to allow the number of permutations to be changed The permutation test will be run using the method suggested by Anderson and Ter Braak 2003 which calculates the predicted and residual values of the reduced model contained all terms except markers then permutes the residuals and adds them to the predicted values When the GLM options dialog is closed the user 1s presented with a dialog allowing the output to be saved to a file rather than stored in memory and displayed by TASSEL This option is useful when the output is expected to be very large and risks exceeding available RAM 36 The following table shows an example of the Marker Test output as viewed with Results Table T Marker Test PS sao Marker ZB00859 1 ZA01271 1 Locus pos marker F marker p markerR 2 markerDF markerMS errorDF 157104 0 199 0 007 1947984 0 942 0 23 522 0 076 2 126 0 208 0 261 39 907 6 089 5 374 ZA03613 2 ZA03613 1 ZA03614 2 ZA03614 1 ZA00258 3 ZA02962 13 2914066 0 705 0 001 2914171 0 905 0 2915078 0 894 0 2915242 0 104 0 012 2973508 0 517 0 002 3205252 0 541 0 002 pea pea pea pea pea pa mi e wa a ie i a a f a f ia t w t ca co co ca o Prnt Export CSv _ Export Tab In addition to displaying the F statistics and p values for the requested F tests th
90. trix i e kinship Load a Genetic Map Load a Table Report Make Best Guess File Format All Files zal OK _ Cancel Camel Open 3 1 1 Hapmap Hapmap is a text based file format for storing sequence data All the information for a series of SNPs as well as the germplasm lines are stored in one file The first row contains the header labels and each additional row contains all the information associated with a single SNP The first 11 columns describe attributes of the SNP while the following columns describe the SNP value for a single germplasm line The first 12 columns of the first 10 row should look like this where Line 1 is the beginning of germplasm line names While all 11 header columns are required not all 11 of the columns need to be filled in for TASSEL to correctly interpret the data The only required fields are chrom Chromosome name and pos Position In the example below genotype values are represented by 2 characters 1e AA Note that you can record those as single character values see Nucleotide Codes in the Appendix For TASSEL to correctly read Hapmap data the data must be in order of position within each chromosome and the file should be TAB delimited example below is in Excel only for easy viewing If some of the data is missing the correct number of TABs must still be present so that TASSEL can properly assign data to columns rst a
91. utonialData3 zip Filename Type Format d8 sequence phy Genotype Phylip Alignment mdp genotype hmp txt Genotype Hapmap Alignment mdp genotype plk ped Genotype Plink Alignment mdp genotype plk map mdp kinship txt Kinship Numerical square matrix mdp population structure txt Population structure Numerical trait data mdp traits txt Phenotype Numerical trait data 64 File 1 is the sequence of dwarf8 gene with 2466 sites on 91 maize inbred lines The data was described by the paper on the association between Dwarf8 and flowering time File 72 6 are 3093 SNPs on 281 maize association inbred lines The data was presented in three formats Hapmap Plink and Flapjack The data was created by the PANZEA project funded by NSF Details of the data can be found at http www panzea org File 45 and 6 are in pair for the format of Plink File 7 is kinship created by Yu et al File 8 is population structure of 282 maize inbred line File 9 is phenotype on three traits including flowering time on 282 maize inbred lines 65 11 3Frequently Asked Questions 1 What do I do if TASSEL misbehaves TASSEL is an open source software project hosted on SourceForge and has a bug tracking list at http sf net projects tassel where you can notify the developer community of problems In order for a bug to be fixed we must be able to replicate the problem Thus it is important to document the steps th

Tassel 5 User Guide

Contents

Download Pdf Manuals

Related Search

Related Contents