Home
        User Manual for Version 3 The Buckler Lab at Cornell University
         Contents
1.                               Seem             Data can be sorted by clicking on the column header of interest  A secondary sort can be done by holding  down the CTRL key and clicking on a second column     Data can be exported to flat files that are either comma separated  Comma Separated Values   CSV          tab delimited  Both these formats can then be imported into a spreadsheet program such as Excel  Tables         also be printed     4 2 Tree Plot Y Tree Plot    Displays the results of cladogram analysis     After running Analysis  gt  Cladogram  select the desired data set and then click Tree Plot in the Results  mode  Results    Tree Plot   Trees can be visualized in either a Normal or Circular layout                     W  ru                                                                                                       fra                                                          a               rmn     These images can be printed  saved in JPEG format  or saved as a Scalable Vector Graphics  SVG  file     43 2D Plot           Displays 2D plots and determines color thresholds   This function is useful for plotting associations in multiple environments     First  select the desired result set  Using the drop down boxes provided  populate rows with     Environment     columns with    Site     and value with  PermuteP   The cutoff value for coloring can be  chosen either by inputting a value in the text box or by using the slider tool to the right of the text box
2.                et al  Efficient Control of Population Structure      Model  Organism Association Mapping  Genetics 178  1709 1723  2008    Zhang  Z  et al  Mixed linear model approach adapted for genome   wide association studies  Nat Genet 42  355 60  2010     Kang  H M  et al  Variance component model to account for  sample structure in genome wide association studies  Nat Genet  42  348 54  2010     Thornsberry  J M  et al  Dwarf8 polymorphisms associate with  variation in flowering time  Nature Genetics 28  286 289  2001    Pritchard  J K   Stephens  M   Rosenberg          amp  Donnelly  P   Association mapping in structured populations  American Journal  of Human Genetics 67  170 181  2000     Zhao  K  et al  An Arabidopsis example of association mapping in  structured samples  PLoS Genet 3    4  2007     Yu  J M  et al  A unified mixed model method for association  mapping that accounts for multiple levels of relatedness  Nature  Genetics 38  203 208  2006     Casstevens          amp  Buckler     5  GDPC  connecting researchers  with multiple integrated data sources  Bioinformatics 20  2839   2840  2004     Ware  D  et al  Gramene  a resource for comparative grass  genomics  Nucleic Acids Research 30  103 105  2002     Ware  D H  et al  Gramene  a tool for grass Genomics  Plant  Physiology 130  1606 1613  2002     Jaiswal  P  et al  Gramene  development and integration of trait  and gene ontologies for rice  Comparative and Functional  Genomics 3  132 136  2002      20 
3.      el c                                         W saes                    Ph impute            lt  lt                pronior       jon  1  m Jan                               connect to more than        database  simply repeat the process outlined above     In the figures of following sections  only the GDPC area will be displayed if other areas are deemed  irrelevant     6 6 2 Data Query    GDPC is equipped with several tabs to query data  namely Taxa  Taxon Parents  Loci  Genotype  Experiments  Environment Experiment  and Localities  Within each tab  any retrieved data will be  displayed in the    Filtered List     Choose attributes by checking the desired boxes  located beneath the  Filtered List   After an attribute is selected  values of that attribute from the database are displayed               using the Taxa tab  choose Germplasm type  field  and then select  After clicking the Get Data    button  the subset of taxa from the database that meets these criteria will appear in both the Filtered List  and the Working List     5                                      2    TE E   Cem         cet  cecen Land             00000                  Tara    centre enters   rasan            tesis reges             aa         OR movens                                                                                         Items listed      the Working List can be modified by the user  To do so  first break the link between the       Filtered List and the Working List by clicking on 
4.     21   22     23     24     25     26              28     Yamazaki       amp  Jaiswal      Biological ontologies      rice databases   An introduction to the activities in gramene and oryzabase  Plant  and Cell Physiology 46  63 68  2005     Zhao  W  et al  Panzea  a database and resource for molecular  and functional diversity in the maize genome  Nucleic Acids  Research 34  D752 D757  2008     Canaran       Stein       amp  Ware      Look Align  an interactive web   based multiple sequence alignment viewer with polymorphism  analysis support  Bioinformatics 22  885 886  2006     Du  C G   Buckler  E   amp  Muse  S  Development of a maize  molecular evolutionary genomic database  Comparative and  Functional Genomics 4  246 249  2003     SAS         SAS  Statistical Analysis Software for Windows  9 0 ed   Cary  NC  USA    2002      Hardy  O J   amp  Vekemans  X  SPAGEDI  a versatile computer  program to analyse spatial genetic structure at the individual or  population levels  Molecular Ecology Notes 2  618 620  2002    Cover  T   amp  Hart  P  Nearest neighbor pattern classification  Proc  IEEE Trans Inform Theory 13 1967     Weir  Genetic Data Analysis Il  Sunderland  MA   1996     Farnir      et al  Extensive genome wide linkage disequilibrium in  cattle  Genome Res 10  220 7  2000     Henderson  C R  Best Linear Unbiased Estimation and Prediction  under a Selection Model  Biometrics 31  423 447  1975     Kang  H M  et al  Efficient control of population structure in mod
5.     is an unknown vector of random additive genetic effects from  multiple background QTL for individuals lines  X and Z are the known design matrices  and e is the  unobserved vector of random residual  The    and    vectors        assumed to be normally distributed with  null mean and variance of    v y         with    as the additive genetic variance and    as the kinship matrix  Homogeneous          variance is assumed for the residual effect which means R162  where 62 is the residual variance  The    proportion of genetic variance over the total variance is defined as heritability  h           When    is derived from pedigrees  the elements of K equal 2 Probability IBD   where IBD means that  two alleles drawn at random are identical by descent  Generally     calculated from markers is an IBS  matrix  The resulting multiplier is then not 6     but some unknown constant times a      Some methods for  calculating K  such as those implemented in SPaGEDI  actually use markers to develop an estimate of the  IBD relationship matrix  For those values of     the resulting variance estimate        be considered an  estimate of a     as long as the assumptions of the method used to derive    are not violated for the  population being analyzed  One implication is that two different K matrices may give very different  estimates of     and heritability yet produce the same model fit and test of marker association    TASSEL implements several methods to improve statistical power and r
6.    Users can    mouse over    any box to view the value associated with that box  as shown here        36                    B  amp  El  amp       anyone manye                                       cours                             ES 3  avons a  EE           OO TE IN                                  If P value coloring is desired  simply check the P value box as shown below                          T 20 chant B                       v  oy oe manye Paw A               sm         rear v   cat  rovs             nesta           BOTE TERETE TE EET TET       By checking the P value box  Cutoff selection tools will be disabled and fields will instead be colored  according to the following grayscale         Dialog  Mon                             This key can be shown by clicking on the    2    icon next to the P value check box     44 LD Plot             Displays the results from the linkage disequilibrium analysis        After selecting the desired result from the Data Tree panel  click on the    LD Plot  button while in       Results    mode  Results LD Plot     The graph that is generated displays LD between all possible pairs of sites  The black diagonal represents   LD between each site and itself  The default setting graphs F in the upper right and p values           lower   left  This default can be modified by clicking on the radio buttons in the lower left  The left side of the  37    graph contains    text description of the gene  or chromosome  and the site within the ge
7.   See Kinship     Square Numerical Matri  25   Standalone  8   Table  33   Transform  18   Web sta  7    
8.   ime    ann                                Pee  Bv  orn DI seen    7 emacs                          En    EE                   SiS         00 eee                                                                                                 Ce                          1 1 Installation   The graphic version of TASSEL can be installed in one of the three ways  using Java Web Start  as a  stand alone application  or using the source code   1 1 1 Web start     TASSEL can be installed using Java Web Start technology  which automatically checks for the most  recent version of TASSEL each time the application is executed  In addition  Java Web Start will ensure  that the correct version of the Java Runtime Environment is running  thus avoiding complicated    7    installation and upgrade procedures  Users should use Web Start unless they have a specific reason to use         of the other installation methods     To begin  Java Web Start  WS  must be installed  prior to the installation of TASSEL   JWS is included  as part of Java Runtime Environment  JRE  5 0 and above  PC s and Mac   s will most likely have JWS  already installed  If you need to install Java  the most recent version is available at hitp  www java com   The easiest way to tell if it is installed on your computer is to try running TASSEL from the following  link    http   www  maizegenetics net tassel    If you will be using TASSEL frequently and would prefer to launch the application from your desktop  rather than
9.  ANALYSIS  m  63 ESTIMATION OF KINSHIP USING GENETIC MARKERS as       6 4  ASSOCIATION ANALYSIS USING GLM  65 ASSOCIATION ANALYSIS USING MLM     6 6 IMPORTING DATA FROM A DATABASE  VIA GDPC   6 6 1 CONNECTING WITH A DATABASE   6 62 DATA QUERY   6 6 3 IMPORTING GDPC DATA INTO TASSEL   6 64 SAVING GDPC QUERY RESULTS       1 APPENDIX               NUCLEOTIDE CODES  DERIVED FROM IUPAC    72 TASSEL TUTORIAL DATA SETS   73 BIOGRAPHY OF TASSEL   74 FREQUENTLY ASKED QUESTIONS    1  WHAT DO IDO IF TASSEL MISBEHAVES    2  WHERE DO I TURN FOR MORE INFORMATION    3  HOW DOT JOIN THE FUN  TASSEL ON SOURCEFORGE    4  How DOI CHANGE THE AMOUNT OF MEMORY USED  WHAT DO 1 DO WHEN THE  EXCEPTION  JAVA LANG OUTOFMEMORYERROR    APPEARS    5  WHEN I CLICK ON THE MOST CURRENT VERSION OF TASSEL WEB START  A PREVIOUS VERSION APPEARS   WHAT SHOULD 1 DO    6  WHAT SHOULD I SUBSTITUTE FOR MISSING VALUES IN TASSEL    7  IS IT POSSIBLE TO CHANGE DATA NAMES IN THE DATA TREE    S  HOW CANI CREATE A TASSEL ICON ON DESKTOP    9  WHY DO I GET EMPTY SQUARES IN MLM ASSOCIATION ANALYSIS    10  WHY SHOULD I EXCLUDE ONE COLUMN OF THE POPULATION STRUCTURE    11  CAN KINSHIP REPLACE POPULATION STRUCTURE    12  WHY      TASSEL AND SPAGEDI GIVE DIFFERENT KINSHIP ESTIMATES    13  CAN I GET MARKER R SQUARE USING SAS PROC MIXED OR TASSEL MLM    14  DOES MLM FIND MORE ASSOCIATIONS THAN GLM    15  DO I NEED MULTIPLE TEST CORRECTION FOR THE P VALUE FROM TASSEL    16  CAN TASSEL HANDLE DIPLOID GENOTYPE DATA    17  How      CI
10.  Tree Panel and then click the  Transform button  Data  gt  Transform   The  Transform Column Data  window will open  Click on the  Impute tab in this window  Finally  click on the Create Data set button to create the new data set with  missing values imputed     Note that missing values are now filled    42          Toot rep GOP          elem                              ameta              Eh Loni   3 Eoo     Sites       Taxa      vein                    Jet Ss                                                              pom es        1                                     E  i           30 8                                                            43    6 2 Principal Component Analysis    Principal component analysis  PCA  is a statistical tool that transforms a set of correlated variables into a  smaller number of uncorrelated variables called principal components  PCs   The first PC captures as  much of the variation as possible  and the succeeding PCs account for a decreasing fraction of the  remaining variance  Another application of PCA is to use PCs derived from genetic markers to represent  population structure     This method requires much less computing time than maximum likelihood  estimation  As most marker data are characters  numericalization must be performed first  A common  approach for converting character marker scores is to set one of the homozygotes to 0  the other  homozygote to 2  and the heterozygote to 1  For haploids  the conversion can be simply p
11.  algorithm as is used in determining linkage disequilibrium   5 3 Preferences       The Quality Score Colors tab  found in the Preferences dialog box  allows the user to set cutoff values  for visualizing quality score values on a sequence alignment or a set of called SNPs        Jo          To set a desired threshold  simply adjust the slider on the left side of the dialog  Ns           dashes   and  alignments without any quality score information have a default value of  1  minus one             6 Tutorial    This tutorial reviews several common scenarios for using TASSEL in order to help the user better  understand its capabilities for data manipulation and association analyses  The TASSEL software package  includes    tutorial data set that        be downloaded from the TASSEL website  please unzip all files to a  directory of your choice   This tutorial data set contains data for phenotype  genotype  population  structure  and kinship     6 1 Missing Phenotype Imputation       The phenotype file mdp traits will be used to demonstrate the process of imputing missing data  Note  that the data set below contains missing values  NaN            Took Hep           ESC   alten    ome             zm                                                                                            FE orc              3              Sits                                    uo      4          m                   p    To impute missing data  first select the mdp  traits data set in the Data
12.  are not appropriate for heterozygous data   GLM or MLM fit SNPs one at a time  treating each distinct genotype as a separate class  This has the  effect of fitting an additive plus dominance model  Separating the two effects is under consideration   Because handling heterozygotes as a third marker class is not appropriate for kinship or LD those  analyses should not be used for that type of data at the present time  Work to improve handling  heterzygotes is ongoing           17         to cite TASSEL        The paper that describes TASSEL  as a software package and the papers that introduce specific   methods implemented in TASSEL should be cited as appropriate  such as the unified     Q K       approach  EMMA  compression of mixed linear model and P3D  For example    A  Linkage disequilibrium  07  R and P value  were calculated by TASSEL         Association analyses were performed with the mixed linear model approach  implemented by  TASSEL        GWAS was performed with the compressed mixed linear model approach  carried by  TASSEL  which also implemented the EMMA    and P3D  algorithms to reduce computing time        69    REFERENCES    Bradbury  P J  et al  TASSEL  software for association mapping of  complex traits in diverse samples  Bioinformatics 23  2633 2635   2007     Zhang  Z   Buckler  E S   Casstevens          amp  Bradbury  P J   Software engineering the mixed model for genome wide  association studies on large samples  Brief Bioinform 10  664 75   2009       
13.  by revisiting the website  Java Web Start can be used to manually launch TASSEL each time  and or to create a shortcut  Access the Java Application Cache Viewer by going to Start  gt  Settings  gt   Control Panel  gt  Java  From the General tab  click on Settings in the Temporary Internet Files section  and then click on View        and the Java Application Cache Viewer will appear   Another  way of achieving this is by going to Start  gt  Run and typing in javaws   The TASSEL icon should now  be visible and can be used to launch the application  Shortcuts can be created from the menu of the Java  Application Cache Viewer  Application  gt  Install Shortcuts        1 1 2 Stand alone  Downloading a  stand alone  version is recommended for anyone who has a slow Internet connection   While Java Web Start is a very good way of deploying software  it does not ask the user before attempting  to download updates  Thus  a slow Internet connection may start a download process that requires       unreasonable amount of time to complete  If you are not interested in disabling your network connection  each time before starting TASSEL  we recommend downloading the stand alone version which does not  attempt to update the program  However  given that TASSEL is a Java application  a Java Runtime  Environment  version 1 6 0 or greater  is still required  To get the stand alone version  download  tassel3 0_standalone zip from the TASSEL web site  To run the stand alone version  double click o
14.  each trait by marker combination will be tested and two reports will be produced          containing trait by marker F tests and the other containing allele estimates          run GLM  select a data set and then click the        button  A dialog box will pop up to allow the user  to indicate that a permutation test should be run and to allow the number of permutations to be changed      The permutation test will be run using the method suggested by Anderson and Ter Braak  2003   which  calculates the predicted and residual values of the reduced model  contained all terms except markers   then permutes the residuals and adds them to the predicted values  When the GLM options dialog is  closed  the user is presented with a dialog allowing the output to be saved to    file rather than stored in  memory and displayed by TASSEL  This option is useful when the output is expected to be very large and  risks exceeding available               The following table shows an example of the Marker Test output as viewed with Results Table                    In addition to displaying the F statistics and p values for the requested F tests  the table also contains  markerR2  mean squares  MS  and degrees of freedom  DF  for the marker effect  for the model   corrected for the mean   and for error  If taxa are replicated  across reps or environments   then the  markers are tested using the taxa within marker mean square  If taxa are unreplicated  then the residual  mean square is used  Marker
15.  for each trait  The first line is for the model with no  markers  Following that is a single line for each marker tested  The columns labeled    Af        F     and  p  are  the degrees of freedom  F  and p value from the F distribution for the test of the marker  The column     emordf    is the degrees of freedom used for the denominator of the F test  The column labeled     markerR2    is the R2 for the marker calculated based on a formula for R2 for a generalized least squares  GLS  model as shown here        32          The columns    Genetic Var   Residual Var   and     2LnLikelihood    list o2a    2    and minus two  times the mode  likelihood  respectively  When the P3D option is used  all of the values are the same  for a given trait because they are only calculated once  A second table lists the estimated effects       each allele for each marker similar to the output for         The compression results table shown  below shows the likelihood  genetic variance  and error variance for each compression level tested  during the optimization process  The meaning of groups and compression is discussed above in the  description of the compression method  The compression level with the lowest value of  2LaLk is  used for testing markers                         orn ed fp                                      ce        BE                                                       pe E       z xar TH s              3 8 Ridge Regression    This function performs ridge regression to 
16.  handling of  s and non standard characters  Added Sliding Haplotype functionality  Changed LD Fisher s Exact p value to use two sided p value    Added Ability to visualize sequence quality scores     Synonymize    match taxa names between data sets  GLM analysis improvements    Code change preventing large data sets from being shown in JTable  Update of GDPC which allows automatic restoration of last data source  connection    Data transformation utilities added  K Nearest Neighbor Data Imputation added       Association analysis with Mixed Linear Model  Taxa name   synonymizer  added   Basic heterozygosity handling added   Many ease of use improvements     Fixed problem loading genotype data  Mixed Linear Model changes     Output NaN if non converged  Fixed problem loading genotype data  Detection of duplicate ID in kinship  Correction on progressive bar with MLM  Starting values of NaN from previous marker are no longer used    MLM  Significant speed improvement   10   faster   GLM  Added User defined F tests  Output taxa or marker means    Principle Components Analysis  Architecture restructure and pipeline version for advanced users  Genetic marker data numerical transformation    MLM implemented P3D algorithm  increased speed in order of magnitude    65    of at least ten times     May 2009 EMMA implemented   November 2009 TASSEL Version 3 release  redesigned for large genomic data and large  samples    April 2010 Compression of MLM implemented    66    74 Frequently As
17.  of 3093 SNPs spread across the maize  genome     For the dwarfS gene sequence  use the joint data set created by following the tutorial for GLM  Solve the  mixed linear model by highlighting the joint data set and the kinship data then clicking the        button  in Analysis mode                                         Quee                                              Fae                                                mo                                         3          ec  O ammess                 EPD mee     a    De                        MLM option dialog will pop up as shown above  Choose the default options  which use P3D and  compression at the optimum compression level  After the Run button is clicked  the progress bar will start  moving  The time required will depend on sample size  number of traits  number of markers  and the  options chosen in the MLM option dialog  After the progress bar is reset to zero  indicating completion of  MLM  three reports will be added to the data tree  The first two are similar to the reports created by GLM      The most significant SNP is still the same  however the strength of association is weaker  with a P value  of 7 199x10   vs  1 1021x10  from GLM  which does not pass the Bonferroni multiple test threshold   5x10        The third report contains the MLM specific statistics  including  2 Log Likelihood  genetic variance and  residual variance components under different level of compression  These statistics are illustrated by th
18.  significant level of 1  after Bonferroni multiple test correction  0 01 3093       The association was not significant  As illustrated below  the output labeled    GLM_Allele Estimates   shows the marker effects assigned to genotypes for each SNP  The GLM is also the same   For example   the first SNP at 157104 bp on chromosome 1 had three genotypes        CC and AC  coded as A  C  and  M based on the IUPAC code  see Appendix  Nucleotide Codes      55                     rot Aaja                                                                          eT                                      La           c Ba   EB Tote         ZE               i chan                                               m 5                1              EE     er    SS     E                             n               lt                                                 6 6 Importing Data from    Database  via GDPC     GDPC  middleware that is integrated into TASSEL  allows the user to import data from a database  To  display GDPC in TASSEL  click on the GDPC button in Data mode  General rules for working with  databases include  1  Establish a connection with the database  2  Define a query  3  once the desired data  is in GDPC  load the data from GDPC into TASSEL     6 6 1 Connecting with a Database    To establish    connection with a database  click the Add Conn button followed by the button of the  database you wish to add  Then click Ok  In the example below  we chose Panzea     56    el         
19.  taxa names can prevent proper  joining  Taxa names can be made uniform by using the  Synonymizer      2 11 Intersection Join  S n Jon    This button joins multiple data sets by the intersection of their taxa  Taxa must be present in both data sets  to be included     Select multiple data sets using the CTRL key in conjunction with mouse clicks  and then click      the  intersection button to join the data sets        Because this function uses taxa names to join data sets  any variation in taxa names can prevent proper  joining           names can be made uniform by using the  Synonymizer      3 Analysis Mode   Qas    Analysis mode consists of the following options     3 1 Diversity Ki Ders    This button executes a basic diversity anal       Average pairwise divergence       segregating sites  and    estimates  ANp  can be calculated  as well as  sliding windows of diversity     To run a diversity analysis  click on a raw sequence alignment  and then select Analysis  gt  Diversity                      es       m                  non sorte o         Endase                Tetai E s  ra rion  Nonsvronvmous  Step 100                 ceang des   M            E   E Indes            In the resulting Diversity Surveys dialog box  the various site classes available for analysis are listed       the left  If the sequence has no annotation  then only the  Overall  and    Indels    options will be active    A sliding window of diversity can also be calculated across the region  To prod
20.  that the variables in this file will be used as covariates not as dependent variables  This is  the format to use for population structure covariates        Example    lt        gt  qo    gt        33 16 0 014 0 972 0 014  38 11 0 003 0 993 0 004  4226 0 071 0 917 0 012  4122 0 035 0 854 0 111  A188 0 013 0 982 0 005    2273 TASSEL version 2 1 formats    Version 2 1 formats for numeric data will continue to be supported to provide backward compatibility   However  that format does not identify covariates as such  As a result  any covariates imported using this  format will need to be properly identified using the  Trait filter    function described later in the manual     2274 Repeated measurements       format for repeated measurements may be implemented in the future     2 2 8 Square Numerical Matrix  Kinship can be calculated externally from pedigrees by using SAS Proc Inbreeding  or from markers by  using software packages such as SPAGedi    The following format is provided to import the resulting    kinship estimates     If a represents the number of taxa  the format for kinship files is as follows        TaxalName rit r12 rin  Taxa Name r21 r22 E  TaxanName        rn2            Here rij  i               is the element in the kinship matrix located at row i and column     Missing values are not allowed for kinship matrix     Important note  The current format is different from the format used in TASSEL version 2 0 or lower     2 2 9 Genetic Map       Genetic Map is a l
21.  trait data or covariates  Kinship must be loaded as    square numerical  matrix     Users can either specify the file type or use the  Guess  option to let the program determine the file type   As an example  we describe how the  Guess  function can be used to import all the files from the tutorial  data set  The tutorial data can be downloaded from the TASSEL website or using this link  biip   www maizegenetics net tassel does TASSEL TutorialData3 zip        To use the data  the zip file must be uncompressed and saved in a folder that the user specifies  To import  data click the LOAD button  The File Loader dialog box will then pop up to let user choose the files and  specify a format  For the files in the tutorial data set  the default  Guess  function will load all the files  correctly  Multiple files can be imported simultaneously by highlighting them first  holding Shift or  Control key while clicking  and then clicking the Open button        2 2 1 BLOB       Binary Large Object  BLOB  is a collection of binary data stored as a single entity  In TASSEL   BLOBs are used to compress large data sets into more manageable sizes  For sequence data  three types of  BLOBs are used  SNP value BLOB  position BLOB and SNP ID BLOB  The three BLOBs are used to  store individual SNP values  SNP position within the genome and the SNP identifiers respectively        BLOB is composed of two components  a header and a body  The header for each BLOB is 1024 bytes  long  while the length 
22.  traits  the algorithm finds other taxa  neighbors  that are most like it for the non   missing traits  It uses the average of the neighbors to impute the missing data  Click on the Impute tab to  display the following                                   21    284PCA    Principal component analysis  PCA  can only be performed on a numerical data set without missing  values  Two methods are available  correlation or covariance  This determines whether a correlation or  covariance matrix will be used as the basis for the analysis  The default  correlation          reasonable choice   for genetic data  The number of PCA axes in the output data set can be controlled by selecting either of  the minimum eigen value associated with each axis  the minimum percent of the variance captured by an  axis or the number of axes  The resulting axes will be sorted by the amount of variance each captures                                        2 9 Synonymize Taxa Names 4 Swmonmizer      This button makes taxa names uniform to permit the joining of data sets        The join functions that generate fused data sets work by matching taxa names  Consequently  if multiple  names exist for a given taxon  an added suffix  alternative spellings  different naming conventions  etc    then the two data sets will not join correctly  To help remedy this  the Synonymizer function allows the  taxa names of one data set to replace similar taxa names in the second data set  It relies on an algorithm   that calcula
23. R  is the marginal R2 for the marker calculated as SS Marker  after fitting all  other model terms    SS Total  where SS stands for sum of squares  The following table shows an example  of the Allele Estimates output as viewed with Results Table                          Eco ERN e 3                   T               pren                     mi        ee  Lene             For each marker and trait combination  each marker allele is listed along with the number of observations  for taxa carrying that allele  Obs   the locus  usually chromosome  and locus position of that marker  the    30    allele  and the estimate of the effect of that allele  Because of the way that GLM codes alleles  the last  allele estimate for a marker is always zero and the other allele estimates are relative to that     3 7 Mixed Linear Mode             This conducts association analysis via    mixed linear model  MLM         mixed model is one which includes both fixed and random effects  Including random effects gives   MLM the ability to incorporate information about relationships among individuals  When a genetic  marker based kinship matrix      is used jointly with population structure  Q   the             approach  improves statistical power compared to    0    only     MLM can be described in Henderson s matrix  notation       follows       Xp Zu e       where y is the vector of observations  f is an unknown vector containing fixed effects  including genetic  marker and population structure  0  
24. TE TASSEL    REFERENCES   INDEX        64         67  67          6s  68    6  6s  6              69        70        INTRODUCTION    While TASSEL has changed considerably since its initial public release in 2001  its primary function  continues to be providing tools to investigate the relationship between phenotypes and genotypes     As  indicated by its title     Trait Analysis by aSSociation  Evolution and Linkage   TASSEL has multiple  functions  including association study  evaluating evolutionary relationships  analysis of linkage  disequilibrium  principal component analysis  cluster analysis  missing data imputation and data  visualization                  of the design elements driving TASSEL development has been the need to analyze ever larger sets of  data     For example  the MLM  mixed linear model  function for association analysis originally used an  EM  expectation maximization  algorithm  which is a common method for solving mixed models but is  relatively slow  Subsequently developers implemented the EMMA algorithm to increase computing  speed     Model compression was added to that to improve speed and statistical power for association  study     Another technique that optimizes variance components once and then uses the estimates to test     markers now provides the ability to screen the large numbers of markers used in genome wide association  studies  GWAS   The method was independently described by Zhang et al  and Kang et al  in 2010  This  method was 
25. Use the graph type combo box to select the desired graph type  XY Plot  from the list of  options  Select data to be plotted in X and Y axes using the appropriate drop down boxes  If two data  series are plotted simultaneously on the Y axis  the  2 Y Axes  checkbox will provide an axis for each                                         Em        hee                         vi  pecu                   re       x  x  pecu HOPESTEAD IDL                 12 VAs   DPOLL HOMESTEAD ID1 vs  DPOLL CLAYTON ID15         dz          EE                                     abe                                           5                     menus      TASSEL include File  Tools  GDPC  and Help menus  The File menu is mainly used to  save the entire data tree which includes the data loaded into TASSEL and the data created within  TASSEL  A previously saved data tree can be loaded to TASSEL  This function provides the users the  capability to save their intermediate results  The tools menu contains contingency test and option to set  preference     GDPC  Genomic Diversity and Phenotype Connection  is a software package to retrieve data from open  database sources such as SNPs and phenotypic data  It can also be started using the  GDPC  button in  data mode  Its use is described earlier in the manual        5 1 File Menu   Individual data sets on the data tree and the entire data tree can be saved  An individual data set is saved  in the genotype format for sequence data or numerical format for ph
26. User Manual for        Trait Analysis by aSSociation  Evolution and Linkage    Version 3    The Buckler Lab at Cornell University     August 28  2011     PAGEL       www maizegenetics net tassel       Disclaimer  While the Buckler Lab at Cornell University has performed extensive testing and results          in general  reliable  correct or appropriate results are not guaranteed for any specific set of data  It is  strongly recommended that users validate TASSEL results with other software     Further help  Additional help is available beyond this document  Users are welcome to report bugs   request new features through the TASSEL website  Questions are also welcome to our current team  members  For more quick and precise answers  please address your questions to the most pertinent  person     General Information Ed Buckler  Project leader   esb33 cornell edu   Data import          Pipeline     Terry Casstevens  tmed6 acomell edu   Statistical analysis     Peter Bradbury   pjb39 comell edu  Zhiwu Zhang  z219 i cornell edu       Contributors  Yogesh Ramdoss  Michael E  Oak  and Karin J  Holmberg  N  Stevens  and Yang Zhang     The TASSEL project is supported by the National Science Foundation and the USDA ARS     USDA       Main Web Site  hitp  vww maizegenetics ncvtassel     Open source code  htip   sourceforge nev projects tassel   Modified version of the PAL library is used  http   www       auckland  ac nz pal project   Database access is achieved by GDPC middleware hitp  
27. ach combination of traits and markers  TASSEL  provides users several options  1  to estimate genetic and residual variance for each combination  2  to  get these estimates once for each trait without fitting genetic markers and then to use those estimates to  test markers  3  to use a prior heritability estimate provided by the user  The second option  named P3D   population parameters previously determined   has the same statistical power as the first option   Using  the      method or using a prior heritability        be much faster than calculating heritability for each  marker     Using MLM is very similar to using GLM  The difference is that in addition to choosing the joint data set   or numerical data set   kinship data must also be highlighted before clicking the MLM button to show the  MLM option dialog  The option of         Compression  is the regular MLM which is equivalent to     Custom level 1   For data sets with large numbers of taxa  the optimal compression option may be  considerably slower than no compression or user supplied compression  This is because the algorithm  solves the model once for each of a series of compression levels in order to determine the optimal one             MLM analyses create two output tables  model statistics and model effects  If compression is used  the  analysis creates three tables                                   T ach aE A rge aah                  E De EET ea             The statisties table shows the results of the tests
28. ata set is  selected  mathematical transformation  data imputation and principal component analysis  PCA         be  performed  The Transform columns tags will be displayed in a Data dialog box with three tabs  Trans   Impute and PCA        2 5 Transform       2 8 1 Genotype Numericaliz       Two options are provided to transform genotype from character to numerical as shown in the following  dialog box                 2 8 1 1 Collapse Non Major Alleles    This function assigns 1 to the major allele and 0 to any other alleles  The converted genotypes are saved in  a new numerical data set     2 8 1 2 Separate Alleles    This function assigns an indicator  1 for present and 0 for absent  for each allele  The converted genotypes  are saved in    new numerical data set     2 8 2 Transform and or Standardize Data    The Trans dialog box is the default selection  as shown below  In the Column list  select the columnis   you wish to transform  Then select the type of transformation you wish to execute  Selecting the  Standardize checkbox will transform data by subtracting the column mean from the value of the trait and  then dividing by the column s standard deviation  Clicking on the Create Data set button will result in  the placement of a dataset containing only the selected columns in the Data Tree                                2 8 3 Impute Phenotype       The k nearest neighbor algorithm  is used to impute missing phenotype data  If data is missing for a  taxon for one of the
29. axa from the right side  Click                   the arrow button to substitute the taxa  Taxa with no synonym can be identified by selecting then  clicking  No Synonym     Click OK to save the changes           Threshold for smonymizer                                                                                                 synonymizer  Taie  HefDNum                  fan  ee  as d  DEM MN NN Es               br      bs Eu                  uan       e    m             n  fz    pa i     Bs ki          2   0         expo pos                          ass S          f             Ze 5            p        as     s                          pan  62 2     o jam     Es              za           n ha oo                                  1  7   tm         mens          Once it has been determined that the taxa names were matched correctly  the synonyms can be applied   With the synonyms selected  hold down the CTRL key while clicking on the second synonym data set   the data set whose names you would like to change   Then once again click on the Synonymizer button  to apply the new names to the data set     2 10Union Join      Jon     This button joins multiple data sets by a union of their taxa  Missing data will be inserted if taxa are  missing from one data set     Select multiple data sets using the CTRL key in conjunction with mouse clicks  and then click on the     union button to join the data sets     Because this function uses taxa names to join data sets  any variation in
30. ction  GDPC   GDPC is  middleware that eliminates the need for end users of data to understand various database schemas and  write SQL queries to extract data  Instead  the GDPC browser provides a single  easy to use interface  which can extract genotype and phenotype data from a variety of sources   Currently  GDPC has  connections to the following databases      Gramene diversity for maize  wheat and rice      http   www  gramene org db diversity diversity view      Panzea   http   www panzea org    GRIN    http   www  ans grin gov     GDPC can be used within TASSEL or as a stand alone application  To display GDPC in TASSEL  click  an the GDPC button in Data mode        Data is available for import once the user has defined the desired filters and data is visible in either the  Genotypes or Phenotypes tab  To load data  activate either the Genotypes or Phenotypes tab  depending  on the data you wish to import  and then click the Load button ww     For additional information about GDPC  please see http   www maizegeneties net gdpe   2 2 Load  I toad    This function provides options to import files for genotypes  phenotypes  populations structure  and  kinship matrices  Several common sequence formats are accepted for genotype data  including BLOB   Hapmap  Plink  and Flapjack  and a general format for polymorphism data  Some file types used by  TASSEL version 2 are also supported for backward compatibility  Phenotype and population structure         be imported as numerical
31. cts tassel   thereby allowing anyone  to access the most recent changes to the code  This setup makes it convenient for anyone to add special  functionality to TASSEL if they so desire  It also serves as a good platform for anyone who wishes to  become involved in a bioinformatics software development project        4  How do    change the amount of memory used  What do    do when the   Exception java lang OutOfMemoryError  appears        Ifyou are working with very large data sets or are running memory intensive procedures  there may be  occasions when TASSEL runs out of memory  For most routine usage  however  TASSEL memory is  sufficient  Memory issues usually result from attempting to execute a procedure like LD on a raw  sequence alignment instead of selected SNPs  You may also experience a memory issue if you are not  sufficiently specific when retrieving information through GDPC  By default  TASSEL is allocated up  to 512 Mb of memory on your computer  If more is available on your computer  you can increase the  amount allocated by downloading the  stand alone  version of TASSEL and opening a command line  window  in Windows use Start  gt  Run and type in  cmd  or  command         run TASSEL from a  command line   cd  to go to the directory containing the stand alone jar file then start TASSEL by  typing the following           java  Xms256M  Xmx768M  jar sTASSEL jar    Where     Xms   M  specifies the starting memory available and     Xmx   M  specifies the  maximum m
32. e  Chart function on the Result mode as follows                          groups vs  2LnLk    groups vs  Var genetic and Var eror                      In the example  79 are included in the final analysis  When they are clustered into 44 groups  the  2 Log  Likelihood reaches a minimum  which indicates the best model fit  The screening of SNPs was performed  at this optimum compression level     Note  When two or more individuals are clustered into one group  the variance component for the random  effect is not equivalent to the one without compression  Consequently  the heritability derived should not  be interpreted as the individual based heritability     To perform a Genome Wide Association Study  GWAS  on the 3093 SNPs  we need to create a new joint  data set containing the filtered phenotype  population structure  and the genome wide genotype  Highlight  the new joint file and the kinship data and click the MLM button  Choose the default options on the  MLM option dialog  The analysis will take a minute or two  The output report labeled     MLM compression  indicates that 259 lines were used in the analysis  With 74 groups  the statistics  from the best are as graphed below                   tS                                                  EN      groups                    groups vs  Var genetic and Var error                                     The strongest associated SNP is at 193565357 bp on chromosome 3  The P value is 1 302710     The  threshold is 3 2331x10   at
33. ection  N  Join on Data mode to create a combined data  set   5  Association analysis  Highlight the joint data set then click GLM in Analysis mode to perform  association analysis  Two reports will be added to the data tree              49                                           m m Imm mam         Bar                   E twon   F susa  pe rana  Y vo   07 moute snes  Fe ranson pe synonymer   ib son   e Jou     pone    senaten          O    4                                                                                                                         One of the reports added to data tree is labeled    GLM_Marker Test   followed by the name of the joint  data  In addition to the information for traits and markers  the data set contains the following statistics        marker     F value from the F test on marker   marker     P value from the F test on marker      markerR2  KC for the marker after fitting other model terms  population structure      50    markerDF  Degree freedom of marker   markerMS  Mean square of marker   errorDF  Degree freedom of residual error   errorMS  Mean square of residual error      modelDF  Degree freedom of model   modelMS  Mean square of model                                                                          Clicking  marker       will sort the table by P value  The smallest P value is 1 1021x10  for SNP at  position 6  The threshold is 5x10    at a significance level of 1  after Bonferroni multiple test correction   0 01 20   T
34. educe computing time  The  Restricted Maximum Likelihood  REML  estimates of 6  and 6  are obtained through the Efficient  Mixed Model Association  EMMA  algorithm  which is much faster than the expectation and  maximization  EM  algorithm        TASSEL also implements a method called compression which reduces the dimensionality of the kinship  matrix to reduce computational time and improve model fitting  When MLM is used without compression    compression   1   each taxon belongs to its own group  At the other extreme  GLM can be interpreted as  maximum compression  compression   n  with all taxa in a single group  In that case  it is not possible to  estimate the random effect independently of error and   l is absorbed into 63  Between these two  extremes  taxa can be grouped using cluster analysis based on kinship  When n individuals are  compressed into s clusters  groups   the kinship among individuals is replaced with the kinship among  groups  At some grouping levels  dependent on the trait and population being analyzed  this compressed  MLM has improved statistical power compared to the regular MLM     The optimum grouping with the  best model fit for MLM without fitting genetic markers has the best statistical power for an association  test of markers   TASSEL allows users to specify the compression level  average number of individuals          group   or to have the program determine the optimum grouping        Similar to GLM  MLM performs an association test for e
35. el  organism association mapping  Genetics 178  1709 23  2008    Laird          amp  Ware  J H  Random Effects Models for Longitudinal  Data  Biometrics 38  963 974  1982     Thornsberry  J M  et al  Dwarf8 polymorphisms associate with  variation in flowering time  Nat Genet 28  286 9  2001    Flint Garcia  S A  et al  Maize association population  a high   resolution platform for quantitative trait locus dissection  Plant J 44   1054 64  2005     Anderson  M J   amp  Ter             C J F  Permutations tests for multi   factorial analysis of variance  Journal of Statistical Computation  and Simulation 73  85 113  2003           Analysis Mode  25   Annotated alignment  14  Cladogram  27   Collapse Non Major Alleles  19  compressed MLM  31  Compression  31   compression level  31   Data Mode  10   data tree  38   Diversity  25   EM algorithm  30   expectation and maximization algorithm  30  Fle Menu  38               13             10  s4   General Linear Model  28  Genome Wide Association Study  53  Genotype Numericaization  18  Hapmap  12   Henderson  See MIM  Hertaily  30   impute Phenotype  20   impute SNPs 18    Kinship  15 28  30 46    INDEX    n    uo Piot  35  Linkage Disequilibrium  26   Mied Linear Model  30   Numerical data  14   Open source code    Panels    Pik  12   Population parameters previously determined  31  Principal component analysis  21   Principal Component Analysts  42   Restricted Maximum Likelihood  20   Specified number of rows  columns  and labels
36. emory available to the Java Virtual Machine  You may set the values higher or lower as  your hardware dictates  Alternatively  you can modify the start_tassel bat or start tasselpl file that  comes with the standalone distribution     6    5  When    click on the most current version of TASSEL web start  a previous  version appears  What should    do        The previous version of TASSEL web start was cached in your machine  To replace it with the most   current version  click the Start button in Windows  followed by Run  Type javaws and then click OK   In the window that opens  keep the most current version of TASSEL and delete the rest     6  What should    substitute for missing values      TASSEL        For numerical data in version 3 format  use NA or NaN  For numerical data in version 2 format  use     999    for missing values  For SNP data  use           For SSR data  use          Kinship does not allow  missing values     7  Is it possible to change data names in the Data Tree   A  Yes  Click on the desired data name in the Data Tree  wait for one second  and then click it again or  immediately hit the F2 key  Rename the data set and then hit Enter to save the change     8  How can    create a TASSEL icon on desktop        Click  Stan  on Microsoft Windows and select  Control Panel     then double click Java to show    java  Control Panel     In  Temporary Internet Files    section  click  View  button show    Java Cache  Viewer   Move mouse over TASSEL application a
37. ences ecoeval spagedi  html  Comparisons of methods for calculating kinship can be found in the literature  eg  Stich et al  2008      3 6 General Linear Model    Sst    This function performs association analysis using a least squares fixed effects linear model     TASSEL utilizes a fixed effects linear model to test for association between segregating sites and  phenotypes  The analysis optionally accounts for population structure using covariates that indicate  degree of membership in underlying populations  A main effects only model is automatically built using  all variable in the input data  A separate model is built and solved for each trait and marker combination   Any factors  covariates  reps or locations are included in every model as main effects  How the data is  used must be defined either in the input data files or using the Trait Filter after the data has been  imported but before it has been joined with a genotype        General Linear Model  GLM  can be run using a numeric data set only  numeric data joined to genotype  data  If only numeric data is selected  best linear unbiased estimates  BLUEs or least square means  will  be generated for the taxa for each trait   Note  only factors and covariates intended to control field  variation should be included at this stage  Population structure covariates which are intended to control for  marker effects should only be included when markers are also in the analysis   If numeric data with  genotypes are analyzed 
38. enotype     within the Data folder in TASSEL  Results will look as follows                             To load phenotype data from GDPC into TASSEL  fist click on the GDPC button in Data mode  Then  choose the Phenotypes tab  followed by the Load button  The phenotype data is then loaded into  TASSEL and labeled as  4 traits environ     To view the uploaded data  select    4 traitsenviron  from  the Phenotypes folder in TASSEL  Results will appear as follows     gne  ee gen                              61    6 6 4 Saving GDPC Query Results    All query results  including both genotype and phenotype queries  can be saved as either Tab delimited  text files or XML files  Results are exported as tab delimited text files by first choosing the Query Tab  a a  and then clicking on the Export button 5020      or by clicking the Save As button 5245  to save  results in XML format  Location and file name must be specified in both situations  Data in XML format   a           be imported back into GDPC by clicking on the Open button        7 Appendix  7 1 Nucleotide Codes  Derived from IUPAC           Code    Meaning                       GG  T TT  R       Y       8       w AT     GT  AX         insertion homozygous   o a        deletion homozygous   N Unknown        6    7 2 TASSEL Tutorial Data sets     The data set contains 9 files and can be downloaded at   http   www  maizegenetics net tassel docs  TASSEL TutorialData3 zip                                File   File name           F
39. enotype  covariate  and kinship  The  data tree is saved in a binary format     5 1 1 Save Data Tree    This feature allows you to save the entire contents of the Data Tree panel to    default location  This is  helpful when the user does not wish to recreate a Data Tree panel that is already well populated with  information the next time they initializes the program  To save a Data Tree  select File  gt  Save Data Tree        5 1 2 Open Data Tree  To restore a Data Tree that was saved previously saved  select File  gt  Open Data Tree     5 1 3 Save Data Tree As       To save the contents of a Data Tree to a specific location or to give it a speci  Data Tree As        ic name  select File  gt  Save       5 1 4 Open Data Tree     To restore a Data Tree from a specific location  select       ile  gt  Open Data Tree       NOTE  The information outlined above for saving a Data Tree is applicable to files that are  in general   version specific  When a new version of TASSEL is released  a data tree saved with a previous version   might not load to the version  For longer term storage  the best practice is to save individual data sets  rather than the entire data tree        5 1 5 Save Selected As          export data to one of the supported file types  select File  gt  Save Selected As       40    5 2 Contingency Test  TE e                   This utility calculates    chi square contingency test or Fisher exact test  when using only the 2 x 2 table of  observations  using the same
40. erformed by  coding one allele as 0 and the other as 1  The TRANSFORM function in TASSEL converts the major  allele to 0  All the other alleles are collapsed to a single class and coded as 1  PCA requires that all  variables should have variation and should not have missing values  As a result  filtering genotype to  eliminate monomorphic markers and imputing missing values may be necessary  Imputing missing values  can be done before or after numericalization  Here we demonstrate how to generate PCs from the  genotype file in the tutorial data     1  Remove monomorphic sites  Make sure TASSEL is in Data mode  Highlight the genotype  and click Site  Set the minimum frequency to 0 05 and have  Remove minor SNP status   checked  Click Filter    2  Numericalization  Highlight the filtered genotype and click Transform  Use the default option     of  Collapse non major alleles   Click Create data set    3  Imputation of missing values  Highlight the numerical genotype and click Transform and  then click Impute Tab  Use the default options  Click Create data set    4  PCA  Highlight the imputed numerical genotype  click Transform  and then click PCA Tab   Change the default option to  Components 3  by choosing Components and type 3 in the text  box  Click Create data set                                                                                                                       EILEEN                Ware                                                            smettere C
41. ge is determined by  calculating D    or  2 for all possible combinations of alleles  and then weighting them according to the  allele s frequency  Note  Jt is not entirely certain that this procedure fully accounts for allele number  effects        P values are determined by two methods  If only two alleles are present at both loci  then a two sided  Fisher s Exact test is calculated  Note  Previous editions of TASSEL used a one sided test  but TASSEL  version 1 0 8 and later use a two sided test     If more than two alleles are present  permutations are used to calculate the proportion of permuted gamete  distributions that are less probable then the observed gamete distribution under the null hypothesis of  independence       When calculating linkage disequilibrium  users have the option of employing    Rapid Permutations     If  this option is selected  the algorithm will compute either a fixed number of permutations or run until 10  permutations are found that are more significant than the observed P value  While this slightly reduces P   values  it also saves a large amount of computational time  If an unbiased p value is desired  then the user  must unselect the    Rapid Permutations    check box     Linkage disequilibrium results can be plotted using Results    LD Plot or viewed in a table via   Results  gt  Table      3 3 Cladogram    Catocram    This function generates a tree or cladogram data set  TASSEL produces neighbor joining trees using only simple parsimony s
42. he denominator in the Bonferroni correction is the total number of SNPs tested  The  association was significant     The other data added to the data tree is labeled  GLM  Allele Estimates   followed      the name of the  Joint data  For the most significant SNP at position 6  there were two genotypes  CC and GG   There are  62 lines with genotype CC and 10 lines with allele GG  For the trit dpoll  days to pollination   the  difference between the two homozygotes was 6 63755 days              Ic TETTE        tos                           per                              Fren     EZ             a                                                                                                                2    6 5 Associ       jon analysis using MLM    Running MLM in tassel is similar to running GLM  The difference is that in addition to the joint data  or  numerical data   MLM requires kinship data to define the relationship between individuals  The kinship  matrix times a parameter equals the covariance matrix between individuals  Here we use kinship file from  the tutorial data set to fit the following statistical model     Flowering time   Population structure   Marker effect   Individuals   residual  Individuals and the residual are fit as random effects  The other terms are treated as fixed effects  With respect to the marker effect  we will demonstrate the analysis using two sets of markers  One is the    dwarf8 gene sequence used in the GLM tutorial  The other is a set
43. here are several formats for  numerical data to fit the requirement for modeling  Trait data  dependent variables  can be imported by  starting the first line with     lt Trait gt   and following that with the trait names  Additional classifiers may  also be included in subsequent header rows by starting the row with  Header name xxx gt     followed by a  name for each column of data  For instance  to define environments  start the second header row with     cHeader name env gt       Comment li  character                       be inserted at the beginning of the file as long as each comment line begins with the       2  This format does not require users to provide information on number of rows and columns  The file stats  with key word   Trait   followed by names of columns  The column for line should not be labeled        1 Trait format    Example 1  simple list of rait values     811 59 5       33 16 64 75 64 5 NA  38 11 92 25 68 5 37 897  4226 6515 59 5 32121933  4722 81 13 71 5 32 421         27 5 62 31 419    Example 2  traits data collected in multiple environments                      AT Plantat        Plantit   lt Header name env   1061 Locl Locl 1002  B11 59 5 NA NA         33 16 64 75 121 5 NA       92 25 15318 37 897 83 4  4226 6515 130 1 32 21933 621  4722 81 13 165 7 32 421 90 1  A188 27 5 110 2 31 419 79 6    2272 Covariate Format    Covariate data uses the same format as trait data except that the first line must be      Covariate       This line  tells TASSEL
44. ie                  mcm                                                                                                                                   45          Three items will be added to the data tree after running PCA  The first are the PCs  The second are the  eigenvalues  And  the last are the eigenvectors  Here we use the Chart Function in the Result mode to  graph the first three PCs  the individual eigenvalue contributions  sometimes called a skree plot  and the  cumulative eigenvalue contributions  The eigenvalues are of interest because they equal the variance  explained by each of the PCS     PE vi individual Proportion and Cumulative Proportion    PA va PE Zand PC 3       a    6 3 Estimation of Kinship using genetic markers  While PCs can be used to capture major population subdivisions  kinship can be used to capture more  subtle relationships  This section shows how to create a kinship matrix based on the same SNP data used  to calculate PC s    1  Remove monomorphic sites  Highlight the genotype and click Site in Data mode  Set the   threshold on MAF to 0 05  check  Remove minor SNP status     then click Filter    2  Estimate kinship  Highlight the filtered genotype and click Kinship in Data mode  A kinship     matrix will be added to the data tree under Matrix category           1  Wem     Etre                           p                                                                                  48    6 4 Association analysis using GLM    We use th
45. inning of the file as long as any comment lines begin with the symbol     Columns are         delimited  Numeric values are allowed but  by default  will be treated as classification variables not  as covariates in analyses        In some cases  a user may wish to have marker values treated as numerical covariates  If the first line of  the file is     lt Numerie gt      then the data will be imported as numeric data but used as marker data      GLM  and MLM     Example 2   Marker        0       Note to TASSEL 2 1 users  The polymorphism format specified in TASSEL v2 1 is still supported to  provide backward compatibility     2 2 6 Phylip       The Phylip format used by TASSEL version 2 1 will continue to be supported  Details      Phylip format  are described at the following website  http   evolution  genetics washington edw phylip doc sequence html       2 2 7 Numerical data    This type of format is used for trait and covariate data such as population structure  Similar to sequence  alignment genotype data  numerical data also consists of two parts  a header that defines data structure  and a body containing the main data  Tabs should be used as delimiters  However  any white space  character such as blank will be treated as a delimiter as well  As a result  embedded blanks in names will  cause data to be imported incorrectly  We suggest representing missing values using             or    NaN      However  any text value  e g         will be interpreted as missing data  T
46. ist of markers with chromosome and map position and  optionally  physical position  It can be used by GLM and MLM to provide genetic positions in the output files  It is not used as part of  the analysis  The input format is     First line              as is  including the brackets   Following lines  marker name  chromosome name  genetic position  physical position  actual data     Example    lt        gt    markerl      213 2456873  marker     521 52345691    There is no header line as such  Marker name  chromosome name  and genetic position are required   Physical position is optional and not used at this time  It is there because it is anticipated that information  from this map may be used to convert between physical and genetic position at some time in the future     2 3 Export   ii Export    Options are provided to export sequence data  BLOB  Hapmap  Plink  Flapjack  Phylip  Sequential or  Interleaved   Phenotypes and covariate data is exported as numerical trait data  Table Reports are exported  as    tab delimited table     This button has the same function as the    Save selected as  on the File menu  For numerical data  the  function of Export is similar to the Table function in Results mode     16    24 Sites Y stes       The alignment can be filtered in several ways  Monomorphic sites can be eliminated  and regions of a  sequence can be eliminated        Minimum Count   the minimum number of taxa in which the site must have been scored to be included  in the filte
47. ked Questions    1  What do I do if TASSEL misbehaves        TASSEL is an open source software project hosted on SourceForge and has a bug tracking list at  buip   sfnet projects tassel where you can notify the developer community of problems  In order for a  bug to be fixed  we must be able to replicate the problem  Thus  it is important to document the steps  that were taken that produced the error  If the data you are working with is not too sensitive  please  include the files which were used in the faulty procedure  If you would rather not post your data file on  SourceForge  you may email it to one of the software developers     2  Where do    turn for more information        If you are having difficulty with a certain aspect of TASSEL  you can either email one of the software  developers listed at www maizegenetics net or you may check the TASSEL forum on SourceForge  hitp   snetprojects tassel   as another user may have already addressed a similar question  There is  also a TASSEL discussion group at http   groups google comy group tassel                  How do l join the fun  TASSEL on SourceForge       TASSEL is an open source project distributed under the GNU general public license  This means that   the source code is available and the user is free to modify the code to suit their particular needs  We   welcome input from developers and those who wish to become involved in the improvement of this  software  The project is hosted on SourceForge  hitp  sf net proje
48. link    Plink is a whole genome association analysis toolset  which comes with its own text based data format      The data is stored in a set of two files  a map file and a        file     12       The  ped file contains all the SNP values and has six mandatory header columns for Family ID  Individual  ID  Paternal ID  Maternal ID  Sex and Phenotype  TASSEL only requires that the Individual ID field be  filled in  Each row of the  ped file describes a single germplasm line  Notice in Plink  an unknown  character is represented with    97  However in TASSEL an unknown character is represented with a  N    and 10 is used to represent heterozygous indel  TASSEL will automatically convert between the 0  and the        Any exported Plink files will represent the heterozygous indel with a    insertion  and a    deletion         The map file describes all the SNPs in the associated ed file  where each row provides information on         SNP  The        file must contain exactly four columns  Chromosome  rs   Genetic distance and  Position  TASSEL does not require the Genetic distance field to be filled in    Both files should be TAB delimited    Fora more detailed description on the data format  please visit the Plink basic usage and data formats    webpage   http   pngu mgh harvard edw  purcell plink data shtml      2 2 4 Flapjack    Flapjack is a software tool for graphical genotyping and haplotype visualization  The program is capable  of outputting data in its own text based da
49. m information content   PIC   and    Haplotype PIC      Overall score  is essentially an  estimate of the ability to design a single base pair extension reaction in the region     These results can be exported by using a table  Results  gt  Table     3 5 Kinship  W Kinship       The function generates a kinship matrix from a set of random SNPs  To do so  first highlight SNP data  then click on the  Analysis  button  followed by the    Kinship    button  The resulting kinship data will be  added as a data set on the Data Tree panel          When a genotype file is selected  the kinship matrix is generated by first using the TASSEL Cladogram  function to calculate a distance matrix  Each element dj of the distance matrix is equal to the proportion  of the SNPs which are different between taxon i and taxon j  The distance matrix is converted to a  similarity matrix by subtracting all values from 2 then scaling so that the minimum value in the matrix is  0 and the maximum value is 2  Kinship can be derived from a set of random SNP data  a minimum of  several hundred SNPs spread over the whole genome is recommended      Warning  This method currently works correctly only for homozygous inbred lines  The method will be  modified in the near future to work with heterozygous taxa  At that point  this warning will be removed     Users may also load their own kinship data using Data   Load  Kinship matrices can be calculated  using      SPAGeDi software package  hitp  vww ulb ac be sci
50. n the  JAR file  STASSEL jar   Alternatively  from    command prompt  in Windows go to Start  gt  Run and type  in  cmd  or  command    change into the tassel3 0_standalone directory and execute this command            tassel bat  For Windows   Star_tssel pl  Far UNIX     1 1 3 Open source code  Open source code for the TASSEL software package is available at  http   sourceforge net projects tassel      The package uses a number of other libraries that are included in the TASSEL distribution  These include     modified version of the PAL library  hitp  iwww ceblauckland ac nz pal project    the COLT library   http  dsd Ibl gov  hoschek colt   and jFreeChart  htip  iwww jfree org freechari    GDPC middleware   http  www maizegeneties net sdpe  provides database access     1 2 Panels   TASSEL is organized into five main panels   1  The Control Panel at the top contains menus and buttons  to control functions   2  The Data Tree Panel is located beneath the Control Panel on the        side  This  panel organizes data sets and results  Data set s  displayed in the Data Tree Panel must first be selected  before a desired function or analysis can be performed  To select multiple data sets  press the CTRL key  while selecting the data sets   3  The Report Panel is located below the Data Tree Panel  It displays       8    information about a selected data set from the Data Tree Panel  such      the type of data and how it was  created   4  The Progress Monitoring Panel below the Repor
51. named P3D by Zhang et al  and EMMAX by Kang et al     TASSEL was designed for a wide range of users  including those not expert in statistics or computer  science  A GWAS using the mixed linear model method to incorporate information about population  structure  and cryptic relationships  can be performed by in a few steps by  clicking  on the proper  choices using a graphic interface  All the processes necessary for the analysis are performed  automatically  including importing phenotypic and genotype data  imputing missing data  phenotype or  genotype   filtering markers on minor allele frequency  generating principal components and a kinship     matrix to represent population structure and cryptic relationships  optimizing compression level and  performing GWAS           The command line version of TASSEL  called the Pipeline  provides users the ability to program tasks  using a script instead of the graphic user interface  GUI   This feature allows researchers to define tasks  using a few lines of code and provides the ability to use TASSEL as part of an analysis pipeline or to  perform simulation studies    Due to the increasing availability of open data sources  TASSEL utilizes a data browser from the  Genomic Diversity and Phenotype Connection  GDPC  project  to provide an interface to relational  databases  As a result  TASSEL users can access any data source that provides a GDPC service  Using  this middleware  which provides a common graphical interface  TASSEL user
52. nd click right button and select  Install Shortcuts      9  Why do    get empty squares in MLM association analysis       The empty square means null information  The major reasons include non convergence in the  estimation of variance componentsor that the statistic      question was not calculated  For example   marker F      and R  are not calculated when no marker is included in the model     10  Why should I exclude one column of the population structure        For some methods of calculating population structure  such as the software STRUCTURE  the  population proportions sum to one  This produces linear dependence between the population   covariates  While the algorithm used by        tolerates that dependency  MLM will fail because the  design matrix will not be invertibleExcluding one column eliminates linear dependence between  columns  Using PC axes to represent population structure does not result in linear dependency  because all PC columns are guaranteed to be independent    11         kinship replace population structure        Sometimes  For some traits and populations  the K only model may be as good as or better than the  Q K model  For others  Q K may be superior  The Q only model is not as effective for controlling  population structure as      alternatives  Unfortunately  no general guidelines exist for predicting which  model will perform best  As a result  an investigator may wish to fit all three models and compare the  results  If eliminating false po
53. ne  or genetic  position within the chromosome   At the bottom of the graph is a display of the position of each site along  the gene or chromosome  This display can be hidden by deselecting the  Schematic  checkbox  Legends  describing the color scheme appear on the right hand side of the graph        TT Linkage Disequilibrium                                              O8 Ore                                         OF            E senate                   LD plots can be printed  saved in JPEG format  or saved as a Scalable Vector Graphics  SVG  file  An  SVG file is useful for creating publication quality graphics which can be easily sized using an editor such  as Adobe Illustrator  Corel Draw  or OpenOffice org Draw 2 0            4 5 Chart il    Chart provides a variety of graphs for visualizing numeric data     This feature can be used to display histograms  XY plots  bar charts and or pie charts  Any numeric table   data can be charted  including LD results  phenotypic data  diversity results  and association results    Histograms  Use the graph type combo box to select the desired graph type  Histogram  from the list of  options  Up to two different series of data can be plotted together  Users may specify the number of bins  to be used in the histogram     38       El Gales                                      eos suo               es  neon KONES   ADD eres  oscuros          as  DPOLL HOMESTEAD ID1  amp  DPOLL CLAYTON ID15 Distribution                Scatter plots  
54. of the body depends on the type of BLOB and on the amount of data being stored     For a more detailed description on the structure and information contained within the header and body   refer to the GDPDM BLOB Specifications    http   www  maizegeneties net gdpdm does 20100526 GDPDMBLobSpecifieation 20100526  pdf        2 2 2                 Hapmap is a text based file format for storing sequence data  All the information for a series of SNPs as  well as the germplasm lines is stored in one file  The first row contains the header labels  and each  additional row contains all the information associated with a single SNP  The first 11 columns describe  attributes of the SNP  while the following columns describe the SNP value for a single germplasm line   The first 12 columns of the first row should look like this  where  Line 1  is the beginning of germplasm  line names                  Takes Deum  ges  center RED           panei SID                        While all 11 header columns are required  not all 11 of the columns need to be filled in for TASSEL to  correctly interpret the data  The only required fields are    chrom     Chromosome name  and    pos      Position     For TASSEL to correctly read Hapmap data  the data must be in order of chromosome and position within    each chromosome  and the file should be TAB delimited  If some of the data is missing the correct  number of          must still be present  so that TASSEL can properly assign data to columns     2 2 3 P
55. ormat       48 sequence phy Genotype Phylip Alignment   mdp genotype hmp txt Genotype Hapmap Alignment  Imdp_genotype fipjk geno Genotype Flapjack Alignment   4___ mdp_genotype flpjk map   map  genotype pik ped Genotype  Plink Alignment   6         genotype pik map      mdp_kinship txt Kinship   mdp_population_structure txt Population structure   BO  mdp traits txt Phenotype Numerical trait data                   File  1 is the sequence of dwarf   gene with 2466 sites on 91 maize inbred lines  The data was described  by the paper on the association between Dwarf8 and flowering time      File 42 6 are 3093 SNPs on 281 maize association inbred lines  The data was presented in three formats   Hapmap  Plink and Flapjack   The data was created by the PANZEA project funded by NSF  Details of  the data can be found at htip   vww panzen org     File  3 and 4 are in pair for the format of Flapjack     File  5 and 6 are in pair for the format of Plink   File  7 is kinship created by Yu et al     File  8 is population structure of 282 maize inbred line      File  9 is phenotype on three traits  including flowering time  on 282 maize inbred lines            7 3 Biography of TASSEL    2001  December  2004    February  2005    March  2005    April  2005    June  2005    October  2005    January  2006    March  2006    September  2006    October  2006  September 2007  April 2008    June 2008    First public release    Score able SNP Extractor  Updated Main Panel    StepClade update    Fixed
56. predict phenotypes from genotypes  It is one of the methods   used for genomic selection  GS      The input dataset must contain one or more phenotypes and numeric marker data  Optionally  it may also  contain factors and covariates  The analysis is run by selecting the input dataset then clicking the    GS     button  Because no additional user input is needed  the analysis will run immediately after the button is  clicked  All traits will be analyzed separately using all of the genotypes  factors  and covariates in the  dataset  The output will consist of two new datasets for each trait  One of the datasets will contain  genomic estimated breeding values              for each taxon and the other will contain BLUPs for each  marker in the genotype file  The output datasets will appear in the  Numerical  folder  which holds the  input data as well  The output datasets can in turn be used for subsequent analysis  For example  it could  be joined with the input data so that the predicted values could be graphed against the original values     Understanding the input data requirements is important to ensure that the results of the analysis will be  correct and useful  Genotypes must be numeric with one column for each marker  It is expected that the  markers are bi allelic  with the homozygotes coded as 1 and  1 and the heterozygotes coded as 0   However  any reasonable coding scheme will work  For instance  missing data could be replaced by     probability resulting from imputa
57. red data set  GAP or missing data do not count    Minimum Frequency   the minimum frequency of the minority polymorphisms for the site to be  included in the filtered data set    Start Position  End Position   establishes the range of sites for filtering    Extract Indels   if selected  indels are extracted from the alignment  If not selected  only point  substitutions are extracted    Remove minor SNP states   converts tertiary and rarer states to missing data  77   thereby forcing sites  to have only two types of segregating sites at    locus  This may help remove sequencing errors    Generate haplotypes via sliding window   creates haplotypes from an ordered set of SNPs              2 5                       Select either genotypic  phenotypic  or population structure data from the data tree  The resulting dialog  box displays the selected data in table format  By using either the CTRL or SHIFT key in conjunction  with the mouse  the user can select or deselect taxa rows  Once desired taxa have been selected  the     Capture Selected    or    Capture Unselected buttons will create a new data set containing only the  captured taxa                                                                                             2 6 Traits Y Traits    Clicking the    Traits    button on the  Data  toolbar launches the Trait Filter dialog  This dialog is used  with numerical data sets to  1  change the trait type   2  view  but not change whether the trait is discrete   or continuou
58. ree files from the tutorial data set to perform association analysis using the GLM  The first file  is the dwarf gene sequence with 2466 sites on 91 maize inbred lines  The second one is the population  structure of 282 maize inbred lines  The last one is phenotypes for three traits  for 282 maize inbred lines   The statistical model is     Flowering time   Population structure   Marker effect   residual    1  Remove monomorphic sites  Highlight the genotype and click Site in Data mode  Set the  threshold on MAF as 0 05  then click Filter    2  Trait selection  Highlight the phenotype and click Trait in Data mode  Uncheck all the traits     except flowering time  DPOLL   Make sure that the Type is set to Data  Click OK to create a  filtered phenotype    3  Covariate selection  The population structure is presented as the proportion of each population      There are three populations represented as Q1  Q2  and Q3  They sum to 100   This creates  linear dependency if we use all of them as covariates  We can eliminate the dependency by            population removing one of them  In this demonstration  we exclude the last one  Highlight the filtered     structure phenotype and click Trait in Data mode  Uncheck the last population  Q3   Make sure that the  Type is set to Covariate  Then click OK to create a filtered population structure data    4  Joining data  Highlight the three filtered data sets by holding the Control key while selecting  the individual data  Then click Inters
59. s and  3  drop one or more traits from the data set  In addition  the dialog can be used to view  the trait properties without changing them  If the  OK  button is clicked  a new data set is created that  incorporates the changes  the original data set remains unchanged  and the dialog closes  If the  Cancel   button is clicked no data set is created  the original data set remains unchanged  and the dialog closes   Allowable trait types are data  covariate  factor and marker  Generally  data and covariate traits will be  Continuous  not discrete  and factor will be discrete  Markers in a numerical data set will be continuous   Discrete valued markers are better imported as sequence or polymorphisms     18    Clicking    Exclude         unchecks the  Include  box for all traits  Clicking    Include All  checks the     Include    box for all traits  The    Exclude Selected    and    Include Selected    buttons do the same thing for  traits that have been highlighted by selecting them with the mouse        Important  Once a numerical data set has been joined with genotypes  it can no longer be modified using  the trait filter function           2 7 Impute SNPs    moute SNPs    This function is used to impute missing genotypes  A sequence data type is required to use the function        This suite of functions allows multiple data manipulation on genotype and phenotype  numerical  data   When a genotype data set is selected  the data are transformed to numbers  When a numerical d
60. s can avoid writing SQL  queries to access data  Currently  GDPC provides connections to Panzea  Gramene  Germinate  and GRIN   USDA s Germplasm Resources Information Network      TASSEL is written in Java  thereby enabling its use with virtually any operating system  It can be  installed using Java Web Start technology by simply clicking on a link at www maizegenetics net tasse   A stand alone version of TASSEL can also be downloaded to use in pipeline mode or in any situation  where the user wishes to start the software from a command line     4 Getting Started       quick way to get started using TASSEL is to load the tutorial data and try performing analyses   However  because some of the necessary steps may not be intuitive  we recommend that new users follow  the tutorial at end of this manual  The objective of this section is to provide information necessary to  install and start TASSEL software and to provide a brief overview of the interface     Most functions are organized into three modes  Data  Analysis and Results  which correspond to the first  three buttons on the TASSEL interface as shown below  Clicking one of these buttons changes the  funetions represented by the second row of buttons  Those three modes are described in detail in the  subsequent sections of this manual  The screen shot shows TASSEL after the tutorial files have been  loaded         IY Ana by Soo                                                                                                 
61. s can either save this genotype data      several formats or upload it to TASSEL  However  before  outlining these procedures  let us finish the query by exploring phenotypes  To get data from experiments  conducted      2000  first select the Environment Experiments tab  followed by the Repetition checkbox     59    Select the desired repetitions in 2000 as the values to be used for filtering  then click the Get Data button      The subset of data that meets these criteria is returned as follows             m       Est m erra                             Daneman       Now extract phenotype data by clicking on the Phenotypes tab  Traits can only be extracted one at a time   Choose Days to Silk from the Ontology field  Make sure no Taxa are selected and all Environment  Experiments are selected that were retrieved in the previous step  Click the Get Data button  then the     Merge bution  leaving only Accession checked under the Taxa Properties section  Leave Locality and  Repetition checked under the Environment Experiments Properties section  Data are merged as follows            r                         EN uL LE  sats i uv          6 6 3 Importing GDPC data into TASSEL  Genotype and phenotype data must be loaded in separate steps  To load genotype data  first click         GDPC in Data mode  Then click on the Genotypes tab  followed by the Load button  The genotype data  60    is then loaded into TASSEL and labeled as    Genotype     To view the uploaded data  click on    G
62. sitives is very important  then it may make sense to accept the most  conservative model  However  if the objective is to identify candidates for further study and the cost of  following up on a false lead is low  the most liberal model may be preferred       68    12  Why do TASSEL and SPAGeDi give different kinship estimates    A  First  many algorithms exist to calculate kinship and their estimates will differ from one another   Secondly  the algorithm in TASSEL treats each genotype as a haplotype  It is not recommended that  TASSEL be used to generate    kinship matrix from heterozygous genotype  In the near future  the  TASSEL kinship algorithm will be modified to handle heterozygous diploids        13         I get Marker R square using SAS Proc Mixed or TASSEL MLM         SAS Proc Mixed does not produce an     statistic  MLM in TASSEL does  The user manual  describes how itis calculated     14  Does MLM find more associations than GLM       Sometimes  MLM has higher statistical power than GLM and may detect more true associations   When the tested genetic markers are confounded with kinship structure   GLM does not correct for  that as effectively as MLM and may produce more false positives       15  Do I need multiple test correction for the p value from Tassel   A  Yes     16         TASSEL handle diploid genotype data   A  While TASSEL accepts most common sequence alignment formats which handle polyploid     genotype data including haploid and diploid  some analyses
63. t Panel shows the progress of running tasks  and has buttons that        be used to cancel tasks   5  The Main Panel occupies the right side of the viewing  area  It displays the content of a selected data set from the Data Tree Panel     Functions in TASSEL are accessed by buttons and menus on the Control Panel      The three buttons on the top left are the Mode Selectors  Data  Analysis and Results   The buttons below    the Mode Selectors changed when a new Mode Selector is clicked  The modes are described in section 2   4  To the right of the Mode Selectors are the Progress Bar  and the Delete  Print  Save and Help buttons        2 Data Mode         Data mode serves the purpose of importing and managing data  Data mode is the default mode when  TASSEL starts  Click on the Data button to switch to this mode        Tassel has two ways of importing data  One way is via GDPC to import data from databases  The other  way is via flat files formatted as genotypes  e g  hapmap  flapjack  and plink   phenotypes  trait data    population structure and kinship matrices        The preliminary data manipulations include filtering data by site or taxa  joining data and data  transformation     2 1 GDPC    9       Genotype and phenotype data generated from numerous genomic research projects are still valuable  resources for the public  even after results are published  Some of these data have been migrated to  several databases and can be accessed using Genotype Data and Phenotype Conne
64. ta format  Like Plink  the data is stored in a set of two files        map file and a  geno genotype file           The genotype file contains all the SNP values  Each column in the first row contains a SNP ID  except for  the first column  which is blank  The first column of the following rows contains the germplasm line  names  TASSEL requires that all fields be filled out in order for data to be read correctly        The map file describes all the SNPs associated with the genotype file  Each row describes a single SNP   There are three columns in the map file for Flapjack  SNP ID  Chromosome and Position  all of which  are required for TASSEL to run correctly     Both files should be TAB delimited     For a more detailed description on the Flapjack data file format  please visit the Flapjack data import  website   hip   bioinf                                               dialog DatalmportDialog shtml         2 2 5 Polymorphism       general format that accepts almost any type of marker data can also be used  Any alphanumeric  character is allowed  Diploid data can be represented by separating alleles with a colon      gt       for example                 or B B  All loci in a file must have the same ploidy level  The first line starts with the symbol   lt Marker gt  followed by the marker names  Subsequent lines must start with the name of the individual or  taxon genotyped followed by the marker scores in the same order as the header  Comments can be  inserted at the beg
65. tes the degree of similarity between names  using the name from the first set which is most  similar to that in the second data set     When using the Synonymizer  keep in mind that order of selection matters  Always select the data set  with the names you wish to use  the  real  name  first  and then  while holding down the CTRL key  click       the second data set with the taxa names you wish to change  the  synonym    Then click on the  Synonymizer button  A synonym data set will be placed on the Data Tree panel under Synonyms  Each  name in the data set selected second is now listed in the TaxaSynonym column  Next to this column is a  TaxaRealName column listing the highest scoring match derived from the  real  name data set  The  MatchScore column gives an indication of the amount of similarity between the two names  where 0 is  no similarity and 1 0 is identity                       Caution  Before the synonyms are applied  we strongly encourage the user to check the match score   especially for those taxa with low match scores  To do that  the user selects the synonym file and clicks  the  Synonymizer  button  The incorrect matches  usually the ones with the lowest match scores  can be  rejected at this point  Sorting on the match score column first makes this    fairly easy process     In the event that some of the taxa are not interpreted correctly  matches can be modified manually  Select  the taxa you wish to modify on the left side  and then choose a replacement t
66. the Link Unlink button    The button will now    appear as LB          activates the Add selected items L P  Aga all items       Remove 00    and Remove all       buttons  Remove all items from the working list  then select items with a name  starting with the letter D  Click on the Add selected items button to move them to the Working List  The  resulting Working List is shown as follows           rer cnm  3       FE  E Lr  iE  i EN l                Em  pe z  re Em     58    To filter data by polymorphism type  first click on the Genotype Experiments tab  check the  Polymorphism Type and Producer checkbox  field   and then select SNP and Jim  Finally  click the  Get Data button to reveal the subset of data that meets these criteria  Results for this example are shown  below           aroma                       BE           xo Lau              Im                                  EI ton atiga           EJ  Con  p v  pun   p                         Tire                                         Genotype data can be extracted from the database by clicking on the Genotypes tab  followed by the Get  Data button  After a moment  genotype data will be displayed as follows             EET   ep eee      CE                                                                                              pawa                     FAINT FOES FANE            um                         s    fT            ihr      mn  fe te fr    Ee  To p                    t    rerit             wet a        User
67. tion  If any genotype data is missing  it will be imputed as the average of               the marker scores across all taxa for that marker  Ifa user prefers to use a different method of imputation   then the missing genotypes must be imputed before importing the data into TASSEL     GEBVs will be calculated for all taxa in the dataset  including any lines that have missing phenotype data   A typical use of genomic selection is to predict GEBVs for a set of unphenotyped lines based on the  performance of a training set       do that a dataset containing both the genotypes to be predicted and the  genotypes of the training set can be joined with a dataset containing the phenotypes of the training set  using a union join  All taxa in the phenotype set should have genotypes  If an individual without genotype  data is included  all the marker data for that individual will be imputed  which is not a generally useful  thing to do           34       4 Result Mode    Results mode consists of the functions to present data as table or graphics        4 1 Table Ed Tae     Allows data to be displayed in a spreadsheet view and exported into a flat file          create a table  select a data set from the Data Tree panel  then click on the  Results  button followed  by the  Table  button  Results    Table   Shown below is an example in which diversity estimates are  displayed                                T Diversity estimates   COE          En        a              tak  PL m              
68. ubstitution models   To retrieve cladogram data  first select genotypic data from the Data Tree panel and then click on the       Analysis    button  followed by the  Cladogram  button  The resulting tree data and the corresponding  matrix will appear as separate data sets on the Data Tree panel        Results can be plotted using Results  gt  Tree Plot     34 SNP Extract      SHP Extract          SNP Extract    extracts SNPs from    raw sequence alignment into a useful format for export  Additionally  this function provides information for designing genotyping assays     Below is a detailed explanation of the SNP Extractor Dialog           Minimum Site Frequency  the minimum frequency for which the site must have    good base    Minimum SNP Frequency  the minimum frequency of the minority polymorphisms for the site to be  included in the resulting data set    Minimum Surrounding Bases  the minimum number of good bases on at least one side of the SNP  28       Minimum Good SBE Bases  the minimum number of good bases on at least one side of SNP       Filter SNPs to Biallelic  converts tertiary and rarer states to missing data   7    thereby forcing sites to  have only two types of segregating sites at any particular locus  This helps to remove bad sequence  effects     Results are displayed on the Data Tree panel and include SNPs along with their context  Additional  information is also provided  including  the location of the nearest polymorphisms on either side   polymorphis
69. uce a sliding window   check the box next to    Sliding Window     and then enter the desired step size and size of the sliding  window        Results can be plotted using Results  gt  Chart or viewed in a table via Results    Table     3 2 Linkage Disequilibrium V kok Dsea  This button generates    linkage disequilibrium data set from SNP data   NOTE  It is important to use only filtered data sets  apply Data  gt  Sites first  when estimating linkage    disequilibrium  as a raw alignment with numerous invariant bases will take a very long time and consume   a large amount of memory to calculate        Linkage disequilibrium between any set of polymorphisms can be estimated by clicking on a filtered set  of polymorphisms and then using Analysis    Link            At this time  D      2 and P values will be  estimated  The current version calculates LD between haplotypes with known phase only  unphased  diploid genotypes are not supported  see PowerMarker or Arlequin for genotype support     D  is the standardized disequilibrium coefficient  a useful statistic for determining whether  recombination or homoplasy has occurred between a pair of alleles     P represents the correlation between alleles at two loci  which is informative for evaluating the  resolution of association approaches     D and r2 can be calculated     when only two alleles are present  If multiple alleles are present  a weighted  average of D  or 72 is calculated between the two loci     This weighted avera
70. www maizegeneties net gdpe                Table of Contents                      INTRODUCTION 6  1 GETTING STARTED 1  1   INSTALLATION 1       WEB start 7  1 12 STAND ALONE E  1 13 OPEN SOURCE CODE 8  12 PANELS 8   DATA MODE  22 10                     22 Loan      221 BLOB 12  222 HAPMAP 12  223 PUNK 12  224 FLAPIACK 13  225 POLYMORPHISM 13  226           14  227 NUMERICAL DATA 14  22 8 SQUARE NUMERICAL MATRIX 16  229 GENETIC        16  23 Exvour 8 Export 16  24 sires W Sites 17  15 Taxa        18  26 Trarrs    Traits     2   impure SNPs    Imoute SNPs 19  28 TrANsFORM   55 Transform  19  2 8 1 GENOTYPE NUMERICALIZATION 5 Trnstorm  19  282 TRANSFORM AND OR STANDARDIZE DATA 20    2 8 3 IMPUTE PHENOTYPE    284PCA  Taxa Names 29 Srmommizer    210 Unto Jorn      son  211 Iwrersection Jor      Join       ANALysis          es    29 SYNONYM             26       34 Diversrry Diversity                         26  32  LINKAGE Diskou          CLanocram               am  34 SNPExreacr         Extract 28  35                    Kinship     36 GENERAL LINEAR              GLM     a                 Lanman              E       mwen necssssioy 88      4 RESULT mope      as       Tane 8 Table as  42 Tree PLor     Tren Pit as         E    as   a   Fux MENU a   1 SAVE DATA TREE 40  5 1 2 OPEN DATA TREE 40  5113 SAVE DATA TREE As  40  SA OPEN DATA TREE  40  SS SAVE SELECTED AS  40  52 CONTINGENCY Test      53 PREFERENCES      6 TUTORIAL a       MISSING PHENOTYPE IMPUTATION a  62 PRINCIPAL COMPONENT
    
Download Pdf Manuals
 
 
    
Related Search
    
Related Contents
Online lesen  - scacom  Home Decorators Collection 0815200260 Instructions / Assembly  Mounting and Operating Instructions EB 9520 EN  GP 2600    Belkin A7J704    Rubbermaid GroundsKeeper 9W30  Vallées du Madon, du Brénon et carrières de Xeuilley  取扱説明書 - TOMEI POWERED USA Inc.    Copyright © All rights reserved. 
   Failed to retrieve file