Home

Running FaST-LMM

1. IND3 INDO 1 0 0 5 05 025 IND1 0 5 13 0 05 0 5 INDA 0425 Ugo 1 0 0 5 SNP dosages are specified using a dat file Example dosage file SNP Al A2 Faml Indl Faml Ind2 Fam2 Ind3 s0001 A C 0 98 0702 1 00 0 00 0000 01 rs0002 G A 0 00 1 00 0 00 0 00 0 99 O02 This file represents data for two SNPs on three individuals The first three columns list the SNP first nucleotide and second nucleotide The minor allele is coded a1 and the major allele is coded a2 Each genotype is represented by two numbers Here the two numbers for the first SNP represent the probability of an a a then an A c genotype The probability of a c c is 1 minus the sum of these The header row is optional but if used it must start with sNP a1 A2 and have a Familyld Individualld pair for each genotype probability pair If there is no header the genotype entries must be in the same order as found in the fam file Dosage files typically do not contain missing data but 9 9 may be used to specify a missing entry To use a dosage file replace the file and fileSim commands with dosage and dosageSim respectively In addition to the dat file a fam file is required The entries in the dat file must correspond to entries in the fam file A map file is optional and will fill out the additional SNP location information Running FaST LMM Once you have prepared the files in the proper format you can run FaST LMM Here is a sample call on the s
2. read directly from this file simOut filename specifies that genetic similarities are to be written to this file linreg specifies that linear regression will be performed When this option is used no genetic similarities should be specified covar filename optional file containing the covariates missingPhenotype lt dbl gt identifier for missing values If the phenotype for an individual is missing then the individual is ignored If a covariate value for an individual is missing then it is mean imputed Default 9 10 out filename the name of the output file Default value is basefilename out txt simLearnType Full Once if set to Once the default then delta the ratio of residual to genetic covariance is optimized only for the null model and used for each alternate model If set to Full then the ratio is re estimated for each alternative model simType RRM COVARIANCE if set to RRM the default then the RRM is used for genetic similarity If set to COVARIANCE then the empirical SNP covariance matrix is used ML use maximum likelihood parameter learning default is ML with the likelihood ratio test REML use restricted maximum likelihood parameter learning default ML REML will automatically invoke the F test REGSE use F test with ML or REML brentStarts lt int gt number of interval boundary points for optimization of delta see Section 2 1 of the Supplemental Information
3. Default 100 brentMaxIter lt int gt maximum number of iterations per interval for the optimization of delta Default le5 brentMinLogVal lt double gt lower interval threshold for log delta optimization Default 10 brentMaxLogVal lt double gt upper interval threshold for log delta optimization Default 10 brentTol lt double gt convergence tolerance of Brent s method used to optimize delta Default 1e 6 runGwasType RUN NORUN run GWAS or exit after computing the spectral decomposition of the genetic similarity matrix Use NORUN to cache the spectral decomposition This option in combination with the next is useful for parallelizing the tests of many SNPs Default RUN eigen directoryname load the spectral decomposition object from the directory name The computations leading to the spectral decomposition of the genetic similarity matrix are skipped note that that SNP file specifying the genetic similarities must still be given eigenOut directoryname save the spectral decomposition object to the directory name Can be used with runGwasType option 11 numjobs lt int gt Partition the SNPS into lt int gt groups and run FaSTLMM on the partition specified by thisjob thisjob lt int gt Specifies which partition of SNPS created by numjobs to process with FaSTLMM extract filename This is a SNP filter option FaSTLMM will only analyze the SNPs explicitly listed in the filen
4. INK file Phenotype under verboseOut The name of the phenotype as specified in the header of the phenotype file NoName means that no header row was specified Pvalue The p value computed for the SNP tested Qvalue The q value computed for the SNP tested estimated from the p values of all test SNPs in the PLINK file using the procedure of Benjamini and Hochberg The sample size or number of individuals that have a been used for this analysis NumSNPsExcluded under xcludeByGeneticDistance IndexExclusionStart under xcludeByGeneticDistance DOF under verboseOut The degrees of freedom of the statistical test NullLogLike The log likelihood of the null model AltLogLike The log likelihood of the alternative model SnpWeight The fixed effect weight of the SNP SnpWeightSE The standard error of the SnpWeight WaldStat The Wald stat of the SnpWeight NullLogDelta The ratio between the residual variance and the genetic variance 5 02 Og on the null model NullGeneticVar The genetic variance Og on the null model NullResidualVar The residual variance o2 on the alternative model 8 NullBias The offset term in the null model LogDelta under verboseOut The ratio between the residual variance and the genetic variance 6 of o on the alternative model geneticVar under verboseOut The genetic variance og on the alternative model ResidualVar under verboseOut The residual variance o2 on the alter
5. User Manual FaST LMM Factored Spectrally Transformed Linear Mixed Models Version 1 08 Microsoft Research March 13 2012 Introduction FaST LMM which stands for Factored Spectrally Transformed Linear Mixed Models is a program for performing genome wide association studies GWAS on large data sets It runs on both Windows and Linux systems and has been tested on data sets with over 120 000 individuals This software is available as open source under the Apache license ver 2 0 at http mscompbio codeplex com A copy of the Apache License can also be found in the root of the project in the file LICENSE TXT For help with the software please contact Christoph Lippert christoph a lippert gmail com Jennifer Listgarten jennl microsoft com Carl Kadie carlk microsoft com Bob Davidson bobd microsoft com David Heckerman heckerma microsoft com Citing FaST LMM If you use FaST LMM in any published work please cite both the software using the link http mscompbio codeplex com and the manuscript describing it C Lippert J Listgarten Y Liu C M Kadie R I Davidson and D Heckerman FaST Linear Mixed Models for Genome Wide Association Studies Nature Methods published online 4 Sep 2011 doi 10 1038 nmeth 1681 Also we would appreciate it if you let us know that you are citing it Installing FaST LMM FaST LMM is available as a zip file that exacts to these directories fast1lmm Bin co
6. ame no header one SNP per line where the SNP is indicated by the rs or snp identifier extractSim filename This is a genetic similarity SNP filter option FaSTLMM will only use SNPs explicitly listed in the filename for computing genetic similarity extractSimTopkK filename lt int gt Similar to extractSim this is a genetic similarity SNP filter option FaSTLMM will only use the first lt int gt SNPs explicitly listed in the filename for computing genetic similarity verboseOut Enable a more detailed and verbose output file with more columns See output MaxThreads lt int gt The option is passed to the MKL math libraries to suggest the level of parallelism to use Assigning a number larger than the number of cores on your machine may cause the program to run slower Assigning a number less than the number of cores on your machine may allow your computer to run FastLmmC without consuming all the CPU resources in different phases of the program The MaxThreads option is currently ignored when using ACML math libraries References 1 Purcell S Neale B Todd Brown K Thomas L Ferreira MAR Bender D Maller J Sklar P de Bakker PIW Daly MJ amp Sham PC 2007 PLINK a toolset for whole genome association and population based linkage analysis American Journal of Human Genetics 81 12 Revision History Date Author s Description of Changes 12 2 2011 Heckerman Update for v1 04 Davidson ad
7. d dosage support 3 13 2012 Update for v1 08 document new output formats option verboseOut document new similarity option extractSim 13
8. e at http www microsoft com download en details aspx id 17017 With the HPC library installed no additional libraries are required to compile the C version of FASTLMM Double click the Gwas FaSTLMM sIn file to load Visual Studio and then build the solution If a reference to the HPC library is not resolved automatically during the load examine the references and double click the indicated library If the HPC library installed properly Visual Studio should successfully resolve the request and you can proceed with your build The program builds in Gwas bin For the C version FastLmmC uses a 3 party math library for advanced math functions and performance FastLmmC can use either Intel s MKL or AMD s ACML math libraries Once you have installed the appropriate library use the Visual Studio IDE to select the appropriate configuration from the solution and build ACML requires an additional step to tell Visual Studio where it is located You must set the environment variable ACML_ROOT to point to your install location or libraries will not be located for example C gt set ACML ROOT C AMD acm14 4 0 You can find more about the math libraries at their respective web sites http software intel com en us articles intel mkl http developer amd com libraries acml pages default aspx With a math library installed no additional libraries are required to compile the C version of FaSTLMM FastLmmC Double click the FastLmmC sIn fil
9. e to load Visual Studio and then build the solution associated with your library Building the C version of FaSTLMM for Linux FaSTLMM is primarily developed and tested on Windows although we are able to build the C version for Linux We provide a simple script file that uses the GNU toolset with the 3 party math library to compile the sources in a Linux environment e FastLmmC uses a 3 party math library for advanced math functions and performance The program has been run on Ubuntu Linux and can use either Intel s MKL or AMD s ACML math libraries for Linux Once you have selected and installed the appropriate library you can then build using the appropriate script file located in the Cpp directory Review of the two files DOMKL_linux and DoAcml_linux will show very simple scripts to compile the program using g and then link the o files with the appropriate math library The o files are written to version specific directories so it is necessary to create the appropriate directory prior to running the script For more details see the script You can find more about the math libraries for Linux at their respective web sites http software intel com en us articles intel mk l http developer amd com libraries acml pages default aspx Data preparation FaST LMM uses four input files containing 1 the SNP data to be tested 2 the SNP data used to determine the genetic similarities between individuals which can be differ
10. earch scope of log 6 values and brentMaxLogVal lt double gt for the maximum of the search scope of log 6 values By default the search is set conservatively to span 100 intervals over values between In 10 and In 10 Command line options file basefilenam basename for PLINK s map and ped files bfile basefilenam basename for PLINK s binary bed fam and bin files tfile basefilenam basename for PLINK s transposed tfam and tped files dosage basefilenam basename for PLINK s dat fam and optionally map files pheno filename name of phenotype file mpheno index index for phenotype in pheno file to process starting at 1 for the first phenotype column Cannot be used together with pheno name Default 1 pheno name name phenotype name for phenotype in pheno file to process If this option is used the phenotype name must be specified in the header row Cannot be used together with mpheno fileSim basefilename basename for PLINK s map and ped files for computing genetic similarity bfileSim basefilename basename for PLINK s binary bed fam and bin files for building genetic similarity tfileSim basefilename basename for PLINK s transposed t fam and tped files for building genetic similarity dosageSim basefilename basename for PLINK s dat fam and optionally map files for building genetic similarity sim filename specifies that genetic similarities are to be
11. ent from 1 3 the phenotype data and 4 optionally a set of covariates When the realized relationship matrix RRM is used for genetic similarity and when the number of SNPs used to construct the RRM is less than the number of individuals the runtime and memory footprint of FaST LMM scales linearly in the number of individuals in the data When this condition is not met the runtime and memory footprint of FaST LMM are cubic and quadratic in the number of individuals respectively All input files should be in ASCII Both SNP files 1 and 2 above should be in PLINK format ped map tped tfam bed bim fam Or fam dat map For the most speed use the binary format in SNP major order The phenotype entries in these files must be set to some dummy value and will be ignored our software uses a separate phenotype file Sex should be encoded as a single digit See the PLINK manual http pngu mgh harvard edu purcell plink 1 for further details Missing SNP values will be mean imputed Dosages files are also allowed see the end of this section The required file containing the phenotype 3 above uses the PLINK alternate phenotype format It should have at least three columns lt familyID gt lt individualID gt and any number of lt phenotype value gt The columns are delimited by whitespace lt tab gt or lt space gt The default option is to test the first phenotype only A missing value should be denoted by 9 but this ca
12. n be changed see options below The first column lt familyID gt is joined with the second column lt individualID gt to create a unique key for the individual that matches an entry for an individual in the PLINK files above 4 Example phenotype file for two phenotypes fastlmm data sampledata pheno txt 1 INDO 2 3 05043 L INDI 2 1721797 1 IND2 9 4 19592 1 IND3 2 3 4492 1 IND4 1 8 99843 iL INDS 1 0 768613 1 IND6 2 6 73734 Optionally the phenotype file may also have a header row for example as follows FID IID MyPheno YourPheno The optional file containing covariates should have at least three columns lt familyID gt lt individualIbD gt and any number of lt covariate value gt The columns should be tab delimited The token for missing values must be the same as that used in the phenotype file All covariates are processed Covariate files should not have a header row Example covariate file fast 1mm data sampledata covariate txt INDO IND1 IND2 IND3 IND4 INDS IND6 PRPRPPRPRPR PPP rPRPRPR Instead of SNP data from which genetic similarities are computed the user may provide the genetic similarities directly using the sim lt filename gt option The file containing the genetic similarities should be tab delimited and have both row and column labels for the individual IDs The value in the top left corner of the file should be var Example similarities file var INDO IND1 IND2
13. native model NullBias under verboseOut The offset term in the alternative model SNPIndex The column index of the SNP tested in the PLINK file starting at 1 SNPCount The number of SNPs tested Speed vs accuracy considerations The FaST LMM inference involves a search over the ratio 6 of genetic and environmental variances As this step represents a non convex optimization FaST LMM performs an optimization procedure over several intervals on a logarithmic scale invoking iterative calls to the likelihood function The total run time of this step scales linear in the sample size times a constant that approximately equals the number of intervals considered for the search For maximum speed the command line option simLearnType Once is set by default removing this constant factor for every SNP tested Using this option the ratio 6 is found on the null model only and is fixed to that value throughout the testing procedure Note though that on some data sets this could lead to slight loss of power when SNPs with a large effect are tested Use the command line option simLearnType Full to perform exact LMM inference that avoids this potential loss of power by refitting the ratio 6 of variances for every SNP tested Additionally the number and coarseness of the search intervals can be adjusted via the command line options brentStarts lt int gt for the number of intervals brentMinLogVal lt double gt for the minimum of the s
14. ntains the compiled executable files fast1lmm Cpp contains C source and project files fastlmm CSharp contains C source and project fastlmm Data sampledata contains sample data and command script fastlmm Doc contains project documentation fastlmm Externals contains other code FaSTLMM depends on There are executables for Windows 64bit and for Ubuntu Linux 64bit under the fastlmm Bin directory and all required dll files are included in the respective directories These executables use the MKL math library which is optimized for Intel processors but also runs on AMD processors If one of these options is suitable please skip ahead to section Data Preparation to see how to run FaST LMM on your data If not please see the next section Compiling FaST LMM In addition to the source code the following external dependencies must be installed and met in order to build FaSTLMM Building for Windows Both C and C versions require Visual Studio 2010 VS A version of VS Express through Universal is capable of building FaSTLMM If you do not already have a copy of Visual Studio the Visual Studio 2010 Express edition can be freely downloaded from http www microsoft com express downloads For the C version Parts of program are capable of running against the Microsoft HPC cluster environment To build you must install the HPC Pack 2008 R2 Client Utilities Redistributable Package with Service Pack 2 This is freely availabl
15. t is loaded in Excel it should look as follows SNP ChromosaGeneticDi Position Pvalue Qvalue N snp2 1 0 3 5 42E 08 1 08E 05 snp110 1 0 111 6 29E 03 6 29E 01 snp55 1 0 56 1 60E 02 8 15E 01 snp167 1 0 168 1 74E 02 8 15E 01 snp140 1 0 141 2 04E 02 8 15E 01 snp171 1 0 172 3 14E 02 8 26E 01 snp144 1 0 145 3 27E 02 8 26E 01 NullLogLik AltLogLike SNPWeigt SNPWeigt WaldStat NullLogDe NullGenei NullResidi NullBias 200 1 83E 02 1 68E 02 1 76E 01 1 92E 03 O 1 69E 00 3 69E 02 2 01E 01 1 53E 00 200 1 83E 02 1 79E 02 1 27E 01 2 82E 03 200 1 83E 02 1 80E 02 8 12E 02 2 04E 03 200 1 83E 02 1 80E 02 8 54E 02 2 17E 03 200 1 83E 02 1 80E 02 8 12E 02 2 12E 03 200 1 83E 02 1 80E 02 9 36E 02 2 64E 03 200 1 83E 02 1 80E 02 7 51E 02 2 13E 03 1 69E 00 3 69E 02 2 01 01 1 53E 00 1 69E 00 3 69E 02 2 01 01 1 53E 00 1 69E 00 3 69E 02 2 01E 01 1 53E 00 1 69E 00 3 69E 02 2 01 01 1 53E 00 1 69E 00 3 69E 02 2 01 01 1 53E 00 1 69E 00 3 69E 02 2 01 01 1 53E 00 olojojojojo The standard and verboseOut columns are SNP The rs or SNP identifier for the SNP tested Taken from the PLINK file Chromosome The chromosome identifier for the SNP tested or 0 if unplaced Taken from the PLINK file Genetic Distance The location of the SNP on the chromosome Taken from the PLINK file Any units are allowed but typically centimorgans or morgans are used Position The base pair position of the SNP on the chromosome bp units Taken from the PL
16. ynthetic data provided in the zip file from the directory containing that data and assuming fastlmmc is in the path gt fastlimmc tfile geno test tfilesim geno cov pheno pheno txt covar covariate txt mpheno 1 You should see something like the following output on the screen using the C version FastLmmC v1 08 20120313 Factored Spectrally Transtormed Linear Mixed Copyright Microsoft Corporation Compiled Mar 13 2012 at 06 20 44 by BOBDOL for Windows using MKL v10 3 5 Build 20110720 Start Processing CommandLine End Processing CommandLine Start Loading FastLmm Data Start Loading Covariance Data PLINK fileset Number of Individuals Selected Number of Phenotypes Individual Number of SNPs Individual Number of SNPs Individual Used End Processing PLINK fileset End Loading Covariance Data Start Loading Test Data PLINK fileset Number of Individuals Selected Number of Phenotypes Individual Number of SNPs Individual Number of SNPs Individual Used End Processing PLINK fileset End Loading Test Data End Loading FastLmm Data ET Eigensym Processing Processing Compute Compute Woodbury low rank Compute GWAs using LMM Start lowrank training Lowrank training done Write output Total elapsed time geno_cov 270 1 200 200 geno_cov geno_test 270 1 200 200 geno_test done geno_test out txt 311 918 ms When the output file geno _test out tx

Running FaST-LMM

Contents

Download Pdf Manuals

Related Search

Related Contents