Home

genomeSIMLA - A forward time simulation for genetic

1. it can slow down processing when only one or two small chromosomes in use FAST LD POOL SIZE integer FAST LD POOL SIZE 3000 Sets the number of chromosomes that are used in sampled plots If this is larger than the population the entire popu lation is used FAST LD PLOT SIZE FAST LD PLOT SIZE 10000 Sets the max number of SNPs used in sampled LD production If the size is greater than a given chromosome the entire chromosome is used NO FASTLD NO FASTLD Unlike most other options this takes no parameters When it is encountered genomeSIMLA will skip the sampled LD calculations and instead render only complete LD plots Please be aware that this will increase the length of time during a growth scan where you are unsure where the best generation to draw from lies It should be used only when you either know ahead of time what your LD will look like or you want to get statistics on the entire population at every drop Ritchie Lab Software genomeSIMLA Reference Locus Generation genomeSIMLA allows for the creation of chromosomes in two different ways The first involves the description of one or more block types and populating a chromosome randomly with one or more of these blocks The other approach uses a locus description file containing positional SNP names and positional information Block Based Locus Generation Block based generation is the production of chromosomes using completely random draws The user specifies one or mo
2. it regardless of variation nor growth curve details It is very important for non logistic growth rates especially expo nential where growth could occur very fast and cause memory problems MIN POOL SIZE integer MIN POOL SIZE 1500 Sets the semi hard lower limit for population size This is evaluated for all generations other than 0 So it is possible to set the initial population to below MIN POOL SIZE and cause a hard spike in population at generation 1 TARGET POP SIZE integer TARGET POP SIZE 100000 If TARGET POP SIZE is greater than 0 genomeSIMLA will use the value as a hard limit for advancement once it reaches the specified population size all advancement and population growth will cease Ritchie Lab Software genomeSIMLA Reference 11 Dataset Generation The entire purpose of genomeSIMLA is data genomeSIMLA is capable of generating 2 types of data sets case control and basic pedigrees genomeSIMLA can produce any number of different data sets and guarantees that in an truly diverse population no single individual will be used in any data sets generated during a single run Both pedigree and case control data sets allow for the use of a label This label is used as part of the filename and allows the user to quickly recognize different data sets These labels can have any character except slashes and spaces Both types of data sets can have the following types of error genotype_error float Exact Portion
3. of SNPs which are not derived via cross over This error is applied evenly across SNPs phenocopy float Percentage of the affected individuals in a given data set whose affected status was determined not by the chosen model missing float Percentage of SNPs that will be missing Case Control DATASET CC label affected unaffected genotype_error phenocopy missing DATASET CC sample 01 500 500 0 05 0 1 0 15 This line creates datasets with 500 affected 500 unaffected each with 5 genotype error and 15 missing data Of the 500 affected individuals 50 of them will not have been evaluated with the model Pedigree Data Pedigree data is slightly more complicated because you can specify multiple types of family structures to be added to your dataset The affected unaffected numbers simply describe the number of children in those categories DATASET PED label genotype_error phenocopy missing DATASET PED family 01 0 05 0 1 0 15 This sets up a framework for datasets with 5 genotype error and 15 missing data 10 of all affected children will not have been evaluated with the disease model This just sets up the data set framework Until you add family types to it the data sets will be empty DATASET FAMTYPE affected unaffected extra_sibs number_of_families DATASET FAMTYPE 1 1 1 250 Ritchie Lab Software genomeSIMLA Reference This sets up a type of family which will be added to the data set A given data set can hav
4. sets were extracted from as well as the random seed used Locus File Format Line 1 is just a line used to describe which chromosome the file was derived from This is ignored when the file is read Ritchie Lab Software genomeSIMLA Reference Line 2 Indicates the number of loci contained within the file genomeSIMLA doesn t actually parse that number out so again this line is not used for reading Line 3 Column Headers This is for the user s beneift and is not used during reading Line 4 N 3 Each line describes a single locus The following represent the 6 columns that should be present in the order listed Each column must have a value for each line and should be seperated by whitespace multiple spaces or tabs is fine Col 1 Label This is usually the RS Number However it can be any label one wants to use All SNPS must have unique labels Col 2 Freq Allele 1 Allele 1 s allele freqency Col 3 Freq Allele 2 Allele 2 s allele frequency Col 4 Recombination Fraction Chance that an odd number of recombinations took place be tween this SNP and the previous SNP in the genome Col 5 Position This is the physical position on the chromosome relative to the beginning of that chromosome NOT the genome These values should be in base pairs Col 6 Description optional This is just a note that can be added Currently this isn t used anywhere The last line of the file should be an empty l
5. 00 000 possible loci To make the task as easy as possible genomeSIMLA can limit the loci presented and present them in a sorted fashion where the topmost SNP shown most closely matches the user s specifications The following commands are used to set up searches Users can have as many searches as they like even if they don t need them all for setting up their models A search describes 3 qualities minor allele frequency ranges types of blocks the SNP is contained within and loca tion Each of the ranges contains three pieces Target Min and Max Currently LOCUS_SELECTOR has 2 ranges mi nor_allele_frequency and block_size LOCUS_SELECTOR label float float float integer integer integer description LOCUS SELECTOR rare loci 0 2 0 15 0 23 4210 The following loci are moderately Far and appear iha block This creates a new search called rare_loci which will only contain SNPs whose minor allele frequency is between 0 15 and 0 23 and are found in blocks of up to 10 SNPs large The SNPs will be ranked so that those that are closest to a minor allele frequency of 0 2 and in blocks with 4 SNPs will be ranked first Notice that the min max values are not evenly distributed around the target The score is ranked on the relative distance from the target for that particular arm So a SNP with a minor allele frequency with just a bit larger MAF than 0 2 would score very similarly to one that had a MAF of just under 0 23 The block size and MAF weigh
6. 01 0 0002 0 000001 0 00001 05 Adds a block definition to the most recently defined chromosome using ADD CHROMOSOME The example above will create a block that ranges from 5 to 10 SNPs 0 0001 and 0 0002 represent the chance of a cross over event occur ing between the previous SNP if one exists and the first in the block This effectively describes how far away from that last SNP the block is The next two describe the chance of a recombination occurring between any two SNPs found inside the block itself The last parameter is the probability this block will be drawn Other Block Related Settings DEFAULT_ALLELE_FREQ float float DEFAULT ALLELE FREQ 0 1 0 5 This allows the user to define min max allele frequencies to be used during the configuration of a new block based chromosome File Based Chromosome Configuration File based chromosome files have all of the information necessary to simulate a chromosome There are two reasons one would use such files 1 To mimic one or more region from a real genome 2 To precisely control a region s density as part of a research project The Ritchie Lab has made a set of these files available which represent a large portion of the Affymetrix 500K cover age As we produce others it is expected they will be made available for use as well These files allow for SNPs to be distributed very similarly to real human assays though the actual LD patters will depend largely on the generations the data
7. EIGHT 10 Adjusting the weight determines how important the attribute is ODDSRATIO float ODDSRATIO 1 25 Ritchie Lab Software genomeSIMLA Reference 16 ODDSWEIGHT float ODDSWEIGHT 1 MARGVAR float MARGVAR 0 0000001 Set the target Marginal variance This determines how pure of an epistatic model you want The higher the values the more likely there will be main effects MARGWEIGHT float MARGWEIGHT 100 PENTARGET float PENTARGET 0 15 Specifies the target prevalence of the disease The following parameters are associated with the GA portion of simPEN For more information about how to use these parameters please see the simPEN user s manual available at the genomeSIMLA website References to pool sizes generations populations and mutation below are completely unrelated to the simulation being performed by the forward time simulation GEN 15000 Number of generations to be tried before ending POPSIZE 1000 The search population DEMES 100 Multiple pools of penetrance values MUTATE 0 01 Frequency of mutation CROSS 0 6 Rate of cross over 1 per genome SUBMODELS ON Turning this on will possibly catch a pattern where a smaller model contained within a larger model does exist with enough strength as to represent a potential problem UPDATE 100 Specifies how many generations between progress is reported Ritchie Lab Software genomeSIMLA Reference 17 The following two par
8. NT DROP COUNT 5 Indicates the total number of drops to be performed including the initial drop If we look at all three of the previous DROP related examples genomeSIMLA would perform 5 drops at generations 500 600 700 800 and 900 It should be noted that the calling parameters can change how drop points are interpreted Graphical Plot Settings The following general parameters control various aspects of the graphical reporting during a given drop MAX SNPS PER ROW integer MAX SNPS PER ROW 3000 In order to render a general overview of an entire chromosome it is necessary to set a maximum number of SNPS which can be drawn on a single row This is not a hard setting more of a suggestion genomeSIMLA will distribute the SNPS evenly on all rows in order to avoid having a small chunk at the bottom BLOCK REPORT SIZE integer BLOCK REPORT SIZE 30 Ritchie Lab Software genomeSIMLA Reference Determines the number of detailed blocks that are reported Each report takes disk space as well as time to generate If your configuration is set to use 22 chromosomes with a report size of 30 there will likely be over 1200 charts drawn and a fair amount of information added to the final report However if the settings are too small it might be more difficult to find the preferred SNP Each Block Report consists of 2 graphs The first is expected to be smaller fewer SNPS on either side of the block of interest The second is generally zoomed o
9. PORT BUFFER SIZE On Off 50 LD BUFFER SIZE WRITE LD REPORT On Off WRITE LD REPORT On DRAW_RSQUARED_PLOTS On Off DRAW_DPRIME_PLOTS On Off DRAW RSOUARED PLOTS Off MAX_SNP_DISTANCE integer MAX SNP DISTANCE 500000 CLOSE POOLS BETWEEN DROPS On Off CLOSE POOLS BE TWEEN DROPS On FAST LD POOL SIZE integer FAST LD POOL SIZE 3000 FAST LD PLOT SIZE FAST LD PLOT SIZE 10000 NO FASTLD Locus Generation Block Based Locus Generation DEFAULT BLOCK min max float float float float DEFAULT BLOCK 5 10 0 01 0 015 0 00001 0 000025 ADD_CHROMOSOME integer label ADD CHROMOSOMI Ritchie Lab Software E 5 chromosome 1 genomeSIMLA iii ADD BLOCK chr idxsnp idx float float float float float 8 ADD BLOCK 5 10 0 0001 0 0002 0 000001 0 00001 0 5 8 Other Block Related Settings 8 DEFAULT ALLELE FREO float float 8 DEFAULT ALLELE FREQ 0 1 0 5 8 File Based Chromosome Configuration 8 Locus File Format 8 Locus Miscellany 9 ALLELE FREQUENCY chr_idx snp idx float float 9 ALLELE FREQUENCY 1 5 0 25 0 75 9 Population Control 10 GROWTH RATE LINEAR initial population variation growth rate 10 GROWTH RATE LINEAR 30000 0 05 10 0 10 GROWTH RATE EXPONENTIAL initial population variation growth rate 10 GROWTH RATE EXPONENTIAL 700 0 05 0 3 10 GROWTH_RATE LOGISTIC initial_population variation growth_rate ca
10. TWI EIGHT 10 ODDSRATIO float ODDSRATIO 1 25 ODDSWEIGHT float ODDSWI EIGHT 1 MARGVAR float MARGVAR 0 0000001 MARGWEIGHT float MARGWI EIGHT 100 PENTARGET float PI ENTARGI GEN 15000 POPSIZE 1000 DEMES 100 MUTATE 0 01 CROSS 0 6 Ritchie Lab Software ET 0415 16 16 16 16 16 16 16 16 17 17 17 17 17 17 17 17 17 17 17 17 17 genomeSIMLA SUBMODELS ON 17 UPDATE 100 17 LOCI 2 18 FREQ 0 2 0 8 18 Main Effects and Interactions with SIMLA 18 DEFINE MODEL SIMLA INDEX simla cfg float int chrom id snp id MIN MAJ float float 18 DEFINE MODEL SIMLA INDEX interactions simla 0 05 2 1 5 MIN 0 26 0 0 18 DEFINE MODEL SIMLA LABEL simla cfg float int snp label MIN MAJ float float 18 DEFINE MODEL SIMLA LABEL interactions simla 0 05 2 RLS MIN 0 26 0 0 18 SIMLA configuration file 18 Ix2x3 0 26 18 Ritchie Lab Software genomeSIMLA viii Introduction Purpose of this manual Contained within this manual are details for configuring and running the application genomeSIMLA If this is your first time to use the software we highly recommend that you take a few minutes to download and work through one or more tutorials Then once familiar with the capabilities of the software users can refer to this guide when making changes to the basic configuration settings Conventi
11. ameters are legacy and have no effect on the production of valid models However the error checking currently requires them to be present Just use these values to satisfy the error checking code for now LOCI 2 FREQ 0 2 0 8 Main Effects and Interactions with SIMLA SIMLA is a simulation program that allows the researcher to specify varying levels of both linkage and linkage dise quilibrium among and between markers and disease loci SIMLA was specifically designed for the simultaneous study of linkage and association methods in extended pedigrees but the penetrance specification algorithm can also be used to simulate samples of unrelated individuals e g cases and controls Users indicate to genomeSIMLA that a SIMLA based model is to be used using a line similar to one of the two lines below DEFINE MODEL SIMLA INDEX simla cfg float int chrom id snp id MIN MAJ float float DEFINE MODEL SIMLA INDEX interactions simla 0 05 2 1 5 MIN 0 26 0 0 DEFINE MODEL SIMLA LABEL simla cfg float int snp label MIN MAJ float float DEFINE MODEL SIMLA LABEL interactions simla 0 05 2 RL5 MIN 0 26 0 0 Both lines do the same thing with the exception of how they specify which loci are to be associated with the model The filename specified by simla cfg represents the list of interactions see below If no interactions are required you may use the keyword NO INTERACTIONS otherwise the file must exist The next number 0 05 in the e
12. ch command we ll describe those properties here and just refer to them as if they were a type Ritchie Lab Software genomeSIMLA Reference Integer Parameters specified in this way just simply refer to a whole number In general these values should be equal to or greater than 0 except when specified otherwise Float Values specified as float are decimal values Index If a parameter is listed as an index it refers to the index starting at 1 the user wishes to select max This is generally an integer value representing the upper bound of some value In some cases such as minor allele frequency it might represent a floating point value min This is generally an integer value representing the lower bound of some value In some cases such as minor allele frequency it is possible that it represents a floating point value On Off These parameters accept a boolean Yes No type setting Users can use ON OFF or YES NO to set them filename When a configuration refers to a file for input or output the filename is generally used This can be either a fully qualified path such as home torstees wga or it can specified as a path relative to the directory where the applica tion was run such as data goodfilename It can also be just a plain filename as long as the file itself is available from the directory in which the application was run label A label refers to a parameter whose value can be any text string withou
13. e as many different types of families as the user needs The affected unaffected counts represent the number of affected unaffected children a given family MUST have The number of extra sibs indicates that a random number from 0 to extra_sibs will be added to the family All children will be evaluated for status however by adding extra sibs you can have larger families which vary by the number of affected siblings It is perfect acceptable to have 0 extra_sibs or unaffected sibs Affected sibs MUST be greater than or equal to 1 DATASET FAMTYPE 1 0 0 150 This would add 150 trios to the data set DATASET FAMTYPE 2 1 0 75 This would add 75 AAU families to the data set DATASET FAMTYPE 1 0 3 50 This would add 50 families with between 1 and 4 children with at least 1 affected sib in them Parent s status is evaluated and written to the dataset but is not considered for determining whether or not the fam ily will be included into the dataset Peformance Note It is important to note that pedigrees with more than 1 affected individual can be computationally difficult Children are created by actually crossing over the parents just like is done during generational advancement If the children don t meet the necessary family shape all individuals are thrown away For instance for a disease with a prevalence of 0 1 it would take the production of almost 1 000 000 families before we found a family with 2 affected sibs and it gets worse as
14. em I should be sure to copy the style sheet to the parent directory of the new location If the style sheet isn t found the report will just be harder to read LD_REPORT_BUFFER_SIZE On Off LD BUFFER SIZE 50 This sets the number of SNPS around the block in the detailed plot In the example above 50 SNPS on either side of a block are drawn WRITE_ LD REPORT On Off WRITE LD REPORT On Causes genomeSIMLA to produce a complete complete report of the pairwise LD values Users should be aware that this file can be very large Ritchie Lab Software genomeSIMLA Reference DRAW RSOUARED PLOTS On Off DRAW_DPRIME_PLOTS On Off DRAW_RSQUARED PLOTS Off By default both RSquared and DPrime plots are drawn If the user wants to save time and disk space they can opt to one or both charts off MAX_SNP_DISTANCE integer MAX SNP DISTANCE 500000 This allows the user to determine far apart SNPs can be before genomeSIMLA decides to calculate LD values Lower ing this value from the default 500K can speed up LD processing CLOSE POOLS BETWEEN DROPS On Off CLOSE POOLS BETWEEN DROPS On In general it is assumed that genomeSIMLA will be used to produce very large populations 1 million unique chromosomes with genomes that approach 500K In order to manage this on a single computer genomeSIMLA must close pools down when they aren t currently in use This frees up valuable memory allowing us to do this without gigabytes of ram However
15. enomeSIMLA LOCUS SELECTOR label float float float integer integer integer de scription 14 LOCUS SELECTOR rare loci 0 2 0 15 0 23 4 2 10 The following loci are moderately rare and appear in a block 14 ADD REGION label snp_start snp_stop 14 ADD REGION rare loci rs 321412 rs 543231 14 MAX LOCI PER CHROM REPORTED Integer 14 MAX LOCI PER CHROM REPORTED 50 14 Disease Modeling 15 Penetrance Table Disease Models 15 DEFINE MODEL PENTABLE INDEX pen file chrom id snp id chrom id snp id 15 DEFINE MODEL PENTABLE INDEX disease pen 1 5 15 DEFINE MODEL PENTABLE LABEL pen file snp label snp label 15 DEFINE MODEL PENTABLE LABEL disease pen RLS 15 Penetrance File Configuration 15 FREQ THRESHOLD Float 15 FREQ AaBbCcDd etc float 15 FREQ A 0 2 15 FREO a 0 8 15 PENTABLE 16 model identification penetrance 16 AABB 0 171 16 AABb 0 155 16 Purely Epistatic Models with simPEN 16 Ritchie Lab Software genomeSIMLA vi DEFINE MODEL SIMPEN INDEX simpen cfg chrom id snp id chrom id snp id DE FINE MODEL SIMP EN INDEX disease simpen 1 5 16 16 DEFINE MODEL PENTABLE LABEL simpen cfg snp label snp label 16 DE FINE _MODEL P ENTABLE simpen cfg disease pen RL5 simPEN File Configuration HERIT float HI ERIT 0 0 HERITWEIGHT float HI ERI
16. es the filename to be used to control genomeSIMLAs overall specific behavior If the configuration is available from within the current working directory the filename alone is sufficient If the filename exists in another directory a fully qualified or relative path should be provided along with the filename itself ld optional When the ld command is present no generational advancement will be performed and complete LD analysis will be performed on the specified pool If no generation is specified via the 1 flag generation 0 is assumed All other com mands are ignored in the presence of this flag datasets optional When the datasets command is present no generational advancement will be performed and data sets will be draws from the specified pool If no generation is specified via the l flag generation 0 is assumed All other commands are ignored in the presence of this flag p project name optional Specifying a project name allows the user to override the nature behavior of using the name of the configuration file as the base name for all of the products generated by execution This can include a relative or fully qualified path as long as a base filename is present i e data affy or home torstees simulated_data affy All files generated will start with this string l generation to load Integer optional Specifies the generation to load This assumes that a previous run has been completed and pools at the specified ge
17. g e n om eS IM L A Reference Manual rev 1 0 1 We ep EES S by P BEE edes eRe EE Be Ep D ads EROL s CEE WEL genomeSIMLA A forward time simulation for genetic data http chgr mc vanderbilt edu genomeSIMLA Table of Contents Introduction 1 Purpose of this manual 1 Conventions Used 1 Random Numbers 1 Common Parameters 1 Integer 2 Float 2 Index 2 max 2 min 2 On Off 2 filename 2 label 2 description 2 Using genomeSIMLA 3 Command Line Arguments 3 genomeSIMLAs config file ld datasets p project name 1 Integer d Integer Integer Integer s Integer 3 Ritchie Lab Software genomeSIMLA Config file 3 Id optional 3 datasets optional 3 p project name optional 3 l generation to load Integer optional 3 d first generation to drop generations between drops drop count optional 3 s seed integer 3 General Parameters 4 The following parameters control the basic behavior of the application 4 SEED integer 4 SEED 23125 4 Drop Points 4 FIRST_DROP_POINT integer 4 FIRST_DROP_POINT 500 4 DROP_FREQUENCY integer 4 DROP FREQUENCY 100 4 DROP COUNT 4 DROP COUNT 5 4 Graphical Plot Settings 4 MAX SNPS PER ROW integer 4 MAX SNPS_PER_ROW 3000 4 BLOCK_REPORT_SIZE integer 4 BLOCK REPORT_SIZE 30 4 FONT filename 5 Ritchie Lab Software genomeSIMLA FONT FreeMonoBold ttf CSS_FILENAME filename CSS_FILENAME genomesimla css LD RE
18. imPEN simPEN is a method for using a Genetic Algorithm GA to evolve purely epistatic models With few exceptions the configuration details are considered to be beyond the scope of this document however a few details will be covered such as those that specify target odds ratios and heritability Users indicate to genomeSIMLA that a penetrance based model is to be used using a line similar to one of the two lines below DEFINE MODEL SIMPEN INDEX simpen cfg chrom id snp id chrom id snp id DEFINE MODEL SIMPEN INDEX disease simpen 1 5 DEFINE MODEL PENTABLE LABEL simpen cfg snp label snp label DEFINE MODEL PENTABLE simpen cfg disease pen RL5 Both lines do the same thing with the exception of how they specify which loci are to be associated with the model The configuration file for simpen must be a separate file When deciding which weights are most appropriate users should keep in mind that the values themselves can differ drastically and a weight of 1 differs in effectiveness for a value whose target is 0 1 than that of a value whose target is 0 000001 The values used in the following examples were determined to be reasonable starting points for obtaining good results from the simPEN module simPEN File Configuration A small number of parameters make up the configuration details of a simpen configuration file HERIT float HERIT 0 01 Specifies that the target heritability will be 0 01 HERITWEIGHT float HERITW
19. ine the last entry should contain a return character It should be noted that when genomeSIMLA sets up the loci allele 1 is ALWAYS the minor allele regardless of the locus frequency in the file This is only important if a user were to draw datasets from a pool at generation 0 Their interpretation of A and a could be different from the way genomeSIMLA When drawing data sets from generation 0 A is ALWAYS the minor allele Also allele frequencies are not exact even in large populations When one is setting up a disease model for genera tion 0 it is recommended to let genomeSIMLA create the pool drop generation 0 it defaults to this and assign model loci based on allele frequencies found in the locus file generated during the initialization Locus Miscellany ALLELE_FREQUENCY chr_idx snp_idx float float ALLELE FREQUENCY 1 5 0 25 0 75 This sets the frequency of allele 1 of Snp 5 on chromosome 1 to 25 and the second allele to 75 Ritchie Lab Software genomeSIMLA Reference Population Control Currently there is a single population in genomeSIMLA though each individual could have several different chro mosomes This population is grown using one of several growth rates During a generational advancement indi viduals are drawn with replacement from the current population mated using Hardy Weinburg mating and added to the new pool until it reaches it s target size Growth rates share many parameter
20. n eration were created d first generation to drop generations between drops drop count optional This allows the user to override the drop configuration found in the configuration file s seed integer Allows the user to override the seed specified in the configuration Ritchie Lab Software genomeSIMLA Reference General Parameters The following parameters control the basic behavior of the application SEED integer SEED 23125 Sets the seed for all random number calls Seeds can range from 0 4 2 billion Drop Points Drop points are points in simulated time generations where the entire contents of the pool s is written to disk and analyzed Reports are produced in HTML format to help the user to interpret the current state of the pool Drop points are designed to allow the user to track the state of LD within the population If the population is large enough any drop point can be the source for dataset generation The reports initially written for a given generation are done using sampling Prior to selecting loci for modeling dis eases users are expected to extract detailed reports from the generation of interest FIRST DROP POINT integer FIRST DROP POINT 500 Sets the first drop point to be performed at generation 500 DROP FREOUENCY integer DROP FREOUENCY 100 Causes genomeSIMLA to drop every N generations once it has reached the initial drop point The example says to drop every 100 generations DROP COU
21. o be associated with the disease model Block based chromosomes see BLOCK_DEFINITION are labeled RLN where N is a number between 1 and however many loci there are associated with all chromosomes being simulated Otherwise the labels are based on information found inside the locus files that were used to populate the simulation most likely this will be an RS number There can be no duplicately labeled SNPs Penetrance File Configuration A small number of parameters make up the configuration details of a penetrance file All but the threshold must be fully specified even if the value is 0 0 FREQ_THRESHOLD Float Specifies maximum allowed variation from the allele frequencies that will be tolerated before execution is halted FREQ AaBbCcDad etc float FREQ A 0 2 FREQ a 0 8 Using letter notation for specifying penetrance cells this command allows the user to tell genomeSIMLA what the intended frequency for a given allele should be The user MUST specify all alleles that are expected to be involved in the given model Ritchie Lab Software genomeSIMLA Reference 15 PENTABLE This just indicates to genomeSIMLA That the various penetrances are about to follow model identification penetrance AABB 0 171 AABb 0 155 Each possible combination must be present regardless if it s value is anything other than 0 0 Penetrance tables should be written as a separate file from the main configuration Purely Epistatic Models with s
22. odeling affection status User generated Penetrance tables simPEN purely epistatic models and SIMLA main effect interactions Each method requires it s own configuration details Penetrance Table Disease Models Users can use predefined penetrance tables to assign status to models The only requirement is that the user specify the allele frequencies associated with each possible allele associated with each model locus This is to help ensure that the appropriate meaning of a given cell is being applied genomeSIMLA will not proceed to use a model if the actual allele frequencies differ too much from those specified in the configuration Users indicate to genomeSIMLA that a penetrance based model is to be used using a line similar to one of the two lines below DEFINE MODEL PENTABLE INDEX pen file chrom id snp id chrom id snp id DEFINE MODEL PENTABLE INDEX disease pen 1 5 DEFINE_MODEL PENTABLE LABEL pen file snp label snp label DEFINE MODEL PENTABLE LABEL disease pen RL5 Both lines do the same thing The first tells genomeSIMLA to load the penetrance table in disease pen and use Locus 5 on chromosome 1 as the single disease locus The contents of the specified penetrance table must match the number of model loci specified on the configuration line Otherwise genomeSIMLA will generate an error or worse become confused and generate misleading data sets The second example simply uses labels to specify which loci are t
23. ome users will use the ADD_CHROMOSOME command indicating how many blocks to draw and possibly giving it a label The user then applies blocks to the chromosome by using the ADD_BLOCK command When the draws are made the blocks associated with a given chromosome will be drawn based on their probability including the possibility of using the default block if necessary DEFAULT_BLOCK min max float float float float DEFAULT BLOCK 5 10 0 01 0 015 0 00001 0 000025 The default block is used when the sum of a given chromosome s blocks probabilities don t sum up to 1 0 Otherwise it is the same as a regular block Ritchie Lab Software genomeSIMLA Reference The first two parameters specify the minimum and maximum number of SNPS will be created The next two repre sent the range of distance this block falls from the previous SNP on the chromosome The last two represent the range of distances of SNPS within the block itself DEFAULT BLOCK should be set prior to the definition of any chromosomes and thus any other blocks ADD CHROMOSOME integer label ADD CHROMOSOME 5 chromosome 1 Adds a new chromosome to the genome The first parameter represents how many blocks to draw and the last op tional parameter is the label that will be used in naming files and on the reports The example above will create a chromosome with 5 blocks named chromosome 1 ADD BLOCK chr idx snp idx float float float float float ADD BLOCK 5 10 0 00
24. ons Used There are two conventions used throughout this document These text conventions are intended to help distinguish examples from configuration parameters Random Numbers genomeSIMLA uses an open source implementation of the mersenne twister pseudo random number generator available at http agner org random When using genomeSIMLA to generate data the following should be kept in mind in order to ensure that products are as reproducible as possible e At the beginning of execution of any kind population initialization generational advancement dataset ex tract etc the random seed will be set Configuration details are listed first in bold left aligned with the rest of the text The first word s are the keywords which specify what is being changed Each keyword or phrase has some number of parameters These are listed in the order they should appear in the configuration line In some cases parameters can be repeated or are optional Those are denoted inside s Configuration details are generally followed immediately by an example line This is an example Examples show how an actual entry would look and are followed by some descriptive information to help the user understand how the example would affect genomeSIMLA s runtime Common Parameters There are a number of parameters which are used commonly across multiple configuration settings In order to sim plify the descriptions of the various properties of ea
25. owth rate GROWTH RATE EXPONENTIAL 700 0 05 0 3 This is just a basic exponential growth based on the growth rate specified Ritchie Lab Software genomeSIMLA Reference 10 GROWTH RATE LOGISTIC initial population variation growth rate carrying capacity GROWTH RATE LOGISTIC initial population variation growth rate carrying ca pacity This is considered to be one of the preferred models for describing growth rates The carrying capacity represents the peak potential which could be caused by various reasons For our needs it is the size of pool required for drawing data sets GROWTH RATE RICHARDS initial population variation growth rate carrying capacity time of max growth polarity GROWTH RATE LOGISTIC initial population variation growth rate carrying ca pacity time of max growth polarity Richard s logistic is just an enhanced logistic curve with two parameters capable of determining when growth starts and just how steep the growth will be By pushing the time_of_max_growth forward the population hovers at initial_population for some amount of time This small population will produce rich LD patterns which tend to be carried forward in time once growth begins However this small population increases the risk of fixing alleles dramatically General Growth rate parameters MAX POOL SIZE integer MAX POOL SIZE 90000 Sets the hard upper limit for population size Every generation is compared against this value and can NEVER exceed
26. re block configurations which will be applied randomly to create the loci on a given chromosome There are 3 elements involved in this process Block Definitions These describe 4 things e Min Max number of snps that can be associated with the block e Min Max recombination fraction for the first SNP how far away is that SNP from the previous SNP on the chromosome e Min Max recombination fraction for each of the containing snps e Probability this block will be drawn When a chromosome draws a block definition to be used to construct a set of loci it will randomly drawn the number of SNPs based on the block Min Max value Then for each SNP it will determine the distance between each SNP and it s predecessor All but the first SNP use the second set of Min Max recombination values The first SNP is drawn from the first set This allows the user to space the block further out from the SNPs in front of it or not Default Block When a chromosome is deciding which block definition to use next it uses the probabilities associ ated with the blocks It is possible for the sum to be less than 1 0 The difference between the sum and 1 0 is the prob ability that the default block will be used The default block is common to ALL chromosomes and should be defined before any other blocks or chromosomes With the exception of probability the default block has the same parameters as regular blocks Chromosome To create a block based chromos
27. rry ing_capacity 11 GROWTH RATE LOGISTIC initial population variation growth rate carrying ca pacity 11 GROWTH_RATE RICHARDS initial_population variation growth_rate carry ing capacity time of max growth polarity 11 GROWTH RATE LOGISTIC initial population variation growth rate carrying ca pacity time of max growth polarity 11 General Growth rate parameters 11 MAX POOL SIZE integer 11 MAX POOL SIZE 90000 11 Ritchie Lab Software genomeSIMLA MIN POOL SIZE integer MIN POOL SIZE 1500 TARGET POP SIZE integer TARGET POP SIZE 100000 Dataset Generation Case Control DATASET CC label affected unaffected genotype_error phenocopy missing DATASET CC sample 01 500 500 0 05 0 1 0 15 Pedigree Data DATASET PED label genotype_error phenocopy missing DATASET PED family 01 0 05 0 1 0 15 DATASET FAMTYPE affected unaffected extra_sibs number_of_families DATASET FAMTYPE 1 1 1 250 DATASET FAMTYPE 1 0 0 150 DATASET FAMTYPE 2 1 0 75 DATASET FAMTYPE 1 0 3 50 General Data set Configuration Parameters DATASET_COUNT integer DATASET COUNT 500 BINARY_DATASETS Yes No BINARY DATASETS Yes USE_STD_PEDIGREE_HEADER on off USE STD PEDIGREE HEADER On Locus Searching 11 11 11 11 12 12 12 12 12 12 12 12 12 13 13 13 13 13 13 13 13 13 13 14 Ritchie Lab Software g
28. s Below is a list of parameters that are used in each of the growth curves intial population The population that is created at generation 0 variation float This value is used to simulate imperfect growth curves It represents the percentage of fluctuation around the curve s value at a given generation The amount of fluctuation is actually 1 2 the varia tion so it is possible that the population at generation N 1 be smaller than at N growth_rate float This is the rate of growth While it is applied differ ently for each model the higher the growth rate the faster the growth carrying_capacity integer This is used in logistic style growths and specifies the ceiling of the growth curve As the population ap proaches this value it becomes less and less exponen tial in nature until it becomes static time_of_max_growth integer Used only in Richard s Logistic this parameter effec tively moves the exponential part of an S curve about on the X axis in the direction of the generation speci fied polarity float Used only in Richard s Logistic this parameter affects the draw of the curve toward the carrying capacity To set up a growth rate the user should configure one of the following GROWTH RATE LINEAR initial_population variation growth rate GROWTH RATE LINEAR 30000 0 05 10 0 This is just a straight line that grows by growth rate each generation GROWTH RATE EXPONENTIAL initial population variation gr
29. t whitespace These labels are generally used for reporting but in many cases are used to determine filenames As a result users should avoid using unusual char acters in the string that could possibly cause problems with filenames Because spaces and tabs are used to separate each parameter on a given line labels can not contain spaces description A description is a chunk of text that can contain spaces It will always be at the very end of a line and is generally optional Ritchie Lab Software genomeSIMLA Reference Using genomeSIMLA Except in very specific cases generating data sets with genomeSIMLA is a multistep process At the very least users must run genomeSIMLA forward through time performing at least 1 drop along the way It is this drop that the user s data sets will be drawn from In addition to generational advancement and data set production genomeSIMLA can pick up from a specified generation and advance further through time or perform complete LD analysis To control genomeSIMLA in this way we offer a small number of different parameters to give the user control over genomeSIMLA s behavior It is important to note that a handful of these parameters must appear in a certain order those that lack a T flag where T is some parameter designator Command Line Arguments genomeSIMLAs config file ld datasets p project name 1 Integer d Integer Integer Integer s Integer Config file Specifi
30. ts are different MAF is more currently weighted higher than block size However each block a SNP is found in will add more to it s final score meaning it will rise higher in the report The description of a LOCUS_SELECTOR gets used in the locus report Be descriptive as is necessary However there can be no newline characters in it Spaces are allowed though ADD_REGION label snp_start snp_stop ADD REGION rare loci rs 321412 rs 543231 This adds a region to the selector rare loci This region is bounded by the two SNPs rs 321412 and rs 543231 Both SNPs must be found and exist on the same chromosome By default all searches are performed over the entire genome However if a user wishes to restrict the region to search they can do so by adding regions once you add a single region it will only search the regions that have been added To add a whole chromosome simply add the first and last SNP geographically MAX LOCI PER CHROM REPORTED Integer MAX LOCI PER CHROM REPORTED 50 Instructs genomeSIMLA to report at most N loci per chromosome for each sector described Loci are ranked accord ing to how well they fit the criterion which could possibly be rather extensive By setting this value to a reasonable number the locus report can be kept at a manageable size Setting the value to 1 will catch all possible matching loci Ritchie Lab Software genomeSIMLA Reference 14 Disease Modeling genomeSIMLA comes with 3 options for m
31. ut to show a bit more of the surrounding SNPS FONT filename FONT FreeMonoBold ttf genomeSIMLA requires access to a true type font in order to write labels and details onto the graphical portions of the reports This font should be available to genomeSIMLA during execution time If the file can t be found there will be a large amount of warnings rendered to STDOUT and none of the graphs will have any textual information on them but execution will continue CSS_FILENAME filename CSS_FILENAME genomesimla css In order to make the reporting flexible each report refers to a stylesheet which contains the necessary information about shading spacing and other information An example stylesheet is provided with the application as well as each of the examples Users are welcome to change this to suit their needs and should be aware that editing the stylesheet does not require anything be done with genomeSIMLA However it is necessary that the stylesheet be found as stated in the configuration file when the reports are read The example above indicates that there will be a file named genomesimla css that resides in the directory above the one in which the report is read from In other words If there is a report named home torstees genomesimla data test1 index 50 html And I used the setting from the example above the following file must exist home torstees genomesimla genomesimla css If I copy the report s to a new filesyst
32. xample above is the target prevalence The last parameter before the loci is the maximum interaction size This is just maxi mum number of loci that will be interacting with one another The MIN MAJ value determines whether the disease is associated with the minor or major allele The next parameter specifies the beta value associated with that locus Finally the last value required for each locus is the type A 0 0 rep resents a recessive trait The locus becomes more dominant as it approaches 1 0 Each locus to be considered must have each of these parameters locus specification MIN MAJ beta and type SIMLA configuration file The simla cfg is just a file that specifies the beta values associated with each of the interactions desired For each in teraction specify them in the following way 1x2x3 0 26 This tells genomeSIMLA that the 1rst 2nd and 3rd locus in the order they are encountered on the DEFINE MODEL line interact with a beta value of 0 26 Users can add as many or as few interactions as they wish Ritchie Lab Software genomeSIMLA Reference 18
33. you add in more required affected sibs Most data sets can be generated in a few minutes but be aware of the possibility of long delays for large numbers of affected sibs and rare disease models General Data set Configuration Parameters DATASET_COUNT integer DATASET COUNT 500 This indicates the number of files that will be created per data set In this example all data sets created by this con figuration would result in 500 unique files BINARY DATASETS Yes No BINARY DATASETS Yes This compresses data sets dramatically allowing whole genome size data sets to occupy a minimal amount of disk space This format was developed in house and won t be supported by any products other than those produced at the lab here and only now are we beginning to implement it in our own applications If you are interested in the format we will make the format available on the wiki in the near future In the meantime feel free to contact us at genomeSIMLA chgr mc vanderbilt edu This is currently not supported for pedigree datasets USE_STD_PEDIGREE_HEADER on off USE_STD PEDIGREE HEADER On When on all pedigree data sets will have 10 column headers When off the header count will be 6 columns Ritchie Lab Software genomeSIMLA Reference 13 Locus Searching The main goal for genomesimla is the production of realistic data sets These might be very large and choosing dis ease loci can be a daunting task when presented with over 2

genomeSIMLA - A forward time simulation for genetic

Contents

Download Pdf Manuals

Related Search

Related Contents