Home
Untitled - CLC bio
Contents
1. Figure 1 18 An example of a variant that is filtered out when the pyro error filter is applied with settings 3 and 0 8 but not with settings 3 and 0 5 Result handling 1 Choose where to run 2 Select read mappings Output options 3 Low frequency variant pli ds 4 Create track Create annotated table 4 General filters Create report 5 Noise filters 6 Result handling Result handling Open o Save Log handling Open log Figure 1 19 Output options Rows 74 865 Table view Genome Filter h Region Type Refe Allele Refer Length Zygosity Count Coverage Freque Proba Forwa Reve Forward r Avera Readc Readcov unig unig Base Hy wo 1564953 Dele G lo a 1 Ne 5 1 1564953 SWV G G Yi 7 1 1920434 1 Inse A N 1 Heterozyg 5 24 20 83 0 90 2 4 0 33 32 00 6 27 5 5 0 00 no Yes 1 1920434 1 Inse Yi 0 Heterozyg 18 24 75 00 0 90 13 6 0 32 35 06 20 27 17 16 no Yes 1 3395974 3 MNV CC cc Y 2 Heterozyg 23 42 54 76 1 00 20 6 0 23 35 96 26 47 36 39 no Yes 1 3395974 3 Dele CC Ne 2 Heterozya 19 42 45 24 1 00 13 8 0 38 31 58 21 47 36 39 2 86 no Yes Figure 1 20 A variant track shown in the table view variant caller might detect two variants A and G at a given position in which the reference is A In this case the variant corresponding to allele A will have Yes in the reference allele column entry and the variant corresp
2. User manual for Combined Variant Detection Beta Plugin 1 0 Windows Mac OS X and Linux April 25 2014 This software is for research purposes only CLC bio a QIAGEN Company Silkeborgvej 2 Prismet DK 8000 Aarhus C Denmark CC big A QIAGEN company Contents 1 The Combined Variant Detection plugin LL WOU ce s a eae ee eae eee ee oe ee a ee ER a a 1 2 The Variant Detection MOO 2 2 26 i8 88 ee eee eae hee eee Dew Ee 4 L21 Basie Varant PORCO s sc a a s s ea oe id oak e ae eR SG 1 2 2 Fixed Ploidy Variant Detection a aoao aoaaa a a a 2 eee a 1 2 3 Low Frequency Variant Detection 0 0 e eee a 1 2 4 The Error Model estimation 4242s oaoa e a LO CEU errea erena Lai ERC MNCS o so oa ae be a da afea EO ee Re ee E E Reference Masking o s as ee bE RARE DE sw kanaa Meets podas ae a e aa DE E RE a A d Coverage and count filters n nonoa oa o a a ee ee a lena WOlse TINEIG 2h ceana e aaa DE E e aT ew E Qualty TES e a e a reeta ew ee eaa ee a Re eee a Direction and position filters aoao aoa oa a a a a rr Technology specific filters 2 nonoa o a ta OUL ODIO a o aa ee ce es a ee a ee e TE A LAA The variant track output 2 aoao oa ee ew ww a O 1 4 2 The annotated table output sussa a we wae dd a ew wa E LAS INGTEDOM 2 2 65 5a ee eee eA ae a we a E a dd 2 installation of the Combined Variant Detection 3 Uninstall Bibliography o O O O A A 10 11 11 11 11 12 12 13 14 16
3. GTACAAAAGAACTTCATCGAAGTA TACAAAAGAACT TCATCGAAGTA TACAAAAGAACT TCATCGAAGTA ACAAAAGAACTTCATCGAAGTA 4 mW gt ES p FEB no read direc X Rows 4 Table view Genome Fil Region Type Reference Allele Reference Length Zygosity Count Coverage Frequency Figure 1 16 The same data as shown in figure 1 15 but now with Disconnect paired reads option Switched on in the reads track side panel Technology specific filters e Remove pyro error variants This filter can be used to remove insertions and deletions in the reads that are likely to be due to pyro like errors in homopolymer regions There are two types of such errors They may occur either at 1 the immediate ends of homopolymer regions or 2 as an overspill a few nucleotides downstream of a homopolymer region In case 1 the exact numbers of the same number of nucleotide is uncertain and a sequence like AAAAAAAA is sometimes reported as AAAAAAAAA In case 2 a sequence like CGAAAAAGTCG may sometimes get an overspill insertion of an A between the T and C so that the reported sequence is C CGAAAAAGTACG Note that the removal is done in the reads as a very first step before calling the initial 1 bp variants There are two parameters that must be specified for this filter In homopolymer regions with minimum length Only insertion or deletion variants in homopolymer regions of at least this length will be removed With f
4. Variants 884 amp s 2 a z ze zaa o o a a lowFreq E Variants 796 E e o a Variants 233 a a n 0 TAGTTTCTGATGTGTGTCCTCAACTAACAGAGTTG ACATTTCTTTAGACAGA A CAGTTTTGAAACA chr2 mapping TAGTCTCTGATGGGTGTCCTCAACTAACAGAGTTGAACATTTCTTTTGACAGT A CAGTTTTGAAACA 140 961 reads GTTTTGAAACA TAGTTTGTGATGTCTGTACTCAACTAA TOMACATTTCTTTAGACAGA A CAGTTTTGAAACA TTGTTTGTGATGTGTGTATTCAACTAACAGAGATG TCCT TAGACAGA A CAGTTTTGAAACA TT ACATTTCTTTAGACAGA A CAGTTTTGAAACA TAGTTTGTGATGTGTGTACTCAACTAA TTGOACATTTCTTTAGACAGA A CAGTTTTGAAACA TAGTTTGTGATGTGTGTACTCAACTAACACAG CAATTCTTTAGACAGA A CAGTTTTGAAACA Pci cd A CAGTTTTGAAACA soso AA CAGAGATTIAGCATTTCTTTTGACAGA A CAGTTTTGAAACA A 32 TAGTTTGTGATGTGTGTACTCAACTAACACAG CACTTCTTTAGACAGA a CAGTTTTGAAACA moy Ja s EB lowFreq X Rows 796 Table view Genome Chro Region Type Refe Allele Refe Length Zygosity Count Cov Freq Prob For Rev For Read Rea chr2 91669382 SNV G G Yes 1 Hetero 196 201 97 51 1 00 184 43 0 19 227 232 chr2 91669386 SNV A A Yes 1 Hetero 174 204 85 29 1 00 163 38 0 19 201 240 _ Figure 1 4 A variant is highlighted that is detected by the Basic and the Low Frequency but not by the Fixed Ploidy Variant Caller The variant track for the Low Frequency Variant Caller variants is opened in the table view at the bottom of the figure The variant is present at a moderate frequency in a hig
5. 0 5 100 0 mills guided locally realigned 9 SCCGTGCCAGCACCGCACCGTGTGTG 100 0 mills guided 3ACGTGCCAGCACCGCACCGTGTGTGAGGGTGAG locally realigned 3 CCGTGCCAGCACCGCACCGTATGTGA 1564283 1565532 828 reads GGGGGCCCCGGGGTGGGGAGGCCCGGCTAGTAGGC SsCCGTGCCAGCACCGCACCGT GGGT GAGGGT GAG T IGGGGGCCCCGGGGTGGGGAGG GGCTAGTAGG GGGGGCCCCGGGGTGGGGAG sCCGTGCCAGCACCGCACCGTGTGTGAGGGTGAG e GGGGGCCCCGGGGT GGGGAGGCCCGGCTAGTAGG GTGAG sCCGTGCCAGCACCGCACCGTGTGTGAGGGTGAG 3sCCGTGCCAGCACCGCACCGTGTGTGAGGGTGAG sCCGTGCCAGCACCGCACCGTGTGTGAGGGTGAG sCCGTGCCAGCACCGCACCGTGTGTGAGGGTGAG 15 2CCGT GCCAGCACCGCACCGTGTGTGAGGGTGAG GGGGGCCCCGGGGTGGGGAGGCCCGG GGGGGCCC GGC GGGGGCCCCGG GGAGGCCCGGCTAGTAGG GGGGGCC CCG GG e eee TT O O O E D a E5 0 8 100 0 m X Rows 2 Table view Genome Allele T Prob 1 00 34 78 Read 58 Refe G Refe No Aver Rea Chro 1 Region Type Freq 1564712 SNV Zygosity Count Cove Hetero 55 95 57 89 la Create Track from Selection ESS YY E5 0 5 100 0 m X Rows 4 Table view Genome Chr Region Type Ref Allele Ref Zygosity Count Cov Freq Prob Aver Rea Rea u un Bas Hyp Hom 1 1564 SNV G T No Hetero 55 95 57 89 1 00 34 78 58 100 41 42 0 54no 1 1564 SNV G G Yes Hetero 40 95 42 11 1 00 35 45 42 100 33 33 no
6. 1 0 Figure 1 9 The Low Frequency Variant Detection parameters 1 2 4 The Error Model estimation The Fixed Ploidy and Low Frequency Variant Detection tools both rely on statistical models for the sequencing error rates An error model is assumed and estimated for each quality score Typically low quality read nucleotides will have a higher error rate than high quality nucleotides In the error models different types of errors have their own parameter so if A s for example more often tend to result in erroneous G s than other nucleotides that is also recognized by the error models The parameters are all estimated from the data set being analyzed so will adapt to the sequencing technology used and the characteristics of the particular sequencing runs Information on the estimated error rates can be found in the Reports Section 1 4 1 5 Estimated frequencies of actual to called bases quality scores 20 29 Actual below rT ors 00 oo come com 003 Po o os co ooo oo co 100 000 Number of sequenced bases with quality scores 20 29 382 854 867 1 6 Estimated frequencies of actual to called bases quality scores 30 39 Called across c T Actual below A 99 979 0 001 0 008 0 003 0 008 0 001 ics 0 010 99 976 0 002 0 008 0 002 0 001 Gc 0 008 0 001 99 974 0 012 0 004 0 001 0 003 0 008 0 002 99 983 0 003 0 001 0 000 0 000 0 000 0 000 0 000 100 000 Number of seq
7. 17 17 20 20 22 24 25 Chapter 1 The Combined Variant Detection plugin 1 1 Introduction The Combined Variant Detection plugin contains three tools for detecting variants e The Basic Variant Detection tool e The Fixed Ploidy Variant Detection tool and e The Low Frequency Variant Detection tool The tools differ in their underlying assumptions about the data and hence differ in their assessments of when there is enough information in the data for a variant to be called The tools and the assumptions that they make about the data are described in detail in Section 1 2 The tools share a set of filters They relate to a which areas and positions of the read mappings that should be inspected for variants b which reads in the data should be considered when this assessment is done c requirements to the coverage frequency and absolute counts of variant carrying reads and d the quality and neighborhood composition of the area surrounding the variant The filters are described in detail in Section 1 3 The variant callers operate in a step wise fashion In the first step each nucleotide positions is examined for the presence of a variant in the second step neighboring variant positions are examined to see if the variants are carried by the same reads If so the variants are joined neighboring SNVs into MNVs neighboring insertions and deletions into longer insertions and deletions and neighboring SNVs
8. 4 but the statistical model for the sample is different It does not make any assumptions about the ploidy of the sample Instead a statistical test is performed at each site to determine if the nucleotides observed in the reads at that site could be due simply to sequencing errors or if they are significantly better explained by there being one or more alleles than the reference present in the sample at some unknown frequency If the latter is the case a variant corresponding to the significant allele will be called with estimated frequency The Low Frequency Variant Detection tool has one parameter Figure 1 9 e Significance this parameter determines the cut off value for the statistical test for the variant not being due to sequencing errors The higher the value you set the more variants are called The Low Frequency Variant Detection tool is suitable for analysis of samples of mixed tissue types such as cancer samples in which low frequent variants are likely to be present as well CHAPTER 1 THE COMBINED VARIANT DETECTION PLUGIN 10 as for samples for which the ploidy is unknown or not well defined The tool also calls more abundant variants and can be used for analysis of samples with ploidy larger than four E Low Frequency Variant Detection Low frequency variant parameters Choose where to run Select read mappings Low frequency variant Parameters Low frequency variant parameters Significance
9. Variants 0 inLowFreqV2 notinFixedV2 u Variants 542 o Es inLowFreqV2 notinBasicV2 I Variants 9 inBasicV2 notinFixedV2 T n n Variants 623 a SS a 7 inBasicV2 notinLowFreqV2 Variants 99 o 0 TGTG TGCAACTGGATATTTGGCTGGCTATGAGGATTTCGTTG chr2 mapping TG TTCAAGTGGATATTTGGCTGGCTTTGACGATTTCGTTG 140 961 reads TGT GGAAITICT GCAAGTGGATACTTTGCTGGCTTTGAGGATTTCGTTG TGTGGAAINICTGCAAGTGGCTATTTGGCTAGATTTGAGGATTTCGTTG TGTGGAAITICT GCAAGTGGETATTTGGCTGGATTTGAGGATTTCTTTG TGTGGAANCTGCAAGTGGCTATTTGGCTAGATTTGAGGATTTCGTTG TGTGGAATTICT GCAAGTGGCTATTTGGCTAGATTTGAGGATTTCGTTG TGTGGAAINICTGCAAGTGGCTATTTGGCTAGATTTGAGGATTTCGTTG 32 IGTG Ge aa Un ance Acre aa cata a cee a 1 Oo E Ja oT 14801 Figure 1 2 The differences in variants called by the three variant callers The variant callers have all been run with the filter settings those that are the defaults for the Low Frequency Variant Caller Figure 1 2 shows variant calls produced by the three variant callers when run with the same filter settings more precisely those that are default for the Low Frequency Variant Caller The Basic Variant Caller calls most variants and the Fixed Ploidy the least the numbers of called variants are shown in the left part of the figure under the variant track names basicV2 LowFreq and FixedV2 The Fixed Ploidy Variant Caller calls a subset of those called by the Low Frequency CHAPTER
10. biases e g induced by the amplification or sequencing protocol that may occur in samples They should be used with care as there is always the risk that a real variant has the characteristics of systematically induced variant P E Fixed Ploidy Variant Detection Noise filters 1 Choose where to run Quality filters 2 Selectread mappings ra Base quality filter 3 Fixed ploidy variant parameters 4 General filters Dr RC a vVINIMU eig joTrnooa quality Kel 5 Noise filters Direction and position filters Read direction filter UO reguer Technology specific filters V Remove pyro error variants In homopolymer regions with minimum length 3 And with frequency below 0 9 A Previous gt Next Finish X Cancel Figure 1 12 Noise filters Quality filters e Base quality filter The base quality filter can be used to ignore the reads whose nucleotide at the potential variant position is of dubious quality This is assessed by considering the quality of the nucleotides in the read in the region around the nucleotide position There are three parameters to determine the base quality filter Neighborhood radius This parameter determines the region size when a neighborhood radius of five is used each nucleotide in a read is evaluated based on the nucleotides in the read 5 positions upstream and 5 positions downstream of the examined site a total of 11 nucleotide
11. read count The number of countable forward reads supporting the allele see under Count above for an explanation of countable reads Reverse read count The number of countable reverse reads supporting the allele see under Count above for an explanation of countable reads Forward reverse balance The minimum of the fraction of countable forward reads and count able reverse reads carrying the variant among all countable reads carrying the variant see under Count above for an explanation of countable reads Average quality The average read quality score of the bases supporting a variant Read count The number of countable reads supporting the allele Only countable reads are considered see under Count above for an explanation of countable reads Note that each read in an overlapping pair contribute 1 To view the reads in pairs in a reads track as single reads check the Disconnect paired reads option in the side panel of the reads track Please see the column Count above for a column that reports the value for fragments rather than for reads Read coverage The read coverage at this position Only countable reads are considered see under Count above for an explanation of countable reads Note that each read in an overlapping pair contribute 1 To view the reads in pairs in a reads track as single reads check the Disco
12. variant reads to top V Show quality scores Matching residues as dots Only show coverage graph Highlight reverse paired reads Text format E Calculus track editor E Rows 854 Table view Genome Forw Aver 11 0 21 24 57 Type Refe Allele Refe Length Zygosity Count Cove SNV A A Yes 1 Heteroz 14 24 58 33 1 00 3 Freq Prob Forw Reve O Region 1 32671628 15 1 36807592 SNV T T Yes 1 Heteroz 34 4 amp 6 73 91 0 95 4 30 0 12 37 65 34 Read Read uni uni Figure 1 13 An example of a variant that is removed by the pase seals filter 26 807 560 26 807 580 26 807 600 26 807 620 l l Homo sapiens hg19 gt ACTCCTCCGTCACCT TCGCCTT T GAGGCGGGGAGGGITIGGGAAGGAGACAAAGATAAACCCAGATGAAGGCAGCC withBaseQfilter 100 0 original Variants 7 032 a TCCTCCGTCACCT TCGCCTT TGAGGCGGGGAGGGTIGGGAAGGAGACAAAGATAAAC 100 0 mills gui uided SACTOCTCCGTCACCTICGCCT TT GAGGCGGGGGGCGYGEGGAGGAGACAAAGATAAAACCAGATGAAGG AGC locally realigne 14 679 825 reads See ea eee Ge ee remove en eQfilter AGGCAGCC ZACTCCTCCGTCACCT TCGCCTT TGAGGCGGGGGGGGIGGGGGAGGAGACAAAGATAAA SACTCCTCCGTCACCTTCGCCTT TGAGGCGGGGGGGGIGGGGGAGGAGACAAAAATAA Se eee eG Ane TT ER RUT CAGE DEES ced CCRC RACE RE AG AA OT AACE TCCGTCACCTTCGCCTT TGAGGCGGGGGGGGIGGGGAAGGAGACAAAAATAAAACC TGAGGCGGGGAGGGITIGGGAAGGAGACAAAGAT AAACCCAGAT GAAGGCAGCC TCCGTCACCTTCGCCTT TGAGG
13. 1 THE COMBINED VARIANT DETECTION PLUGIN 6 Variant caller which in turn calls a subset of those called by the Basic Variant caller in spite of the fact that there are 9 variants in the Low Frequency variant track that are not in the Basic Variant track Although those 9 variants are in fact not in the Basic Variant track they are sub variants of variants in that track The highlighted variants in the figure is an example of this The Basic variant caller has called a heterozygous 2bp MNV The Low Frequency variant caller has judged that one on the SNVs constituting this 2bp MNV is likely to be the result of sequencing errors and has only called one of the SNVs ly Track List X 91 669 360 91 669 380 91 669 400 91 669 420 chr2 Genome YJACCTAGTTTCTGAGATGTGTCCTCAACTAACAGAGTTGAACATT ICTTTTGACAGA A CAGTTTTGAAACACTCTTTTTGTGGAAT basicV2 Variants 884 ariants 834 a 5 a a o a o E E B a E lowFreq Variants 796 o E B E B DU E a Do o fixedV2 Variants 233 o 0 oO 0 o a X AACTAGTTTCTGATGTGTGTCCTCAACTAACAGAGTTGCACAT TINCT T TAGACAGA chr2 mapping AGCTAGTCTCTGATGGGTGTCCTCAACTAACAGAGTTGAACATTITICT TT TGACAGT CAGTTTTGAAACACTCTTTTTGTG CAGTTTTGAAACAGTGTTTTTG GTITTGAAAGAGTCITTTTGTGGAATC CAGTTTTGAAACACTCTTTTTGTGGAAT CAGTTTTGAAACACTCTTTTTGTGGAAT A CAGTTTTGAAACACTCTTTTTGTGGAAT CAGTTTTGAAACACTCTTTTTGTGGAAT A CAGTTTTGAAACACTCTTTTTGTGGAAT G AACTAGTTTGTGATG
14. 23 Chapter 3 Uninstall Plugins are uninstalled using the plugin manager Help in the Menu Bar Plugins and Resources E or Plugins ES in the Toolbar This will open the dialog shown in figure 3 1 am A Download Plugins Manage Resources r Manage Plugins and Resources GD CLC bio suppor t dcbio com Version 1 5 1 Build 131211 2142 102901 Perform alignments with ClustalO ClustalW and MUSCLE Ones Daal Annotate with GFF file Q CLC bio suppor t dcbio com Version 2 2 6 Build 131211 2143 102901 Using this plug in it is possible to annotate a sequence from list of annotations found in a GFF file Located in the Toolbox CLC Microbial Genome Finishing Module CLC bio support dcbio com Version 1 3 2 Build 140318 1029 Various tools for genome finishing aimed to dose and produce high quality genomes in sequencing projects CLC Workbench Client Plugin Q CLC bio suppor t dcbio com Version 6 0 Build 140207 0940 105889 Client plugin for connecting to a CLC Genomics Server CLC Science Server CLC Drug Discovery Server or Bioinformatics Database The plug in also includes Grid Engine Integration Proxy Settings Check for Updates Install fromFile Close Figure 3 1 The plugin manager with plugins installed The installed plugins are shown in this dialog To uninstall Click the Combined Variant Detection Uninstal
15. 4 060 Homo sapiens hg19 sequg a AT ACATTT TGGGCTCACCTGCGACATT TIGGAAGTACAAAAGAACT TCACCGAAGAAGCG with read direction Varian D E E no read direction o E Em TTTGGGCTCACCTGCGACATT GAAGTACAAAAGAACTTCATCGAAGTA read mar TTTGGGCTCACCTGCGACATT GAAGTACAAAAGAACTTCATCGAAGTA 344 TGGGCTCACCTGCGACATT TIAGAAGT ACAAAAGAACTTCATCGAAGTA GGCTCACCTGCGACATT TJAGAAGT ACAAAAGAACT TCATCGAAGTA GCTCACCTGCGACATT TJAGAAGT ACAAAAGAACT TCATCGAAGTA CTCACCTGCGACATT THRGAAGT ACAAAAGAACT TCATCGAAGTA TCACCTGCGACATT TIAGAAGT ACAAAAGAACT TCATCGAAGTA TCACCTGCGACATT TIAGAAGT ACAAAAGAACT TCATCGAAGTA CACCTGCGACATT TIAGAAGT ACAAAAGAACT TCATCGAAGTA ACCTGCGACATT TIAGAAGT ACAAAAGAACT TCATCGAAGTA CCTGCGACATT TJAGAAGT ACAAAAGAACT TCATCGAAGTA CTGCGACATT TJAGAAGT ACAAAAGAACT TCATCGAAGTA TGCGACATT TJAGAAGT ACAAAAGAACT TCATCGAAGTA GACATT TIAGAAGT ACAAAAGAACT TCATCGAAGTA CATT TIAGAAGT ACAAAAGAACT TCATCGAAGTA CATT TIAGAAGT ACAAAAGAACT TCATCGAAGTA CATT TIAGAAGT ACAAAAGAACT TCATCGAAGTA ATT TIAGAAGT ACAAAAGAACT TCATCGAAGTA ATT TIAGAAGT ACAAAAGAACT TCATCGAAGTA TT TAGAAGTACAAAAGAACTTCATCGAAGTA TTIAGAAGT ACAAAAGAACT TCATCGAAGTA T TIAGAAGT ACAAAAGAACT TCATCGAAGTA TTIAGAAGTACAAAAGAACTTCATCGAAGTA GAAGTACAAAAGAACT TCATCGAAGTA GAAGTACAAAAGAACT TCATCGAAGTA GAAGTACAAAAGAACTTCATCGAAGTA GAAGTACAAAAGAACTTCATCGAAGTA AAGTACAAAAGAACT TCATCGAAGTA
16. 7 65 34 47 26 26 Figure 1 14 The same data as in figure 1 13 now with the Show susie scores option in the reads track switched on Direction and position filters Many sequencing protocols are prone to various types of amplification induced biases and errors The Read direction and Read position filters are aimed at weeding out variants that are likely to originate from such biases e Read direction filter The read direction filter removes variants that are almost exclusively present in either forward or reverse reads For many sequencing protocols such variants are most likely to be the result of amplification induced errors Note however that the filter is NOT suitable for amplicon data as for this you will not expect coeverage of both forward and reverse reads The filter has a single parameter Direction frequency Variants that are not supported by at least this frequency of reads from each direction are removed Read position filter The read position filter is a filter that attempts to remove systematic errors in a similar fashion as the Read direction filter but that is also suttable for amplicon data It removes variants that are located differently in the reads carrying it than would be expected given the general location of the reads covering the variant site This is done by categorizing each sequenced nucleotide or gap according to the mapping direction of the read and also where in the
17. AAGTACAAAAGAACTTCATCGAAGTA CTTCATCGAAGTA AAGTACAAAAGAACTTCATCGAAGTA GTACAAAAGAACTTCATCGAAGTA 4 HE 4 p FEB no read direc X Rows 4 Table view Genome Filter Figure 1 15 An example of a variant that is filtered out by the Read Direction filter Note that variant calling was done ignoring non specific matches and broken pair reads so only the 16 intact paired reads the blue reads are considered To see the direction of the reads you must adjust the viewer settings in the Reads track side panel to Disconnect paired reads This has been done in Figure 1 16 Now it becomes apparent that the variant is found in the forward reads that is the green reads of the 16 intact paired reads and in no reverse reads except the three that come from broken pairs and which were ignored and therefore removed by the read direction data Figure 1 1 7 shows an example of a variant that is removed by the read position filter but not by the read direction filter The variant is only present in a portion of the reads that cover the variant and the portion or the reads that carry the variant have the variant occurring in read positions that are systematically different from what you would expect given the general placement of reads covering the variant e g none of the reads that start after position 186 641 600 carry the variant CHAPTER 1 THE COMBINED VARIANT DETECTION PLUGIN 16 1 gi aa 1 654 040 1 65
18. CGCCTT TGAGGCGGGGAGGGITIGGGAAGGAGACAAAGAT AAACCCAGATGAAGGCAGCC withBaseQfilter 100 0 original Variants 7 032 removedByBaseQfilter 9 SACTCCTCCGTCACCT TCGCCTT TGAGGCGGGGAGGGITIGGGAAGGAGACAAAGATAAACC 100 0 mills guided locally realigned 14 679 825 reads ZACTCCTCCGTCACCT TCGCCTT TGAGGCGGGGGGGGIGGGGAAGGAGACAAA ZACTCCTOCCTCACCTTCOCOTTTCACOCOCOCECO GGGGAGGAGACAAAGATAAA TACTCCTCCGTCACCTTCGCCTTTGAGGCGGGGGG bab rr phy ACAAAAATA GGAAG AGGCAGC ZSACTCCTCCGTCACCT TCGCCTT TGAGGCGGGGGGGG GEGGAGCGCACAAARATAAARCCA C GGGG GGGAAGGAGACAAAAATAAAACC GCAGC GGGAAGGAGACAAAGATAAACCCAGATGAAGGCAGC GGGGAGGAGACAAAGATAAACCCAGATGAA GCAGC GACAAAGATAAACCCAGATGAAGGCAGC C ZTACTCCTCCGTCACCTTCGCCTTTG ZACTCCTCCGTCACCTTCGCCT T TGAGGCGGGGRGGGKIGGGRAGGAGACAAARAT AAACCCAGAT GAAGGCAGCC ACCTTCGCCTTTGAGGCGGGGGGG ZACTCCTCCGTCACCT TCGCCT TACTCCTCCGTCACCTTCGCCTT TGAGGCGGGGAGG ACCTTCGCCTTTGAGGCGGGGAGG ZACTCOTCCCTCACCTTC GCCT ZSACTCCTCCGTCACCTTCGCCT sACTCCTCCGTCACCTTCGCET GGGGAGGGGACAAAGATAAAACC GGGAAGGAGACAAAAATAAACCC GGGAAGGAGACAAAGAT AAACCCAGATGAA TGAAGGCAGCC MCC le GAGACAAAGATAAACCCAGATGAAGGCAGCC 14 gt Track List Settings removedbytasey ZACTCCTCCGTCACCT TCGCCTT TGAGGCGGGGGGGGGGGGGAGGAGACAAAGAT AAAACCAGAT GAAGG AGCC gt Varian Data aggregation above 100bp w Graph color HH Fix maximum of coverage graph Hide insertions below 1 0 Highlight variants 7
19. CGGGGGGGGGGGGGAGGAGACAAAGAT AAACCCAGATGAA GCAGCCO ZTACTCCTCCGTCACCTTCGCCTTTG GACAAAGAT AAACCCAGATGAAGGCAGCC GCAGCC ZACTCCTCCGTCACCT TCGCCT TT GAGGCGGGERGGCKIGGGRAGGAGACAAARAT AAACCCAGATGAAGGCAGCC ACCTTCGCCTT TGAGGCGGGGGGGGIGGGGGAGGGGACAAAGATAAAACC ZACTCCTCCGTCACCTTCGCCT Sata Rs ES PEAN AEP Aa Ree GGGAAGGAGACAAAGATAAACCCAGATGAA TTCGCCTTTGAGGCGGGGAGGG ZACTCCTCCGTCACCTTCGCCT SACTCCTCCGTCACCTTCGCCT SACTCCTCCGTCACCTTCGCCT TGAAGGCAGCC PESE re EEE AAA IS GGAAGGAGACAAAGATAAACCCAGATGAAGGCAGC gt Track List Settings Find Find Track layout DNA sequence track gt Varian ts track v Reads track Data aggregation above 100bp w Graph color E Fix maximum of coverage graph Hide insertions below 1 0 Highlight variants 7 Float variant reads to top V Disconnect paired reads Show quality scores Matching residues as dots Show read type specific coverage Only show cov erage graph Highlight reverse Text format Calculus track editor paired reads may Fa man n EE remove dByBase Rows 854 Table view Genome Filter Chro Region Type Refe Allele Refe Length Zygosity Count Cove Freq Prob Forw Rev Forw Aver Read Read uni uni BaseQR Hype Hom 1 32671628 SNV A A Yes 1 Heteroz 14 24 58 33 1 00 3 11 0 21 24 57 15 25 12 12 o 1 36807592 SNV T T Yes 1 Heteroz 34 46 73 91 0 95 4 30 0 12 3
20. TCTGTACTCAACTAA TGCACATT AACTTGTTTGTGATGTGTGTATTCAACTAACAGAGATGA TTGCACATT AACTAGTTTGTGATGTGTGTACTCAACTAA TTGCACATT AACTAGTTTGTGATGTGTGTACTCAACTAACACAG CAAT AGCTAGCCTCTCATGTGTGTCCTCAACTAACAGAGTGGAACATT AGAGATTAGCATT 32 AAACTAGTTTGTGATGTGTGTACTCAACTAACACAG CACT CTT TAGACAGA SSS SS SSS TSS id 090 44 AA gt ao gt gt 00 gt gt ao gt gt ret o TARTE brrr Py CTT TAGACAGA A 14801 E Ei Fa Sah E basicV2 X Rows 884 Table view Genome Filter F Chro Region Type Refe Allele Refe Length ee Count Cove Freq Forw Reve Forw Read Read uni uni Base Hype chr2 ane SNV 205 23 90 32 0 00 no EE EE EEE EE 24 254 Figure 1 3 A variant is highlighted that is detected by the Basic Variant Caller but not by the Low Frequency or the Fixed Ploidy Variant Caller The variant track for the Basic variant Caller variants is opened in the table view at the bottom of the figure The variant is present at a low frequency in a high coverage position and is likely to have been caused by sequencing error In figure 1 3 a variant is highlighted that is detected by the Basic Variant Caller but not by the Low Frequency or the Fixed Ploidy Variant Caller The variant is present at a low frequency in a high coverage position The Low Frequency Variant Caller compares this evidence to the er
21. and insertions or deletions into replacements The filters are applied at various stages some before the initial 1 bp variants are found and some after Figure 1 1 shows a schematic representation of the procedure As the tools differ in their model assumptions about the data they will not call the same variants However when run with the same filter settings you will generally have that e The Basic Variant Caller will call the highest number of variants e The Low Frequency Variant Caller will call a subset of the variants called by the Basic Variant caller The variants called by the Basic Variant Caller that the Low Frequency Variant Caller will NOT call are those that according the error model that the Low Frequency Variant Caller estimates from the data are likely to have been caused by sequencing errors 4 CHAPTER 1 THE COMBINED VARIANT DETECTION PLUGIN 5 Call 1 bp variants Pyro error Base quality 1 Probability significance Read filters broken non specific 1 Mark potential variant positions given General filters Examine each position and call bp variants 2 Estimate error rates accounting for potential variant positions 3 Given the estimated error model call single position variants 4 Gotol Masking max coverage Method iterative General filters min count Read direction min coverage and frequency read position Method Inallregions where you have neighboring 1 bp variants tabulate how m
22. annotate a sequence from list of annotations found in a GFF file Located in the Toolbox Batch Rename Q CLC bio support ckbio com Version 1 3 1 Build 131211 2144 102901 Rename files in batch by adding a prefix or a number Biobase Genome Trax Annotate Q CLC bio su 7 com Version 2 0 11 Build 140103 1321 103719 Create tracks with various data from Biobase Genome Trax Q CLC bio support ckbio com Version 2 0 11 Build 140103 1322 103719 Create tracks with various data from Biobase Genome Trax Plugin requires registration Blast2GO PRO Q BioBam Bioinformatics pluginsupport blast2go com IA ss A GD additional Alignments This module allows for use of two other alignment methods which are otherwise not distributed with the CLC Workbench When the plug in is installed you will see the new alignment methods in the Toolbox under Alignments and Trees gt Additional Alignments When you run the alignments there are a number of parameters that can be set You can also specify command line instructions ee Alignments and Trees _ gt EE Create Alignment HEE Join Alignments EBA Create Pairwise Comparison E Create Tree The additional alignments in the toolbox Allignment methods Three different alignment methods are included in this extension ClustalW ClustalO and Muscle For more detailed information on each of Figure 2 1 The plugins that are available for download
23. ants that span multiple positions see below e The rare variant tool makes statistical tests for the various possible explanations for each site This means that the probability for the called variant must be estimated separately since it is not part of the actual variant calling This is done by assigning prior probabilities to the various explanations for a site in a way that makes the probability for two explanations equal in exactly the situation where the statistical test shifts from preferring one explanation to the other For a given single site variant the probability is then calculated as the sum of probabilities for all the explanations containing that variant So if a G variant is called the reported probability is the sum of probabilities for these configurations G A G C G G T A C G A G T C G T and A C G T and also all the configurations containing deletions together with G For multi position variants an estimate is made of the probability of observing the same read data if the variant did not exist and all observations of the variant were due to sequencing errors This is possible since a Sequencing error model is found for both the fixed ploidy and rare variant tools The probability column contains one minus this estimated probability If this value is less than 50 the variant might as well just be the result of Sequencing errors and it is not reported at all CHAPTER 1 THE COMBINED VARIANT DETECTION PLUGIN 20 Forward
24. any reads carry the combinations of the variants Join 1 bp variants into longer Examine reads and join neighboring 1bp variants _j Mergethe counts of a combination into another higher count that occur in the same readsinto gt 1bp variants combination if the low count could have arisen as a sequencing error of the high count combination Calculate probabilities of combined joined variants and remove those that are likely to be the result of sequencing errors General filters min count min coverage and frequency Probability significance Output final variants Figure 1 1 A schematic representation of the variant calling procedure of the three variant callers e The Fixed Ploidy Variant Caller will call a subset of the variants called by the Low Frequency Variant caller The variants called by the Low Frequency Variant Caller that the Fixed Ploidy Variant Caller will NOT call are those that according to the assumed ploidy of the sample analyzed and the error model that the Fixed Ploidy Variant Caller estimates from the data are likely to have been caused by either mapping errors or by sequencing errors 91 a aiii 91 si noi a E chr2 Genome T GTGGAAITICT GCAAGTGGATACTTTGCTGGCTTTGAGGATTTCGTTG basicV2 i o o Variants 884 x B m E a o 5 lowFreq o Variants 796 E E Ss ES Do EI fixedV2 Variants 233 n n n nom n inFixedV2 notinLowFreqV2 Variants 0 inFixedV2 notinBasicV2
25. ed variant table and a report Figure 1 19 The report contains information on the estimated error model and as only the Fixed ploidy and the Low Frequency variant callers uses an error model the report is only available for those and not for the Basic Variant caller The outputs are described below 1 4 1 The variant track output The variant track contains information on each of the variants called When opened in the table view there is a number of columns for each of the variants see figure 1 20 The contents of these are Chromosome The name of the reference sequence on which the variant is located Region The region on the reference sequence at which the variant is located The region may be either a single position a region or a between position region Variant type The type of variant This can either be SNV single nucleotide variant MNV multi nucleotide variant insertion deletion or replacement Reference The reference sequence at the position of the variant Allele The allele sequence of the variant Reference allele Describes whether the variant is identical to the reference This will be the case for one of the alleles for most but not all detected heterozygous variants e g the CHAPTER 1 THE COMBINED VARIANT DETECTION PLUGIN 18 a e di ic o Homo sapiens hg19 3c CGTGCCAGCACCGCACCGTGTGTGAGGGTGAGTIGGGGGGCCCCGGGGT GGGGAGGCCCGGCTAGTAGG 0 8 100 0 mills guided locally realigned
26. ely on any assumptions on the data and does not estimate any error models It can be used on any type of sample It will call a variant if it satisfies the requirements that you specify when you set the filters see Section 1 3 The tool has a single parameter Figure 1 7 that is specific to this tool the user is asked to specify the ploidy of the sample that is being analyzed The value of this parameter does not have an impact on which variants are called it will merely determine the contents of the hyper allelic column that is added to the variant track table variants that occur in positions with more variants than expected given the specified ploidy will have Yes in this column other variants will have No f XG Basic Variant Detection Basic Variant Parameters 1 Choose where to run 2 Select read mappings 3 Basic Variant Parameters Basic Variant Parameters Ploidy 2 A Previous gt Next Finis X Cancel Figure 1 7 The Basic Variant Detection parameters 1 2 2 Fixed Ploidy Variant Detection The Fixed Ploidy Variant Detection tool relies on two models 1 a model for the possible site types and 2 a model for the sequencing errors The model for the possible site patterns i depends on the user specified ploidy parameter For a diploid organism there are two alleles and thus the site types are A A A C A G A T A C C a
27. eport In addition to the estimated error rates of the different types of errors shown in figure 1 10 the report contains information on the total error rates for each quality score as well as a distribution CHAPTER 1 THE COMBINED VARIANT DETECTION PLUGIN 21 of the qualities of the individual bases in the reads in the read mapping at the sites that were examined for variants see figure 1 21 1 1 Error rates for quality categories Error rates for the different quality categories 0 30 0 25 0 20 0 15 Error 9 0 10 0 05 0 00 amp E Fy Ze E T E 2 7 do Co a ae 2 Quality 1 2 Qualities of examined sites Distribution ofthe qualities of examined sites 3500000000 3000000000 2500000000 2000000000 Counts 1500000000 1000000000 500000000 o x 79 ap CA gt Quality Figure 1 21 Part of the contents of the report on the variant calling Chapter 2 Installation of the Combined Variant Detection The Combined Variant Detection is installed as a plugin Plugins are installed using the plugin manager Help in the Menu Bar Plugins and Resources E or Plugins 4 in the Toolbar The plugin manager has three tabs at the top e Manage Plugins This is an overview of plugins that are installed e Download Plugins This is an overview of available plugins on CLC bio s server e Manage Resources This is an overview of resources that are installed To install a plugi
28. h coverage position and is under the assumed ploidy most likely to have been caused by mapping error Here you are presented with the three tools see figure 1 5 E3 Variant Detectors beta Sf Basic Variant Detection ge rae ixed Ploidy Variant Detection TEA Low Frequency Variant Detection Figure 1 5 The Variant Detectors When double clicking one of the tools a dialog is opened where you select the reads track or read mapping you want to analyze Fixed Ploidy Variant Detection z Select read mappings 1 Choose where to S Navigation Area Selected elements 1 2 Selectread mappings CLC_Data a Chr 19 Reads Example Data a EE StructuralVariation H NA12878 ERR091571 z E MouseData H E Chr 19 GoldStandardAnnotations A H GoldStandard AKR J E E Chr 190nly elle an H E SV nov21 Chr 19 Reads locally realignec 5 3 BRCA H E HCC1954 IP Cm e Qr lt enter search term gt Batch Figure 1 6 Select the read mapping that you want to analyze Click Next when the reads track is listed in the right hand side of the dialog The user is next asked to set the parameters that are specific for the variant detection tool The three tools their assumptions and the tool specific parameters are described here CHAPTER 1 THE COMBINED VARIANT DETECTION PLUGIN 8 1 2 1 Basic Variant Detection The Basic Variant Detection tool does not r
29. if they do not stem from repeat regions Coverage and count filters These filters specify absolute requirements for the variants to be called Note that suitable values for these filters are highly dependent on the coverage in the sample being analyzed e Minimum coverage Only variants in regions covered by at least this many reads are called e Minimum count Only variants that are present in at least this many reads are called e Minimum frequency Only variants that are present at at least this frequency calculated as count coverage are called These values are calculated for each of the detected candidate variants If the candidate variant meets the specified requirements it is called Note that when the values are calculated only the countable reads are considered The countable reads are those that the user has not chosen to ignore This means that if the user in the read filter has specified that reads from broken pairs should be ignored broken pair reads will not be countable Similarly goes for the non specific reads Also note that overlapping paired reads only count as one read since they only represent one fragment 1 3 2 Noise filters The Noise filters examine each candidate variant at a more detailed level They are intended as a means of filtering out variants that are likely to be the result of various types of systematic CHAPTER 1 THE COMBINED VARIANT DETECTION PLUGIN 13 errors and or
30. l If you do not wish to completely uninstall the plugin but you don t want it to be used next time you start the Workbench click the Disable button When you close the dialog you will be asked whether you wish to restart the workbench The plugin will not be uninstalled until the workbench is restarted 24 Bibliography 25
31. n click the Download Plugins tab This will display an overview of the plugins that are available for download and installation see figure 2 1 Clicking a plugin will display additional information at the right side of the dialog This will also display a button Download and Install Click the Combined Variant Detection and press Download and Install A dialog displaying progress is now shown and the plugin is downloaded and installed If the Combined Variant Detection is not shown on the server and you have it on your computer e g if you have downloaded it from our web site you can install it by clicking the Install from File button at the bottom of the dialog This will open a dialog where you can browse for the plugin The plugin file should be a file of the type cpa When you close the dialog you will be asked whether you wish to restart the CLC Genomics Workbench The plugin will not be ready for use until you have restarted tin order to install plugins on Windows Vista the Workbench must be run in administrator mode Right click the program shortcut and choose Run as Administrator Then follow the procedure described below 22 CHAPTER 2 INSTALLATION OF THE COMBINED VARIANT DETECTION Manage Plugins and Resources Additional Alignments GD CLC bio support cicbio com Version 1 5 1 Build 131211 2142 102501 Perform alignments with ClustalO ClustalW and MUSCLE Using this plug in it is possible to
32. nd so on until The error model ii specifies the probabilities of the analyzed sample having a certain base in the sequenced position but a different base being called in a read at that position The error model is estimated from the data prior to calling the variants see Section 1 2 4 The Fixed Ploidy algorithm will given the estimated error model and the data observed in the site calculate the probabilities of each of the site types One of those site types is the site that is homozygous for the reference that is it stipulates that whatever differences are observed from the reference nucleotide in the reads is due to Sequencing errors The remaining site types are those which stipulate that at least one of the alleles in the sample is different from the reference The sum of the probabilities for these latter site types is the posterior probability that the sample contains at least one allele that differs from the reference at this site The Fixed Ploidy Variant Detection tool has two parameters the Ploidy and the Variant probability parameters Figure 1 8 e The ploidy is the ploidy of the analyzed sample The value that the user sets for this parameter determines the site types that are considered in the model CHAPTER 1 THE COMBINED VARIANT DETECTION PLUGIN 9 e The variant probability is the minimum value required for the posterior probability that the sample contains at least one allele that differs fr
33. nnect paired reads option in the side panel of the reads track Please see the column Coverage above for a column that reports the value for fragments rather than for reads Unique end positions The number of reads with different end positions that support the variant BaseQRankSum The BaseQRankSum column in the variant table contains an evaluation of the quality scores in the reads that has a called variant compared with the quality scores of the reference allele Variants for which no corresponding reference allele is called does not have a BaseQRankSum value Likewise no values are calculated for reference alleles The score is a Z score So a value of 2 0 means that the observed qualities for the variant two standard deviations below the qualities for the reference allele The scoring is performed using a Mann Whitney U for comparing the two sets of quality scores from the reference allele and the variant Homopolymer The column contains Yes if the variant is likely to be a homopolymer error and No if not This is assessed by inspecting all variants in homopolymeric regions longer than 2 A variant will get the mark yes if it is a homopolymeric length variation of the reference allele or a length variation of another variant that is a homopolymeric variation of the reference allele 1 4 2 The annotated table output The Annotated table output contains an old style variant format output 1 4 3 The r
34. om the reference at this site before calling a variant Only variants with a probability higher than the specified value will be called That means that the higher the value you set the fewer variants are called As the Fixed Ploidy Variant Detection tool strongly depends on the model assumed for the ploidy the user should carefully consider the validity of the ploidy assumption that he makes for his sample The tool allows ploidy values up to and including 4 tetraploids For higher ploidy values the number of possible site types is too large for estimation and computation to be feasible and the user should use the Low Frequency or Basic Variant Detection Tools Fixed Ploidy Variant Detection 2 Fixed ploidy variant parameters 1 Choose where to run 2 Select read mappings 3 Fixed ploidy variant Parameters Fixed ploidy variant parameters Ploidy 2 Required variant probability 90 0 A Previous gt Next F X Cancel Figure 1 8 The Fixed ploidy Variant Detection parameters 1 2 3 Low Frequency Variant Detection As the Fixed Ploidy Variant Detection tool the Low frequency variant Detection tool relies on 1 a statistical model for the analyzed sample and 2 a model for the sequencing errors The method employed in the Low Frequency Variant Detection tool for estimating the sequencing error rates is similar to that of the Fixed Ploidy Variant Detection tool see Section 1 2
35. onding to allele G would have No Had the variant caller called the two variants C and G at the position both would have had No in the Reference allele column Length The length of the variant The length is 1 for SNVs and for MNVs it is the number of allele or reference bases which will always be the same For deletions it is the length CHAPTER 1 THE COMBINED VARIANT DETECTION PLUGIN 19 of the deleted sequence and for insertions it is the length of the inserted sequence For replacements both the length of the replaced reference sequence and the length of the inserted sequence are considered and the longest of those two is reported Zygosity The zygosity of the variant called as determined by the variant caller This will be either Homozygous where there is only one variant called at that position or Heterozygous where more than one variant was called at that position Count The number of countable fragments supporting the allele The countable fragments are those that are used by the variant caller when calling the variant Which fragments are countable depends on the user settings when the variant calling is performed if e g the user has chosen Ignore broken pairs reads belonging to broken pairs are not countable Note that although overlapping paired reads have two reads in their overlap region they only represent one fragment and are counted only as one Please
36. read the nucleotide is found each read is divided in five parts CHAPTER 1 THE COMBINED VARIANT DETECTION PLUGIN 15 along its length and the part number of the nucleotide is recorded This gives a total of ten categories for each sequenced nucleotide and a given site will have a distribution between these ten categories for the reads covering the site If a variant is present in the site you would expect the variant nucleotides to follow the same distribution If the read position distribution of the variant nucleotides differs significantly from the expected the variant is filtered out The filter has one parameter Significance Variants whose read position distribution is significantly different from the expected with a test at this level are removed Figure 1 15 shows an example of a variant that is removed by the Read direction filter 1 sar eae 1 hikes 1 ni uai gt sapiens hg19 sequence GAATACATTTTGGGCTCACCTGCGACATT BEREE ar e oe eq with read direction filter Variants 3 D D E no read direction filter i Variants 4 O E EM 9 i CTTCATCGAAGTA read mapping 344 reads GCTCACCTGCGACATT TIAGAAGT ACAAAAGAACT TCATCGAAGTA TTTGGGCTCACCTGCGACATT TJA TTTGGGCTCACCTGCGACATT TIA TGGGCTCACCTGCGACATT TA GGCTCACCTGCGACATTTIA CTCACCTGCGACATTT F TCACCTGCGACATTTSA TCACCTGCGACATT TIA CACCTGCGACATT TI ACCTGCGACATT TJA CCTGCGACATT TJA CTGCGACATT TJA TGCGACATT TF AGAAGTACAAAAGAACTTCATCGAAGTA AG
37. requency below Only insertion or deletion variants whose frequency ignoring all non reference and non homopolymer variant reads is lower than this threshold will be removed Note that the higher you set the With maximum frequency parameter the more variants will be removed Figure 1 18 shows an example of a variant that is called when the pyro error filter with minimum length setting 3 and frequency setting 0 5 is used but that is filtered when the frequency setting is increased to 0 8 The variant has a frequency of 55 71 CHAPTER 1 THE COMBINED VARIANT DETECTION PLUGIN 1 86 641 500 186 641 550 186 641 600 186 641 650 186 641 700 186 641 75 I l l l I Homo sapiens hg19 sequence NNI JA MAMA HIM UH A 000000 AMA O E 1 OE EL NEN AA AO removed by read direction filter not by read position removed by read position filter not by read direction o C100_S5_L001_R1_001 paired Reads locally realigned 925 938 reads Romy Fg SI E removed by re X Rows 219 Table view Genome Filter iy Chrom Region Type Ref Allele Zygosity Count Cove Frequ Proba Forwa Rever Forward r start end BaseQRank 1 120611964 SNV G G Heterozy 192 647 29 68 1 00 201 200 0 50 12 15 Figure 1 17 A variant that is filtered out by the Read position filter but not by the Read direction filter 1 4 Output options The Variant Detection Tools have the following outputs a variant track an annotat
38. ror model and has decided that the three reads carrying the variant are likely to be the result of sequencing errors rather than the result of a true variant Figure 1 4 highlights a variant that is detected by both the Basic and the Low Frequency Variant Caller but not the Fixed Ploidy The variant is present at a higher frequency 14 22 in a high coverage region coverage 204 Observing the variant in 29 out of 204 reads is not likely to be due to Sequencing errors However observing 29 reads from one allele and the remaining from the other in a diploid sample is highly unlikely and the Fixed Ploidy Variant Caller judges that this variant is most likely caused by mapping errors that is a subset of the reads in the region being mapped there spuriously and filters out this variant Below we first describe the three variant detection tools Section 1 2 Each of the tools have a set of parameters that are specific to that tool Second we describe the filtering and output options that are shared among the tools Section 1 3 1 2 The Variant Detection tools To run the Variant Detection tools in the Combined Variant Detectionplugin go to Toolbox Resequencing Analysis Variant Detectors beta CHAPTER 1 THE COMBINED VARIANT DETECTION PLUGIN T Iny Track List X 91 669 360 91 669 380 E 91 669 400 l l i chr2 Genome TAGTTTCTGAGATGTGTCCTCAACTAACAGAGTTGAACATTTCTTTTGACAGA A CAGTTTTGAAACA basicV2 a
39. s Note that near the end of the reads eleven nucleotides are still considered by changing the region offset relative to the nucleotide in question Minimum central quality Reads whose central base has a quality below this value are ignored Minimum neighborhood quality Read for which the minimum quality of the bases within the specified neighborhood radius is below this value are ignored Figure 1 13 gives an example of a variant that is called when the base quality filter is NOT applied and not called when it is In figure 1 14 the same data is shown as in figure 1 13 however now the Show quality scores option in the side panel of the reads track is switched on This reveals that the reads that carry the potential G variant tend to have poor quality As all reads that have a base with quality less than 20 in this potential variant position are ignored when the Base quality filter is turned on no variant is called most likely because it now does not meet the requirements of either the Minimum coverage Minimum count or Minimum frequency filters Note that the errorin the example shown is a typical Illumina error the reference has a T that is surrounded by stretches of G The G signals drown the signal of the T CHAPTER 1 THE COMBINED VARIANT DETECTION PLUGIN 36 807 560 36 807 580 36 807 600 36 807 620 I I I i Homo sapiens hg19 gt aCTCCTCCGTCACCTT
40. s may arise for a number of reasons one being erroneous mapping of the reads In general variants based on broken pair reads are likely to be less reliable so ignoring them may reduce the number of spurious variants called However broken pairs may also arise for biological reasons e g due to structural variants and if they are ignored some true variants may go undetected e Non specific match filter Non specific matches are likely to come from some type of repeat region and the exact mapping location of them is uncertain In general variants based on non specific matches are likely to be less reliable However as there are regions in the genome that are entirely perfect repeats ignoring non specific matches may have the effect that true variants go undetected in these regions There are three options for specifying to which extend the non specific matches should be ignored No when this option is chosen they are not ignored Reads when this option is chosen they are ignored Region when this option is chosen no variants are called in regions covered by at least one non specific match When ignoring regions containing a non specific match the last of the options mentioned above the minimum length of reads that are allowed to trigger this effect has to be stated The reason is that we want to avoid really short reads triggering the effect as really short reads will usually be non specific even
41. see the column Read count below for a column that reports the value for reads rather than for fragments Coverage The read coverage at this position Only countable reads are considered see under Count above for an explanation of countable reads Note that although overlapping paired reads have two reads in their overlap region they only represent one fragment and overlapping paired reads contribute only 1 to the coverage Please see the column Read coverage below for a column that reports the value for reads rather than for fragments Frequency Count divided by Coverage Probability The contents of the Probability column for Low frequency and Fixed Ploidy variant callers only depend on the variant caller that produced and the type of variant e In the Fixed Ploidy Variant Detection Tool the probability in the resulting variant track s Probability column is NOT the probability referred to in the wizard The probability referred to in the wizard is the required minimum posterior probability that the site is NOT homozygous for the reference The probability in the variant track Probability column is the posterior probability of the particular site type called The fixed ploidy tool calculates the probability of the different possible configurations at each site So using this tool for single site variants the probability column just contains this quantity for vari
42. uenced bases with quality scores 30 39 7 400 088 878 Figure 1 10 Example of estimated error rates The figure shows average estimated error rates across bases in the given quality score intervals 20 29 and 30 39 respectively Higher error rates are estimated for bases with lower quality scores An example of error rates estimated from a whole exome sequencing Illumina data set is shown in figure 1 10 As expected the estimated error rates that is the off diagonal elements in the matrices in the figure are higher for the lower quality nucleotides than for higher Note that although the matrices in the figure show error rates of bases within ranges of quality scores a separate matrix is estimated for each quality score in the error model estimation CHAPTER 1 THE COMBINED VARIANT DETECTION PLUGIN 11 1 3 The filters The variant callers offer a number of filters These relate both to which reads should be used and how much evidence should be required for a variant to be called The user is asked to set the values of these filters in two wizard steps the General filters step Figure 1 11 and the Noise filters step Figure 1 12 The filters are described below 1 3 1 General filters The General filters relate to the regions and reads in the read mappings that should be considered and the amount of evidence the user wants to require for a variant to be called General filters Reference masking Ignore positions with co
43. verage above 100000 Restrict calling to target regions o Read filters Ignore broken pairs Ignore non specific matches Reads w Minim i reaf annt K Minimum read lengt U Coverage and count filters Minimum coverage 10 Minimum count 2 Minimum frequency 20 0 Previous gt Next X Cancel Figure 1 11 General filters Reference masking The Reference masking filters allows the user to only perform variant calling incl error model estimation in specific regions There are two parameters to specify this e Ignore positions with coverage above All positions with coverage above this value will be ignored when inspecting the read mapping for variants e Restrict calling to target regions Only positions in the regions specified will be inspected for variants Note that the Ignore positions with coverage above parameter is extremely powerful no matter how much evidence you have for a variant it will NOT be called if the coverage at the position of this variant is higher than the specified value Also note that the Restrict calling to target regions parameter is optional When not specified the full read mapping will be examined Read filters The Read filters determine which reads or regions should be considered when calling the variants CHAPTER 1 THE COMBINED VARIANT DETECTION PLUGIN 12 e Ignore broken pairs When ticked reads from broken pairs are ignored Broken pair
Download Pdf Manuals
Related Search
Related Contents
User manual - Iris Hellas ORIGINAL BEDIENUNGSANLEITUNG ISO GUIDE PsRBExportDevices PRO Version 14.04 User`s Manual VICSアダプターユニット キッチンセット Transition Networks MIL-SW8T1GPA User's Manual Kenwood TTM024A toaster Pioneer X920 Installation Guide Copyright © All rights reserved.
Failed to retrieve file