Home

User Manual - The University of Hong Kong

image

Contents

1. h Build analysis genome Connect SNPs in high LD r2 gt 0 9 Ignore Low LD r2 lt 0 005 Adjust by genomic control divided by chi square median 7 Gene defintion database RefGene a LD Data Haplotypes VCF format Genotypes Plink format Genome Coordinates Version hgi9 v Download D KGG EUR 1kg phase 1 v3 shapeit2 eur hg19 chr1 vcf c a ID KGG EUR 1kg phase 1 v3 shapeit2 eur hg19 chri0 vcf ID KGG EUR 1kg phase 1 v3 shapeit2 eur hg19 chri1 vcf ID KGG EUR 1kg phase 1 v3 shapeit2 eur hg 19 chr 12 vcf ID KGG EUR 1kg phase 1 v3 shapeit2 eur hg19 chr13 vcf ID KGG EUR 1kg phase 1 v3 shapeit2 eur hg 19 chr 14 vcf ID KGG EUR 1kg phase 1 v3 shapeit2 eur hg19 chr 15 vcf ID KGG EUR 1kg phase 1 v3 shapeit2 eur hg 19 chr 16 vcf ID KGG EUR 1kg phase 1 v3 shapeit2 eur hg19 chr 17 vcf n Wee lel IDlila nhacai u chsncit r hai0 chr 18 urf ai 4 T es Genome Name genome_crohn P value files RODD eK aaan inal SNP CHR POS RISK NONRISK META Z METAP Select CHR ONEor pos rs3094315 1 792429 a E 1 208042 0 227031 a MORE Risk s4040617 1 819185 A l 0 5591984 0 5760264 palue NONRISK rs2980300 1 825852 Ic 0 5241999 06001394 column s bara 7 rs4075116 1 1043552 f Ic 2 665530 0 007686718 far Sag s3934834 1 1045729 fT Ic 1 319292 _ 0 18707166 a a s3737728 1
2. HLA DOB _ 3 68E 7 32780539 4287 Gene database RefGene N m neren on en z ener Used columns META he j Gene 5 extension 5 0 w 7 e pate tering for anria 5 Gene 3 extension 5 0 x 7 I T eas Multiple testing gene sets p values by Benjamini amp Hochberg 1995 x Error Rate 0 05 SNP LD 1KG Haplotypes VCF x GOLA z _ Highlight genes in SIGNIFICANT sets by Benjamini amp Hochberg 1995 v Error Rate 0 05 Adjust P Value No 2 a 4 Gene Scan _crohn eee le a one s Te 3 Genome genome_crohn _ t n Seis F ESPE Multivariate Test No ypergeome j P value sources META P Candidate Genes Set Optional candidategeneset_crohn z Only seed genes used Test GATES Enroll genes with p value lt Fixed p value threshold Error Rate 0 05 s Show Detailed Results i E 2 LY Pathway GeneSet Scan PathwayGenesets Multiple testing enrichment p valuesby Benjamini amp Hochberg 1995 Error Rate 0 05 Gene scan genescan_crohn V Filter out NOT significantly enriched gene sets Remove overlapped genes between sets 4555555 455 P value sources META Test No weights Filter pathway by p value 0 05 Filter gene by p value 0 05 Pathway DB c2 cp v4 0 symbols gmt ga Export and Visualize Candidate Gene candidategeneset_cra Show Detailed Results Format Excel xls Content
3. Hochberg 1995 test for the overall error rate 0 05 SLC22A4 CX3CR1 PRKAA1 SLC22A5 SLC35D1 UBA7 CUL2 HEXA CREM PIK3R2 PPP1R3A IFI30 AGPAT1 INF ICAM1 C4B STIP1 C4A PPP3CB INPPSD DAG1 IL12B RPL37 JAK2 HLA DOB MICA HLA DRBS TAP2 TAP1 ABCA7 RHOA PSMB9 SELP RNF123 IL3 NCR3 TUBB2B cDC34 TEC NDUFS7 IRF1 PPP1R1B SLCO2A1 LIA TCAP TAB2 LTB PTPN2 Aae KKK FF FT KK EEEE KKK TET TET TR RTT TRE RTT TERRE REET REET Elapsed time 3 min 42 sec Analysis Log Window 3 INFO 2014 11 28 11 30 24 Variants gt Benjamini amp Hochberg 1995 FDR gt The threshold is 4 709124E 5 m Figure 5 7 2 The display after pathway based association scan Step 8 for more detailed information of the result you can click the node Show Detailed Result Figure 5 8 You can also change the multiple test methods and export the results you want in this tab Kas KGG V3 5 A sy Project Data Gene BioModule Tools Window Help ENFORA 2 E55 ShowPathways x Gene Info Symbol PValue ignifi Chromo Start Length Ei Candi Gene Files ee a r REA T fra c40 4 38E 2 20 44746892 _ 11493 ail candidategeneset_crohn TNF 4 36E 4 6 31543343 2770 o s a HLA DMA 3 14 3 32916390 a a HLA DMB 1 90E 3 32902405 6443 Source CrohnDiseaseSNP txt cae ies a B van 2 rawr 1 128 17E 8 158741790 15692 Version hg19 alg 3 http
4. Search gene TN CD27 E Multiple Test Method Benjamini amp Hochberg 1995 METAP 1 41E 2 rs8087237 3 68E 26 4 03E 6 rs10502414 1 04 19 1 32E 2 Error Rate 0 05 rs478582 3 70E 16 13 96E 6 rs973767 8 94E 14 1 04 2 rs10502416 5 99E 12 Export Setting rs17597893 3 61E 10 Content Variants inside genes rs1893217 rs8085163 Gene 3 extension 5 0 SNP LD 1KG Haplotypes VCF Format Excel xdsx Export 12785476 16357351 182847289 rs2542170 Gene p values lt rs657555 Adjust P Value No 1 01148 101292689 rs908579 Gene Scan genescan_crohn Genome genome_crohn oe 3 86757 150157507 rs11NRN6N6 Variants p values lt 5E 2 E2 Multivariate Test No P value sources META P Test GATES Show Detailed Results 1114 10 4 o log10 p value 0 0 0 2 0 2 0 4 0 4 0 6 0 6 0 8 0 8 1 0 1 0 ref Gene PTPN2 1 57E 9 Exon Intron UTR UpDownStream ncRNA A InterGene W Others gt ro pm o 8 i 6 5 44 3 2 1 0 2 12 780 000 12 790 000 12 830 000 12 840 000 12 850 000 12 860 000 12 870 000 12 880 000 12 890 position 12785476 12884334 12 800 000 12 810 000 12 820 000 Analysis Log Window INFO 2014 11 28 11 30 24
5. gt Benjamini amp Hochberg 1995 FDR gt The threshold is 4 709124 5 m Figure 5 9 2 The display after running PPI based association scan Step 10 Click the node Show Detailed Results and you will get the graph of PPI network You can also export the results you want in this tab ie Project Data Gene BioModule Tools Window Help ENFORA Projec t Window Resource Window m Sr iG runningRieadtviewer_m ffl Pasoediendtviewer E Tahe Window _m Mew genes S Showathways_m Show Pris Ei CrohnDisease 6 P value Files ail CrohnDiseaseSNP txt Candi Gene Files all candidategeneset_crohn EB Genome genome_crohn Source CrohnDiseaseSNP txt Version hgi9 Gene database RefGene lt Used columns META Gene 5 extension 5 0 Gene 3 extension 5 0 s SNP LD 1KG Haplotypes VCF wv Adjust P Value No Gene Scan genescan_crohn wv Genome genome_crohn PPI Scans P value sources META P Interaction sets STRINGPPIV905 txt gz E Candidate Genes candidategeneset_crohn Multiple testing Standard Bonferroni y Error Rate 0 05 Heterogeneity gray other interesting genes Select Model PICKING Multivariate Test No Keep PPI pair CE antes A Re HE SK 7 3 Layout KKLayout g P value sources META P Hiaansi2 lt 0 05 PIKSCG 04E 7 T w Test GATES r j Hide p values s a A Bi z Show
6. Li MX Gui HS Kwan JS Sham PC GATES A rapid and powerful gene based association test using extended Simes procedure Am J Hum Genet 2011 Mar 11 88 3 283 293 2 Li MxX Kwan JS Sham PC HYST A hybrid set based test for genome wide association studies with application to protein protein interaction based association analysis Am J Hum Genet 2012 Sep 7 91 3 478 88 3 Sluis et al MGAS a powerful tool for multivariate gene based genome wide association analysis Bioinformatics In press 2 Installation 2 1 Installation of Java Runtime Environment JRE The Java Runtime Environment JRE v1 7 or higher version is required to run KGG3 on any operating systems OS It can be downloaded from http java sun com javase downloads index jsp for free Installing the JRE is very easy in Windows OS and Mac OS X In Linux you have more work to do Details of the installation can be found at http www java com en download help linux_install xml In Ubuntu if you have an error message like Exception in thread AWT EventQueue 0 java awt HeadlessException then please installs the Sun Java Running Environment JRE first To install the Sun JRE on Ubuntu 10 04 please use the following commands sudo add apt repository deb http archive canonical com lucid partner sudo apt get update sudo apt get install sun java7 jre sun java7 plugin sun java7 fonts Detailed explanation of above commands can be found at h
7. Only Pathway INFO 2014 11 28 11 30 24 Variants gt Benjamini amp Hochberg 1995 FDR gt The threshold is 4 709124E 5 m Figure 5 8 Function of displaying the results of pathway based analysis Step9 search PPIs between significant genes The significant genes can be picked up according to the gene p values and SNP p values set as Figure 5 9 1 output as Figure 5 9 2 12 ii Gene pair based association scan Scan Name PPIBasedAssociationScan_Crohn _ Gene pair DB STRINGV9 05 PPI STRING Confidence gt 0 6 Gene Association Set genescan_crohn v All Format Remvoe Merge Gene pair association test by HYST Li MK Kwan J5 Sham PC H YST A H brid Set basged Test for gen association studies with application to protein protein inter association analysis Am J Hum Genet 20127 Sep 7791 3 478 88 TT t 53 KGG V3 5 A sys Project Data Gene BioModule Tools Window Help ENFORA K RunningResultviewer se 8 PassedResultviewer se TableViewer Window 38 gt View genes 28 E gt ShowPathways s Calculating local pair wise LD of SNP within genes on chromosome 21 Reading haplotpyes on chromosome 22 Warning 191 variants within genes out of 194 on chromosome 22 have NO haplotype information and will be assumed to be independent of others Detailed information of these variants is saved in D KGG CrohnDisease PP
8. 1061338 2 474539 0 01334083 a Extended gene region length 5 kb at5 5 kb at 3 Correlation matrix of phenotypes for multivariate analysis File Setting Chromsome Column CHR v Marker Position Column POS a Marker ID Column SNP w Marker Position Version hg19 7 Imputation Quality Column Optiona _ No positions Get positions of SNPs by SNPTracker Input type p values Input format Single testper column w Missing data label NA 4 Has Title Row in the association file Exclude regions Same Version as Reference Geome In chromosome from bp to bp Cancel e o Figure 5 4 1 Select META P to build analysis genome and name the genome as genome_crohn 53 KGG V3 5 A syste Project Data Gene BioModule Tools Window Help NnFuewsG Project Window Resource Window E p mi CrohnDisease E B P value Files all CrohnDiseaseSNP txt Candi Gene Files ail candidategeneset_crohn 8 Genome genome_crohn Source CrohnDiseaseSNP txt lt Version hg19 Gene database RefGene Used columns META Gene 5 extension 5 0 Gene 3 extension 5 0 SNP LD 1KG Haplotypes VCF Adjust P Value No Observed log10 P oe m i aE F p pir sy r a it 14 15 16 17 18 Analysis Log Window 28 INFO 2014 11 28 11 30 24 Variants gt Benjamini amp Hochberg 1995 FDR gt The threshold is 4 709124E 5 Figure 5 4 2 Th
9. Detailed Results peeled momen ete l ale offer H Gene pair Scan PPIBasedAssociationScan_G Benjamini amp Hochberg 1995 v Gene scan genescan_crohn s P value sources META E Hide gene symbols Error Rate 0 05 53E 3 Zoom a BARESE Le aye ie Fed BPS TE 7 Test No weights ee y RA JE D aN an BY hu Ets zA Eae Ai EnA E ese i Q py Only visualize significant gene pairs in graphs f s CERIEL PIK AGH 29 55 Rct o a Only export significant 5 4 7 Pathway GeneSet Scan PathwayGeneSetg gene pairs in file E wil Ent Gene scan genescan_crohn ty DE g B s P value sources META IF ait BY the a Test No weights jB ate J s IC oS g Pathway DB c2 cp v4 0 symbols gmt gz Ae Candidate Gene candidategeneset_cra sf Show Detailed Results Figure 5 10 Function of displaying the results of PPI based association scan Step 12 View results of Crohn s Disease gt By text file or Excel file Open text or excel file for snp based or gene based analysis from local computer gt By Graphs Check QQ plots and Manhattan plots saved in htmlLog folder gt By KGG Interface Visualize pathway and PPI network output on KGG interface 6 Power estimation of set based tests by SPS Step 1 Open the software and enter the main user interface on KGGV3 5 Tools gt Power Estimation The interface is divided into two parts The left one is used to set the basic parameters a
10. Model Additive Model Position of Risk Variants Random Start from 1 Separated by space or comma Figure 6 3 Set parameters about risk variants Table to list parameters Parameter Description Risk SNPs The number of risk SNPs This parameter can increase from a smaller to a larger value step by step Odds Ratio The value used to quantify the association between risk SNPs and disease This parameter can increase from a smaller to a larger value step by step Disease The proportion of a population found to suffer the disease This will be used in the Prevalence genetic model The genetic model of risk loci The additive model and multiplicative model are Genetic Model l candidates in SPS Position of Risk The location information of risk variants within the total variants The users can click Variants the random button for automatic setting or set by themselves Step 4 Set population and sample The larger population size and number of case and control are recommended because they make the result more accurate and stable but it will take more time correspondingly So the user should keep balance between them Population amp Sample Population Size 500000 Number of Case 500 Number of Control 500 Figure 6 4 Set population and sample 17 Table to list parameters Parameter Description l l The number of individuals in a population generated by simulation according Population Size to the certai
11. Variants gt Benjamini amp Hochberg 1995 FDR gt The threshold is 4 709124E 5 Figure 5 6 Function of displaying the gene based association scan result Step 7 perform pathway enrichment exploration both by gene p values settings as Figure 5 7 1 and the output as Figure 5 7 2 Ki Pathway based association scan Scan Name PathwayGeneSetScan_Crohn Genes and pathways Candidate Genes Set Optional candidategen x Only seed genes used MsiaDB GeneSet V4 0 C2 Canonical pathways from the pathway databases 1320 sets im Example format Pathway DB in File Pathway size More than 10 Gene set based assocation test by HYST Li MX Kwan J5 Sham PC HYSI A H brid Set baged Test for genome wide association studies with application to protein protein interaction bas association analysis Am J Hum Genet 2012 Sep 77591 3 478 88 4 mm Figure 5 7 1 Pathway enrichment exploration by gene p values 11 753 KGG V3 5 A sys Project Data Gene BioModule Tools Window Help ENFVesaG K RunningResultviewer 2 B PassedResultViewer 38 TableViewer Window G gt View genes 2 Calculating local pair wise LD of SNP within genes on chromosome 22 35 Pry ww aa ar er a HYST Set based test O Gene based t
12. for analytical procedures involved Gene Association Results Main steps involved 1 Build an Analysis Genome generate an intermediate dataset which integrates original GWAS p values SNP annotation and gene annotation and LD between SNPs WITIN genes together It is a unified dataset which will be used for all kinds of analyses on KGG 2 Conduct a multivariate gene based association test calculate gene based p values of multiple phenotypes by a method 3 Conduct gene based association test calculate gene based p values of a single phenotype by GATES or HYST 4 Explore significantly associated pathways by HYST and enriched with susceptibility genes by hypergeometric distribution test One can use either the integrated pathways gene sets from MsigDB http www broadinstitute org gsea msigdb or his or her self customized pathways on KGG 5 Explore statistically significant associated PPI pairs by HYST which may work together to contribute to the development of the disease or traits Again one can use either the integrated PPI pairs from the STRING PPI http string db org or his or her self customized PPI pairs on KGG 6 Annotate and export significant SNPs genes pathways and PPIs 7 View external bioinformatics annotation results of statistically significant SNPs genes and pathways Other plug in 1 SPS a simulation tool for calculating power of set based genetic association tests References 1
13. 0 000 67 330 000 67 340 000 67 350 000 position 6727857 1 67390570 67 370 000 67 390 000 67 380 000 Analysis Log Window TNFO 2014 11 28 11 30 24 Variants gt Reniamini_ amp Hochhera 1995 FOR gt The threshoi m Figure 3 1 A typical KGG interface Frame 2 view of input data or output results Frame 3 running log of KGG analysis results is 4 709174F 5 The graphic dialogs of KGGs are self explaining Therefore we will not elaborate the function of each buttons 3 1 Project gt gt gt gt 3 2 Data gt gt gt 3 4 Gene gt Gene based association scan conduce the gene based association scans Create project create a new KGG project Open project open an existing KGG project Close project close the current project Exit exit the KGG application gt View genes view and export gene based association results Load P value file import your association summary results e g the plink output Define seed genes tell KGG the known causal genes of the disease you are studying Build analysis genome build an analysis genome in which KGG maps all SNPs to their gene features and calculates the r square or genotypic correlation of SNPs within genes 3 5 Module gt PPI based association scan conduct PPI based association scan gt View PPIs view significant PPI pairs gt Pathway based association scan conduct pathway based association sca
14. 2 3 4 5 6 7 8 16 17 1418 ail candidategeneset_crohn Chromosomes i Genome genome_crohn Source CrohnDiseaseSNP txt 53 51 of SNPs are inside genes on the whole genome wv Version hg19 10 r Gene database RefGene Used columns METAF Gene 5 extension 5 0 Gene 3 extension 5 0 SNP LD 1KG Haplotypes VCF Adjust P Value No 5 Jf Gene Scan genescan_crohn Genome genome_crohn Multivariate Test No w Pvalue sources META P Test GATES Show Detailed Results Gene p values o All SNPs o SNPs inside of gene O SNPs outside of gene O Observed log10 P 2 33 291 3 49 Expected log10 P Gene based association scan has been finished for META P Elapsed time 0 min 10 sec INFO 2014 11 28 11 30 24 Variants gt Benjamini amp Hochberg 1995 FDR gt The threshold is 4 709124E 5 Figure 5 5 2 The display after gene based scan Step 6 Click the Show Detailed Results node under Genome Scan and a new tab ShowGenes will be created to provide you more information about the result Figure 5 6 You can also export the results you want in this tab 10 53 KGG V3 5 A sy Project Data Gene BioModule Tools Window Help ENVuwd Source CrohnDiseaseSNP txt Version hg19 Gene database RefGene Used columns META Gene 5 extension 5 0 2 53E 34 SNP Info SNP rs2847288 2 73E 31 Gene_Feat
15. 5316 0 002688 0 007688 0 4893 4 rs1841043 0 01115 0 006112 0 119 4 rs11726946 0 005892 0 4893 0 Example input format with only position of KGG CHR SNPID SNPPOS P valuel P value2 P value3 4 Snp 100001 0 02301 0 8815 0 007688 4 Snp2 110011 0 4384 0 9575 0 006112 4 Snp3 120001 0 002688 0 007688 0 4893 Snp4 130011 0 01115 0 006112 0 119 4 Snp5 140001 0 005892 0 4893 0 P Moreover a p value column could include values of different models KGG will recognize this format if you select the input format as multiple tests per column when building the analysis genome Example a more complex input format of KGG CHR SNP P valuel Test Mode P value2 4 rs1513559 0 02301 additive 0 007688 4 rs 1513559 0 4384 recessive 0 006112 4 rs 1513559 0 002688 dominant 0 4893 4 rs 1841043 0 01115 additive 0 119 4 rs 1841043 0 005892 recessive 0 4 2 Input file 2 Candidate Gene list Candidate genes could be loaded one by one or imported from a TXT file The input file has only one column without header while one row contains one gene symbol or ID 5 Set based association analysis tutorial Step 1 create a new project named CrohnDisease and set the project path at C KGG Tutorial or other path defined by user ici Create KGG Project Project Name CrohnDisease Working Folder D KGG Description The knowledge based downstream genetic genomic s tatistical analysis forcrohn s disease Figure 5 1 Create project Step 2 s
16. E 15 rs2863202 5 99E 12 1 89059 117117019 93359 67296356 intronic 7 53E 2 rs 1884444 LOC283045 3 61E 10 9 98602 ojo 640993465 35541 67345833 intronic 5 25E 7 rs7515029 LINC01475 4 54E 10 1 11609 101286106 4829 67308393 intronic 3 65E 17 rs7539795 6 80E 10 1 50364 150309997 16150 67321467 4 63E 7 rs4655684 1 57E 9 3 15590 12785476 98859 67323793 intronic 9 13E 13 rs6701962 2 94E 9 5 41875 16357351 197872 67276878 wnstream 4 51E 2 rs 1004819 5 95E 9 gels nss eN 1 01148 101292689 3592 67382234 intronic 7 66E 24 rs6588245 2 61E 8 3 86757 150157507 18792 67323820 intronic 3 73E 7 rs R63700 67776641 downstream 79F 1 Search gene TTN CD27 Multiple Test Method Benjamini amp Hochberg 1995 w Error Rate 0 05 Export Setting Content Variants inside genes X Format Excel xisx m Gene p values lt 5E 2 Variants p values lt 5E 2 Genome Browser Gene WDR78 3 68E 26 0 0 0 2 0 2 0 4 0 4 0 6 0 6 0 8 0 8 1 0 1 0 ref 27 5 25 0 log10 p value o 2 m Exon Intron UTR UpDownStream ncRNA A InterGene W Others 67 280 000 2 m 67 290 000 67 300 000 67 310 000 67 32
17. IBasedAssociationScan_Crohn STRINGPPIV905 txt gz NoLDSNPs 22 txt Calculating local pair wise LD of SNP within genes on chromosome 22 Sub network size 2 Number of Sub networks 178636 Finished gene network scan on the genome s Source CrohnDiseaseSNP txt Version hg19 Citak Rem The significance level for Bonferroni correction to control familywise error rate 0 05 on the on whole genome is 2 80E 7 0 05 178636 for all gene pairs Used columns META The significance level for Bonferroni correction to control familywise error rate 0 05 on the on whole genome is 2 26E 6 0 05 22101 for ppi genes Gene 5 extension 5 0 Gene 3 extension 5 0 SNP LD 1KG Haplotypes VCF g Gene pair p values O Adjust P Value No Gene p values o Gene Scan genescan_crohn Genome genome_crohn Multivariate Test No s P value sources METAP Test GATES Show Detailed Results B Gene pair Scan PPIBasedAssociationScan_G Gene scan genescan_crohn P value sources META Test No weights Gene pair STRINGPPIV905 txt gz Confidence 0 6 Show Detailed Results Pathway GeneSet Scan PathwayGeneSets Gene scan genescan_crohn P value sources META d 109 10 P Observe 2 01 2 51 3 02 Test No weights Expected log10 P Pathway DB c2 cp v4 0 symbols gmt ga Candidate Gene candidategeneset_cro Elapsed time 3 min 58 sec Show Detailed Results INFO 2014 11 28 11 30 24 Variants
18. KGG A systematic biological Knowledge based mining system for Genome wide Genetic studies Version 3 5 User Manual Miao Xin Li Jiang Li Department of Psychiatry Centre for Genomic Sciences Department of Biochemistry The University of Hong Kong Pokfulam Hong Kong SAR China Content 1 Introduction and general pipeline cc eeeeeseeccccccceeeeeeeeecececeesaeeeeeeccecesesaaeaseeeeeees 2 RN er to 01 a E A A A E E E E E E E 3 Dy O a a E O 4 2 1 Installation of Java Runtime Environment JRE ccc ceecceecceeseeeseeeeees 4 DD lastaan on OF KOG sesar E EE 4 s T O eee E E 4 NE ua e E E E N E A E A E E 5 Dig 2 D1 OPA A AE AA ea ssec geneva A AAA EO E AN N nena 5 SEE E E N E VO EA N EA EA T VESA N EE N E 5 S MO ee EEI E E A AEE 6 PILOOT hat attceancesiasbsiabeas 6 DOW NOW aise E E TE E E aneaseessaesines 6 Ae pO O ea a A E E E 6 4l dInpat ne Al Ga VW Ex resulis so cscceccoh Samia ancienncsaop i Sntceanseannecepeumaniensmentecusees 6 4 2 Input tale 2 Candidate Gene USC cwsnseeasteweseresassensdedenenssusvraaancdientsensassdndetsesien 7 5 Set based association analysis tutorial 0 0 ceecccccccccceccssseseecceeeeecaeeeeseeeceeeeeesaeeseeeeeees 7 6 Power estimation of set based tests Dy SPS ccecccccccccsssseseeeeeeeeeeeeeeeeeeeeeeeeeeseaeeeeees 14 7 Update from KGG 3 0 to KGG 3S ccsssseescasensnecesseasteenenncdbaaienasenenosseuenctbenenncbbaalveteedt 20 Hints for large GWAS dataset around or over 2 5 mil
19. P LD 1KG Haplotypes VCF Adjust P Value No Gene Scan genescan_crohn Genome genome_crohn Multivariate Test No P value sources META Test GATES Show Detailed Results D Gene pair Scan PPIBasedAssociationScan_C Gene scan genescan_crohn Pvalue sources META s Test No weights Gene pair STRINGPPIV905 txt az Confidence 0 6 sf Show Detailed Results 2 Pathway GeneSet Scan PathwayGeneSet s Gene scan genescan_crohn s P value sources META Test No weights lt Pathway DB c2 cp v4 0 symbols gmt ga Candidate Gene candidategeneset_cra Show Detailed Results Illustration Frame 1 tree structured branches to manage input data and analysis results of a KGG project K RunningResultviewer 8 fe PassedResultViewer 8 TableViewer Window 2 View genes Gene Info Symbol NominalP 2 53E 34 Chromo Start_P Length boo 67390577 23347 SNP Info SNP a rs6664119 2 73E 31 1 04 19 see 16 5 73825 49311828 233924676 3915 191874 2 ShowPathways 22 O Show PPIs Position 67367916 Gene Feat META 9 63E 8 rs10889656 67306580 intronic intronic 5 65E 2 rs2064689 67365031 1 90E 15 rs11209018 3 70E 16 1 63478 67465014 55067 67379312 fron intronic 6 41E 12 rs17375018 8 94E 14 3 29436 17882217 141267 67367168 intronic 2 39
20. Times 500 P Value Threshold Gj i Figure 6 6 Run the program STEP 7 Save the result The user can review the power from two tables at the SNP level and set level A line chart is draw to show the variation of power within different odd ratios with given the MAF and LD information The user can also change the MAF and LD values to update the chart The users can right click on the tables and save the results as excel file or txt file The chart is can also be save by right click pa odds power Samping Times 500 P vae Threshoid Odds GATE gt he HH 2 GATE Sos HH GATES io 4 gt GATE Sachs Figure 6 7 1 The output of SPS 19 Figure 6 7 2 The saved table of set based power Figure 6 7 3 The saved table of variant based power 7 Update from KGG 3 0 to KGG 3 5 Much progress was made from KGG 3 0 to KGG 3 5 mainly including 1 Multivariate gene based association analysis 2 Direct link to multiple bioinformatics annotation databases 3 Simplified operation better plotting function 4 Integrate regulatory information to prioritize risk genes under development 5 SPS plug in is included _ 20
21. e display after building analysis genome Step 5 do a gene based association scan using SNP p values integrated in the analysis genome named genome_crohn select Extended Simes test GATES more powerful for a gene with one or a few independent causal variants method Set the parameters as Figure 5 5 1 and name the result as genescan_crohn Remember that exported Manhattan plots and QQ plots will be shown in Running Result Viewer Window Figure 5 5 2 gt Gene based association scan Scan Name genescan_crohn Manhattan plot display Label genes with p values lt 1E 6 Width 1200 Label SNPs with p values lt 5E 8 Height 500 Minimal p value 1E 10 Manhanttan plot SNPs outside genes QQ plot display QQ plot SNPs inside genes Width 600 QQ plot SNP outside genes Height 400 This analysis genome has NO phenotype correlation matrix Minimal p value 1E 10 E Methods Extended Simes test GATES more powerful for a gene with one or a few independent risk variants Li MX Gui HS Kwan J5 Sham PC GATES A rapid and powerful gene based association test using extended Simes procedure Am J Hum Genet 2011 Mar 11766 3 2283 293 Project Data Gene BioModule Tools Window Help h oese Resource Window E K RunningResultviewer s fj PassedResultviewer se TableViewer Window gt View genes s k 1 g Ls JEJ y saat ry IEN ail CrohnDiseaseSNP txt 5 Candi Gene Files a 1
22. elect the menu Data gt Load P Value File and choose CrohnDiseaseSNP txt file which contains the whole genome association p values for Crohn disease at SNP level This dataset was downloaded from a public domain released by Barrett et al 2008 It includes 7 columns as SNP CHR POS RISK NONRISK META Z and META P fo KGG V3 5 A Project Data Gene BioModule Tools Window Help OPuwsao gs e eee e e e eee Figure 5 2 Input GWAS original result file Step 3 import file CrohnCandidateGeneSet txt as input of candidate gene define ATG16L1 CARDS IBD5 IL23R NOD2 and TNFSF15 as seed genes Then save it as candidategeneset_crohn Define seed genes Chromosome C2 E cell co stimulator 21q22 3 telectin 1 galactofuranose 1q21 3 8 S itmen 1p31 2 autophagy related 16 ike 1 2q37 1 paser ciDIE T deleted in malignant brain tu 10925 3 lt 26 1 otide eS ty related GTPase fam 5q33 1 t domain iz ecrul 1 L noz o fay hudeotdebndng oigomerz 16412 Figure 5 3 Input candidate gene set for crohn s disease Step 4 select META P for building analysis genome extend gene region to its flanking 5 kb region in both sides and use LD SNP coefficients from 1000 Genome Project to adjust LD
23. est o Source CrohnDiseaseSNP txt Version hg19 Gene database RefGene Used columns META Gene 5 extension 5 0 Gene 3 extension 5 0 SNP LD 1KG Haplotypes VCF Adjust P Value No S J Gene Scan genescan_crohn Genome genome_crohn Multivariate Test No P value sources META ww Test GATES wv Show Detailed Results 2 Pathway GeneSet Scan PathwayGeneSets Sd Gene scan genescan_crohn P value sources META Test No weights For association test based on HYbrid set based test of GATES and scaled chi square Test HYST The significance level for Benjamini Hochberg 1995 FDR test Pathway DB c2 cp v4 0 symbols gmt gi to control error rate 0 05 for gene sets is 1 22E 2 There are 314 significant gene sets involving 4133 unique genes in total N in Observed log10 P 3 N in _ 4555555 gt a wv wv wv e w in 0 38 0 76 1 15 1 53 1 91 2 29 Expected log10 P Candidate Gene candidategeneset_cro For enrichment test of selected risk genes based on Hypergeometric distribution test The fixed p value threshold 0 05 for gene sets is 5 00E 2 i Show Detailed Results For enrichment test of risk genes based on Wilcoxon signed rank test one sided The fixed p value threshold 0 05 for gene sets is 5 00E 2 In all of significantly associated gene sets by HYST and enriched gene sets by Wilcoxon test 48 genes are significant p value cutoff 7 36E 4 according to Benjamini
24. he MAF can increase from a initial value to a terminal value according to a step value that set from the GUI The relationship between SNPs If the SNPs are dependent the user should SNP set the LD value r otherwise O is set as default The LD information can Dependence also be read from the real data where it will be calculated based on the allele frequency Minor Allele Frequency 16 The r score used to represent LD information The SNPs in the same block are dependent and keep the same r value while SNPs in the different blocks are independent with each other and the r value is set as 0 The r value can also increase from an initial value to a final value by a step value Family File The path of the Plink files The valid file path can be input by the button on Linkage Disequilibrium LD r Map File the right If the three files have the same file prefix and are stored in the same BED File directory the other file paths will be filled automatically when one file is set Consider the The number of SNP that input from the real data The real data usually first several include large size of SNPs which is unnecessary for our simulation Hence SNPs we just consider the first several SNPs as our study objects VCF File The path of a VCF file Step 3 Set parameters of risk variants Risk variants Risk SNPs from 1 to 3 step 1 Odds Ratio from 1 8 to 2 2 step 0 05 Disease Prevalence 0 05 Genetic
25. lion SNPs Set or change large memory for KGG3 say 2000MB by Tools gt Set System Memory 1 Introduction and general pipeline KGG Knowledge based mining system for Genome wide Genetic studies is a software tool to perform knowledge based analysis for genome wide assoc