Home

User Manual - The University of Hong Kong

1. rey Figure 1 1 Pipeline chart of KGG analysis version 2 54 Notes Circle nodes stand for data and fles input output single directional arrows stand for analytical procedures involved Details for each node and procedure will be given as follows Steps involved 1 Build an analysis genome generate an intermediate dataset which integrates original GWAS p values SNP annotation and gene annotation and LD between SNPs WITIN genes together It is a unified dataset which will be used for all kinds of analyses on KGG 2 Weight SNP p values produce weighted p values according to the prior knowledge of tested SNPs by a weighting method 3 Conduct gene based association test calculate gene based p values from tested SNPs within or around the genes by GATES or HYST 4 Explore significantly associated pathways by HYST and enriched with susceptibility genes by hypergeometric distribution test 5 Explore statistically significant PPI pairs by HYST which may work together to contribute to the development of the disease or traits 6 Select significant genes in a functional gene set cluster Expand a cluster of genes which share the same pathways or have PPI with seed genes proposed by users Genes are separated into two exclusive subsets the functional cluster and the remainder Use multiple testing methods to pick up the significant genes in genes in the subsets 7 Annotate and export significant SN
2. fbi Gb senate Deka HEEREREREER 8 e Poston colom OE ater Pion eir i as sge ester cim s ii Fie tan can Correct Pin uate roae 03 a ae arc addy ned m Sram Galen cat nin cerne verson A TURIS E enaa 2 amaaan ananas Dira devenues cras CE bx fal ema 4 x Figure 3 11 Dialog to set parameters for genome building with variant position File setting setting for Marker position column Marker position version and Reference genome are included in order to correctly map the position on the genome build 18 or 19 ally you can click the build button to build a new analy renome for your original data It takes some time to run the whole process you can see it from the log frame if interested Once finished KGG will be shown as Figure 3 12 A node named genome set will be created and parameters set for this genome could be viewed by click symbol ECTHTTE Figure 3512 KGG view afier genome building red box shows created node for this analysis 3 3 SNP gt Weight SNPs gt Annotation amp export 1 Weight SNPs After building genome set you can sta
3. 3 Build analysis genome by RSID After the original GWAS result file include rsID for each SNPs is imported you need to build an analysis genome which is necessary for following SNP based gene based pathway and PPI analysis Just click the menu Data gt Build 12 Analysis Genome by RSID or the accelerator open following dialogue Jl halt Any ora by SMP ASI Tm yuh ricus he pti chi fe sb Bi Mat e anne mer cr exei jens a i FI pc B jer pee eism lore L PETA p pans sse temi 0 8788 pares parle nosse p p 8 e ma ect a Firas Tee aowi mercato ne m i Memini 5 hes s G 5 X 3 nonc aooi vey Toe rege nai ir E year hanana eik emi escuras amnad EU mi JS m x i Vue Viscera Cv ciem a nse ie ER Figure 3 10 Dialog to set parameters for genome building with RSID Genome Name genome name prepared for KGG analysis defined by user Original association file Choose a project file to open if only one files offered KGG system will auto load this file select column of p value or chi square in original GWAS for following analy
4. KGG A systematic biological Knowledge based mining system for Genome wide Genetic studies Version 2 5 User Manual Miao Xin Li Hong sheng Gui Pak C Sham and You Qiang Song Department of Psychiatry Department of Biochemistry The University of Hong Kong Pokfulam Hong Kong SAR China Content Introduction and general pipeline Installation 2 1 Installation of Java Runtime Environment JRE itallation of 2 3 KGG directory 3 Interface and functions 3 1 Project 32 Data 3 3 SNP 34 Gene 3 5 Module 3 6 Tools T 4 Input amp output files 4 1 Input file 1 GWAS results 4 2 Input file 2 Candidate Gene list 4 3 Output file 1 log file 4 4 Output file 2 Annotated SNP result 4 5 Output file 3 Annotated gene result 4 6 Output file 4 Enriched pathway result 4 7 Output file 5 Enriched PPI network result 4 8 Output file 6 Graphs in htmllog directory Tutorial Update from KGG 1 to KGG 2 0 References EE Hints for large GWAS dataset around or over 2 5 million SNPs 1 Maximize your Java heap sizes by Xmx1500m or a number larger than 1500 2 Only annotate or export a small set of genes you are interested in by choosing Gene Annotate amp Export or SNP gt Annotate amp Export 1 Introduction and general pipeline KGG Knowledge b
5. 6 petere E m I Figure 3 7 Dialog to define the candidate genes by searching tissue specifically expressed genes Genes in the lower table of this dialog are the candidate genes and will be used a reference to generate the optimal weights for the p values of SNPs or genes or other functional analysis However they will be treated differently according to the feature As Seed Genes the As Seed will be set as seed candidate genes to extend a larger set of candidate genes which share the same biological pathway and having protein protein interaction with the seed ones at the beginning of the weighting procedure When the As Seed property is false the genes will not be used as seed candidate genes to infer others na ee we Ca amm E LIA E Ie E B B E d Ei B E Er E B BE p d E E cce 8 E ES s pup i i Soo Figure 3 8 Dialog to define candidate gene set and seeded genes ci BS Be oo Figure 3 9 KGG view after candidate gene set input Finally you can save the chosen genes in the bottom table into your created project by clicking Save As button A name is required for the candidate gene set to be saved And a new branch in the frame of projects will be created genes input can also be viewed in the frame data viewer
6. Input genes by user select OMIM genes and select Tissue specific genes To select genes as candidate genes you need move the selected genes on this table into the bottom table by clicking the Add button Remember that this step is optional all functions except for the Functional gene cluster analysis can go without the definition of candidate genes However we suggest using some important candidate genes if available 8 seed genes which may introduce additional information about the disease into analysis nm eatem eS 1 ifra DJ Dm C9 pm D Cem Cem rane Comoe Ce Ga Figure 3 5 Dialog to directly define the seed candidate genes The second way is to define candidate genes according to OMIM database The OMIM dataset has been integrated into KGG You can easily retrieve OMIM genes by the OMIM ID OR disease name on KGG by clicking the Search button Figure 3 6 Retrieved genes can be further selected on KGG by ticking the checkbox on KGG Please also move selected genes into the bottom table by clicking the Add button Fess tte fer oan 13 oom nme se Rem me an EN mio mcus pm mx mum oo ZEE c 8 Fos a a E34 B a eg
7. evista WERN guia eine WR Figure 4 6 2 Manhattan plot of SNP around or fell into genes p values chr 9 to 22 Step 1 create a new project named CrohnDisease and set the project path at CAKGGWutorial or other path defined by 3 Gener Figure 5 1 Create project Step 2 select the menu Data gt Input original association file choose CrohnGWASresult txt fi whole genome association p values for Crohn diseases at SNP level This dataset was downloaded from a public domain released by Barrett et al 2008 It includes 7 columns as SNP CHR POS RISK NONRISK Tasdev bess ile which contains the META Z and META P Figure 52 Input GWAS original result file Step 3 import file CrohnCandidateGeneSet txt as input of candidate gene define ATG16L1 CARD9 IBDS IL23R NOD2 and TNFSF15 as seed genes Then save it as candidategeneset_crohn 38 lt pm xs Ds Fi Fed OE 7 E ms EUER Ei Ea E eene m co Teen D E E ee Figure 5 3 Input candidate gene set for crohns disease Step 4 select column META P for building analysis genome extend gene region to its flanking 10 kb region in both sides and use HapMap LD SNP coefficients to adjust LD DTI CHE E Fo em IT F
8. The expanded SNP list was classified into different categories according to their features See Table 1 for details The main resource was the gene features defined by the dbSNP database http www ncbi nlm nih gov bookshelf br fegi book helpsnpfaq amp part Build The_dbSNP_Mapping_Pr Build _Annotation_of_SNI We summarized the proportion of SNPs belonging to each category among all SNPs of genes which could be slightly expanded by a say 3kb at both sides We also collected all SNPs which were included by one of the popular 13 Affymetrix and Illumina high throughput genotyping platforms including the Ilumina HumanHap M and Affymetrix GenomeWide_Human_SNP_Array_6 0 certain distance The number of unique SNPs was 1 778 780 The SNPs were also partitioned into various categories indicated by Table 1 Similarly we counted the proportion of SNPs of genes in these defined categories The latter proportions were regarded baselines The ratio of the proportion of GWAS hits to the proportion SNPs in the genotyping platforms at each category was calculated Table 1 Criteria to Categorize SNPs T According to gene features where SNPs are located Feature Description Categories adjacent Beyond 2 Kb 5 500 bp bur less than x bp ofa 1 gene The default value of xis3 b Users can customize it near gene 5 Within 2 Kb S of a gene on either strand but the 2 variation is not in the transcript for the gene near gene 3
9. a a a fers es 2 P a Lm E EE 5 d x m E ES F Ed as E lt Gece Figure 3 6 Dialog to define the seed candidate genes by searching OMIM dataset The third way is to choose tissue specifically expressed genes as the seed candidate genes on KGG Figure 3 7 These genes proposed by Greco et al 2008 Greco et al 2008 where 1601 genes were identified as selecti human normal tissues This tissue specit ly expressed in one or more expression feature might be important for some complex diseases For instance genes exclusively expressed in adult brain might be interesting candidates for Alzheimer disease There are total 77 tissues listed in the top left table of this dialog Their selectively expressed genes can be shown on the top right table by clicking the View button Again please remember to transfer selected genes into the bottom table by clicking the Add button Gi wey E M M e E ee eee B 1 pu Mur E Do Rem LI EE uu IR uL E ee s pr ey B acu pee E gre E ie um E iss E fer ier E ri DU M m E E arene r P 8 Te E 3 P nd ei pk a 8
10. es Output Tex Format evolves Pink fma papap t Cases MACH appe S export gene parsin 7 siyicant ge parsin grants ajag gj Figure 5 11 1 PPI association scan Step 12 View results of Crohn s Disease gt By text file or Excel file Open text or excel file for snp based or gene based analysis from gt By Graphs Check QQ plots and Manhattan plots saved in htmlLog folder gt By KGG Interface Visualize pathway and PPI network output on KGG interface 6 Update from KGG 1 to KGG 2 0 by gene based p values local computer Much progress was made from KGG 1 0 to KGG 2 0 mainly illustrated as follows 1 Include gene based analysis 2 More structured design 3 Less computation burden 4 Outputs from KGG are more illustrative and easily interpreted 7 References Li MX Sham PC Cherny SS Song YQ A knowledge based weighting framework to boost the power of genome wide association studies PLoS One 2010 Dec 31 5 12 e14480 Li MX Gui HS Kwan JS Sham PC GATES A rapid and powerful gene based association test using extended imes procedure Am J Hum Genet 2011 Mar 11 88 3 283 293 Li MX Kwan JS Sham PC HYST A hybrid set based test for genome wide association studies with application to protein protein interaction based association analysis Am J Hum Genet Am J Hum Genet 2012 Sep 7 91 3 478 88
11. important as other SNPs of a gene say the coding synonymous SNPs mainly indicated as A somewhat unexpected point is the ratio at C 11 which is mainly made up of genomic variants like frame shift or nonsense The GWAS hits do not enrich in this category although they might have large biological impact once mutated One possible explanation is that nonsense polymorphisms we observed as SNPs may not excessively contribute to the development of common complex diseases This is different from what we learned from the study of Mendelian disorders Altshuler et al 2008 Antonarakis and Beckmann 2006 The weight for a category is equal to 10 which will be standardized to make the summation of the weights equal to the SNP number within gene being tested In this way the ratios of the lowest weight to the largest weight are 1 5 05 1 3 62 and 1 6 85 for the three HapMap m populations CEU JPT and YRI They could roughly cover the effective region we observed in the empirical simulation scc more in the result section Table 2 Ratios of SNP proportions at each category for the GWAS hits expanded according to three HapMap samples CI C2 C4 C6 CI C9 Cl ceu TOT 13H 1429 078 1363 1185 1288 1487 1339 0778 0884 CHB JPT 1026 1 137 1347 1311 1254 1200 1444 12M 0788 0 936 YRI 1165 1274 1431 0733 1353 1024 1442 1087 1569 1167 0958 2 Cluster analysis Cluster analy
12. of pervious projects Close project close the current project Exit exit the KGG application vvvvv The first step to use KGG is to create KGG project You can click Project Create Project or the accelerator to open the dialog Create KGG Project Figure 3 2 H Create KGG Project Pro Nane project 1 merae Memes Flea yee Fes om Figure 3 3 Dialog to open an existed KGG project 3 2 Data gt Import original association file gt Define candidate genes gt Build analysis genome by RSID gt Build analysis genome by position 1 Import original association file After the creation of the project the first thing you need to do is to imp this just click Data Import original association files or the accelerato Oriana statistical Flee your GWAS results such as output of PLINK To do Eomae amiet ut 9 Figure 342 KGG view after GWAS original value input Red box shows a node created by KGG after input red arrow shows GWAS result 2 Define candidate genes At the second step you can define a set of candidate genes of the diseases or traits being studied for the knowledge based weighting analysis and other functional analysis Just click the menu Data Define Candidate Genes to open a Define Candidate Genes Dialog On this dialog the candidate genes can be defined by three different ways
13. the number of typed SNPs within a gene set and marker dependency So they have unbeatable speed to handle millions of SNPs while having comparable or even higher power than existing gene set based methods KGG currently integrates 8 SNPs and gene related biological resources including the SNP annotation from the NCBI dbSNP pathways and PPI information from multiple databases 1 SNPs gene features e g intron missense splicing site etc from dbSNP htp www ncbinlm nih gov SNP 2 conservation scores from the UCSC Genome Browser website bttp hedownload cse ucsc edu 3 positive selection scores of SNPs from http haplotter uchicago edu selection 4 human microRNA target gene binding site information from Sanger s miRBase http microrna sanger ac uk 5 disease genes from OMIM http seww ncbi nlm nih gov omim 6 tissue specific expression genes from an analyzed dataset of mRNA expression arrays by Greco et al 2008 7 biological pathways from MsigDB http www broadinstitute org esea msigdb index jsp and 8 protein protein interaction information from String database http string db org Files or data involved GWAS summary results a file contains SNP based p values chi square statistics or Zescores see more information of input file in chapter 4 Genome Set a compiled intermediate dataset it has integrated original GWAS p values SNP annotation and gene annotation as well as LD information from HapMa
14. update the resource data You will not need to wait for the downloading later when you analyze your data 2 3 KGG directory After correctly installing JRE and KGG package the KGG directory should be like Figure 2 1 And xxplanations for file or folder are given as follows 1 KGG jar the main KGG program 2 kgg ini xmi the initiation setting of KGG application 3 run linux sh run mac sh run win bat example command file for Linux Mac and Windows respectively 4 htmiLog folder intermediate analysis results including figures and log file 5 lib folder functional library files of the KGG application 6 resources folder TXT files for SNP annotation database gene annotation database 7 Tutorial folder store input data for tutorial practice see chapter 5 D B 2 ae Be Figure 21 Structure of KGG directory 3 Interface and functions Once you start KGG A window will pop up as Figure 3 1 1 the window will be like Figure 3 1 2 when a project is created and executed by KGG Figure 3 1 2 KGG interface after project Frame 1 tree structured branches for different sub analysis Frame 2 view of input data or output result Frame 3 log output of KGG analysis and some results Frame 4 resources of KGG 3 1 Project Create project create a new KGG project Open project open an existing KGG project Latest projects quickly open one
15. 2 Genetic Association Database HuGon Database to validate your assocaton or Export Settng Output Format Text Format Output Path Ciimpltest Figure 3 153 Dialog set parameters for SNP annotated results by genes mA Annotate SNPs SIP Set by Pialues By SP By cers av Reams In chromosome gon fem dtp o Expert Setting Output Forma Text Figure 3 15 4 Dialog to set parameters for SNP annotated results by regions You can check the exported file in your defined output path See chapter 4 for more information on output file interpretation 34 Gene gt Association scan gt Cluster analys Cross validation Annotation amp export LD plot annotation 1 Association scan In parallel with weighting SNP analysis gene based association analysis could a iso performed on the analysis genome built just now Original p values will be used to compute gene based p values by algorithm referred in KGG paper 2 Multiple outputs will be created which include Manhattan plots as well as QQ plots of the gene based and SNPs based p values Moreover you can click the Scan button to calculate gene based p values on the whole genome and check output graphs Once finished KGG will be shown as Figure 3 17 A node named Gene Association Set will be created an
16. Association Set select one gene based analysis set previous done which produced gene level test statistics and p values PPI DB the PPI String database htp string db org PPI DB in File the customized PPI dataset in a text file PPI pair association test choose to use either HYST or HYST with prior weights in the analysis The weights were derived from a number of properties of the genes in the PPI network unpublished data implying their tendency of being a susceptibility gene of complex diseases Note while the result of HYST will be reliable as we declared in the paper you may only try the HYST with prior weights unpublished method to see how it works in your dataset currently 28 Multiple testing by methods of multiple testing for PPI based p values Keep PPI pair Heterogeneity Exclude significant PPI pairs in which the gene based values are significant different in this case there is usually only true susceptibility gene which is highly significant and dominated the resulting PPI based p value Always keep significant PPI pair in which both genes are genome wide significant The heterogeneity test may exclude the PPI pairs in which one gene has extremely significant p value and the other has significant p values although both are genome wide significant After the filtration of heterogeneity test KGG will get back these PPI pairs LD files The LD data to account for correlation between nearby g
17. E 5 0 or change the default version of Java Similar to the Linux OS the Java Home environmental variable has to be configured to initiate KGG Hint We have prepared a default configure for Mac OS users to change the Java version in the file run mac sh 2 2 Installation of KGG KGG has not had an installation wi ard by far After downloaded from our website and decompressed it can be launched through command java Xms256m Xmx512m jar KGG jar in a command prompt window provided by OS In the command Xms lt size gt and Xmx lt size gt set the initial and maximum Java heap sizes for KGG respectively A larger maximum heap size can speed up the process of analysis A higher setting like Xmx768m is suggested dealing with large number of SNPs say more than 1 000 000 The number however should be less than the size of physical memory nux Windows and Mac OS In the Linux and We also prepared three command files run linux sh run win bat and run mac sh for the respectively In a Microsoft Windows command line terminal KGG jar can be initiated by typing run terminals users can type sh run linux sh and sh run mac sh to run KGG Hint In run mac sh you must ensure the JAVA HOME is correct in your machine KGG package downloaded from the website does not include the resource data You are suggested launching KGG once you download and unzip the KGG package KGG will automatically download and
18. Ps genes pathways and PPIs The significant genes are selected according SNP based p values either the original or the weighted 2 Installation 2 1 Installation of Java Runtime Environment JRE The JRE is required to run KGG on any operating systems OS can be downloaded from http iava sun com javase downloads index jsp for free The version number for KGG is 1 6 or up Installing the JRE is very easy in Windows OS and Mac OS X In Linux you have more work to do Details of the installation can be found http www java com en download help linux_install xml ception then In Ubuntu if you have an error message like Exception in thread AWT EventQueue 0 java awt Headle please installs the Sun Java Running Environment JRE first To install the Sun JRE on Ubuntu 10 04 please use the following commands sudo add apt repository deb http archive canonical com lucid partner sudo apt get update sudo apt get install sun java6 jre sun java6 plugin sun java6 fonts Detailed explanation of above commands can be found at hitp www ubuntugeek com how install sun java runtime ubuntu 10 04 lucid Lynx html environment jre in For OS the JRE 1 6 has been available at http developer apple com java download since April 2008 Mac OS users may need update the Java application to run IGG A potential problem is that this update does not replace the existing installation of JOS
19. SNPs and seed candidate genes respectively Here the 2 Level means that the minimal length from a tested gene to a seed candidate gene is 2 edges There are intermediate genes in gray on the plot between the tested gene and seed candidate genes which have PPI with the both M 48 Output file 6 Graphs in htmllog directory When running KGG some graphs as one of the results will automatically be saved in the htmllog directory These graphs include QQ plots and Manhatton plots for SNPs genes or PPI networks It can help user to easy understand and interpret those SNP based and gene based analysis done KGG Some examples are given as following three graphs GO plot of SNP espa MEIRE Hah ries Figure 4 5 1 00 plot of three sets of SNP p values after weighting SNP analysis red 23233 Eosed Figure 4 5 2 00 plot of gene based p values and the original SNP p values after gene association analysis Figure 4 6 1 Manhattan plot of gene based p values chr 9 to 22 Note User can determine to label how many genes and SNPs on the plot by setting different p value thresholds In order to present the whole genome clearly two plots are drawn for chromosome 1 to 8 and chromosome 9 to 22 respectively 5 Tutorial user siu FERRI
20. Within 500 bp 3 of a gene on either strand but the 3 variation is not in the transcript for the gene intron In the intron of a gene but not in the first two or last 4 two bases of the intron ms In the 5 transcript of a gene but will not be translated 5 In the transcript of a gene but will not be translated 6 coding Wit in the coding region of a gene but does not 7 synonymous change amino acid sequences of a gene splice 5 In the first two bases of the 5 intron 8 splice 3 In the last two bases of the 3 intron 9 missense Within the coding region of a gene and changes 10 amino acid equences of a gene frame shift Insertion or deletion disrupting the reading nonsense frame mutations resulting in a premature stop codon II According to non gene properties IF UCSC Conservation Score gt X 0 8 as default then move it to next category If Selection Score gt Y 2 0 as default then move it to next category If miRNA binding site then move it to next category The following table Table 2 shows the different HapMap samples are not exactly identic according to the three HapMap samples Although the magnitudes of ratio in the three at each category the general trend is consistent For example the Category C 4 has a ratio less than 1 for all the three samples This makes sense because most of the SNPs in C 4 belong to the introns of genes Biologically the introns SNPs may be not as functionally
21. an Eckeroon Nene Gere Associaton Set FIE Genes and patwaye gores set Optona x Oriy seed enes used soos Patas C2 Canonical pathnaysi860 gene sets v Pathraysize Morethen D andlessthan 300 ceres Sdect genes focHypsrceonetictestby Benjamin Error Rate 0 2 testna pathway p vauesby Benjamini 19985 Eror Rate 0 05 5 A tst OuputPeth eet 18 e None mepri sis Sher patay w expert by mae nas Fler gene t expert by pase ns Bae Figure 5 10 Pathway enrichment exploration by gene p values Step 11 search PPIs between significant genes The significant genes can be picked up according to the gene p values and SNP p values set as Figure 5 11 43 Gare newark assocation san scan tane Gere Arion St reser ce Conddate Geres EZ PPL par acceaation test PPlparaseca est M tpeistrgmebvdb Smaad Heterogeneity Keep PP par Heterogenety 12 lt Gene ovale Standard Bonferroni ies Oniy seed genes used Prolsininisracton 22 STRING Krov end Predicted Protein Protsin Interactions w os eso PPI pairin which both genes cercme vide secant v 0 05 Pink Map cenare Fai rani Fe Best OuioutPath
22. archical tree Figure 4 3 The p value of the pathway sharing is calculated according to hyper geometric distribution or our method 2e Enriched Pathways 2 KEGG Pathways Alzheimer s disease p value 5 620E 13 Candidate Gene APOE Candidate Gene APP Candidate Gene IDE Candidate Gene NCSTN Candidate Gene 053 Candidate Gene PSENI Candidate Gene PSEN2 Candidate Gene TNF Tested Gene CACNAIC 157313403 p value 2 325 04 weighted p value 4 864 06 risk 4 Tested Gene C0782 154560384 p value 1 81 1E 04 weighted p value 3 789 06 risk 4 Tested Gene GRINZA 151875205 p value 4 509 04 weighted p value 9 434E 06 risk 4 Figure 4 3 Visualization of shared pathways 4 7 Output file 5 Enriched PPI network result The enriched PPI sub networks are nicely visualized by an open source Java package Jung http jung sourceforge net Figure 44 It is a directed graph The genes of selected SNPs are the starting nodes The end nodes are the seed candidate genes The depth of a sub network is the largest number of edges from a starting node to an end node pne cmm Figure 4 4 Visualization of 2 level sub PPI network enriched by both seed candidate genes and genes with significant SNPs through our tool KGG Each node denotes a gene labeled by Gene Symbol The edge indicates a PPI between two genes The red and green nodes denote tested genes with significant
23. ased mining system for Genome wide Genetic studies is a software tool to perform knowledge based analysis for genome wide association studies GWAS At present it has three major functions 1 prioritizing SNPs through a knowledge based weighting method 2 conducting gene based association tests using SNP p values from GWAS 3 advanced biological module level association analysis pathway enrichment and protein protein interaction PPI network association by a set based test PH using gene based p values The knowledge based weighting method for SNPs has described in our paper 1 It combines both biological knowledge and statistical association p values to produce optimal weights which can maximize the potential power of association tests while controlling false positive discoveries The method presents excellent performance in systematical investigations from theoretical calculation computer simulation and practical application The gene based and set based genome wide association analysis methods have been described in our paper 2 and 3 These methods do not requite the genotypes of SNPs but focuses on quickly combining available statistic SNP p values for association Compared with existing methods proposed for gene based and or set based association a nice characteristic of these methods is that they do not resort to any computation intensive procedure like permutation to address the issue of varied gene set size or
24. can results can also be annotated and exported but only by p values threshold and genes LE Mute Testing Method 0995 Forts Error Rate 0 05 Oni orcortcorscwthconbrodpvaus lt adswewhpues spor setng moweedstenfoanst Outpt Fost omea Figure 320 1 Dialog to set parameters for cross validate gene association M By Vokes By Geros raare CRF Tutor Cr rDieoselrohvaerebipiaue Scart ET Figure 3202 Dialog to set parameters for cross validate gene association 5 LD plot annotation Once the analysis genome was built with integration of LD information KGG provides a LD plot for presenting the gene level p value the SNP level p values as well as the LD structure across the variants see Figure 3 21 Seo pe a Figure 321 Dialog to set parameters for LD plot annotation 3 5 Module gt Pathway based association gt PPLbased association Tn the higher level of biological knowledge KGG provides currently pathway and PPl based association analysis More biological module based analysis methods may be added in the future The pathway based association analysis aims to explore enriched and associated pathway in MsigDB pathways a secondary pathway dataset curated from KEGG Reacto
25. clicked 2 PPl based association scan Genes are combined using PPI as analysis units for set based association test by HYST Statistically significant PPI pairs and involved genes will be reported This function has potential to detect genes with moderate effects which normally cannot pass the multiple test using SNP based and or gene based p values 27 i PrE basea association scan Skan Name FPleasedAssociatonScand Assocation Sat gansscant x Gores Candidate Genes SetiOpticnal V SczCandidatean Only seed genes used STRING Known end Predicted Protan PratenInteractons nell Pepa association test FPparseeceaso test Rin weighs Mopletostrg method stndadsonforoni Erorrat 0 5 Heterogeneity Resp FPt pat Heterogereiy Higgins v keep significant PEF par in which bof genes are gerome wite signieant Gene p value Standard Bonteroni Error Rate ns 1D fies Genotypes Fick format HapMap LD SKP Coertcents MACH Haplotypes Hagman nga Darrell tovelyaaeclpeayasourcestpapmaple CELAH chet CEU a ER ER J sport Cuba Fath est Output Format Text Format Ez Ca TET mr Zonk export significant gene pairs in fle 7 Visualize signficant gene pairs in graphs Figure 3 24 PPI Association scan setting in KGG Scan Name give a unique name for PPI association scan Gene
26. d parameters set for this genome could be viewed by click symbol 19 7 sanane Merete testes UE lampresa oes co ah TOS mm mo ms mer T Memorie E Menino SO E Maru a po eer ra wan jeas n3 suse tie pre ope ar Methods Entered Sines test o GATES Fore for agere vith one or a few independent causal variants Figure 3 16 1 Dialog to set parameters for Gene based Scan using the improved Simes test method Genome Set select the analysis genome built previously all SNP p values integrated in the analysis genome will be used Manhattan plot setting include settings for both gene and SNPs default threshold for gene is 1E 4 and 1E 6 for SNP User can reset the thresholds according to the local data QQ plot setting include settings for both gene and SNPs Methods four different methods are offered the improved Simes test GATES the hybrid set based test HYST the their corresponding weighting versions details including the advantages and disadvantages of the methods are described in the corresponding papers P Weight Settings once the improved Simes test with weights selected setting for conservation score threshold nature selection score threshold and reference population would be shown The following paragraph illustrates how the weights are constructed See illustration 2 Confirmation frame once all SNPs
27. enes on the same chromosome Note the built analysis genome only contains LD of SNPs within genes Export KGG could output PPI association scanning results in local disk and also automatically pop out the results by PPI viewer Figure 3 25 re rene p Are Pres nen rgo reete aves Figure 325 KG View after PPI association scan Selected Models include three models as transforming picking and annotating Layout include 7 modes of layout as KKlayout FRlayout FRlayout2 Circlelayout Springlayout Springlayout2 and ISOMlayout 3 6 Tools Currently KGG offers a tool for automatically downloading HapMap LD data see Figure 3 26 2 Figure 3 26 KGG tool frame download HapMap linkage disequilibrium data 4 Input amp output files 4 1 Input file 1 GWAS results KGG focuses on the downstream analysis of GWA studies where statistical association p values or chi square values at SNPs have been generated by conventional statistical genetic methods such as PLINK Therefore the association p values are the major input of our KGG KGG flexible supports a user customized format for the association p values three columns of information chromosome number and SNP IDs or physical position and p values are available in a file you can define the column order by yourselves on KGG The
28. enome set Once finished KGG will be shown as Figure 3 14 A node named Weighted SNP Set will be created and parameters set for this genome could be viewed by click symbol 2 Annotation amp Export In the same project you created just now you can retrieve interesting biological knowledge for SNPs after weighting SNPs The knowledge may provide important hints for you to understand the statistical significances and propose functional hypothesis KGG now has four ways to pick up SNPs you are interested in You can try any one according to your purpose Waited SNP set E Piss s sus canoe w Mie Testing Mhe Beniamin Errare a E5 Tt Feet omean Amato weer Figure 313 1 Dialog set parameters for SNP annotated results Dy p values zi Weed SN Set E By PValues By SNPs By Genes By Regions rei oa Hnt put SNPs with pesibve assoration in previous studies hare to valdate your assecistion resuis 19 Export Setting Cutout Format Test Output Path Cinpkest Annotate Export Figure 3 15 2 Dialog to set parameters for SNP annotated results by SNP TD Annotate SNPs Weighted SNP Set Fuss By Sus Genes By Regions Gane Sio En Hint put genes here fron 212510 zNp72
29. igure 54 Select META P to build analysis genome and name the genae as genome croft Step 5 Weight SNPs p values in the genome crohn built Set the parameters as Figure 5 5 and name the result as weight crohn 3 Wet Name roo rn econe sst genome carn Pake lane E hereon ee Concerto eve ee f Morre Seen Referee Peston E erdian eom v Crate Gene Expats 2 Ue Panyu Pathway See Z ip Maho a one wht by traton E Figure 5 5 Setting for weight SNPs analysis Step 6 export the weight SNPs result by p value threshold and save it to local computer with Excel format PhkpeTesnonalad Sent tiers 1985 wobec BEB Epor sang Outpt Formet S Figure 3 6 Annotate weighted SNPs by p values Step 7 do a gene based scan using SNP p values integrated in the analysis genome named genome_crohn select the GATES More powerful for a gene with one or a few independent causal variants method Set the parameters as Figure 5 7 and name the result as genescan_crohn Remember that exported Manhattan plots and QQ plots will be saved in htmlLog folder 40 mm c c NM ESS Vw c ER cun asete retegre Neh m peg 5 B m Sese Figure 57 Sening for gene based
30. iled information of how the analyses are conducted on KGG is recorded in this log file 4 4 Output file 2 Annotated SNP result Results of weighted SNP are saved in the path defined by user Once opened by text editor it looks like following graph Figure 4 1 Tut format of weighted SNP result Illustration of column in the file SNP RS ID of SNPs in dbSNP PValue Original statistical significance Weighted PValue Weighted original statistical significance IsSignificant Whether the SNP is significant according to the weighted p vlaues Chromosome The SNPs chromosome number Position Physical location on the chromosome Gene_Symbol Approved official gene symbol of the SNP Entrez_GeneID NCBI Entrez gene ID Gene Feature Gene feature where the SNP is located Conservation Score Conservation score of this SNP generated by UCSC hutp genome uese edu miRNA Binding Site Whether the SNP is within the target binding site of a miRNA according to the Sanger s miRBase hup microrna sanper ac uk Selection Score CHBJPT Nature selection score of this SNP in the HapMap CHB IPT population chosen Selection Score CEU Nature selection score of this SNP in the HapMap CEU population chosen Selection Score opulation chosen Nature selection score of this SNP in the HapMap YRI 4 5 Output file 3 Annotated gene result Results of gene based p val
31. input file can include more than one p value column The following is an example Example input format with rsID of KGG CHR SNP P aluel P value P value 4 151513589 002301 08815 0 007688 4 15294755 0434 09575 0 006112 4 1835316 0 002688 0 007688 04893 4 251841043 00115 0 006112 0 119 4 1511726946 0005892 0 4893 o 0 Example input format with only position of KGG CHR SNPID SNPPOS P valuel P value2 4 Supt 100001 002301 08815 0 007688 4 110011 04384 09575 0 006112 4 Snp3 120001 0 002688 0007688 0 4893 4 13002 0006112 0119 4 SnpS 140001 0 005802 0 4893 Moreover a p value column could include values of different models KGG will recognize this format if you select the input format as multiple tests per column when building the analysis genome Example a more complex input format of KGG CHR SNP P valuel Test Mode P valuc2 4 151513559 000301 additive 0 007688 4 151513559 0 4384 recessive 0006112 4 151513550 0 002688 dominant 0 4803 4 1184143 001115 additive 0119 4 151841043 0 005892 recessive 4 2 Input file 2 Candidate Gene list Candidate genes could be loaded one by one or imported from a TXT file The input file has only one column without header while one row contains one gene symbol or ID 4 3 Output file 1 log file the output in log frame will be saved in log html file htmlLog directory Deta
32. me database etc by significant genes or seed genes PPI based association protein protein interaction analysis aims to explore significantly associated PPls in the integrated PPI databases for a disease in question 1 Pathway based association scan ae Exploration Name pathwayscan1 Gere ascoaaton Set genescan Genes andpathways Carate genes set ontin Only seed genes sed Patmmars Ci Canaria 3 gee sets size More than i0 309 genes Stein select genes for Hyoergeomeme test by Bergamini Error Rate 0 2 Pte testna pathmay ovabesby Benonn Errar Rate 0 05 Pink format HapMap LD SNP Coeficients Genome boit pO RA Gl Diae nsxl MyJave KGG2Edipse vesources hapmapldiCEU Nd chr13 CEU txt g 2 E B Boat opera B Norsed to ents Fiter pathnay to CIS Fiter gere to exoorthy palie 0 05 Figure 3 22 1 Dialog to set parameters for pathway based association analysis Each pathway will be interrogated by two tests for their implication to the disease trait in question HYST and hypergeometric distribution test The HYST combines all gene based p value for association with correction of LD between genes while the h
33. p reference population or raw genotype data Candidate Genes amp Seed Genes a list of gene symbols or gene IDs from previous GWAS study or Meta analysis candidate genes refers to genes with suggestive evidences being involved in the development of the traits or diseases the seed genes are defined genes with very strong evidence being involved in the development of the traits or diseases according to previous studies This file is optional if not offered by user KGG will automatically select top N significant genes as seed genes ONLY when generating optimal weights for SNPs SNP weighted results weighted p values for SNPs generated by the algorithm referred in KGG paper 1 Gene based results gene based p values for statistical association generated by the algorithm referred in KGG paper 2 and 3 PPI Network PPls based p values for association generated by the method described in KGG paper 3 The String PPI http string db org database is used here Enriched Pathway pathway based p values for association generated by the method described in KGG paper 3 The pathway sets from MsigDB http www broadinstitute org gsea msigdb were used here Gene Clusters Subsets of genes which have functional correlations with seed candidate genes of the diseases or traits being studied This function requires pre set Candidate Genes amp Seed Genes Annotated results significant SNPs or genes with genomic annotations
34. rt to perform SNP based anal be shown by clicking the menu SNP Weight SNPs or the accelerator is now The Weight SNPs dialog Figure 3 13 can Waste x Ped Nene E Es Mature Sslecton Pops z p pelo tide cee been ue re natn T abre Se 2 aden poo oom oe one ees Geneseos inten 2 Figure 3 13 Dialog to set parameters for SNP weighting Ilustration of parameters in Figure 3 13 15 Weight Set Name define a name of weight set which records the parameters of the weighting procedure Genome Set select correct genome set for this analysis when you have more than one genome sets Classification Settings Conservation Score Threshold A conservation score cutoff to define increased disease risk of SNPs The range of the score is from 0 to 1 The default cutoff is 0 8 We assume that the higher score the higher risk Nature Selection Reference Population Choose a reference population with nature selection scores At present there are only three CEU combined HCB and JPT YRI available as defined by the HapMap project http seww hapmap org You can choose just one of populations close to your sample being tested Nature Selection Score Threshold A nature selection score cutoff to define increased disease risk of SNPs The scores are calculated by Voight et al 2006 Their range is from to According to Voigh
35. scans Step 8 select genescan_crohn as subject perform cluster analysis for candidate seed genes and their expanded genes by pathway or PPI network export the clustering result to local computer by Excel format E Group Cluster Gene Assocation nesacisten Set genescan_ arn Testing Method Berjsn 1956 Fanty nise error Rate 0 05 Oriy esport genes with combed p value 6 eE Meedtoeseta Z 9 Candidate Genes Seti vardidteGeneSe ar Only seed gies used Gonos sharing pattaya with one tho seed gens 7 Paway Sect Merothan 2 Genes havea l level PPts wih cre ofthe sed oenes Figure 5 8 Setting Jor group and cluster gene association Step 9 select genescan_crohn as subject export the gene association results no need SNP information to local 41 computer by Excel format Bl port Gene Asso ere Association Set genescar cic By values By eres Mulle Testing Motned Borja Hochberg 1098 False Eiron Rate nas e Z Output Fort Eel Format 8 Figure 5 9 Annotate gene based association result by p values Step 10 perform pathway enrichment exploration both by gene p values settings as Figure 5 10 42 Tit Pathway association and ennchment se
36. sis setting select chromosome and marker ID column in original GWAS file Input type include p values and chi square input format contains single test per column and also multiple test per column a common format for PLINK output Exelude regions some region might include false positive signal or too many noises or outliers you can remove them at this step The same LD as available genome once previo s genome was built with LD correction KGG would store this LD information in the system Following genome could be easily corrected as long as the same population was used as previous one LD files amp Columns in LD files KGG offers three different ways for integrating LD information by HapMap LD SNP coefficients downloaded from HapMap ftp genotype of ser samples Plink bed file format or available local LD SNP coefficients 1 LD Coefficients are loaded users also need to define corresponding columns in panel of Columns in LD files 4 Build analysis genome by position If the input association file don t contain rsID for many variants see chapter 4 for input file introduction the analysis genome can be alternatively built by Build analysis genome by position Figure 3 11 shows the frame for setting detailed parameters and additional annotations not covered in part 3 are also given I oN TSR n Peza erly hose 1e p vata
37. sis is used to examine significant genes in a set of candidate genes defined by users Optionally user can expand seed genes in the candidate gene set by including genes sharing the same pathway or having PPIs with the seed genes Detailed explanation for each parameter could be referred in the part of Weight SNPs or gene association scan and Cluster Gene Assocation Se penc sj enam 1985 v Faye for Rate 0 05 nly exer genes th corbinsd prse E Tet Format Conte ra a retires ether p Jone eee lel rs then the send anes Figure 3 18 Dialog to set parameters for Group and cluster gene association 3 Cross validation If more than one gene based association p value sets based on samples were created cross validate gene association can be used to examine significant genes appear in multiple samples Increasing evidences of significance could increase the likelihood of being a true association mors one gene asocapon st Gomdeniiii enon oni Mihoe retna narod meses 1 capram aac al id Figure 3 19 Dialog to set parameters for cross validate gene association 4 Annotation amp Export As similar to weight SNP annotation amp export association gene s
38. t et al 2006 2 0 are suggested as the default threshold of significant selection The higher absolute value of selection score the more significance of the nature selection and thus the higher risk The negative scores indicate negative selection and the positive ones indicate positive selection Candidate Genes Set the candidate gene set defined previously itis optional PPI Level for Candidate Gene Expansion The exploration depth to extend candidate genes The default level 2 means that the extended candidate gene set will include genes having indirect PPI with a seed candidate gene Use Pathway Information it is optional once you select it pathway size can also be limited default if from 10 to 300 genes per pathway Weight Settings Multiple comparison methods Choose a multiple testing method to calculate significance of tested SNPS Error Rate Set the nominal false positive error rate Top n as the derived seed candidate genes The top n 20 is the default number genes according to the newly weighted p values are chosen to form new set of seed candidate genes Generate weight by iteration KGG will compute the weighted SNP p values iteratively according to seed gene log frame when KGG is running letailed information could be seen in the 16 Figure 314 view after SNP weighting Finally you can click the weight button to calculate weighted SNP p values for your g
39. test selected this frame would pop out and give user some information usually click Yes for following analysis EE ene based San m laneeneswinpvaue lt EE win aw p ro ma pw wa n os Mant ED i atx Snes es ATES and sk os Ts OST mere ovr foa gene wh uta fev dependent causi aran gt co 20 Figure 3 16 2 Dialog to set parameters Jor Gene based Scan using HYDrid Test of extended Simes text GATES and scaled chi square Test HYST method E Figure 3 17 KGG view after gene association scan Illustration 2 Construction of weights to prioritize SNPs We used categorical information of available GWAS hits to construct weights for the gene based test This is a simple and feasible alternative for the weight construction Presumably the weights will reflect a priori likelihood of a particular class of SNP being found to be associated with disease We downloaded all reported GWAS hits contained in a comprehensive GWAS database HuGE 21 htp bugenavigator ncUHuGENavigator g WAHitStartPage do Hindorif et al 2009 The number of significant association hits was 2908 as of March 12 2010 This set was then expanded by including SNPs in strong LD 7220 8 with them in the HapMap database http hapmap ncbi nlm nih gov
40. ues are saved in the path defined by user Assuming user selected containing SNPs or not in the result once opened by Microsoft Excel it looks like Figure 4 2 1 and Figure 4 2 2 Figure 4 2 1 Excel format of gene based p value result SNP included Gene Symbol official gene symbol in NCBI Gene PValues gene based p values combined from several SNPs within and around it IsSignificant whether si testing ant or not according to a p value threshold for multiple Entrez GenelD gene ID recorded by NCBI Entrez Chromosome chromosome of the gene Start_Position start physical position of the gene Length length of the gene by base pairs SNP SNP ID rs contributed to the gene al location of the SNP Gene Feature SNP s functional region related to the gene Conservation Score Binding Site _Selection_Score_CHBJPT Selection Score CEU Selection Score YRI please see the illustration for output file 2 Position ph mae Bus Figure 42 2 Excel format of gene based p value result SNP excluded Note the illustration for columns are similar to result file include SNPs except SNP which means numbers of SNPs of the gene 4 6 Output file 4 Enriched pathway result The shared pathways are presented by a tree structure Pathway sources names involved genes and SNPs are nicely visualized in a hier
41. ypergeometric distribution test examine statistical significance of enrichment in the pathway by gene with promising p values and pre set important candidate genes of the disease trait The pathways will be sorted according to the summation of the ranks of the two tests Illustration of parameters in Figure 3 22 1 set a name for this pathway enrichment exploration Exploration Nam i based lts MsigDB pathway pathway datasets provided by MSigDB http www broadinstitute orp gsea msigdb Pathway size exclude pathway with too few and too many genes Statistics parameters for how to select genes into pathways for hypergeometric distribution test LD files The LD data to account for correlation between nearby genes on the same chromosome Note the built analysis genome only contains LD of SNPs within genes Export the path and format to export pathway enrichment results Finally you can click the Explore button to perform pathway enrichment exploration Once finished KGG will be shown as Figure 3 23 A node named Enriched pathways will be created Pathway enriched can be viewed from Data view frame You can also check the output result file Excel or TXT format from your defined output path See more from Chapter 4 on output files Figure 3 23 KGG view after pathway enrichment exploration Red arrow shows enriched pathway which could also be viewed by internet after right

User Manual - The University of Hong Kong

Contents

Download Pdf Manuals

Related Search

Related Contents