Home

Introduction to Bioinformatics for Biological Sciences

1. e Translated query vs protein database blasb dog sheep cat e References e Protein query vs translated database e Chicken puffer fish zebrafish e NCBI tblastn e Environmental samples Contributors e Translated query vs translated database e Malaria blasti e Insects nematodes plants fungi microbial Mailing list Contact us genomes other eukaryotic genomes Special Meta e Search for gene expression data GEO e Retrieve results BLAST Align two sequences bl2seq Screen for vector contamination VecScreen Immunoglobin BLAST IgBlast SNP BLAST BLAST can be found on the NCBI Website http www ncbi nih gov blast When you enter the BLAST website you will be given a choice of different BLAST programs For our purposes we will be using blastn blastp and blastx e blastn is used to compare an unknown nucleotide sequence to the NCBI nucleotide database e blastp compares a protein sequence to a protein database e and blastx takes a nucleotide sequence converts it into its protein complement and compares it to the protein database Note Just remember that you have to compare nucleotides with nucleotides and proteins with proteins unless you are using blastx 22 AN INTRODUCTION TO BIOINFORMATICS FOR BIOLOGICAL SCIENCES STUDENTS Part 2 The BLAST form Once you have chosen the appropriate BLAST program you will see the BLAST input window You can enter the query sequ
2. 1833 199 TO mmm 138 1335 1384 ws 18583 A mm As or Marcu 2006 THE PDB Is HOLDING A TOTAL OF 35579 3 D stRUCTURES 27204 IN Serr 2004 AMONG WHICH 32519 ARE PROTEINS PEPTIDES OR VIRUSES 1448 ARE PROTEINS NUCLEIC ACIDS COMPLEXES 1510 ARE NUCLEIC ACIDS ONLY AND 102 ARE OTHER COMPOUNDS Determining the 3 dimensional structure of macromolecules in particular of proteins is a daunting task involving X ray crystallography or NMR spectroscopy The success of these experimental techniques is difficult to predict and structure determination is often likened to an art X ray crystallography for instance requires the growth of a protein crystal up to 1 mm of size from a highly purified protein source Contents What information is contained in each entry of PDB A variety of information associated with each structure is available including sequence details atomic coordinates crystallization conditions 3 D structure neighbors computed using various methods derived geometric data structure factors 3 D images and a variety of links to other resources PDB website Two file formats are available to represent the structural data contained in a PDB entry and other information such as name of molecule references etc They are namely the PDB and the macromolecular Crystallographic Information File or mmCIF formats which consist essentially of plain text specifying spatial coordina
3. s full name a description the species References Articles referring to this protein Comments Combination of various fields concerning that protein like a description of the protein s function etc Ma ee a uer des Database Cross References Links to other databases concerning the protein of interest such as domains it contains 6 Features A description of the domains disulfide bonds transmembrane regions etc with begin end position and length 7 Sequence The peptide sequence in plain text The default view from UniProtPIR is of course the PIR view Probably because UniProt is still in its infant stages the EBI format SRS and the SIB format Niceprot are also offered as alternatives All views show the same information with fields ordered slightly differently Uniprot website http www uniprot org 2 4 Protein Families and Domains Databases Before talking about Protein families and domains databases it 1s important to outline some of the concepts of molecular evolution itself a major field of study in bioinformatics and biomathematics A protein family is a group of evolutionarily related proteins Wikipedia Evolution is an expensive process in the sense that if an enzyme doesn t work you die Most of the mutations will appear as neutral there s a nucleotide change but it s either in non coding regions or it doesn t change the amino acid the codon ultimately coded for or as having delet
4. BLAST Basic Local Alignment Search Tool is a bioinformatics tool that is used to compare an unknown sequence from now on we will call this sequence a query sequence to millions of known sequences in a database Therefore the choice of the completeness and the integrity of the database are essential to a BLAST search BLAST hosted by NCBI works by comparing a query sequence to all the sequences in the NCBI databases It does so by looking for regions of similarity between the query sequence and sequences contained in the database Part 1 Using the web based version of BLAST BLASTX BLASTP BLASTN NCBI BLAST T 28 August 2005 BLAST 2 2 12 released Getting started e Mews FAQs Nucleotide NAR 2004 Quickly search for highly similar sequences NCBI megablast Handbook e Quickly search for divergent Protein protein BLAST blastp e Position specific iterated and pattern hit initiated BLAST PSI and PHI BLAST The Statistics discontiquous megablast e Search for short nearly exact matches of Sequence e Nucleotide nucleotide BLAST blastn e Search the conserved domain database Similarity e Search for short nearly exact matches rpsblast Scores e Search trace archives with megablast or e Protein homology by domain architecture discontiquous megablast cdart e Downloads e Developer info Genomes Translated e Human mouse rat chimp NEW cow pig
5. Matrix A key element in evaluating the quality of a pairwise sequence alignment is the substitution matrix which assigns a score for aligning any possible pair of residues The matrix used in a BLAST search can be changed depending on the type of sequences you are searching Part 5 Analyzing Conserved Domains using Blastp If you are lucky enough to have a sequence that is highly annotated you may be able to determine the protein function of specific open reading frames through the use of conserved domains using the blastp database Conserved domains are a region in a protein sequence that are retained in the 3 D structure of a protein and confer a special function for the protein 1 e zinc finger domain Ribonuclease domain e eS NCBI Nucleotide formatting BLAS T Translations Protein Retrieve results for an RID Your request has been successfully submitted and put into the Blast Queue Conserved Domains Query 903 letters Putative conserved domains have been detected click on the image below for detailed results 1 100 200 300 400 500 00 RUP e GS Gites The request ID is fi 123988493 14342 12830498298 BLASTQ ED aa Clicking on the colored conserved domains above will open a more detailed outlook of the various domains and their positions within your ORF CHAPTER 3 TUTORIALS 27 4000 The domain relatives button looks for similarity of domain Aes E i AED ar
6. Repetitive elements provide important clues about chromosome dynamics evolutionary forces and mechanisms for exchange of genetic information between organisms The most ubiquitous class of repetitive elements in the DNA sequence in primate genomes is the Alu family of interspersed repeats which have arisen in the last 65 million years of evolution Alu repeats belong to a class of sequences defined as short interspersed elements SINEs Approximately 500 000 Alu SINEs exist within the human genome representing about 5 of the genome by mass S Selectivity Selectivity of bioinformatics similarity search algorithms 1s defined as the significance threshold for reporting database sequence matches As an example for BLAST searches the parameter E is interpreted as the upper bound on the expected frequency of chance occurrence of a match within the context of the entire database search E may be thought of as the number of matches one expects to observe by chance alone during the database search Sensitivity Sensitivity of bioinformatics similarity search algorithms centers around two areas First how well can the method detect biologically meaningful relationships between two related sequences in the presence of mutations and sequencing errors Secondly how does the heuristic nature of the algorithm affect the probability that a matching sequence will not be detected At the user s discretion the speed of most similarity search programs ca
7. SDVVLGDYFPTVOPWFNCIRNDSNDLYVTLENLEKALYUWDYATENITWNHR DTVDVTNGLGTYYVLDRVYLNTTLFLNGYYPTSGSTYRNMALEGSVLLSR DTVDVTNGLGTYYVLDRVYLNTTLLLNGYYPTSGSTYRNMALEGTLLLST ETVEVSQGLGTYYVLDRVYLNATLLLTGYYPVDGSKFRNLALTGTNSVSL NYTOHTSSMRGVYYPDEIFRSDTLYLTODLFLP FYSNVTGFHTINH 35 44 50 40 38 39 28 32 AN INTRODUCTION TO BIOINFORMATICS FOR BIOLOGICAL SCIENCES STUDENTS The guide tree is constructed by ClustalW to infer a MSA It is based on pairwise alignments and is not a valid substitute of a true phylogenetic tree itself built from a MSA Part 4 Building the phylogenetic tree Now save your multiple alignment because we need it for the next step Depending on the system you work on you may open you aln directly from your browser window Copy its contents to the Clipboard or Notepad if you re afraid to lose it Otherwise you should save you aln file Right click on the link at the top of the page Alternatively you may also copy the alignment as seen on the results webpage ClustalW is smart and will interpret it but only if you didn t copy any junk before and after the alignment Now return to your original ClustalW form http www ebi ac uk clustalw and paste your multiple sequence alignment as ugly as it might be Or choose to upload the aln file you saved it s always a good idea to save every file you use in a safe place Like a good experiment in a real lab requires you to keep track of anything you do in a
8. and the DNA Databank of Japan DDBJ Japan are the three biggest nucleotide sequence databases in the world Their main sources for DNA and RNA sequences are direct submissions from individual researchers genome sequencing projects and patent applications The NCBI hosts the most well known database GenBank As a result of the International Nucleotide Sequence Database INSD Collaboration between the NCBI EMBL and DDBJ new submissions are shared between databases leading them to have similar content although the annotations can differ This collaboration between the three institutes has existed for 16 years 2 2 2 Entrez NCBI s multi purpose search engine Entrez can be used to search any of the NCBI hosted databases Pubmed is one of NCBI s databases it is the scientific publications database The NCBI website is not easy to navigate and takes a lot of fooling around before one can safely sail from place to place You can use Entrez directly from NCBI s homepage http www ncbi nih gov but you will be missing out on many of the search options If you want to search PubMed or another database just click on the upper bar link with Entrez being the cross database search useful when you want all the information about a specific gene Refine your search Some parameters can be used to refine your search In general you might want to start by limiting your searches So these options are only accessible through each specific Entre
9. ncbi nih gov gorf The ORF Finder Open Reading Frame Finder is a graphical analysis tool which finds all open reading frames of a selectable minimum size in a user s sequence or in a sequence already in the database This tool identifies all open reading frames using the standard or alternative genetic codes The deduced amino acid sequence can be saved in various formats and searched against the sequence database using the WWW BLAST server NCBI website Enter GI or ACCESSION x E Enter the accession of the or sequence in FASTA format sequence you would like to search within or you can paste the fasta fa format of nucleotides in the box below To limit your search to within a specific genetic code you may select it from the below B FROM TO toolbar D ORF Finder Open Reading Frame ORF Finder Results PNS Finder PubMed Entrez OMIM Taxonomy structure The Orf Finder results page lists the largest open HIV_genome unknown_isolate reading frames along with a graphical view of their relative positions Wiew 1 GenBank Redraw i0 s Sixtrames Frame fom to Length EO C E O se 3 m1725 4436 2712 3 H5574 8072 2499 1 H 145 1620 1476 1 H8074 8691 618 1 E4381 4959 579 m 2 B6692 7219 528 3 H4899 5189 291 2 2156 2425 270 2 1592 1840 249 2 H5411 5656 246 1 85170 5388 219 2 BH1862 2074 213 2 B 151 343 213 1 8451 8654 204 Program blastp Database
10. nr with parameters The graphical view shows all three reading frames both in the positive and the negative Lviev GenBank Redraw 100 7 LSixFranes Frame from to Length directions a NN 3 m1725 4436 2712 3 H5574 8072 2499 1 H 145 1620 1476 1 P 1 ma074 8691 618 Clicking on an ORF will highlight it in the list 1 84381 4959 579 A and present you with its sequence 2 866927219 528 3 B4899 5189 291 2 82156 2425 270 TH ee a ie Eripe in You can also examine alternative initiation 41 m5170 5389 19 COdons as opposed to the deafult ATG codon 1725 stgatagggggaattggaggttttatcaaagtaaaacagtatgat 2 EBH1862 2074 213 that ORF Finder uses M IGGIGGFIEKVEKOsSYonm 2 W 151 3435 213 1770 aacatactcatagaaatttgtggacacaaggctataggtacacgt c H8451 8654 204 Wf be dey I ty TIENE ee RTE RI ee ty 1815 ttagtaggacctacgectgtcaacataattggaagaaatatgttg 3 G2734 2922 189 ho y o mpo rop y yon m e PEN FINT 2 m5687 5869 183 1860 actcagattggttgtactttaaattttccagttagtcctattgaa 1 m5656 5829 174 CHAPTER 3 TUTORIALS 29 Tutorial How to use ClustalW to perform multiple sequence alignments and build phylogenetic trees By Cedric Sam lt cedric sam elf mcgill ca gt Version 2 0 September 2005 Part 1 Using the web based version of ClustalW For this tutorial we will be showing you how to use the web interface for ClustalW hosted by the European Bioinformatics Institute If you can t remembe
11. 1 NA 5 42 8e 5 45 86 T 8e 5 B8 127 T Parent no parent Children no children Found in IPR008063 IPR011172 IPR011366 Contains no entries GO terms Molecular Function receptor activity GO 0004872 42 AN INTRODUCTION TO BIOINFORMATICS FOR BIOLOGICAL SCIENCES STUDENTS The ID next to each row represents the families found using each different program Each link leads you to a description of the domain found To simplify things we can limit ourselves to the InterPro description ID starting with IPR since all domains listed in one block are equivalent Numbers to the right represent the location of the domain within the sequence and the letters beside the numbers signify the status of the hit T for True F for False positive or for unknown For hits with Negative N and partial P status the positions are undefined and cannot be shown in graphical view Part 3 Gathering information from the PFAM database Pfam is a database of protein domain families Pfam contains curated multiple sequence alignments for each family as wellas profile hidden Markov models profile HMMs for finding these domains in new sequences Pfam contains functional annotation literature references and database links for each family Pfam is a member of the InterPro consortium and has likethe other member databases contributed annotation and familiesto the InterPro project InterPro aims to provide an integrated view of the diverse protein family data
12. AY651441 1 Influenza A virus A bird Thaila 2587 0 0 Score gi 50296148 gb 4 651440 1 Influenza virus A Ck Thailand 2587 0 0 gi 50296152 gb Y6514432 1 Influenza virus A Qa Thailand 2579 0 0 gi 50296146 gb 4Y651439 1 Influenza virus Ck Thailand 2579 0 0 gail46578135 d05 1A1Y555151 28 Influenza virus A amp Thailand it 2579 0 0 4 E value gi 46360358 gb Y577316 1 Influenza A virus A Thailand 4 2579 0 0 gi 50296144 gb A 651438 1 Influenza virus A Ck Thailand 2571 0 0 gi 46578139 gb A Y555152 2 Influenza virus A Thailand 2 2571 0 0 gi 54299629 gb A 627886 1 Influenza virus A Thailand 5 2563 0 0 gi 50843945 gb AY679513 1 Influenza virus A Thailand LF 0 0 Name of the sequence gi 50083232 gqb A 646168 1 Influenza virus A tiger Supha 0 0 gi s50083248 0gb AY646176 1 Influenza virus A leopard Sup fee pe fee fee fee fee fee fee fee fee e fee e fee fm fee fm fee e fee e fe fe D I gi 46360356 gb Y577315 1 Influenza virus A Thailand 3 0 0 gi 504z8801 gb AY660558 1 Influenza virus A Kalji pheas 2535 0 0 gi 50428797 gb AY660556 1 Influenza virus A open bill B 2535 0 0 gi 50428795 ghb A Y660555 1 Influenza virus A white peafo 2535 0 0 gi 50428793 gb Y660554 1 Influenza virus A crow Bangko 2535 0 0 gi 50296154 ghb Y651443 1 Influenza virus A D
13. Mutation INS 275 A 282 A INS 275 D 282 D Chains B E Polymer 3 Molecule NONAPEPTIDE FROM RAT NADH DEHYDROGENASE Chains C F Functional Class Histocompatibility Antigen peptide Source Polymer 1 Scientific Name Mus musculus S Expression system Mus musculus Polymer 2 Scientific hlama Mur muren hir e Cunraccian puntara Mnr mares lir Dalumear 2 Caiantifie harma Qunthotir e Title A description of the structure e Primary citation Reference published when this structure was submitted e Molecular Description A summary of the structure s chains a single structure can be made of several polypeptide chains e Source The organism from which the protein originally comes from how it was amplified for crystallization etc e SCOP Classification A manual classification of similar structures into hierarchized categories Step 4 The Structure Explorer bar At the top of each PDB entry page you will also find the Structure Explorer bar which you will use to find more information about a structure as well to download the structure for viewing in RasMol an external program which allows you to manipulate a structure and make cosmetical changes to it Structure Summary Biology amp Chemistry Materials amp Methods Sequence Details Geometry E E CHAPTER 3 TUTORIALS Step 5 viewing the structure s sequence Before we go download the structure we will look over some of the features of the St
14. Sequences in any supported format 30 AN INTRODUCTION TO BIOINFORMATICS FOR BIOLOGICAL SCIENCES STUDENTS What sort of data does ClustalW take Many formats are supported by ClustalW but we will use the format called FASTA the name of another alignment program a fairly standard and simple format to use The FASTA format looks like this P clustalw tutorial sequences txt Notepad Sele File Edit Format View Help bSARS coranavirus_spike iIMFIFLLFLTLTSGSDLDRCTTFDDVOAPNYTOHTSSMRGVYYPDEIFRSDTLYLTODLFLPFEYSBNVTGFH LMHTFOGNPVIPFEDGIYFAATERSBNVWVRGwWVFEGSTIMNMNKSOSVITINNSTMVvVIREACNHFELCDMNPFF A SKPMGTOTHTMIFDNAFPNCTFEYISDAFSLDWVSERSOGNFKHLREFVFEKNEDGF Lv EGYOPIDWVVEDLP SGFNTLEPIFELPLGIMITNFRAILT AFSPAODIWGTSAAAYFVGYLEPTTFEMHLEKYDEMGTITDAVvDCRSO MPLAELEKCSVESFEIDKGIYGOTSNMFRWVvPSGDSVVREFPMITNLCPFGEVFENATEFPSVYAWERKEI SNC A D Y SVLYNSTEFFSTFEECYGVSATELNDLCFSMNVYADSEVVEGDDVROIAPGCOTGVIADYMNYELPDDFEMGC L AWNTRNIDATST GNYBNYKYRYLEHGEKLRPFERDISNWVPFSPDGEKPCTPPALNMN Y wPLNDYGEYTTTGGIG YOPYTRVVVILSEELLNAPATVCOGOPELSTDLIENOC VvNEMNFEMNGLTGTGVETPSSERFOPFOOFGRDVSDPFTD SvRDPETSEILDISPCAF GVSVITPGTIMASZEVAVI YODSVMNCTDVSTAIBADOLTPAWRTYSTGNNVFECO TOAGCLIGAEHVDTSYECDIPIGAGICASYHTVSLLRSTSOKSIVAYTMSLGADSSIAYSNNMTIAIPTNF SISITTEVMPVvSMAEKTSVvDCMNHMYICGDSTECANLLLOYGSFCTOLNRALSGIAAEOGDRMTREEMF AQ KOM YETPTLKYFGGFMHFSOILPDPLEPTERSFIEDLLFRNEVTLADAGFIMEOY GECLGDIRBNARDLICAOGKEF MEL TYLPPLLTOOMIAAYTAALYSGOTATAGWTFOAGAALG I PE AMOMAYRENGIGYVTOGNYLYENQKR
15. ae Oe CEE eo eee ee hres are and where exactly they are located j 53 ct 61 agqcttaatgqttacaaattqqgaacttgqatctcaatatgggqtcagqtcattcaattcacaca 120 s The query sequence 1S usually on top and the database Query 121 gggaatcaacacaaagctgaaccaatcagcaatactaatcttcttactgagaaaactgtg 180 match is usually on the bottom EEPELELELEDEH a DEUM Sbjct 121 gqgqgaatcaacacaaagctgaaccaatcagqcaatactaatcttcttactqagaaaactgtg 180 e The numbers on each side of the sequence represent residue numbers eg the first line of the alignment Query 161 gettcagtaaaattagcggqgcaattcatetetttgecccattaatggatgggetgtatac 240 TPT Ye YTV TO E shows residues from 1 through 60 Sbjct 181 qgcttcagtaaaattagcqggcaattcatctcetttgcocccattaatggatgeagetgctatac 240 Query 241 agtaagqgqacaacagtataaggatcgqgttccaagggqggatgtgtttgqttataagqagageca 300 ke a PH MO BEEERELU EA ETT EETETT shbjct 241 agqtaagqgqacaacagtataagqyatcgyygttccaaggqyggygatytytttgttataagagageca 300 Query 301 ttcatctcatgetcccacttqgaatgcagqaactttctttttgactcagggagecttgetg 360 CHAPTER 3 TUTORIALS 25 Part 3 Interpreting the BLAST results page Scores Scores in BLAST represent the extent of similarity between the query sequence and a database sequence They are based on the percent identity conservation observed when the sequences are optimally aligned against each other Naturally the higher is the score the more similar are the sequences E Values E values also called the Expect values are the measures of t
16. alignments representing the family e Hidden Markov Models HMM derived from the seed alignment which can be used to find new members of the domain and also take a set of sequences to realign them to the model One HMM is in Is mode global the other is an fs mode local model e A full alignment which is an automatic alignment of all the examples of the domain using the two HMMs to find and then align the sequences e Annotation which contains a brief description of the domain links to other databases and some Pfam specific data To record how the family was constructed What can I do with Pfam With full alignments and hidden Markov models one has a lot of raw information representing a family of protein Keeping up with the MHC I family of proteins some viruses like HCMV encode MHC I like proteins in order to evade killing from Natural Killer cells Using the data contained in the MHC I Pfam entry one could then build programs to scan viral protein databases for novel candidates in viral host resistance This is a first step in an experiment that will necessarily include a wet lab component 2 5 3 D Structure Databases From the protein sequence the ultimate goal would be to decipher the function based on the sequence alone While sequence comparisons are somewhat useful in this manner knowing the three dimensional structure can get us a step closer to this goal This is done in part by elucidating the interaction between macromol
17. domain and SignalPHMM predicts the presence of signal peptides Part 2 Gathering the results After InterProScan has looked through the database using the programs you selected you will get a set of results as shown below CHAPTER 3 TUTORIALS First block that contains a set of hits for the same motif from different databases First set of hits which found a TNFR domain in the query hicuxieeis uM Results Button for table view 9 SEQUENCE Sequence 1 CRC64 52106C94FD532CFB LENGTH 139 aa 41 InterPro TNFR CD27 30 40 95 cysteine rich region unintegrated IPR001368 TNFR NGFR Pelis PF00020 0 emm cysteine rich Pans region Tumor necrosis SM00200 em factor receptor nerve growt PS50050 cuo TNFR NGFR 2 PS00652 E TNFR_NGFR_1 EGF like IPR006209 P3801186 EGF 2 Domain m nolPR unintegrated PD000547 PD006259 de PD013401 98 PD070564 _ a PD405526 emm SSF57590 ae XEDA HUMAN Q9HAV5 TR1A HUMAN P19438 TR1A BOVIN O19131 TR1A HUMAN P19438 TR1A HUMAN P19438 TNF receptor like Table View XML Output Original Sequences SUBMIT ANOTHER JOB This picture shows the default graphical view of the results Each block represents a set of hits from several programs databases for one documented protein domain
18. evolutionary changes Proteins are all somehow evolutionarily related and the information obtained from protein families and domains databases is crucial to understand the relationships between proteins to infer function for newly discovered proteins and the biological importance of certain protein domains 15 2 4 1 PROSITE Prosite 1s both a database and a collection of tools As a database it serves to collect the amino acid sequence patterns for different peptide motifs The collection of motifs is drawn from analysis of the amino acid sequences in the SWISS PROT Tremble database The main tool of interest to the user is the peptide scan function of ScanProsite which detects the presence motifs from the database in any amino acid sequence of interest Other tools available but not covered in this manual include the motif scan function of ScanProsite tools which scan against other motif databases and tools which allow the user to scan various databases in search of as yet unnoted motifs and create profile for them Prosite was written by L Falquet M Pagni P Bucher N Hulo C J Sigrist K Hoffmann and A Bairoch was produced by a collaboration between the Swiss Institute of Bioinformatics SIB and the European Bioinformatics Institute EBI and is hosted on ExPASy Expert Protein Analysis System the proteomics server of SIB It is available in Canada via the mirror site at http ca expasy org prosite Tue PROSITE Loco What
19. family In this example the first hit is for TNFR cysteine rich domains which are said in the literature to be repeated four times in members of the TNFR superfamily of receptors which we used here in our example Boxes show the relative location of each conserved domain so we only see three repeated domains but this is probably because this is a truncated version of the protein don t appear If you are using Internet Explorer you may hover on each rectangle to obtain numerical values for the start end amino acids of the hit as well as an E value determining the goodness of the hit lower is better If you use a different browser you may need to click the Table View see figure below button to see these details The hits in Table View are sorted by the InterPro accession number which has the form of IPRXXXXXX with X being a digit Weles Results Picture View Raw Output XML Output Original Sequences Motif position SEQUENCE A CRC64 FODBB355254DAE03 LENGTH 162 aa InterPro TNFR CD27 30 40 95 cysteine rich region nora PFAM PF00020 8 TMFR NGFR cysteine rich region 1 3e 09 5 42 T 12 17 45 BB T 1 4e 09 88 127 T Tumor necrosis factor receptor 9 5e D8 5 42 T 1 3e 15 45 85 T 1 7e 07 88 127 T nere SMART SM00208 nerve growt 0 023 129 156 T Sus PROFILE PS50050 TNFR NGFR 2 9 167 4 42 T 12 655 44 86 T 9 456 B7 127 T PROSITE PS00652 TNFR NGFR
20. from a common ancestral gene by speciation Normally orthologs retain the same function in the course of evolution Identification of orthologs is critical for reliable prediction of gene function in newly sequenced genomes See also Paralogs P PAM matrix PAM percent accepted mutation and BLOSUM blocks substitution matrix are matricies that define scores for each of the 210 possible amino acid substitutions The scores are based on empirical substitution frequencies observed in alignments of database sequences and in general reflect similar physiochemical properties e g a substitution of leucine for isoleucine two amino acids of similar hydrophobicity and size will score higher than a substitution of leucine for glutamine Paralog Paralogs are genes related by duplication within a genome Orthologs retain the same function in the course of evolution whereas paralogs evolve new functions even if these are related to the original one Parameters Parameters are user selectable values typically experimentally determined that govern the boundaries of an algorithm or program For instance selection of the appropriate input parameters governs the success of a search algorithm Some of the most common search parameters in bioinformatics tools include the stringency of an alignment search tool and the weights penalties provided for mismatches and gaps Protein families Sets of proteins that share a common evolutionary orig
21. get our results For longer queries it 1s possible that the query takes up to 20 30 minutes especially if you send at peak periods daytime for North American time zones From the ClustalW form it s possible to change the Results field at the top from the default interactive to e mail where a link to your results is sent to you when your query has been processed CHAPTER 3 TUTORIALS Part 3 Interpreting the Multiple Sequence Alignment MSA 31 After processing you get your first set of results the MSA Along with other data it 1s displayed in your browser window as follows Output file Alignment file Guide tree file Your input file eee Results Results of search Humber of sequences 3 Alignment score 63299 Sequence format Pearson Sequence type aa ClustalW version 1 52 JalVieW pem SUBMIT ANOTHER JOB clustalw 20041229 083934053 dnd clustalw 20041223 08334053 input We re not showing the whole page here but be aware of the output you get The Output file shows what the program outputs as it runs ClustalW first does a pairwise alignment between each sequence inputted and then puts them together for the multiple alignment This file contains important data about the identity between each sequence clustalw 20041 229 08394053 output What the program outputs as it runs clustalw 20041228 08384053 aln E The multiple alignment file guide tree NOT phylogenetic tree The Ali
22. identifiable features found in known proteins can be applied to unknown protein sequences Part 1 Using InterProScan to search InterPro Go to http www ebi ac uk InterProScan case sensitive e Enter an e mail address if hui sequence Search This form allows you to query your sequence against InterPro For more detailed information see the documentation for the perl stand alone InterProScan package Readme file or FAQs or the InterPro user manual or help pages Download Software YOUR EMAIL RESULTS APPLICATIONS TO RUN Clear all 9 Check all BlastProDom FPrint amp can HMMPIR HMMPfam HMMSmart HMMTigr ProfileScan ScanReaEx SuperFamily SignalPHMM TMHMM TRANSLATION TABLE DNA RNA only MIN OPEN READING FRAME SIZE Enter or Paste a PROTEIN Sequence in any format you want the results sent to your inbox InterPro integrates data from various Protein Family database the most notorious of which are ProSite a product of the Swiss Bioinformatics Institute and Pfam originally developed by the Sanger Institute in the UK It s OK to choose the default options This is where you paste your sequence You would typically use a protein sequence but the system will take a nucleotide sequence or even multiple protein sequences You may also use a file containing all sequences already HMMPfam looks in Pfam ScanRegExp looks in Prosite TMHMM predicts transmembrane
23. lab book Enter or Paste a set of Sequences in any sup MATRIS GAP OPEN END GAP GAP GAPS EATEN SION DISTANCES OUTPUT PHYLOGENETIC TREE OUTPUT OUTPUT TREE TYPE FORMAT ORDER CORRECT DIST IGNORE GAPS aln w numbers aligned CLUSTAL W 1 52 multiple sequence alignment Human coronavirus NL63 spike MELFLILLVLPLASCFFTCH2HNANLZ HLOQLGV Porcine epidemic diarrhea viru HRSLIYFWLLLPVLPTLSLPQDVTRCQSTTNFREFFSKFNV lll gt The only modification you have to make is at the level of tree type in the Phylogenetic Tree section This will tell ClustalW that we don t want the default MSA but rather a phylogenetic tree as the output Phylip is one of the existing tree formats which we ll show you briefly on the next page HAc Results Results of search Humber of sequences 3 Sequence format Clustal Sequence type aa ClustalW version 82 Output file Phylip tree file Your input file SUBMIT ANOTHER JOB Clustahw 20041 224 1 2475553 0utput Clustahw 20041 224 124 5553 ph clustahw 20041 224 124 75553 Input Press run and after the usual wait screen you will get the following results page Again the output file 1s a semi misnomer it is what the program ClustalW outputs while it runs Here nothing really useful comes out of it but the length of the sequences and the name of the input format The ph file is what really interests us Every ClustalW
24. molecule describing its geometry and hence its molecular function Consensus sequence A single sequence delineated from an alignment of multiple constituent sequences that represents a best fit for all those sequences A voting or other selection procedure is used to determine which residue nucleotide or amino acid is placed at a given position in the event that not all of the constituent sequences have the identical residue at that position Database Any file system by which data gets stored following a logical process Deletion A chromosomal alteration in which a portion of the chromosome or the underlying DNA is lost Domain protein A region of special biological interest within a single protein sequence However a domain may also be defined as a region within the three dimensional structure of a protein that may encompass regions of several distinct protein sequences that accomplishes a specific function A domain class is a group of domains that share a common set of well defined properties or characteristics E FASTA format A sequence in FASTA format begins with a single line description followed by lines of sequence data The description line is distinguished from the sequence data by a greater than gt symbol in the first column It is recommended that all lines of text be shorter than 80 characters in length An example sequence in FASTA format is 44 gt g1 532319 pir TVFV2E TVFV2E envelope protein E
25. search time and the ID of your request To continue click the Format button This will produce a window where the results of your query will be displayed after the BLAST program will have processed it 2 e NCBI formatting BLAST Nucleotide Protein Translations Retrieve results for an RID Your request has been successfully submitted and put into the Blast Queue Query 1333 letters Click Format To continue 2 lt The request ID is 1104447731 28194 172944797949 BLASTO4 Format Eel Gam The results are estimated to be ready in 37 seconds but may be done sooner Please press FORMAT when you wish to check your results You may change the formatting options for your result via the form below and press FORMAT again You may also request results of a different search by entering any other valid request ID to see other recent jobs Format Show M Graphical Overview M Linkout M Sequence Retrieval M NCBLgl Alignment in HTML tomat Use new formatter D Masking Character Default X for protein n for nucleotide Masking Colo Black Number of Descriptions 100 v Alignments 50 v Part 3 The BLAST results page The wait may be quite long in the case of long query sequences or during peak hours Once the search is over you will see the BLAST Results window If you scroll down you will see a picture representation of your search which will look something like this Color Key for Hlignment Scores e ach of
26. ssssssssssssssssesseeeeennrnnnnnr nennen nennen 16 Mat SP PADO hoa Sesuen tee cio adie tva o eio bett estate n bend undae qaaa eoa ouod to ao eau dao o ante feeit tofu oput 16 Whata Ido qt P TOR Sarat cds dun od oo ON SAPE ean bo oto bu t tese eoa c tout mb miel Sata VOU Dum diamenesetinduays edo A 16 De eS UCC I ala S E EE T STD 17 2 9 LPDB Phe Protein ata Bank eo a D bii boe bate ada b da ead ania na uet s AL 17 CONEA Ap M 18 SEACE gt Bb ane ee ee erg eo a ey 18 Ami example EVO DODITI ccu eL o RED ERR esa aceon as 19 Out MUO Alc denos ai Pepe eee ciate daa ttu dante Adae obersten M ooo t oie ta ol aoe da esata de 20 FOB Sie EPOC ei Els vei eecaftatute Duobus dro uata ol a acu cL MM Mari Meer ueDE 20 2 3 2 VIE WING iS CEHCEUTES with Ras MOL ehe te tor etisfiqi ar ep tuas temone uultu s RA UR MIU II 20 CCUStomiizhng SUEDCIUH GS s odeeietenree oO cue a ee oes Ree 21 Using select to change the display options for specific reSidueS cccccesssssseccceececceecccceeeeeeeeeeeseeesesessaaaeaeaaas zl 2 9 thier C Al AD AC Sesto foedus tenideoccmiodo tum vefte Edo maaan Radices ud mue edi Sene dedo esta eee tdv Rad natal uero dinis 22 PM ese Ue INOS sa ee es ie TD D D DL 22 Chapter M Tutorial Sresi R HR 23 Tutorial How to use BLAST to search for homologous sequences and using NCBI ORF finder 23 Tutorial How to use Cl
27. the Macromedia Flash player download Comments info rcsb org Blood performs many essential jobs in your body it transports oxygen and nutrients it protects your cells from infection and it carries hormones and other messages from place to place in your body But since blood is a liquid that is pumped under pressure we must protect ourselves from leaks Fortunately the blood has a built in repair method that quickly stops up breaks in the blood circulatory system as soon as they happen You see these repairs in action whenever you cut yourself the blood thickens and forms a gooey clot which then dries into a scab that seals and protects the cut until it can heal 14 Mar 2006 RCSB PDB at Science Expo for NJ Students On March 21 the RCSB PDB will take part in a Science Expo held at Princeton University for middle school students from New Jersey m Full Story 07 Mar 2006 RCSB PDB Focus Frequently Asked Questions 28 Feb 2006 RCSB PDB Exhibit News 21 Feb 2006 Virtual Reality Environment Highlights PDB Structures 14 Feb 2006 PDB Statistics Structures Solved by g More Multiple Methods g Previous Features THE PRorEIN DATA BANK HOMEPAGE HTTP WWW PDB ORG 2 5 2 Viewing Structures with RasMol and derivatives RasMol is considered the grandfather of many molecular visualization tools out there Its first version was released by Roger Sayle at the University of Massachusett
28. the lines on the picture represents a match between a database sequence and a query sequence 24 AN INTRODUCTION TO BIOINFORMATICS FOR BIOLOGICAL SCIENCES STUDENTS e There can me numerous matches for one query sequence Only the top few matches are shown in the picture representation e The picture is colour coded you can see the colour map on top of the picture Red lines represent matches with the highest scores gt 200 green lines are for the lowest scores and so on e If you click on any line in the picture you will be taken to a page that shows the alignment of the matching sequence with the query sequence e Ifyou scroll down further you will see a list of all the matching sequences in the database e On the left is the gi number it is a unique identifier for a sequence within a database e Clicking on the gi identifier will summon a new page with a complete description of the sequence provided by GenBank e On the right you can see the scores and E values e Clicking on the score takes you to the alignment of the database match with the query Score E Sequences producing significant alignments bits Value gi 56792951 gh5 AY842936 1 i i 2642 0 0 gi 56553499 gb A 834280 1 Influenza virus A tiger Thail 2587 0 0 gl gi 55793692 gb A 649383 1 Influenza virus A chicken Tha 2587 0 gi 50296156 gb AY651444 1 Influenza A virus A Gs Thailand 2587 B gi 50296150 gb
29. to shift from the normal series of triplets G Gaps affine gaps A gap is defined as any maximal consecutive run of spaces in a single string of a given alignment Gaps help create alignments that better conform to underlying biological models and more closely fit patterns that one expects to find in meaningful alignment The idea is to take in account the number of continuous gaps and not only the number of spaces when calculating an alignment Affine gaps contain a component for gap insertion and a component for gap extension where the extension penalty is usually much lower than the insertion penalty This mimics biological reality as multiple gaps would imply multiple mutations but a single mutation can lead to a long gap quite easily Gap penalties The penalty applied to a similarity score for the introduction of an insertion or deletion gap the extension of a gap or both Gap penalties are usually subtracted from a cumulative score being determined for the comparison of two or more sequences via an optimization algorithm that attempts to maximize that score Gene Classically a unit of inheritance In practice a gene is a segment of DNA on a chromosome that encodes a protein and all the regulatory sequences promoter required to control expression of that protein Gene families Subsets of genes containing homologous sequences which usually correlate with a common function Heterodimer Protein composed of 2 different c
30. 15 l E million citationis for biomedical articles back to the 1950 s These citations PubMed Services are from MEDLINE and additional life science journals PubMed includes Journals Database links to many sites providing full text articles and other related resources MeSH Database NCBI s PUBMED uttp wWww NCBI NIH GOV PUBMED EBI The European Bioinformatics Institute The European Bioinformatics Institute EBI is a non profit academic organization that forms part of the European Molecular Biology Laboratory EMBL EBI website http www ebi ac uk The EBI is located in Cambridgeshire United Kingdom and was established in 1992 It is the European equivalent of the NCBI In 2004 EBI was funded primarily by the EMBL 45 and the European Union 25 but also by the National Institutes of Health NIH in the USA accounting for about 1096 Many applications are available from EBI through a web interface Here are some examples Homology amp Similarity the BLAST or Fasta programs can be used to look for sequence similarity Note The BLAST provided by EBI is different from the one provided by NCBI it s WU BLAST by Washington U in St Louis rather than NCBI BLAST Protein Functional Analysis InterProScan can be used to search for motifs in your protein sequence Sequence Analysis ClustalW a sequence alignment tool Structural Analysis MSDfold or DALI can be used to query your protein structure a
31. An Introduction to Bioinformatics for Biological Sciences Students Department of Microbiology and Immunology McGill University Version 2 5 For the BIOC 300 lab March 2006 McGill 2 AN INTRODUCTION TO BIOINFORMATICS FOR BIOLOGICAL SCIENCES STUDENTS Contributors The first edition of the Introduction to Bioinformatics for biological sciences students was written during the summer of 2004 at McGill University for the Bioinformatics Project BIP as part of the U2 undergraduate laboratory in Microbiology and Immunology MIMM 386 What you are holding in your hands is the second edition of the manual put together by a new group of students during the summer of 2005 From the first edition it contains only the section on biological databases and the main institutes that develop and maintain them other parts were included in an extended version of the manual instead The biggest change from the first edition is that the manual now includes the exercise sheets and tutorial written during the course of the BIP s first year of existence making this volume the comprehensive resource students need to understand the material covered in the BIP but also to perform the exercises This version has been adapted for the BIOC 300D Laboratory in Biochemistry course lab on bioinformatics Main contributors to the abridged version e C dric Sam cedric sam elf mcgill ca e Oksana Kapoustina e Abrar Khan Contributors to all sections includ
32. BL C DDPBIC PDB sequences ibut no EST STs Doo Environmental samples or phase 0 1 or 2 ATGS sequences 3 413 089 sequences 14 5365 149 293 total letters If you have any problems or questions with the results of this search please refer to the BLAST FAQs Taxonomy reports Distribution of 1105 Blast Hits on the Query Sequence Mouse over to show cdefline and scores click ta show alignments Color key for alignment scores 40 40 50 50 80 80 200 200 e The Lineage Report gives a simplified view of the relationships between the organisms generating database hits to the query sequence by showing how closely these organisms are related to a focus organism according to the taxonomy database This focus organism is the organism giving the strongest BLAST hit and this will often be the source organism of the query sequence e Inthe Organism Report the BLAST results are grouped into blocks by species Within each species block the records are sorted by BLAST score The order of species blocks themselves is based on the BLAST score of the best hit within the block e The Taxonomy Report summarizes the relationships among all of the organisms found in the BLAST results Using this report it 1s easy to see how many records are found within broad taxonomic groups such as the mammalia or the archaea 28 AN INTRODUCTION TO BIOINFORMATICS FOR BIOLOGICAL SCIENCES STUDENTS Part 7 Using NCBI s Orf Finder NCBI ORF finder website http
33. LRLRYCAPAGFALLKCNDADYDGFKTINCSNVSVVHCTNLMNTTVTTGLLLNGSYSENRT QIWOKHRTSNDSALILLNKHYNLTVTCKRPGNKTVLPVTIMAGLVFHSQOKYNLRLROAWC HFPSNWKGAWKEVKEEIVNLPKERYRGTNDPKRIFFOROWGDPETANLWFENCHGEFFYCK MDWFLNYLNNLTVDADHNECKNTSGTKSGNKRAPGPCVORTYVACHIRSVIIWLETISKK IYAPPREGHLECTSTVTGMTVELNYIPKNRINVTLSPQIESIWAAELDRYKLVEITPIGFE APTEVRRYTGGHEROKRVPPVXXXXXXXXXXXXXXXXXXXXXXVOSOHLLAGILOOOKNL LAAVEAQOOMLKLTIWGVK A FASTA file can also contain multiple sequences SVECTOR32 Synthetic vector sequence 32 ATGAGCGGCGGCCCCATGGGCGGCAGGCCCGGCGGCAGGGGCGCCCCCGCCGTGCAGCAG AACATCCCCAGCACCCTGCTGCAGGACCACGAGAACCAGAGGCTGTTCGAGATGCTGGGC gt VECTOR33 Synthetic vector sequence 33 ACGAGCGGCGGTCCCATGGGCGCCAGGCCCGGCGGCAGGGGCGCTGCCGCCGTGCAGCAC ATCATCCCCAGCACCCTGCAGCAGGACCACGAGTACCAGAGGCTGTTCGAGATGCTGGGC gt VECTOR34 Synthetic vector sequence 34 GTGAGCGGCGGCTACTTGGGCGGCAGGCCCGGCGGCAGGGGCGCCCACGCCGTGCAGCAG AN INTRODUCTION TO BIOINFORMATICS FOR BIOLOGICAL SCIENCES STUDENTS Sequences are expected to be represented in the standard IUB IUPAC amino acid and nucleic acid codes with these exceptions lower case letters are accepted and are mapped into upper case a single hyphen or dash can be used to represent a gap of indeterminate length and in amino acid sequences U and are acceptable letters see below Invalid characters digits blanks are automatically removed Frameshift A deletion substitution or duplication of one or more bases that causes the reading frame of a structural gene
34. NBANK ARE FORMATTED FOR THE SEQUENCE VIEW A PULL DOWN MENU ALLOWS YOU TO SELECT THE VIEW FITTING YOUR NEEDS sucH AS FASTA ONE OF THE POPULAR FORMATS ACCEPTED BY SEQUENCE ANALYSIS PROGRAMS dbEST and UniGene Complementary DNA cDNA is single stranded DNA synthesized from a mature mRNA template Now what are Expressed Sequence Tags ESTs They are short sequences generated by sequencing the ends of these cDNA molecules ESTs are important gene mapping and discovery tools because they can be used as primers to amplify genomic DNA spanning a region presumably bounded on one side by the EST The EST database dbEST is one of the many divisions of GenBank the NCBI nucleotide database As of September 2004 there were some 5 7 million Homo sapiens ESTs in dbEST By design ESTs in dbEST may be redundant as several different ESTs might be derived from mRNA expressed by the same gene This is where UniGene comes into play What does UniGene contain exactly UniGene regroups ESTs mRNAs high throughput cDNAs HTC etc representing a unique gene into clusters UniGene is an automated system and has so far reduced 4 6 million Homo sapiens sequences to some 107 014 gene clusters Clusters are never stable they can be merged together at any point based on certain criteria as new sequences are added to GenBank UniGene Every cluster has its own webpage through UniGene s web interface From that page related information about the clu
35. OGT ANF M KATISOIOESLTTTSTALGKEKLODVVNONAOAIL NTLVEOLSSNFGATSSVENDILSRLEDKVEAEVOTDRLIT iaRLOoSLOTYVTOOLIRAAEIRASANLAATEMSECVLGOSKRVDFCGkKGYHLMSFPOAAPHGVVFLHVTYY iIPSOERMFTT APAICHEGEAYFPREGVFVEMNGTSWwEFITORNMFFSPOIITTDMTIFWSGNCDVVvIGIINMTIWwY iDPLOPELDSFKEELDEK YFKMHTSPDPVDLGDISGIBMASVVMIOKEIDRELNMEWVAKMLRNESLIDLGOELGGK EGO Y IEWPuwYwwLFIAGLIAIVMWwIILLCCMTSCOC SC L KGACSCGOS5C CKFDEDDSEPVLKGVELHYT Human coronavirus oC43 spike MFLILLISLPTAFAVvIGDLKCTSDTSYTYIMDEDTGPPPISTDTWwDWwTHhM LGT YrYwvLDRYYLNTTLFLMaYY PTS5GSTYRNMAL Ki S vL LSRLEWEEPPFLSDFEIBMGIFAKVKNTEVIEKDRVMYSEFPAITIGSTFVMTISYAV VVOPRTIMSTODGYNELOGLLEVSVo ov NMHCEYPOTICHPNLGRHHEKELWHLDTuVwVvSCILYKREMNETYDSVM ADYL YFHFYOGOEGGTFYAYFTDTGVVTEFLFEBNVYLGMWAI SHY YVMPLTCNSKWVENGFTLEYWVIPFLTSROY LLAFNODGIIFNAVvDOMBESDFEFMBEEIKCETOSIAPPTGVYELNGYTWVWOPIADWwYRRELNLPRCHMI EAWwLNMDE SVvPSPLNWEREKTFSNCMFRNIMISSLMSEFIOADSFETCBHNIDAAKIYGMCFSSITIDKFAIPBMGRKWDLGOL GNL av LOSFNYRIDTITATDSCOLYYMLPAANVSVSRENPSTWNEKRFGFIEDSVERPRPAGVLTRHDVVY AOHCF KAPEKNFCPCKL Maso ViO SGPGEKMMGIGTCPAGTBNYLTCDMLCTPBPITFERATGTYECPOTEKSLViOIGEHC SGL AVESDYCGGONSUTUCRPOAFLGWwSADSCLOGDEKCNIFANEIDLHBVNSGLETCSTBLOGEANTDITILVCY Ln 1 Coll Each sequence is given as a block of text with a description header on a single line starting with a Greater Than symbol gt The first entry of the example given is called SARS coronavirus spike and the sequence goes like MFIFLL VKLHYT The second gt symbol indicates the start of the
36. TICS FOR BIOLOGICAL SCIENCES STUDENTS Part 5 Using TreeView to view tree files ph The next step is to view ph files in a program somewhat more flexible than the ClustalW webpage s java applet The program we use is called TreeView while the Phylip suite contains a tree viewing utility with more viewing options TreeView is much easier to manipulate TreeView has a Windows version that can be downloaded from this website http taxonomy zoology gla ac uk rod treeview html lt TreeView File Edit Style Tree Window Help i clustalw 20041229 08575117 ph ESL ER X Porcine epidemic diarrhea viru Human coronavirus NL63 spike Human coronavirus 229E spike Transmissible gastroenteritis Avian infectious bronchitis vi SARS coronavirus spike Murine hepatitis virus spike Human coronavirus OC43 spike Bovine coronavirus spike Clicking one of the buttons on the top will allow you to change the view of the tree Here the same tree as before is now view as an unrooted tree more appropriate with this example different species of coronaviruses with no specified evolutionary ancestry TreeView clustalw 20041 229 08575117 ph E File Edit Style Tree Window Help DH lt e E els i f Human coronavirus 229E spike Human coronavirus NL63 spike Porcine epidemic diarrhea viru Transmissible gastroenteritis SARS coronavirus spike Avian infectious bronchitis vi Murine hepatitis virus spike E L 0 1 Bovine coro
37. TrEMBL protein databases the PROSITE protein families and domains database and the SWISS 2DPAGE database of 2D gels plus many other specialized databases The SIB is also active in developing software tools like Melanie for the analysis of 2 D gels Swiss Pdb Viewer for the visualization of 3 D structures such as those found in the Protein Data Bank or PDB database and SWISS MODEL a fully automated server which takes protein sequences and tries to model their 3 D structure from known 3 D structures found in the PDB 1 3 1 How to access SIB s resources ExPASy Tug Swiss PROTEOMICS SERVER HTTP CA EXPASY ORG CANADIAN MIRROR ExPASy Expert Protein Analysis System http ca expasy org is the SIB s proteomics web server ExPASy is the website to use to access to all of the aforementioned SIB databases and analytical tools and Swiss Jokes http www expasy org cgi bin sw jokes pl 1 4 Bioinformatics in Canada The website of the Canadian Bioinformatics Resource in Ottawa hosts well known j bioinformatics applications such as BLAST ClustalW and a web version of the popular n it m phylogenetics program Phylip http cbr rbc nre cnrc gc ca Car Y nac Chapter 2 Molecular Biology Databases Databases are large collections of data arranged for ease of search and retrieval Common databases such as GenBank PDB or Swiss Prot exist as extremely large files which can be downloaded from public sites for various data manip
38. abases of biological relevance which takes more than a half of his already extensive links page http www expasy org alinks html 2 7 References e General references 1 Introduction to Molecular Biology Databases by Rolf Apweiler EBI s Swiss Prot coordinator http www ebi ac uk swissprot Publications mbd1 html 2 The NCBI Handbook http www ncbi nlm nith gov books bv fcgi rid handbook e 3 D structure databases 1 A reference article used when citing PDB H M Berman J Westbrook Z Feng G Gilliland T N Bhat H Weissig I N Shindyalov P E Bourne The Protein Data Bank Nucleic Acids Research 28 pp 235 242 2000 2 More publications from PDB are available on the PDB Info webpage http www rcsb org pdb info html It s information you can skim through during your spare time 3 Examples of mmCIF the file format used in PDB http ndbserver rutgers edu mmcif examples e Protein sequence databases 1 UniProt User Manual http ca expasy org sprot userman html 2 Protein Sequence Databases by Apweiler Bairoch and Wu A short overview of the existing protein sequence databases and what differ between them Curr Opin Chem Biol Feb 2004 8 1 76 80 Chapter 3 Tutorials 21 Tutorial How to use BLAST to search for homologous sequences and using NCBI ORF finder By Oksana Kapoustina lt oksana k gmail com gt and Abrar Khan lt abrar khan mail mcgill ca gt layout Cedric Sam Version 2 0 August 2005
39. acts PubMed is NCBI s biomedical literature database giving access to citations compiled in databases such as MEDLINE To the average user Pubmed just equals Medline although a website describes the difference between both concepts http www nlm nih gov pubs factsheets dif med pub html What you need to know is that PubMed is a biomedical literature giving access to the MEDLINE database but also to certain non medical article featured in MEDLINE journals What you read in a textbook today has almost always been 1 Source Database resources of the National Center for Biotechnology Information Nucleic Acids Res 2004 Jan 1 32 Database issue D35 40 published and debated through peer reviewed journals Reading review articles in prominent journals like Science or Nature is a good way to start familiarizing yourself with peer reviewed journals 1 2 i National E PubMed B ntrez Pubhiec Nucleotide Protein Genom Struct OMIM PMC Journal Book Search PubMed for Go Clear Limits Previewlndex History Clipboard Details About Entrez Text Version Enter one or more search terms or click Preview Index for advanced searching Enter author names as smith jc Inthals are optional Enter journal titles in full or as MEDLINE abbreviations Use the Journals Database to find journal titles Overview Entrez PubMed NevwiNoteworthy E unnes PubMed a service of the National Library of Medicine includes over
40. alue gi 56792951 gb AY842936 1 Influenza A virus A tiger Thail 2642 OQ e From the name of the match you can infer that the query sequence represents an Influenza A virus gene or a part of it This gene most likely belongs to the A tiger Thailand CU T3 2004 H5N1 strain and it codes for the neuraminidase gene e You can obtain all of this information by clicking on the score beside the gene name and examining the header of the alignment You can get more information about the Influenza A gene from GenBank by clicking on the gi number beside it Part 4 Using Blastp to search the protein databases As stated previously the BLAST search pages allow you to select from several different programs blastn blastp blastx The blastp database takes an amino acid input sequence and compares it with millions of protein sequences within the Blast database It then provides you with a list of the closest matches found e NCBI protein protein BLAST Nucleotide Protein Translations 4 Protein protein means you are in the right database Depending on what you are looking for you can modify the database to search within Retrieve results for an RID The following is a list of some important databases used in blastp searches nr default All non redundant GenBank CDS 7 translations PDB SwissProt PIR PRF Set subsequence From al Set subsequence Fon T month All new or revised GenBank CDS Choose database nr xi translati
41. ammer ELL Human cytomegalovirus encodes a x T glycoprotein homologous to MHC class 1 Type definition Domain antigens Beck S Barrell BG Alignment method of seed Clustalw Nature 1988 331 269 272 ALIGNMENTS PHYLOGENETIC TREES STRUCTURES AND OTHER RELEVANT INFORMATION CAN BE DOWNLOADED FROM A PFAM ENTRY PAGE What s in Pfam Pfam is divided in two sets of protein families e Pfam A families are based on curated multiple alignments A certain number of proteins ranging from around 10 to a few thousands are chosen to form the seed group representing a protein family An example of protein family can be the Class I Histocompatibility antigen domains alpha 1 and 2 Code MHC I which regroups MHC I like proteins based on HMMs e Pfam B is based on an automated clustering of the proteins in UniProt not already in Pfam A from a database called ProDom 9 A hidden Markov model is essentially a statistical model which has found interesting applications in describing protein families as well as in computerized speech recognition 10 http www sanger ac uk Software Pfam 11 http pfam wustl edu 17 845 a23 Cr G2 Qa b1 b2 b3 y1 y2 y3 How TO VISUALIZE A MARKOV MODEL X STATES OF THE MARKOV MODEL A TRANSITION PROBABILITIES B OUTPUT PROBABILITIES Y OBSERVABLE OUTPUTS PICTURE WIKIPEDIA The data in a Pfam entry will include the following e A seed alignment which is a hand edited multiple
42. amp https fwww pdb org pdb Welcome do A MEMBER OFTHE IP DB An Information Portal to Biological Macromolecular Structures As of Tuesday Mar 14 2006 BJ there are 35579 Structures PDE Statistics PDB ID or keyword a Author uM ecARCH O Advanced Search Welcome to the RCSB PDB The RCSB PDB provides a variety of tools and resources for studying the structures of biological macromolecules and their relationships to sequence function and disease The RCSB is a member of the wwPDB whose mission is to ensure that the PDB archive remains an international resource with uniform data This site offers tools for browsing searching and reporting that utilize the data resulting from ongoing efforts to create a more consistent and comprehensive archive Information about compatible browsers can be found here 4 narrated tutorial illustrates how to search navigate browse generate reports and visualize structures using this new site This requires the Macromedia Flash player download Comments info rcsb org Blood performs many essential jobs in your body it transports oxygen and nutrients it protects your cells from infection and it carries hormones and other messages from place to place in your body But since blood is a liquid that is pumped under pressure we must protect ourselves from leaks Fortunately the blood has a built in repair method that quickly stops up breaks in the blood circulatory system
43. anization and analysis of large amounts of biological data using networks of computers and databases usually with reference to the genome project and DNA sequence information 2 Computational biology sometimes is used interchangeably with the term C Cluster The grouping of similar objects in a multidimensional space Clustering is used for constructing new features which are abstractions of the existing features of those objects The quality of the clustering depends crucially on the distance metric in the space In bioinformatics clustering 1s performed on sequences high throughput expression and other experimental data Clusters of partial or complete gene sequences can be used to identify the complete contiguous sequence and to better identify its function Clustering expression data enables the researcher to discern patterns of co regulation in groups of genes Complexity of gene sequence The term low complexity sequence may be thought of as synonymous with regions of locally biased amino acid composition In these regions the sequence composition deviates from the random model that underlies the calculation of the statistical significance P value of an alignment Such alignments among low complexity sequences are statistically but not biologically significant 1 e one cannot infer homology common ancestry or functional similarity Conformation The precise three dimensional arrangement of atoms and bonds in a
44. as soon as they happen You see these repairs in action whenever you cut yourself the blood thickens and forms a gooey clot which then dries into a scab that seals and protects the cut until it can heal g More g Previous Features 14 Mar 2006 RCSB PDB at Science Expo for NJ Students On March 21 the RCSB PDB will take part in a Science Expo held at Princeton University for middle school students from New Jersey m Full Story 07 Mar 2006 RCSB PDB Focus Frequently Asked Questions 28 Feb 2006 RCSB PDB Exhibit News 21 Feb 2006 Virtual Reality Environment Highlights PDB Structures 14 Feb 2006 PDB Statistics Structures Solved by Multiple Methods Step 2 Searching by PDB ID Now that you are on the PDB website use the main search form at the top of the page to find the struture you want to study If you know the PDB ID a unique four character ID for all structures found in PDB you may input it in the main search form Otherwise you may search the PDB database by keywords and browse the results for a suitable structure O PCE ID orkeyword Author H PTT Advanced Seare 36 AN INTRODUCTION TO BIOINFORMATICS FOR BIOLOGICAL SCIENCES STUDENTS Step 3 The Structure Summary Page Searching for 1 MHC will lead you to the structure s webpage Various information is given on the Structure Summary Page of each structure in PDB A MEMBER or THE IP DAB An Information Portal to Biolo
45. at are represented by at least one nucleotide or protein sequence LocusLink provides a single query interface to curated sequence and descriptive information about genetic loci e BLAST family of sequence similarity search programs e Resource for gene level sequences UniGene is a system which partitions GenBank sequences into non redundant set of gene oriented clusters There are many other specialized databases for single nucleotide polymorphisms dbSNP and for information on Major Histocompatibility Complex dbMHC e Resources for genome scale analysis Entrez Genomes provides access to genomic data and includes genomes spanning from microbes to multicellular organisms e Eukaryotic Genomic Resources Map Viewer displays genome assemblies using sets of aligned chromosomal maps e Structural databases The NBCI MMDB is built by processing entries from the Protein Data Bank 1 1 2 PubMed The ultimate biomedical literature database MEDLINE is the NLM s premier bibliographic database covering the fields of medicine nursing dentistry veterinary medicine the health care system and the preclinical sciences MEDLINE contains bibliographic citations and author abstracts from more than 4 800 biomedical journals published in the United States and 70 other countries The database contains over 12 million citations dating back to the mid 1960 Coverage is worldwide but most records are from English language sources or have English abstr
46. bases and one of its strengths is that a comprehensive set of annotations has been created through the merging of information from each member Protein families database of alignments and HMMs QuickGO FUNCTION receptor activity GO 0004872 GO Terms For additional annotation see the PROSITE document Ppocoos561 Expasy SRS UK SRS USA mem Premises ets Seed 37 C Ful 621 View 12 rampa architectures View architectures for 621 proteins Format Coloured alignment Zoom 0 5 pixels aa Get alignment View HMM logo Species Distribution View Graphic Further alignment options here ViewGraphic Tree Help relating to Pfam alignments here Species Distribution Phylon tic tree Seed 37 C Ful 621 NEW View alignments amp domain organisation by species Tree depth show all levels Download tree ATV Applet View Species Tree The trees were generated using Quicktree To find out more about ATV phylogenetic tree viewer click here GO Terms This link is a browser of the Gene Ontology at the EBI It is a site that describes gene products in terms of their associated biological processes cellular components and molecular functions 1n a species independent manner Pfam Alignment This link leads to the output alignment file that Pfam uses to determine the domains within a query sequence Species Distribution Tree This link leads to a level based visualization of the phylogenetic
47. chitecture between different um TE taxonomical groups Sinilar domain architectures Clicking on the conserved domains in dUTPase the graphical view or the tabular view J2 SEQUENCES eee xs Xi MEM will direct you to the Pfam website bag p2d s 2 Sequences Q2 rm m and provide you with more pris celuceotetn information about the structure and zf 115 Sequences mir gt nature of that domain Retrouiridae Gag Pol i Te ee The domain relatives page above is Human immunodefici Mari idend dr useful in analyzing homology of the 5 Sequences 58080858585 Co r lt X o gt domains between evolutionary House mamwarg tune Pr 160 species It also shows domains in Ha anaa m a m close proximity within other species pone ren pel which may be useful in defining its attired function pol polyprotein Part 6 Analyzing Taxonomy reports An interesting feature of the Blastn output is the Taxonomy reports which can provide you with valuable information of the taxonomic relationships among the records returned from a BLAST search The taxonomy report link can be found just above the Blast Hits on the Results page Clicking on the Taxonomy Reports link on the BLAST results page will generate taxonomy reports in three formats a Lineage Report an Organism Report and a Taxonomy Report Database 411 GenBankc rEM
48. dia swissprot THE Swiss Pnor LOGO Each entry of Swiss Prot http ca expasy org sprot is carefully inspected by specialists from around the world to ensure a high quality of the information contained This is a long process and more and more sequences are added to the database every day That s where TrEMBL comes to the rescue TrEMBL Translation of EMBL nucleotide sequence database TrEMBL is a computer annotated supplement to Swiss Prot introduced in 1996 as a solution to preserve the high editorial standards of Swiss Prot while making new sequences available to the public TrEMBL contains translations of all coding regions in the DDBJ EMBL GenBank nucleotide databases and protein sequences extracted from the literature or submitted to UniProt which are not yet integrated into Swiss Prot TrEMBL allows these sequences to made publicly available quickly without diluting the high quality annotation found in Swiss Prot 13 Searching Swiss Prot TrEMBL In an effort to create a single source of protein information the UniProt consortium was established Searching and using Swiss Prot TrEMBL is similar to searching and using the UniProt databases so this section will actually be covered below 2 3 3 PIR PSD The Protein Sequence Database The Protein Information Resource PIR located at Georgetown University Medical Center is an integrated public bioinformatics resource that supports genomic and proteomic research and
49. does the Prosite database contain The Prosite database consists of only two files the data file PROSITE DAT and the documentation file PROSITE DOC Both are text files and both contain exactly one entry for each motif which has been identified by Prosite The format of each data file entry depends on whether it represents a pattern or profile described motif While both give a data file identification name data file accession number and pointer to the motif s documentation entry a motif pattern entry give a one line pattern description and a motif profile entry will give a multiple line weight matrix A pattern description defines the exact amino acid sequence expected for the motif whereas a profile weight matrix defines gives score values for all the different amino acids for each site The entries of the documentation file all conform to a single format and each contain the documentation entry accession number the corresponding data entry accession number and identification name and any important documentation information regarding the entry in free format text ex Biochemical taxonomic anatomical and source information Search PROSITE using ScanProsite The aim of ScanProsite is to identify the occurrences of any motifs from the Prosite database in the sequence indicated by the user To do this the tool scans through the entire amino acid sequence once with each motif entry in the PROSITE DAT file The scanning process consists of progress
50. e most similar pairs and progressing to the least similar until there are no longer any sequence pairs remaining to be aligned J CHAPTER 3 TUTORIALS Junk DNA Term used to describe the excess DNA that is present in the genome beyond that required to encode proteins A misleading term since these regions are likely to be involved in gene regulation and other as yet unidentified functions K L Library A large collection of compounds peptides cDNAs or genes which may be screened in order to isolate cognate molecules M Map unit A measure of genetic distance between two linked genes that corresponds to a recombination frequency of 1 Motif A conserved element of a protein sequence alignment that usually correlates with a particular function Motifs are generated from a local multiple protein sequence alignment corresponding to a region whose function or structure is known It is sufficient that it is conserved and is hence likely to be predictive of any subsequent occurrence of such a structural functional region in any other novel protein sequence Multigene family A set of genes derived by duplication of an ancestral gene followed by independent mutational events resulting in a series of independent genes either clustered together on a chromosome or dispersed throughout the genome Multiple sequence alignment A Multiple Alignment of k sequences is a rectangular array consisting of characters taken from th
51. e EBI SED Oe The European Bioinformatics Institute EBI is a About Home non profit academic organisation that forms part of the f 1 About the BI European Molecular Biology Laboratory EMBL B Funding The EBI is a centre for research and services in visitors Programme bioinformatics The Institute manages databases of zio biological data including nucleic acid protein sequences and News and Press macromolecular structures ereddot ihe leant Staff EBI staff only ever single injection Staff EBI star ont The mission of the EBI is to ensure that the growing body of information co l Publications from molecular biology and genome research is placed in the public bioinformatics i Jobs domain and is accessible freely to all facets of the scientific community in infrastruc structure in a l ways that promote scientific progress Europe to the EBI Industry Support more Events ChFRI Released THE EUROPEAN BIOINFORMATICS INSTITUTE WEBSITE HTTP WWW EBI AC UK 1 3 SIB The Swiss Institute of Bioinformatics The SIB is an academic not for profit foundation established on March 30 1998 whose mission is to promote research the development of databanks and computer technologies teaching and service activities in the field of bioinformatics in Switzerland with international collaborations SIB website http www isb sib ch The SIB maintains a number of important databases such as the Swiss Prot
52. e alphabet A that satisfies the following conditions There are exactly k rows ignoring the gap character row number i is exactly the sequence sl and each column contains at least one character different from In practice multiple sequence alignments include a cost weight function that defines the penalty for the insertion of gaps the character and weights identities and conservative substitutions accordingly Multiple alignment algorithms attempt to create the optimal alignment defined as the one with the lowest cost weight score Mutation An inheritable alteration to the genome that includes genetic point or single base changes or larger scale alterations such as chromosomal deletions or rearrangements N Naked DNA Pure isolated DNA devoid of any proteins that may bind to it O Open reading frame ORF Any stretch of DNA that potentially encodes a protein Open reading frames start with 45 a start codon and end with a termination codon No termination codons may be present internally The identification of an ORF is the first indication that a segment of DNA may be part of a functional gene Operator A segment of DNA that interacts with the products of regulatory genes and facilitates the transcription of one or more structural genes Operon A unit of transcription consisting of one or more structural genes an operator and a promoter Ortholog Orthologs are genes in different species that evolved
53. e nnn nn nnns 12 2 9 29 WISS PIOE and TEBMDBBEs n5 doeet Sitch sa EP Sot On Reine spd Naas tna add gal ds ead esu estu edt Boa en sect D duse Copie etm 12 TrEMBL Translation of EMBL nucleotide sequence database ccccccccccccccccceseeeseeeesesseesnaaaaaeeeeeeeeeeeeeeeeeeeeees 12 oearching S wISSSProU btB MI eot or AN m NIMM MEM SRM UE 13 23 5 PIR PSD Tho Protec Sequence Databdse uestes ete e itta eret aerei as bndotaou dendi E oe toot rds 13 DDN MRO eenn scan eaa threo save tte otio cote Nep ce ub IA ne tis UH ts I USA IAS Sce E USE E 13 Searc hiha DpIPEOL crore hub utn uu f Ce P quM MM D ME Ae 13 Iscadinp a mPrtob SIUN ooo Codrt aote DAR Cut ims estes oa wheter 14 Linh 0 9 Raydo bo UE meee me eR E eae en ee nat nee ee 14 2 4 Protem r amiliesand Domains Database Sa dodo estar odtes doen e bares baden E au IRR Ute a ER ode Esto Pues 14 2d RE AAG fo Gl Se ee en re ee nee eed nen ee CoRR a coude stunt eek ee eee ee een 15 What does the Prosite database contam iiie ee eei eee sateedeverdehaasescevdnaacdseec dvanesedec aeccusacdedeesrceneecduenceeeeads 15 4 AN INTRODUCTION TO BIOINFORMATICS FOR BIOLOGICAL SCIENCES STUDENTS search PROSITE sine SCA EOSITOs sedeo ec e tite amaba tete nep eon gepate a ereibute pnt 15 Operation and interpretation of ScanProsite eeeseessssssssseeeeeeee nn nnnnn nnne nnne nennen nennen nnn n nennen nnns sanas nsns 15 2 4 2 Pfam Protein families database of alignments and HMMS
54. ecules and by comparing the spatial arrangement of the polypeptide chain As well 3 D structures have been of prime importance in the rational development of new drugs versus the good old trial and error approach PDB is the most comprehensive structural database and we will now go more in depth about it 2 5 1 PDB The Protein Data Bank The Protein Data Bank PDB http www rcsb org pdb is the single worldwide repository of despite its name protein nucleic acid and other biologically relevant 3 D structures These data typically obtained by X ray crystallography or NMR spectroscopy are submitted by biologists and biochemists from around the world are released into the public domain and can be accessed for free The database is the central repository for biological structural data Wikipedia PDB 1s hosted and managed by a three research centers based in the United States that are also part of the Research Collaboratory for Structural Bioinformatics RCSB consortium For reference they are Rutgers University the San Diego Supercomputer Center SDSC and the Center for Advanced Research in Biotechnology near Washington DC 18 Yearly Growth of Total Structures number of structures can be viewed by hovering mouse over the bar Humber 2 500 5 000 F 500 10 000 12500 15 000 17 500 20 000 22500 35 000 27 500 30 000 32 500 35 000 200 2005 hh D o E 2003 200 i 200 200 i
55. ence into the search window The sequence can be in plain text format FASTA fa format will also be accepted NCBI Blast Mozilla Firefox File Edit Yiew Go Bookmarks Tools Help dI v Qu ue EJ x e http www ncbi nlm nih gov BLAST Blast cai7 CMD Web amp LAYOUT TwoWindows AUTO_FORMAT Semiautoe Getting Started GY Latest Headlines S stumble ally d Llike it P Notfor me CD B C Menur Cc Insert sequence e NCBI nucleotide nucleotide BLAST Nucleotide Protein Translations Retrieve results for an RID Set subsequence From To Choose database nr ov GSD Eus Gua Here we inserted a query sequence into the search window In this case we are using blastn and the sequence inserted is a nucleotide sequence nucleotide nucleotide B LA S T Translations ac NCBI Nucleotide Protein Retrieve results for an RID ATGAATCCAAATAAGAAGATAATAACCATCGGATCAATCTGTATGGTAACTGGAATGGTT 4 TACAAATTGGGAACTTGATCTCAATATGGGTCAGTCATTCAATTCACACAGGGAATCAAC ACCAATCAGCAATACTAATCTTCTTACTGAGAAAACTGTGGCTTCAGTAAAATTAGCGGGC Choose a database CTTTGCCCCATTAATGGATGGGCTGTATACAGTAAGGACAACAGTATAAGGATCGGTTCC TGTTTGTTATAAGAGAGCCATTCATCTCATGCTCCCACTTGGAATGCAGAACTTTC ull gt Chotse database refseq ma refseq genomic est est human est mouse est others qss htqs Now Options Limit by entrez pdb query month alu_repeats Choose filter dbsts c
56. erious effect the aa sequence is changed modifying the structure and function of the protein and you die and a few mutations will slightly change the structure of a protein in the long run conferring a selective advantage to the organism carrying it Within a protein a structural domain domain is an element of overall structure that is self stabilizing and often folds independently of the rest of the protein chain Many domains are not unique to the protein products of one gene or one gene family but instead appear in a variety of proteins Domains often are named and singled out because they figure prominently in the biological function of the protein they belong to for example the calcium binding domain of calmodulin Because they are self stabilizing domains can be swapped by genetic engineering between one protein and another to make chimeras A domain may be composed of one more than one or not any structural motifs Wikipedia But such change is slow If evolution depended only on point mutations we wouldn t be here today Instead we must see proteins as collections of domains Protein domains themselves are made of sequences of the simplest secondary structures a helices D sheets and turns segments between helices and sheets Therefore in an oversimplified conclusion the swapping deletion or duplication of these building blocks entire genes domains secondary structures are at the origin of most significant
57. es note of it updates the sequence and keeps the old version e Initially the UniProt Knowledgebase UniProt consists of the merging of the Swiss Prot TrEMBL and PSD entries but will later be derived from the UniParc database UniProt will retain the organization of the Swiss Prot TrEMBL duo Swiss Prot as a manually curated database and TrEMBL as a computer annotated database and integrate data from PIR PSD that s not already in Swiss Prot TrEMBL e The UniProt Non redundant Reference UniRef is as its name implies a collection of non redundant protein sequences UniRef sequences come from the UniProt knowledgebase and the non redundancy is generated automatically Sequences are compared with each other and if there is sequence homology they are merged together and added as a single entry in UniRef Searching UniProt The search interface 1s slightly different among the three UniProt associates even if the same tools are essentially offered Because all computers connecting from North America are redirected to the PIR s UniProt site we will only consider that version of the search interface PIR s UniProt website http www pir uniprot org The two main search tools are 1 Text Search which allows you to search in one field in particular or all of them at once for plain query strings One of UniProt s layers databases as discussed in the previous section must be selected Query boxes can be added as you go by clicking
58. gical Macromolecular Structures As of Tuesday Mar 14 2006 GY there are 35579 Structures PDB Statistics co Cn 2 Oc PROTEIN DATA BANK Contact Us Help Print Page PDB ID orkeyword Author ae egARCH O Advanced Search Home Search f Structure Queries Structure Summary Biology amp Chemistry Materials amp Methods Sequence Details Geometry 1MHC Images and Visualization Title MODEL OF MHC CLASS H2 M3 WITH NONAPEPTIDE FROM Biological Molecue RAT ND1 REFINED AT 2 3 ANGSTROMS RESOLUTION Authors Yang C R Fischer Lindahl K Deisenhofer J Wang C R Castano A R Peterson P A Slaughter C Lindahl K F Deisenhofer J Nonclassical binding of formylated Primary peptide in crystal structure ofthe MHC class Ib molecule H2 M3 Celi v82 Citation pp 655 664 1995 Abstract w History Deposition 1995 08 23 Release 1996 01 29 Experimental Type X RAY DIFFRACTION Data D EDS Resolution A R Value R Free Space Group Parameters 240 0190 0bs n a P1 xir eda Jmol WebMol Unitcej LemghlA a 6525 b 6610 c 55 17 TIN aan P Angles alpha 102 71 beta 96 28 gamma 110 19 QuickPDB All Images Molecular multimer protein homodimer 10 residues homodimer 282 Description residues homodimer 99 residues Asymmetric Unit Polymer 1 Molecule MHC CLASS ANTIGEN H2 M3 Mutation INS 275 A 282 A INS 275 D 282 D Chains A D Polymer 2 Molecule MHC CLASS ANTIGEN H2 MS3
59. gnment file is the MSA itself Symbols below each column of the MSA roughly indicates the level of identity for the aligned nucleotide amino acid No symbols are seen in the image because there is not enough identity in this section of the alignment ey AM Mozilla Firefox File Edit iew Go Bookmarks Tools Help py r amp A amp amp E http www ebi ac ukJcgi bin jobresults clustalw clustalw 20041229 08394053 aln DER el V CLUSTAL W 1 82 multiple sequence alignment Human coronavirus NL63 spike Human coronavirus 229E spike Porcine epidemic diarrhea viru Transmissible gastroenteritis Human coronavirus OC43 spike Bovine coronavirus spike Murine hepatitis virus spike SARS coronavirus spike vian infectious bronchitis vi Human coronavirus NL63 spike Human coronavirus 229E spike Porcine epidemic diarrhea viru Transmissible gastroenteritis Human coronavirus OC43 spike Bovine coronavirus spike Murine hepatitis virus spike SARS coronavirus spike vian infectious bronchitis vi MRSLIYFWLLLPVLPTLSLPOQDVTRCOSTTNFRRFFSKFNV QAP MKKLFVVLVVMPLIYGDNFPCSELTNRTIGNOQWNLIETFLLNYSSRLPPN SSSaSSseSS MFLILLISLPTAFAVIGDLKCTSDTSYINDEDTGPPPIST PIOSIDIDISIDEREE OO MFLILLISLPTAFAVIGDLKCT TVSINDVDTGVPSIST 2 2 2 MLFVFILFLPSCLGYIGDFRCIQ LVNSNGANVSAPSIST 2 2222 MFIFLLFLTLTSGSDLDRCTTFDDVQ AP SSTIVTGLLP THWFCANQSTSVYSANG FFYIDVGN HRSAF AVVVLGGYLPSMNSSSWYCGTGIETASGVHG IFLSYIDSGQGFEI
60. hains or subunits Homeobox A highly conserved region in a homeotic gene composed of 180 bases 60 amino acids that specifies a protein domain the homeodomain that serves as a master genetic regulatory element in cell differentiation during development in species as diverse as worms fruitflies and humans Homeodomain A 60 amino acid protein domain coded for by the homeobox region of a homeotic gene Homology strict Two or more biological species systems or molecules that share a common evolutionary ancestor general Two or more gene or protein sequences that share a significant degree of similarity typically measured by the amount of identity in the case of DNA or conservative replacements in the case of protein that they register along their lengths Sequence homology searches are typically performed with a query DNA or protein sequence to identify known genes or gene products that share significant similarity and hence might inform on the ancestry heritage and possible function of the query gene I in silico biology Lit computer mediated The use of computers to simulate process or analyse a biological experiment Iteration A series of steps in an algorithm whereby the processing of data is performed repetitively until the result exceeds a particular threshold Iteration is often used in multiple sequence alignments whereby each set of pairwise alignments are compared with every other starting with th
61. he background noise in an alignment They represent the number of matches with high scores that can occur in a searched database purely by chance Eg E value of 1 that means that there could be at least one sequence in a database that has a high alignment score i e it will be considered a matching sequence for the query but it is not really a match for the query sequence it has a high score purely by chance Therefore your goal is to find E values that are closest to 0 An E value of 0 means that the match is one of a kind and therefore it 1s significant If you are wondering why some unrelated sequences can have high alignment scores with the query sequence refer to the BLAST section of the manual that you were given but largely this question is beyond the scope of our course Hint For the purposes of this exercise we will concentrate on the scores rather than the E values i e As long as your result has a very high score it does not have to have a perfect E value Choosing the best matching sequence e Once the alignment is complete and you have examined the results you can choose the sequence that matches the query sequence the best Remember when choosing the optimal matching sequence you want it to have the lowest E value and the highest alignment score e Inthe previous example the first sequence in the list of would be a perfect match for the query since it has the highest score and an E value of 0 Score E v
62. hromosome was Done ems nt or select from All organisms Human repeats Mask for lookup table only Mask lower case Choosing a database It is now time to choose the database that BLAST will use to search matches for the query sequence As it was already mentioned there is a number of databases that BLAST can use All of them contain sequences that have already been identified by the researchers If you click on choose database and scroll down you will notice that some databases are specific for an organism and some can only be used for nucleotides or proteins e The database fit for our purposes is the nr database e For the complete list and description of BLAST databases you can refer to or the manual e After entering the sequence and selecting a database click on the BLAST button CHAPTER 3 TUTORIALS 23 ak e N C BI nucleotide nucleotide B LA N T Nucleotide Protein Translations Retrieve results for an RID Click BLAST ATGAATCCAAATAAGAAGATAATAACCATCGGATCAATCTGTATGGTAACTGGAATGGTT TACAAATTGGGAACTTGATCTCAATATGGGTCAGTCATTCAATTCACACAGGGAATCAAC ACCAATCAGCAATACTAATCTTCTTACTGAGAAAACTGTGGCTTCAGTAAAATTAGCGGG CTTTGCCCCATTAATGGATGGGCTGTATACAGTAAGGACAACAGTATAAGGATCGGTTCC TGTTTGTTATAAGAGAGCCATTCATCTCATGCTCCCACTTGGAATGC AGAACTTTCTTTT v n Set subsequence From To nr ov GSD GaGa Search Choose database Now you should see the following screen It will let you know the estimated
63. ick menu allows you to perform many manipulations on the appearance of the molecule The context menu can also be access by single clicking the Jmol logo Step 2 manipulating the image By default all of the molecule or sometimes just the protein chains are selected The application keeps in memory what molecules you have selected and performs the rendering commands on these only Maus By using the context menu you may select a Select I P aw N whole group of molecules under the Select sub Render menu However for greater flexibility for Labels selecting a range of amino acids for instance you must use the Jmol Console instead Color J Zoom E Spin Animate Measurements a Crystal OA 1 Options 13 Console About Jmol z Ei a E re CHAPTER 3 TUTORIALS f Jmol script completed 39 Step 3 using the console In our example IMHC we know from the Sequence Details of the structure that it made up of four polypeptide chains By using the Console we can select a particular chain and then perform various aesthetic changes on the selection The syntax of the scripting commands is fully described on the Jmol documentation website http jmol sourceforge net docs The scripting documentation also shows interactive example of some commands For instance if we wanted to select all of chain A and show the electromagnetic contours of
64. ility of the algorithm to discriminate them from non pattern sequences X Y
65. in reflected by their relatedness in function which is usually reflected by similarities in sequence or in primary secondary or tertiary structure Subsets of proteins with related structure and function Q Query sequence A DNA RNA of protein sequence used to search a sequence database in order to identify close or remote family members homologs of known function or sequences with similar active sites or regions analogs from whom the function of the query may be deduced R Reading frame A sequence of codons beginning with an intiation codon and ending with a termination codon 46 typically of at least 150 bases 50 amino acids coding for a polypeptide or protein chain see ORF and URF Repeats repeat sequences Repeat sequences and approximate repeats occur throughout the DNA of higher organisms mammals For example the Alu sequences of length about 300 characters appear hundreds of thousands of times in Human DNA with about 8796 homology to a consensus Alu string Some short substrings such as TATA boxes poly A and TG also appear more often than by chance Repeat sequences may also occur within genes as mutations or alterations to those genes Repetitive sequences especially mobile elements have many applications in genetic research DNA transposons and retroposons are routinely used for insertional mutagenesis gene mapping gene tagging and gene transfer in several model systems Repetitive elements
66. ing those not in the abridged version e Belinda Befort PROSITE Phylip e Scott Bunnell Editing e Mansoureh Hakimi Exercises review e Oksana Kapoustina BLAST ClustalW e Abrar Khan Editing e Francois Pepin Editing e C dric Sam Institutes Databases Editing Exercises original layout e Sean Wiltshire Introduction editing Faculty members who contributed to this manual e DrSilvia Vidal silvia vidal g mcgill ca e Dr Nicholas Acheson e Dr Malcolm Baines Copyright 2004 2005 Department of Microbiology and Immunology of McGill University All rights reserved Table of Contents Table of Contents oo e EE ERE PeR P9 Cte OON ROV PRI e Oo E dius Se eI Cena Ta Ib er Due eo LU Poe R vea Uo V P a 3 Chapter 1 Bioinformatics Institutes uei evvede deba eee eet eds vuv au vxve dee sea da va va vv vU vea dax ease vv edu vE eeu aA aee uda A TE Ve e y Ryan a 6 1 1 NCBI The National Center for Biotechnology Information USA cccccesssssssssecceeeeceeeeeeeeeeeeeeeeeeeseeseseesaaaaaaeeaeas 6 LE pDatdbase Tesourees at the INC BT eese nien Unete E Mipasdusbinaehl arated a 6 1 1 2 PubMed The ultimate biomedical literature database sss 6 L2 EBL The European Bioimloraiaties TAs ti te 2o evel nevada ceaseantinonamnasassanesd a a a 7 LS IB The Swiss Insttute oc BIOIDTOPIEH OS iret Ra de omae a a Fea odii v Do Unete 8 Tea OW TO ACCESS 6 Te SOIC ES ovate te seusnadacs ase a a T 8 1A BO mirOrmatl CS CAMA OE
67. ively aligning the pattern or profile with different positions in the sequence The first alignment matches the first position of the pattern or profile to the first position of the sequence and compares all the now aligned sites If the pattern finds a match or the profile score is high enough both situations are called hits the positions are marked as being the pattern or profile s motif The pattern or profile s alignment head then moves forward one site in the sequence to begin comparison all over again Thus identified specific motifs may overlap As this scanning process is done with all the different patterns and profiles different identified motifs may also overlap Because too many or too extensive overlaps are likely meaningless once all possible motifs have been identified Prosite implements an algorithm to select among them Operation and interpretation of ScanProsite The Quick Scan tool on the main page of Prosite performs exactly the same function as ScanProsite with all but one default option set The input can be a sequence in raw FASTA or Swiss Prot format an accession number for a protein sequence in the Swiss Prot TrEMBL database or an ID for a protein sequence in the Protein Data Bank PDB 16 There is one option whose setting must be considered by the user Exclude patters and profiles with a high probability of occurrence 2 4 0 Pfam Protein families database of alignments and HMMs Pfam is a large collection
68. k Thailand 2535 0 0 gi 54873459 gb AY770992 1 Influenza virus A chicken A yu 2533 0 0 gi 48431279 gb AYS90567 2 Influenza virus A chicken Nak 2524 0 0 gi 55247883 gb AY 779051 1 Influenza virus A chicken Tha 2514 0 0 gi 50428799 gb A 660557 1 Influenza virus A chicken Nak Zo 0 0 gi 55247879 gb5 AY 79049 1 Influenza virus A duck Thaila 2508 0 0 gi 50296166 gb AY651449 1 Influenza A virus A Ck Viet Nam 2508 0 0 If you keep moving down the page you will notice that most of it is taken up by the alignments that look something like the next picture below O gt gi 56792951 gb Y842936 1 Influenza A virus A tiger Thailand CU T3 2004 F gene partial cds Length 1333 This line lets you know the score and the E value of this specific li Score 2642 bits 1333 Expect 0 0 alignment Identities 1333 1333 11005 Strand Plus Plus This line shows the number of residues in the querygg quence and in the alignment that are identical Query 1 atgqaatccaaataagqaagataataaccatcggatcaatectgtatggtaactggaatggtt 60 EAE BOE DE EE EE P a EAE EE SE OE BOE E P BS P EE n e Alignments represent two sequences the query sequence Sbjct 1 atqaatccaaataagqaagqataataaccatcgqgatcaatctgtatggtaactgqgaatgqgtt 60 and the matching sequence lined up against each other Query 61 agcttaatgqttacaaattgqqgaacttgatctcaatatgggtcagqtcattcaattcacaca 120 This will help you determine how many mutations there PEEP PEPE
69. n be sacrificed in exchange for greater sensitivity with an emphasis on detecting lower scoring matches Similarity homology search Given a newly sequenced gene there are two main approaches to the prediction of structure and function from the amino acid sequence Homology methods are the most powerful and are based on the detection of significant extended sequence similarity to a protein of known structure or of a sequence pattern characteristic of a protein family Statistical methods are less successful but more general and are based on the derivation of structural preference values for single residues pairs of residues short oligopeptides or short sequence patterns The AN INTRODUCTION TO BIOINFORMATICS FOR BIOLOGICAL SCIENCES STUDENTS transfer of structure function information to a potentially homologous protein is straightforward when the sequence similarity is high and extended in length but the assessment of the structural significance of sequence similarity can be difficult when sequence similarity is weak or restricted to a short region Structure prediction Algorithms that predict the secondary tertiary and sometimes even quarternary structure of proteins from their sequences Determining protein structure from sequence has been dubbed the second half of the Genetic Code since it is the folded tertiary structure of a protein that governs how it functions as a gene product As yet most structure prediction methods are
70. navirus spike Human coronavirus OCA3 spike PHYLIP 1 1 1 CHAPTER 3 TUTORIALS Tutorial How to use PDB and Jmol to find and manipulate three dimensional Structures By Cedric Sam lt cedric sam elf mcgill ca gt Version 2 5 March 2006 Part 1 Protein Data Bank PDB to find structures PDB homepage http www rcsb org pdb or search pdb beta on Google As of September 2005 this tutorial shows the use of the beta site of PDB found at http pdbbeta rcsb org pdb PDB is a repository for 3 D structures of biological relevance Although PDB means Protein Data Bank it is also a database where you can find structures of nucleic acids and other macromolecules although PROTEIN DATA BANK Contact Us Help Print Page proteins are by far the most ome search well represented category Just to illustrate the use of Tutorial About This Site PDB and molecular m Getting Started visualization tools we will use a major histocompatibility complex class 1 MHC I molecule from mice throughout this tutorial W Acknowledgements W Frequently Asked Questions i e Known Problems XX Report Bugs Comments Step 1 Getting started Open a browser window and google PDB Then click on the first link the site s address is http www rcsb org pdb but it s easier to find through Google RCSB Protein Data Bank Mozilla Firefox File Edit View Go Bookmarks Tools Help Qa ed A amp
71. nd compare it to those in the Protein Data Bank PDB Tools Miscellaneous Expression Profiler a set of tools for clustering analysis and visualization of gene expression and other genomic data As well as applications the following are databases maintained by EBP EMBL Nucleotide Database Europe s primary collection of nucleotide sequences is maintained in collaboration with Genbank USA and DDBJ Japan Note These are the three partners of The International Nucleotide Sequence Database Collaboration INSD See Science Brunak et al 298 5597 1333 UniProt Knowledgebase a complete annotated protein sequence database It is a central repository of protein sequence and function created in 2002 by joining the information contained in Swiss Prot TrEMBL Switzerland Europe and PIR USA See Curr Opin Chem Biol 2004 Feb 8 1 76 80 a recent article on Uniprot and Protein sequence databases at large Macromolecular Structure Database European Project for the management and distribution of data on macromolecular structures ArrayExpress for gene expression data 2 Source EBI Services http www ebi ac uk services 3 Source EBI Databases http www ebi ac uk databases e Ensembl Providing up to date completed metazoic genomes and the best possible automatic annotation Site search ae L E 5 I 2 fae Site Start Map SRS Sessio Downloads Submissions Databases to th
72. of multiple sequence alignments and hidden Markov models HMMs covering many common protein domains and families While there is only one Pfam database in circulation there are many websites from which it 1s accessible from same db different interfaces These sites aren t mirrors of each other but the services offered on them are equivalent There are Pfam sites in Sweden South Korea and France but the main ones are Sanger Institute s in the UK and Washington University s in St Louis MI Domain organisation Seed 25 Full 4514 6 view9 representative Format Coloured alignment Bal architectures C view Get alignment View HMM logo architectures for 4514 Further alignment options here proteins Help relating to Pfam alignments here Zoom fo pixels aa View Graphic Species Distribution Phylogenetic tree amp Seed 25 C Full 4514 Download tree ATV Applet The trees were generated using Quicktree To find out more about ATV phylogenetic tree viewer click here NEW View alignments amp domain organisation by species Tree depth Show all levels v View Species Tree Database References PDB fama si A ere ere BIETEN You can find out how to set up Rasmol here CATH PDBSUM SCOP UK SCOP USA PD0CO0262 Expasy SRS UK SRS USA PROSITE HOMSTRAD PFAMB SYSTERS PANDIT Literature References Pfam specific information z d Author of entry Sonnh
73. on PDB SwissProt PIR released in the last 30 Do CD Search M days vow CED Gru Qu swissprot The last major release of the SWISS PROT protein sequence database pat Protein sequences derived from the Patent division of GenBank pdb Sequences derived from the 3 dimensional structure Protein Data Bank 26 Blast Options Limited to Entrez query BLAST searches can be limited to the results of an Entrez query against the database chosen This can be used to limit searches to subsets of the BLAST databases Filtering Mask off segments of the query sequence that have low compositional complexity Expect The statistical significance threshold for reporting matches against database sequences the default value is 10 meaning that 10 matches are expected to be found merely by chance Word size BLAST is a heuristic that works by finding word matches between the query and database sequences One may think of this process as finding hot spots that BLAST can then use to initiate extensions that might lead to full blown alignments AN INTRODUCTION TO BIOINFORMATICS FOR BIOLOGICAL SCIENCES STUDENTS Options for advanced blasting Limit by entrez C Al organisms ej s or select from g Composition based ae A statistics Choose filter V Low complexity Mask for lookup table only Mask lower case Word Size 3 v Matrix BLOSUME 2 Gap Costs Existence 11 Extension 1 Other advanced PHI pattem
74. only partially successful and typically work best for certain well defined classes of proteins Substitution matrix A model of protein evolution at the sequence level resulting in the development of a set of widely used substitution matrices These are frequently called Dayhoff MDM Mutation Data Matrix BLOSUM or PAM Percent Accepted Mutation matrices They are derived from global alignments of closely related sequences Matrices for greater evolutionary distances are extrapolated from those for lesser ones U Unidentified reading frame URF An open reading frame encoding a protein of undefined function V Variation genetic Variation in genetic sequences and the detection of DNA sequence variants genome wide allow studies relating the distribution of sequence variation to a population history This in turn allows one to determine the density of SNPS or other markers needed for gene mapping studies Quantitation of these variations together with analytical tools for studying sequence variation also relate genetic variations to phenotype W Weight matrix The density of binding sites in a gene or sequence can be used to derive a ratio of density for each element in a pattern of interest The combined individual density ratios of all elements are then collectively used to build a scoring profile known as a weight matrix This profile can be used to test the prediction of the identification of the selected pattern and the ab
75. ormatics Institute EE Nucleotide sequences Vie I cokers 8 Site Database Map I Queries EBI Home About EBI Services Toolbox Databases Downloads Submissions SEQUENCE ANALYSIS SSES Submission Form Clustal W is a general purpose multiple sequence alignment program for DNA or proteins It produces biologically meaningful multiple sequence alignments of divergent sequences It calculates the best match for the selected sequences and lines them up so that the identities Matrix similarities and differences can be seen Evolutionary relationships can be seen via viewing Cladograms or Phylograms New users please read the FAQ ET Download Software Research Help Index General Help Formats Gaps References Clustalh Help Clustal FAQ Jalview Help YOUR EMAIL ALIGNMENTTITLE RESULTS ALIGNMENT CPU MODE Sequence interactive full single v WINDOW SCORE TYPE TOPDIAG PAIRGAP Scores Table KTUP Alignment Guide Tree Colours NORD SIZE LENGTH def def percent w def v def w MATRIX GAP OPEN END GAP GAP GAPS EXTENSION DISTANCES def w def w def w de w def w OUTPUT OUTPUT FORMAT PHYLOGENETIC TREE TREE TYPE CORRECT DIST IGNORE GAPS OUTPUT ORDER aln w numbers aligned none off off Enter or Paste a set of
76. ort to group nucleotide sequences ESTs and mRNA of selected organisms by genes they are related to 5 See Nucleotide Sequence Database Policies Science 298 5597 1333 15 Nov 2002 6 The Entrez Help Document is a useful resource http web ncbi nih gov entrez query static help helpdoc html 11 IIT am ET Nucleotide Protein Genome Structure PMC Taxonomy for m157 Go Clear Limits Preview Index History Clipboard Details Search Nucleotide About Entrez Entrez Nucleotide Help FAG Brief ASN 1 Entrez Tools 7 1 AY6 XML Hor GenBank GI list ba Graphics S91 TinySeq XML LinkOut GBSeq XML INSDSeq XML 2 A Y6 LinkOut Assembly Hem Nucleotide Neighbors Component Links 21 51 mRNA Links Components to Genome Gene Links 3 Aye Assembly to Genome H GEO Profile Links om HomoloGene Links Submitto GenBank throu g 51317552 gb A 686613 1 51317552 Page fi of 1 C class I antigen precursor HLA B gene HLA B 48 variant 2 alle revision history 17556 C class I antigen precursor HLA B gene HLA B 48 variant 2 alle Related resources BLAST throu 17554 Reference sequence project Search for Genes C class I antigen precursor HLA B gene HLA B 48 variant 1 alle Search for full length 14 NM 152982 Damo rerio myosin heavy polypeptide 2 fast muscle specific myhz2 mRNA 8 5051229lreflNM 152982 211905122931 Bv DEFAULT THE RESULTS OF AN ENTREZ SEARCH IN GE
77. ou are using an external molecular visualization tool you may choose to download the PDB file from the Download Files sub menu or under one of the However our preferred mode of viewing would be with one of the web based viewers using Jmol for instance under Images and Visualization on the right side menu 38 AN INTRODUCTION TO BIOINFORMATICS FOR BIOLOGICAL SCIENCES STUDENTS Part 2 Using Jmol to visualize PDB files 4 Help interacting with JMol Step 1 getting the structure Simple Interaction Guide requires flash 1 MHC From the Structure Explorer page you have clicked on one of the links leading to Jmol The applet will load and display the structure in its most basic view in ribbons without annotation Each single chain 1s coloured differently and non protein compounds are shown in sticks and balls Advanced JMol Help Pressing the middle wheel button allows you to zoom in and out the molecule Pressing CTRL and dragging with the right button allows to move the molecule around translation within the window Double click on an atom easier to perform when the molecules are in spacefill mode will display a meter that will show the distance between it and any other atom you are pointing to and to subsequent atoms as well Single click on an atom and the status bar of your browser view gt status bar or the console if opened will display details on what you have just clicked The right cl
78. r the website Google will just search for clustalw and you will get it as your first hit If you don t want to look ClustalW can be found here http www ebi ac uk clustalw ClustalW is a program Many interfaces exist for it and we show only show you the web version You will also find ClustalW bundled with DS Gene and another ClustalW called ClustalX which can be downloaded and run on your home computer Part 2 The ClustalW form and importing data to it First take your time and look around The ClustalW form has many options to be toggled For the purpose of our tutorial and upcoming exercise we will keep all the default settings minus trivial things such as the name we want to give to identify our query and an e mail to send results to For now don t touch the output and phylogenetic tree sections By default ClustalW will output a multiple sequence alignment or MSA which takes several sequences amino acid or nucleotide and gives you the best alignment it can find between all the sequences As part of the alignment process and depending on the sequences given gaps are inserted and similar letters aligned together ClustalW is just one of many alignment programs It has its strength and its weaknesses but the details are beyond the scope of this course ClustalW Mozilla Firefox File Edit View Go Bookmarks Tools Help 9 BOD BS G htpwww ebi ac ukcustaw lato AEMBL EBI j uropean Bioinf
79. results page comes with a java applet displaying a simple representation of the tree the ph file So what is actually Phylip tree file the tree How do you represent a tree if not as something visual CHAPTER 3 TUTORIALS Here s the tree data gt Mozilla Firefox Fille Edit View Go Bookmarks Tools Help Not that you need to understand the format of the ph file but it insightful to know that so little is used to define the appearance of the X en Be amp E http ebi ac uk cgi bin jobresulEs cluskalwcluskalw 20041229 08575117 ph Human coronavirus HL63 spike 0 19196 tree Human coronavirus 29E spike 0 1648z 0 07836 At the bottom is a Porcine epidemic diarrhea viru 0 26588 i Transmissible gastroenteritis 10 47739 representation of the tree by the java applet built into the ClustalW web version webpage Human coronavirus OC43 spike 0 03695 Bovine coronavirus spike 0 0348 0 13037 Murine hepatitis virus_spike 0 16757 0 19245 SARS coronavirus spike 0 36096 0 04438 Avian infectious bronchitis vi 0 37623 0 10494 0 01046 Phylogram Human_coronavirus_NL63_spike Human coronavirus 229E spike Porcine epidemic diarrhea viru Transmissible gastroenteritis _ Human coronavirus OC43 spike Bovine coronavirus spike Murine hepatitis virus spike SARS coronavirus spike Avian infectious bronchitis vi show as Cladogram Tree View PH File 34 AN INTRODUCTION TO BIOINFORMA
80. ructure Explorer The first one is called Sequence Details which shows the amino acid sequence of each chain of the structure file From the top menu select Sequence Details The sequence of the structure can be downloaded in Fasta format from the button below the secondary structure overview All sorts of data concerning the protein sequence can be found on this page Occasionally the structure will have a link going to the corresponding Swiss Prot page which would contain curated data on the protein Images and isualization Biological Molecule Display Options KING Jmol WebMal Protein Workshop QuickPDB All Images 37 Sequence Details 1MHC Chain A representative of identical chains Chains D Description MHC CLASS I ANTIGEN HZ M3 Type polypeptidelL Polymer Id 1 Number of residues Domains 2B dimhcaz Class MHC alpha 1 and alpha 7 domains dimhcal Class MHC alpha 3 domain Sequence and Secondary Structure a GM ue tm am alpha helis 310 helix pi helix Greyed out residues have no structural information A A A A EREEJ y y s GSHSLEYFHTAVSKEGRGEPOYI amp SVGYVDDVOFORCDSIEEIPRMEPRAPWMEKERBEYW KELEKLEKVENIAQSARANLETLLERYYNOSEGGSHILQWMVSGEVGPDMRLLGAHYOAAYDG RADPPREAHVAHHPRPEGDVTLRGWAL GFYPADITLTWOKDEEDLTODMELVETERPS3GDGT 220 230 240 250 260 270 Chain 4 in Fasta Format For Sequence Only Step 6 viewing the structure If y
81. s in 1992 Molecular visualization tools back then had to be run on graphics workstations but RasMol being an extremely well optimized program could run on then moderately powerful computers It is still being used nowadays although technology has since evolved and several convenient web based tools have been developed One piece of software adopted by the PDB is Jmol which does not need to be installed and can be run from the web on any computer equipped with a browser and Java Jmol borrows its scripting language from RasMol and users can interact with the program using a command line interface which allows the user to perform several tasks that would otherwise be too complex in a point and click fashion such as selecting a range of amino acids and to highlight them Another nice program to view and customize structures is DeepView aka Swiss Pdb Viewer developed by the Swiss Bioinformatics Institute and GlaxoSmithKline Another fairly popular downloadable program is PyMol Both of these programs are used in a production environment and have advanced view and customization features you will not find in RasMol or web based applications 20 2 6 Other databases There s bound to be a database suited to your needs For instance the McGill Center for Bioinformatics hosts a database called HERA which compiles all human proteins known to reside in the endoplasmic reticulum Amos Bairoch the founder of the Swiss Prot database lists many dat
82. scientific studies PIR website http pir georgetown edu The PIR maintains the Protein Sequence Database PSD an annotated protein database similar to Swiss Prot The PSD grew out of the Atlas of Protein Sequence and Structure 1965 1978 edited by the late Margaret Dayhoff 2 3 4 UniProt Nucleotide sequence databases were united under the International Nucleotide Sequence Database INSD Collaboration but curated protein databases didn t have their equivalent body until 2002 when the UniProt consortium was established between the developers of the main existing annotated protein databases the EBI SIB Swiss Prot amp TrEMBL and the PIR Protein Sequence Database PSD UniProt is a very recent addition that aims to replicate the efforts of UniGene in the amino acid sequence world The first version 1 0 of UniProt was officially launched 15 Dec 2003 and its second version 2 0 on 5 Jul 2004 Both of these were in fact the most current versions of Swiss Prot TrEMBL and PSD merged together Databases making up UniProt are e The UniProt Archive UniParc is the most comprehensive publicly accessible non redundant protein sequence database available It includes sequences from databases hosted by the founding members of UniProt but also sequences derived from other public databases such as PDB RefSeq or EMBL As its name implies UniParc is an archive so every time a change is made to an entry on the native database UniParc tak
83. second sequence you don t need to put a space between sequences and the next description line like we did here Enter ar Paste a set of Sequences in any supported format Enter Sequences gt gt SARS coronavirus sp ike HFIFLLFLTLTSGSDLDECTTFDDVOAPHTYTQHTSSHEGVYYPDEIF TINHTFGNPVIPFEDGIYFAATEESNVVREGWVFGSTMMMESQOQSVIIII SEPMGTOTHTMIFDNAFHCTFEYISDAFSLDVSEESGNFEKHLEEFVFE SJGFMTLEPIFELPLGINITNFRAILTAFSPAGQDIWGCTSAALAYFVGYL I HPLAELECSVESFEIDEGIYOTSHFREVVPSGDVVEFPNITHLCPFGE DYSVLYHSTFFSTFECYGVSATELHDLCFSNVYADSFWVVEGDDVROI LAWHMTRHIDATSTGNYNYEYRYLRHGEKLEPFERDISNVPFSPDGEKPC YOPYRVVVLSFELLNAPATVCGPELSTDLIENQCVNFNFNHGLTGTGVI SVERDPETSEILDISPCAFGGVOVITPGTIHASSEVAVLTODVNCTDV2O w E gt Upload a file Documents and Setti Browse Run Note FASTA files are written as plain text Plain text as opposed to formatted text consists only of characters without information pertaining to font size etc Formatted text files like Word documents are encoded and can only be opened using specific programs like MS Word and cannot be interpreted by programs such as ClustalW You must therefore use a plain text editor like Windows Notepad to write your own FASTA files and save them with any Windows extension txt here Remember this nuance and always use plain text for anything that doesn t require formatting such as sequences or Upload a file gt For such short sequences we might be waiting for a few minutes to
84. snena ea E tee derdsn alia teedeeaduateoweaadtate ddan aloutea esca dhe Mas cand cebat dba eis etui va os dut ud 8 Chapter 2 Molecular Biology DatabaS6sS i ei ekoo erased ao Dorv uae be so PNE Vn Y eT o os Ene anaes sissiescadadswsssasdssbaeseaaesadaaseaseesedeastavessiaes 9 P Blair ever emm t dL 9 2 2 Nucleotide Sequence Databases eas nen ep mmi hum EM DM Mi 9 22 The Bie Three GenBank DDB I iain EMDBIa d SUE enm Ra Ua EE eu M E MA 10 2 2 2 Entrez NC BD san l putrpose Scare Duel Ime sue ado e tte istae ttes tas Dei EN D E e RUE EE 10 Renne volt SC ARC oss epo sd ossi eus acta teenie tea tsa Peduste lesse sao bait abolita asc mcr ive pini actuss ubican tula etc ette 10 INC BP Unmene ve OS MG iy sack cas imext ure dae teesoudeoides b mdpet ou Ebene n toitulind fad testet age en a iaaties Tween estmus adven ue adea ne 10 220 39 0NC BI S UI DIGIEB eee nee ee ee ae ee ee ee tenes teen eR E S esum xac lure Tee eee rol testae Ue tpud 10 GE SF and UnG OMG NN OO D Om 11 What does DntGene comntdthi Cx ACY uerai dp iom Gua Fels oa Ro dea petet Im e coda dto Nette eb devel 11 Howdo yousearel d Bt GENS asenteeni aa a a a N a defe tut uite Us 11 NCB EG CMG WC OS 1UG odiis debiu etri liotobee hu asc acd ea vende Uc e toU puedas dedmpD det eoe quo ISdU UU eua Eu ania 12 2 Protcinsequence Databases cie tbe Chere bae atat cni docete luu Me Lun a Lait AE LE NEL A MD 12 2 3 1 What can you find in a curated protein database sse eene eene een
85. ster can be found tissue types in which the gene is expressed protein similarities with clusters in model organisms say if you wanted to express a human gene in mice using the murine counterpart LocusLink report for the gene and its location in the genome How do you search UniGene Each species has its own summary page with all kinds of nice statistics about the number and size of the clusters as sequences get added to the system by the thousand every week Typically you might know the name of the gene or could be looking for the cluster to which your nucleotide sequence belongs to UniGene is the database and Entrez is NCBI s all purpose search engine so Entrez UniGene is naturally the way to go 12 NCBI UniGene website http www ncbi nih gov UniGene 2 3 Protein Sequence Databases Protein databases are the natural extension of nucleotide sequences They come in two varieties Sequence Repositories and Universal Curated Databases Sequence repositories are generally just places where protein sequences are compiled with minimal attention given to provide non redundant entries GenBank is a well known example of sequence repository In contrast universal curated databases are manually organized and looked after by experts Among protein databases NCBI s GenPept is an example of a sequence repository It contains the translation of nucleotide sequences contained in the GenBank EMBL DDBJ triumvirat mentioned earlier C
86. tes of the atoms and bonds of the molecule in question You could read all of this code if you opened the files from a text editor like Notepad but that s not generally what you d want Instead visualization programs are used to convert the text information into molecules you can twist and turn in space Searching PDB This section will now cover methods for searching the PDB database for information that s relevant to us 1 The simplest search tool is SearchLite which is available directly on PDB s homepage as the text input box right in the middle of the webpage You may enter a query as a PDB ID a unique 4 character alphanumerical string scientific papers sometimes use these the authors of the structure or the full text search which 1s basically any text that s found associated with an entry 1 2http www whatislife com reader techniques techniques html 19 2 Other search tools are linked from the homepage e QuickSearch Searches the entries like SearchLite but also all of the support pages making up PDB e SearchFields Searches against specific fields of information for example deposition date or author e Status Search Searches on the status of an entry on hold or released 3 Interactive Search Among structures obtained through one of the types of search previously mentioned you can choose a subset of structures to perform additional searches Structures to search within can be selected through a pull down men
87. the add input box or box button with the corresponding boolean operator and or not which are used to concatenate the query terms 2 BLAST which is a sequence alignment program to search a protein sequence against a database UniProt in our case 7 Dr Dayhoff 1925 1983 was considered a pioneer of bioinformatics She developed a number of algorithms for alignment and comparison as well as protein and DNA databases A footnote in her biography would probably be the single letter code for amino acids she came up with 8 The difference between database and knowledgebase is subtle The web defines knowledgebase as A collection of in formation used to answer questions while a database is A collection of data arranged for ease and speed of search and retrieval 14 But that s not it There are various tools and analyses available from the individual UniProt consortium member web sites and other sites that complement the UniProt Databases These are categorized as Similarity Search Multiple Sequence Alignment Batch Retrieval Proteomics and Bibliography There is also a section for Comprehensive Tools Links Lists http www uniprot org search tools shtml on the UniProt website Reading a UniProt entry A UniProt entry is just text organized in a consistent format Every entry contains information about the following items Entry Information Entry name Accession Nb etc Name and origin of the protein The protein
88. the molecules we would have to execute select a and spacefill We can also decide to select by amino acid by using the select command by appending a number range to the chain letter For instance to select amino acids 300 to 400 in chain B we would call the command select 300 400b If successful if the range delimiters exist verify with the sequence details page the console will display that it has just selected a certain number of molecules The chain name is optional and if omitted it will select molecules on all chains This might not be important because many structures have one single polypeptide chain or many subunits of the same polypeptidic chain To put emphasis on these selected molecules the user can use various view customization commands such as color lt colorname gt and spacefill To select co crystallized compounds peptides single nucleic acids chemicals solvent etc it is more convenient to use the context menu because their selection name may not be standard If these chemicals are listed it will be under Chemical Component on the main page 40 AN INTRODUCTION TO BIOINFORMATICS FOR BIOLOGICAL SCIENCES STUDENTS Tutorial How to use InterPro to find conserved protein domains By Cedric Sam lt cedric sam elf mcgill ca gt and Abrar Khan lt abrar khan mail mcgill ca gt Version 2 August 2005 InterPro is a database of protein families domains and functional sites in which
89. tree and allows the user to view alignments and amp domain organisation by species CHAPTER 3 TUTORIALS Glossary 43 Selected from the 2can glossary on the EBI website http www ebi ac uk 2can glossary A Accession number An identifier supplied by the curators of the major biological databases upon submission of a novel entry that uniquely identifies that sequence or other entry Algorithm A series of steps defining a procedure or formula for solving a problem which can be coded into a programming language and executed Bioinformatics algorithms typically are used to process store analyze visualize and make predictions from biological data Analogy Reasoning by which the function of a novel gene or protein sequence may be deduced from comparisons with other gene or protein sequences of known function Identifying analogous or homologous genes via similarity searching and alignment is one of the chief uses of Bioinformatics Annotation A combination of comments notations references and citations either in free format or utilizing a controlled vocabulary that together describe all the experimental and inferred information about a gene or protein Annotations can also be applied to the description of other biological systems Batch automated annotation of bulk biological sequence is one of the key uses of Bioinformatics tools B Bioinformatics 1 The field of endeavor that relates to the collection org
90. u on a query results page or manually by checking the box corresponding to a target entry RCSB Protein Data Bank Mozilla Firefox File Edit View Go Bookmarks Tools Help Qa amp c3 A amp amp http www pdb org pdb Welcome do Glew CO Q amp S A MEMBER or THE IP DB e An Information Portal to Biological Macromolecular Structures As of Tuesday Mar 14 2006 GY there are 35579 Structures PDE Statistics g Qe PROTEIN DATA BANK A PDB ID orkeyword Author D SEARCH O Advanced Search Contact Us Help Print Page Home Search Welcome to the RCSB PDB 1 Qe Tutorial About This Site B Getting Started F W Acknowledgements E W Frequently Asked Questions g5 Known Problems E lt Report Bugs Comments The RCSB PDB provides a variety of tools and resources for studying the structures of biological macromolecules and their relationships to sequence function and disease The RCSB is a member of the wwPDB whose mission is to ensure that the PDB archive remains an international resource with uniform data This site offers tools for browsing searching and reporting that utilize the data resulting from ongoing efforts to create a more consistent and comprehensive archive Information about compatible browsers can be found here A narrated tutorial illustrates how to search navigate browse generate reports and visualize structures using this new site This requires
91. ulations on local private computers or more practically consulted on line by molecular biologists at large using search tools such as BLAST 2 1 Introduction This section will cover nucleotide sequence protein structure and protein sequence databases Some of the main databases are found below e Biomedical literature PubMed e Species specific SGD FlyBase WormBase MGI e Nucleotide sequences GenBank EMBL DDBJ e Genome sequences Entrez Genome TIGR databases e Protein sequences GenPept Swiss Prot TrEMBL PIR e Macromolecular 3 D Structures Protein Data Bank MMDB e Protein and peptide mass spectroscopy PROWL e Post translational modifications RESID e Biochemical and biophysical information ENZYME BIND e Biochemical pathways PathDB KEGG WIT e Microarray chips data ArrayExpress SMD e 2D PAGE SWISS 2DPAGE e Protein families and domains PROSITE Pfam InterPro ProDom 2 2 Nucleotide Sequence Databases Nucleotide sequences DNA and RNA are essential pieces of information Researchers might use protein coding nucleotide sequences to produce large quantities of protein for various experiments in the wet lab 4 This listing is vaguely based on the one found in Developing bioinformatics computer skills by Cynthia Gibas and Per Jambeck O Reilly amp Associates 2001 10 2 2 1 The Big Three GenBank DDBJ and EMBL NCBI s GenBank USA EBI s EMBL Nucleotide Sequence Database Europe
92. urated databases contain information validated by expert biologists and thus considered highly reliable Swiss Prot TrEMBL and PIR are examples of such databases and we will look more closely at their history and modes of functioning before talking about UniProt an effort by Apweiler Bairoch and Wu s groups to establish networks for sharing protein information around the world 2 3 1 What can you find in a curated protein database Information on each protein is very specific and presumably highly reliable Anything essential such as the accession number the source organism and sometimes very many references can be found Cross references with other databases can be useful for researchers who might be interested to learn more about a given structural domain contained in the protein or the related nucleotide sequences if one wishes to express the protein for various assays 2 3 2 Swiss Prot and TrEMBL Swiss Prot is a curated biological database of protein sequences created in 1986 by Amos Bairoch during his PhD and developed by the Swiss Institute of Bioinformatics and the European Bioinformatics Institute It strives to provide a high level of annotation such as the description of the function of a protein its domain structure post translational modifications variants etc a minimal level of redundancy and high level of integration with other databases As of July 5 2004 Swiss Prot release 44 0 contains 153 871 entries Wikipe
93. ustalW to perform multiple sequence alignments and build phylogenetic trees 3 Tutorial How to use PDB and Rasmol to find and manipulate three dimensional Structures sssss 37 Tutorial How to use InterPro to find conserved protein domains sse eene nns 43 GOSS AVY Ue 46 Appendix How toowrite the report iioii rectc e eae eio SEP EE ES EL go SYM YER P C Sero E eoe eo Ee e Fear e n ae eo NO DE NE RERO ETEEEEECIDER 52 TABLE OF CONTENTS Chapter 1 Bioinformatics Institutes This section will be an overview of the major actors in the field of bioinformatics what are the services they offer and what sort of databases they each manage These research institutes were all established in different countries but their reach their funding sources their staff are now well worldwide 1 1 NCBI The National Center for Biotechnology Information USA The NCBI is a unit of the National Library of Medicine NLM which is in turn a branch of the National Institutes of Health NIH The NCBI is located in Bethesda MD in the outskirts of Washington DC 1 1 1 Database resources at the NCBI Here s an overview of a few of the databases hosted by NCBI and the services which come with them e Database Retrieval Tools Entrez is an integrated retrieval system for the databases hosted by NCBI Taxonomy indexes over 150 000 organisms th
94. z flavor Using the Limits link at the bottom of the Entrez search bar box you specify many parameters such as the fields you want to limit your search to There are also limits specific to the type of Entrez you are using for example with Entrez Nucleotide GenBank you can decide the type of nucleotide genomic DNA RNA mRNA or rRNA date of modification the subset of GenBank it belongs to etc In Entrez PubMed you can conveniently specify a range for the date of publication or choose the publication type among other things NCBI UniGene website To search Genbank use Entrez http www ncbi nih gov Entrez case sensitive 2 2 3 NCBI s UniGene Each UniGene cluster contains sequences that represent a unique gene as well as related information such as the tissue types in which the gene has been expressed and map location NCBI website UniGene is not a database per se it is rather a system for automatically partitioning GenBank sequences including expressed sequence tags ESTs into a non redundant set of gene oriented clusters Wheeler DL et al Database Resources of the National Center for Biotechnology Nucl Acids Res 31 28 33 2003 The importance of UniGene is its role in organizing the numerous sequences contained in public databases that can relate to a single gene Means of organizing the information have been overwhelmed by the deluge of sequences coming from the various genome projects and UniGene is an eff

Introduction to Bioinformatics for Biological Sciences

Contents

Download Pdf Manuals

Related Search

Related Contents