Home
Chapt 5
Contents
1. ACT1_FUGRU MEDEIAALVVDNGSGMCKAGFAGDDAPRAVEPSIVGR 37 ACT2_FUGRU MDDEIAALVVDNGSGMCKAGFAGDDAPRAVEPSIVGR 37 ACT3_FUGRU MEDEVASLVVDNGSGMCKAGFAGDDAPRAVEPSIVGR 37 5H1A_FUGRU MDLRATSSNDSNATSGYSDTAAVDWDEGENATGSGSLPDPELSYQIITSLFLGALILCSI 60 5H1B_FUGRU HFDSTSNRTSKSFDEEVKLSYQVVTSFLLGALILCSI 48 5H1D_FUGRU MELDNNSLDYFSSN FTDIPSNTTVAHWTEATLLGLQISVSVVLAIVTLATM 51 M ACT1_FUGRU PRHQGVMVGMGQK DSYVGDEAQS KRGILTLKYPIEHGIVTNWDDMEKIWHH 88 ACT2_FUGRU PRHQGVMVGMGQK DSYVGDEAQS KRGILTLKYPIEHGIVTNWDDMEKIWHH 88 ACT3_FUGRU PRHQGVMVGMGQK DSYVGDEAQS KRGILTLKYPIEHGIVTNWDDMEKIWHH 88 5H1A_FUGRU FGNSCVVAAIALERSLQNVANYLIGSLAVIDLMVSVLVLPMAALYQVLNKWTLGQDICDL 120 5H1B_FUGRU FGNACVVAAIALERSLQNVANYLIGSLAVIDLMVSVLVLPMAALYQVLNRWTLGQIPCDI 108 5H1D_FUGRU LSNAFVIATIFLTRKLHTPANFLIGSLAVIDMLVSILVMPISIVYTVSKTWSLGQIVCDI 111 wia SS Ee aS JIE A E E ACT1_FUGRU PSIVHRKCF 375 ACT2_FUGRU PSIVHRKCF 375 ACT3_FUGRU PSIVHRKCF 375 5H1A_FUGRU KKILRCKFHRH 423 5H1B_FUGRU KKIIKCHFCRA 416 5H1D_FUGRU QKLIK FRR 379 and here is the corresponding alignment in the phylip format 6 431 ACTICPUGRY ean seeaNas samen sare ane MEDEIAA LVVDNGSGMC KAGFAGDDAP ACT2 PUGRU q s seraes Aaa sss an MDDEIAA LVVDNGSGMC KAGFAGDDAP ACTS FUGR szamitana prs sees san See MEDEVAS LVVDNGSGMC KAGFAGDDAP 5H1A_FUGRU MDLRATSSND SNATSGYSDT AAVDWDEGEN ATGSGSLPDP ELSYQIITSL 5HiB_FUGRU MEG TNNTTG
2. 34 36 37 38 39 40 o om an ol e Q for a om a o0 5 5 BLAST PARSING HHHHRHHHHHRHHHHHHHHHHHHHHHHHHHHHHHH use strict The main program Part 1 Read the input file and check that is non zero undef my Blast lt gt unless length Blast gt 0 die Zero length input file n Part 2 Some hardcoded constants my Icutlow 80 my Icuthigh 85 my PrintAlign 1 Part 3 The rest my subjects GetSubjects Blast my Idents Alns foreach my subj subjects my sID rHSPs GetHSPs subj All the high scoring pairs undef Idents undef Alns for my i 0 i lt rHSPs i my rHPSres AnalyzeHSP rHSPs gt i my id rHPSres gt Pid Check the criteria if id gt Icutlow amp amp id lt Icuthigh push Idents id push Alns rHPSres gt Aln if Idents print ID sID n foreach my i 0 i lt Idents i print gt Identities Idents i n if PrintAlign 1 print Alns i n sub GetSubjects The blast file my blast _ Note this is a copy of the supplied blast file Get rid of everything before the ALIGNMENTS line blast s ALIGNMENTS n s 107 108 CHAPTER 5 APPLYING PERL 89 90 and everthing after the database statement 91 blast s s sDatabase s 92
3. Collect the sequence if read seq line elsif line SQ read 1 last close sFH seq s n g seq s s g return os seq search pl 5 4 4 Follow up tasks 5 3 The following files are needed for the below exercises search pl PDB table txt pdb seqres txt mkindex pl pdb2sprot txt and sprot subset2 dat Make sure that you understand this program and then write a program header for search pl Modify the program so that all of the optional RX lines are returned and printed instead of the organism The file pdb_seqres txtf contain all PDB entries in Fasta format Use this file to extract the protein sequences for all PDB codes in the PDB table txt file Compare each sequence with the corresponding sequence from the Swiss Prot database Print the difference in length together with the other information Optimization of the code The subroutine searchSwiss always starts from the begin ning of the Swiss Prot flat file For relatively small files this is an acceptable solution but for really large files this approach can take very long time The Perl functions seek and tell can be used to move around in a file handle Using the tell function we can create an index that relates a certain ID to a position ftp ftp wwpdb org pub pdb derived_data pdb_seqres txt gz 5 5 BLAST PARSING 105 in the file With the use of this index
4. CGA GCC GCG GCT C gt TGC TGT D gt GAC GAT We can summarize the with the following table Reference to Named Anonymous Scalar scalar Array array LIST Hash hash LIST Code amp function CODE 5 3 2 Using references There are many ways of creating references as there are many way of using or dereference references Before we look at small examples were we use references we must learn how to dereference them Variable names When you come across a scalar like amino you should be thinking the scalar value of amino This means that there is a amino entry in the symbol table and the character is a way of obtaining the scalar value behind the name If what is inside is a reference you can look inside that dereferencing amino by prepending another character Formulating 96 CHAPTER 5 APPLYING PERL it in another way you can replace the literal amino in amino with a scalar variable that points to the actual referent Here is an example amino PERLISFUN scalarref amino scalarref is now a reference to amino deref scalarref deref is now PERLISFUN We can of course use this way of dereferencing on both arrays and hashes More examples arrayref aminos Make a reference to the array aminos arrayref 0 Gly Set the first element of O arrayref push arrayref
5. while my line shift data next if line 7 s TO BE COMPLETED Seq Species np line np 1 if np nprot np 0 convert missing pl sub ReadPhylip To store the alignments a hash is used i e the global variable Seq where a sequence is index by its name i e the species These names are also stored in the correct order in the array Species Some notations e The line my data _ simply makes a copy of the supplied array and stores it in the local array data e Throughout this routine the contents in data is accessed using the shift function onan nua e N e e N e oO that chops off the first item of the array and returns it e The split function is used here to split on whitespace e The TO BE COMPLETED will be left as an follow up exercise Next the corresponding routine that reads a clustalw file convert missing pl sub ReadClustalW sub ReadClustalW Copy the supplied data to a local data array my data _ Skip starting blank lines while data 0 s shift data The first non blank line should contain the word CLUSTAL unless data 0 CLUSTAL die The input file does not contain the word CLUSTAL n 13 14 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 OO oan nu e N e e BP H e N O pi Si 5 2 CHANGE OF FILE FORMATS 91 shi
6. STA NEQK NHQK NDEQ QHRK MILV MILF HY FYW Indicates that one of the following weaker groups is fully conserved CSA ATV SAG STNK STPA SGND SNDEQK NDEQHK NEQHRK FVLIM HFY Summary You should write a Perl program that converts clustalw files to phylip files and vice versa When you run your program on clustalwi1 align clustalw2 align you should get exactly the phylip1 align phylip2 align and the other way around 2 Write a parser for the result files produced by the FASTA sequence alignment program This parser should for each alignment print the percentage of identities and print the length of the longest sub alignment with only identities To identify each aligned se quence extract the first group characters after gt gt e g TR E7EVS8_HUMAN The FASTA result file that you should work with is called 1TEN_fasta res and is available at the course web page Summary Write a parser for the 1TEN fasta res file Your output should look like the sample output file sample 1TEN_fasta txt
7. 31 5 3 REFERENCES OBJECTS AND METHODS 97 hashref GCG A The standard way hashref GCG A The shortcut way hashref gt GCG A This preferred arrow way We can even use many of these arrow operators print array 3 gt Perl gt 0 You can see from this expression that the fourth element of array is a hash reference and the value of the Perl entry in that hash is an array reference 5 3 3 Examples of using references Here we will show two examples of how to use references The first one deals with subroutines and in the other one we create a hash of arrays Passing arrays to subroutines ref2 pl usr bin perl w use strict Main part of the program my fasta lt gt my hashref SearchFasta fasta TGG my res hashref foreach my key sort keys res print key res key n sub SearchFasta Get the references from the argument array my f1ref subsref _ Dereference my f1 f1ref This makes a copy of the referenced array Make a long string of the fasta sequence my seq join f1 Remove all comment lines seq s gt n g Remove all new lines seq s n g Loop over all substrings to search for 32 33 34 35 36 37 38 39 OO N Q A e N Y N N N NNN NY NB Bee Be Be ee eB ee oI O2aSk NYHSO KF FCO wAANDAA BwWHNH HO 98 CHAP
8. N ao va G u Te a a a a N N NouyvuooadoaooqQo9qaoqaoaangenan na o om oo o ao ao a w N jn oO N QA a A Ww N y O N Dm wo ke Ww N O 00 e O r e 5 4 SEARCHING IN LARGE TEXT FILES while my line lt tableFH gt next if line PDB chomp line my pdbID line pdbID s s my swissID keyref gt pdbID my 0s seq searchSwiss swissID ARGV 2 print pdbID t os n close tableFH End of main program sub readKey my keyFile _ open my kFH lt keyFile or die Cannot open file keyFile my Aptos while my line lt kFH gt next if line code Each line contains 3 items we are only interested in the first two my pdb sid split line Store in a hash ptos pdb sid close kFH return ptos End of readKey sub searchSwiss my swissID swissFile _ open my sFH lt swissFile or die Cannot open file swissFile my os my seq while my line lt sFH gt Check for the correct ID line next unless line ID s swissID st Now scan for OS and SQ my read 0 while my line lt sFH gt Stop if we hit the last of entry code last if line ml7 103 104 End of searchSwiss CHAPTER 5 APPLYING PERL Collect the OS line if line OS s os 1 7
9. 182 4e 57 dbj BAG64930 1 unnamed protein product Homo sapiens 191 2e 54 ref XP_005498448 1 PREDICTED tenascin isoform X3 Columba 1 158 3e 42 gb KFQ87172 1 Tenascin R Phoenicopterus ruber ruber 158 3e 42 ALIGNMENTS gt pdb 1TEN A Chain A Structure Of A Fibronectin Type Iii Domain From Tenascin Phased By Mad Analysis Of The Selenomethionyl Protein Length 90 Score 182 bits 461 Expect 4e 57 Method Compositional matrix adjust Identities 90 90 100 Positives 90 90 100 Gaps 0 90 0 Query 1 RLDAPSQIEVKDVTDTTALITWFKPLAEIDGIELTYGIKDVPGDRITIDLTEDENQYSIG 60 RLDAPSQIEVKDVTDTTALITWFKPLAEIDGIELTYGIKDVPGDRTTIDLTEDENQYSIG Sbjct 1 RLDAPSQIEVKDVTDTTALITWFKPLAEIDGIELTYGIKDVPGDRITIDLTEDENQYSIG 60 ou e N e O O N Q 10 11 12 13 14 16 17 18 19 20 21 22 23 24 26 27 28 29 30 31 32 106 CHAPTER 5 APPLYING PERL Query 61 NLKPDTEYEVSLISRRGDMSSNPAKETFTT 90 NLKPDTEYEVSLISRRGDMSSNPAKETFTT Sbjct 61 NLKPDTEYEVSLISRRGDMSSNPAKETFTT 90 gt dbj BAG64930 1 unnamed protein product Homo sapiens Length 1080 Score 191 bits 486 Expect 2e 54 Method Composition based stats Identities 90 90 100 Positives 90 90 100 Gaps 0 90 0 Query 1 RLDAPSQIEVKDVTDTTALITWFKPLAEIDGIELTYGIKDVPGDRTTIDLTEDENQYSIG 60 There are of course many ways we can try to summarize the information found in this file The program presented below parses a blast result file like 1TEN_blastp res in the fo
10. RA RT RL CC CC CC CC DR DR DR DR DR DR DR PE KW FT FT SQ Y491_PASMU Reviewed 127 AA Q9CNE1 22 AUG 2003 integrated into UniProtKB Swiss Prot 01 JUN 2001 sequence version 1 10 AUG 2010 entry version 25 RecName Full Uncharacterized protein PM0491 OrderedLocusNames PM0491 Pasteurella multocida Bacteria Proteobacteria Gammaproteobacteria Pasteurellales Pasteurellaceae Pasteurella NCBI_TaxID 747 1 NUCLEOTIDE SEQUENCE LARGE SCALE GENOMIC DNA STRAIN Pm70 MEDLINE 21145866 PubMed 11248100 DOI 10 1073 pnas 051634598 May B J Zhang Q Li L L Paustian M L Whittam T S Kapur V Complete genomic sequence of Pasteurella multocida Pm70 Proc Natl Acad Sci U S A 98 3460 3465 2001 Copyrighted by the UniProt Consortium see http www uniprot org terms Distributed under the Creative Commons Attribution NoDerivs License EMBL AE004439 AAK02575 1 Genomic_DNA RefSeq NP_245428 1 GeneID 1243838 GenomeReviews AEQ04439_GR PM0491 KEGG pmu PM0491 NMPDR fig 272843 1 peg 491 BioCyc PMUL272843 PM0491 MONOMER 4 Predicted Complete proteome CHAIN 1 127 Uncharacterized protein PM0491 FTId PRO_0000216293 SEQUENCE 127 AA 14589 MW A85EFFC5579E4184 CRC64 MQLVFSYIEH KSQVIPVCFW KENHQLHPLT GYLNDPMGGL NYFAFLDKVL SMLRDEDIQQ GDISSNSWGV EIHGDQVYFC FLFAQEDTSL HFALSRAVLI DILVLWLAFR SQKPVAGYQE VLSFAEA In this application we will use the I
11. and the seek function we can jump directly to the position of a certain ID The Perl script mkindex p1 contains a subroutine that makes such an index Use this index and modify the searchSwiss routine to make use of the seek function This will significantly speed up the code 5 5 Blast parsing In this section we will look at a Perl program that parses the large amount of information that a Blast run produces Below is part of such a result file 1TEN_blastp res produced by the blastp program when blasting the protein 1TEN PDB id BLASTP 2 2 30 Reference Stephen F Altschul Thomas L Madden Alejandro A Schaffer Jinghui Zhang Zheng Zhang Webb Miller and David J Lipman 1997 Gapped BLAST and PSI BLAST a new generation of protein database search programs Nucleic Acids Res 25 3389 3402 Reference for compositional score matrix adjustment Stephen F Altschul John C Wootton E Michael Gertz Richa Agarwala Aleksandr Morgulis Alejandro A Schaffer and Yi Kuo Yu 2005 Protein database searches using compositionally adjusted substitution matrices FEBS J 272 5101 5109 RID 3CDJ5NH701R Database All non redundant GenBank CDS translations PDB SwissProt PIR PRF excluding environmental samples from WGS projects 49 886 901 sequences 17 905 752 166 total letters Query Length 90 Score E Sequences producing significant alignments Bits Value pdb 1TEN A Chain A Structure Of A Fibronectin Type Iii Domai
12. 1 32 33 5 2 CHANGE OF FILE FORMATS 93 TO BE COMPLETED print aas n print n PrintName 0 convert missing pl sub WritePhylip 5 2 2 Follow up tasks 5 1 Download the files for this exercise convert missing pl clustalwl aln clustalw2 aln phylip1 aln and phylip2 aln and complete the following tasks 1 Complete the first TO BE COMPLETED The line scalar contains one line of the phylip file without the names e g Use a regular expression to get rid of the spaces and newline characters in line 2 Complete the second TO BE COMPLETED This is the other way around You have aas_tmp which is a subsequence of length 50 and you should create aas which is the same as aas_tmp but space delimited after every 10 character E g K LCYVALDFEQEMGTAASSSSLEKSYELPD Gq should be K LCY VALDFEQEMG TAASSSSLEK SYELPD GQ 3 Test the program using the files clustalw1 align phylipt align or clustalw2 align phylip2 align Note this program will not handle the conservation line appearing in the clustalw format 4 Change the program so that one does not have to specify the conversion to make i e the program should recognize that it reads a clustalw file and then convert it to a phylip file and vice versa 5 Add a subroutine to the program that finds the number of full matches in the multiple alignment This number should be written at the end of the
13. 93 Now split on the gt xxx pattern but keep the gt xxx itself 94 my subjects split gt w 2 3 blast 95 96 return subjects 97 98 End of GetSubjects 99 100 101 sub GetHSPs 102 One subject 103 my subject _ Note this is a copy of the supplied subject 104 105 Extract an ID 106 my id 107 if subject 7 gt w 2 3 1 108 id 1 109 else 110 print Warning Subject does not start with a known format n 111 print subject 112 exit 113 114 115 Now get HSPs by dividing at the Score line 116 subject s Score s 117 my hsps split sScore subject 118 119 return id hsps 120 121 End of GetHSPs 122 123 124 sub AnalyzeHSP 125 The HSP 126 my hsp _ Note this is a copy of the supplied hsp 127 128 This will store the results 129 my HSPres 130 Identities 90 90 100 Positives 90 90 100 Gaps 0 90 0 131 Find the score 132 if hsp Identities st st d d st d Z 133 HSPres Pid 1 134 135 136 Get the Alignment 137 hsp s Query s 138 hsp s n n g 139 HSPres Aln hsp 140 141 Now return a reference to the result hash 142 return HSPres 143 5 5 BLAST PARSING 109 144 End of AnalyzeHSP parse pl A few details in this program is worth noting e The program treats the input file
14. Ala Add one element to O arrayref arrayref 2 4 qw Val Leu Ile Set several elements of arrayref hashref ahash Make a reference to the hash ahash hashref GCA gt A GCC gt A Initialize whole hash hashref GCG A Set one key value pair Is is important to understand that dereferencing happens before any array or hash lookups This is why it is important at least in the beginning to use curly braces We can of course use shortcuts deref scalarref is the same as deref scalarref arrayref is the same as arrayref thashref is the same as hashref The use of braces is recommended To help you understand this note the important dif ference between arrayref 0 and arrayref 0 where the former means the first element of the array referred to by arrayref and the latter which is dereferencing the first element of the array named arrayref It is important to understand this The arrow operator For references to arrays hashes or even subroutines a third method of dereferencing in volves the use of the gt operator Look at the following equivalent ways of dereferencing arrayref 2 Val The standard way arrayref 2 Val The shortcut way arrayref gt 2 Val This preferred arrow way O N1 Q A e N e e BP H e N e O 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
15. Chapter 5 Applying Perl 5 1 Introduction to chapter In this chapter we will use the knowledge gathered during the first two weeks in order to solve some problems that may occur in the field of Bioinformatics There will also be a lecture on how to use references in Perl The following list is a summary of this chapter e Change of file formats e References objects and methods e Searching in large text files e Parsing Blast result files During the lectures we will study programs or part of programs that solves the different tasks We will look at both the structure of the program i e how the author decided to logically solve the problem and details in the Perl code 5 2 Change of file formats Pretend that you are performing a multiple sequence alignment The result of such an alignment using e g clustalw can be a text file showing the alignment of the sequences There are different formats for this text file At a later step in your analysis of the alignments you may be faced with the problem of converting one text format to another We will now look at the task of converting clustalw files aln format to phylip files phylip format and vise versa 5 2 1 clustalw to phylip and phylip to clustalw Here is an alignment of the 6 proteins ACT1_FUGRU ACT2_FUGRU ACT3_FUGRU 5H1A_FUGRU 5H1B_FUGRU 5H1D_FUGRU shown in the clustalw format 87 88 CHAPTER 5 APPLYING PERL CLUSTAL W 1 82 multiple sequence alignment
16. D OS and SQ lines e N e OO N nw 10 12 13 14 15 16 iz 18 19 20 21 22 23 24 25 26 27 28 102 CHAPTER 5 APPLYING PERL 5 4 2 The pdb2sprot txt file The pdb2sprot txt file is a cross reference between PDB ID s and if possible corresponding Swiss Prot ID s The first few lines of the file looks like this code Swiss Prot entry name s 101M MYG_PHYCA P02185 102L LYS_BPT4 P00720 102M MYG_PHYCA P02185 103L LYS_BPT4 P00720 103M MYG_PHYCA P02185 104L LYS_BPT4 P00720 104M MYG_PHYCA P02185 105M MYG_PHYCA P02185 106M MYG_PHYCA P02185 5 4 3 search pl Here is a program that will accomplish the task oulined obove Use sprot subset2 dat whick is a subset of the full Swissprot database sprot subset2 dat is available at the course homepage search pl usr bin perl w HHH Program description Title search pl Author s Mattias Ohlsson Description List of subroutines Overall procedure Usage search pl table file key file swissprot file HHH HH H HHH HOH OH OF HEHHEHHHHHHHHHHHHEHHHHEHHHHEH HEHE use strict The main program unless ARGV 3 die You must have three files as argument Read the key file my keyref readKey ARGV 1 Open the table that we should complete open my tableFH lt ARGV O or die Cannot open table file ARGV O 29 30 31 32 33 34 36 37 38 39 40
17. TER 5 APPLYING PERL my cnt foreach my sub subsref cnt sub seq s sub sub g Return the hash of matches return cnt ref2 pl This small program contains one subroutine SearchFasta that takes two arguments both references to arrays The subroutine returns a reference to a hash Can you figure out the purpose of the program If we run the program on the first entry in the fasta file ecoli fasta we get the following output TGG 2044 Hash of arrays In this little example we create a hash and where each value in the hash is an array This can only be accomplished using references ref3 plt usr bin perl w use strict The translation table my codon codon A GCA GCC GCG GCT codon C TGC TGT codon D GAC GAT codon E GAA GAG codon F TTC TIT codon G GGA GGC GGG GGT codon H CAC CAT codon I ATA ATC ATT codon K AAA AAG codon L CTA CTC CTG CTIT TTA TTG codon M ATG codon N AAC AAT codon P CCA CCC CCG CCT codon Q CAA CAG co
18. WT HFDSTSN RTSKSFDEEV KLSYQVVTSF 5H1D_FUGRU MEL DNNSLDYFSS N FTDIPSN TTVAHWTEAT LLGLQISVSV RAVFPSIVGR PRHQGVMVGM GQK DSYVGDEAQS KRGILTLK RAVFPSIVGR PRHQGVMVGM GQK DSYVGDEAQS KRGILTLK RAVFPSIVGR PRHQGVMVGM GQK DSYVGDEAQS KRGILTLK FLGALILCSI FGNSCVVAAI ALERSLQNVA NYLIGSLAVT DLMVSVLVLP LLGALILCSI FGNACVVAAI ALERSLQNVA NYLIGSLAVT DLMVSVLVLP VLAIVTLATM LSNAFVIATI FLTRKLHTPA NFLIGSLAVT DMLVSILVMP ASLSTFQQMW ISKQEYDESG PSIVHRKCF ASLSTFQQMW ISKQEYDESG PSIVHRKCF ASLSTFQQMW ISKQEYDESG PSIVHRKCF SLLNPIIYAY FN KDFQSAF KKILRCKFHR SLLNPIIYAY FN KDFQSAF KKIIKCHFCR SLINPVIYTV FN DEFKQAF QKLIK FRR Po The task is to write a program that can read a text file containing an alignment in one format and print on the screen the alignment using the other format And vice versa As usual when we are programming in Perl there many ways of solving this problem I will present one of many possible solutions ona N QOQ A e N N N N e eB Be Be Be Be H HKH H N e e O OO NOAUA RA OO N F OO O N Q A e Ww N 5 2 CHANGE OF FILE FORMATS 89 The program called convert missing pl is divided into the following parts e Main program i e read command line arguments and load the alignment file to convert Subroutine parsing the Phylip format Subroutine parsing the ClustalW format Subroutine writing to screen a Phylip format Subroutine writing to screen a ClustalW format Here is the main prog
19. YGIKDVPGDRTTIDLTEDENQYSIG LDAPSQIE KDVTDTTALITW KPLAEI GIELTYG KDVPGDRTTIDL EDENQYSIG Sbjct 772 KLDAPSQIEAKDVIDTTALITWSKPLAEIEGIELTYGPKDVPGDRTTIDLSEDENQYSIG Query 61 NLKPDTEYEVSLISRRGDMSSNPAKETFTT 90 NL P TEYEV LISRRGDM S P KE F T Sbjct 832 NLRPHTEYEVTLISRRGDMESDPVKEVFVT 861 D ref XP_005022416 1 gt Identities 84 Query 1 RLDAPSQIEVKDVTDTTALITWFKPLAEIDGIELTYGIKDVPGDRTTIDLTEDENQYSIG LDAPSQIE KDVTDTTALITW KPLAEI GIELTYG KDVPGDRTTIDL EDENQYSIG Sbjct 772 KLDAPSQIEAKDVIDTTALITWSKPLAEIEGIELTYGPKDVPGDRTTIDLSEDENQYSIG Query 61 NLKPDTEYEVSLISRRGDMSSNPAKETFTT 90 NL P TEYEV LISRRGDM S P KE F T Sbjct 832 NLRPHTEYEVTLISRRGDMESDPVKEVFVT 861 ID gb E0A99266 1 gt Identities 84 Query 1 RLDAPSQIEVKDVTDTTALITWFKPLAEIDGIELTYGIKDVPGDRTTIDLTEDENQYSIG LDAPSQIE KDVTDTTALITW KPLAEI GIELTYG KDVPGDRTTIDL EDENQYSIG Sbjct 772 KLDAPSQIEAKDVIDTTALITWSKPLAEIEGIELTYGPKDVPGDRTTIDLSEDENQYSIG Query 61 NLKPDTEYEVSLISRRGDMSSNPAKETFTT 90 NL P TEYEV LISRRGDM S P KE F T Sbjct 832 NLRPHTEYEVTLISRRGDMESDPVKEVFVT 861 D ref XP_005022414 1 gt Identities 84 Query 1 RLDAPSQIEVKDVTDTTALITWFKPLAEIDGIELTYGIKDVPGDRTTIDLTEDENQYSIG LDAPSQIE KDVTDTTALITW KPLAEI GIELTYG KDVPGDRTTIDL EDENQYSIG Sbjct 772 KLDAPSQIEAKDVIDTTALITWSKPLAEIEGIELTYGPKDVPGDRTTIDLSEDENQYSIG Query 61 NLKPDTEYEVSLISRRGDMSSNPAKETFTT 90 NL P TEYEV LISRRGDM S P KE F T Sbjct 832 NLRPHTEYEVTLISRRGDMESDPVKEVFVT 861 ID gb KFQ40359 1 gt Id
20. as single scalar Blast including all line breaks A convenient way to read an file into a single scalar is to modify the special variable called the input record separator This variable is usually set to n the new line character By undefining this variable we get the whole input file as a single scalar see line 41 42 e Pattern matching Normally in a regular expression the dot matches any character except the new line one n However if we use the s modifier it matches the new line character also This is used in the program several times e g line 88 e Positive lookahead assertion Sometimes it useful to have regular expression that matches in a hypothetical way Perl have four such constructs The positive looka head assertion is used in this program and it looks like 7 PATTERN You can see an example of it at line 138 The regexp s n n g looks for two consecutives n n characters but when it finds two it will on only replace the first one with nothing since the second is a lookahead assertion only As they put in the Learning Perl text book The Engine works it all out for us by actually trying to match the hypothetical pattern and then pretending that it didn t match if it did This program run on the 1TEN_blastp res file e g gt gt parse pl 1TEN_blastp res results in the following output Result of parse pl 1TEN_blastp res ID ref XP_005022418 1 gt Identities 84 Qu
21. conversion By a full match I mean a residue that is fully conserved among the proteins Note For the clustall aln alignment this number is 21 Item 5 can be considered as things to do if you have time 94 CHAPTER 5 APPLYING PERL 5 3 References objects and methods 5 3 1 Creating references What is a reference This leads us to the concept of how Perl is storing values in variables Each variable has a name and the address that corresponds to a piece of memory associated with it Storing addresses is fundamental to references because a reference is a value that contains the location of another value We call the scalar value that contains the memory address a reference Lets look at the reference of a simple scalar variable var Hello world We can create a reference to this variable using the backslash operator varref var Both var and varref are scalars Let s print them print var n print varref n results in the following output Hello world SCALAR 0x815d7e4 Here we can see that the reference is just a memory address that holds the value of var Why use references In some circumstances it is very convenient and sometimes it is necessary For example if you want to pass two arrays to a sub routine then we need references We recall that everything passed to a sub routine is stored in the _ array which makes it difficult to pass two arrays Another example is if we want to crea
22. don R AGA AGG CGA CGC CGG CGT codon S AGC AGT TCA TCC TCG TCT codon T ACA ACC ACG ACT codon V GTA GTC GTG GTT codon W TGG codon Y TAC TAT Print the translation table foreach my aa sort keys codon 29 30 31 32 33 34 5 3 REFERENCES OBJECTS AND METHODS 99 print aa foreach my nuc codon aa print nuc print n ref3 plt Note the square brackets when defining the hash codon The expression codon aa should be read as The array referred to by the key aa in the hash codon 5 3 4 Objects and methods This very short section about objects and methods is meant as a very brief introduction to prepare you for the coming lectures about CPI pm and the modules in the BioPerl project From Programming Perl 3rd edition An object is a data structure with a collection of behaviors Every object gets its behaviors by virtue of being an instance of a class The class defines methods behaviors that apply to the class and its instances When the distinction matters we refer to methods that apply only to a particular object as instance methods and those that apply to the entire class as class methods But this is only a convention to Per
23. e O O N DOD a A A OO NBEO N o 92 CHAPTER 5 APPLYING PERL Count the number of non in the aas my Naa 0 while aas w g Naa 1 aacount i Naa if aacount i gt 0 print aacount i n print n n convert missing pl sub WriteClustalW The task is here to write the names on each line followed by 60 amino acid symbols including gap characters The line should end with a count of the number of amino acid letters written sofar e The line Seq Species i s 1 60 may need some explanation Here the task is chop off if possible the first 60 characters of the sequence This is accomplished with the above regexp s q1 60 means up to 60 characters are replaced with nothing using the s function However the match is collected in 1 using in the regexp Finally the routine that writes a Phylip format convert missing pl sub WritePhylip sub WritePhylip Print the header line printf d t d n scalar Species length Seq Species 0 Now the rest of the sequences my PrintName 1 my nprot scalar Species while length Seq Species 0 gt 0 The names foreach my i 0 nprot 1 my name Species i if PrintName 1 printf 10s t name else printf 10s t The sequences Seq name s 1 50 my aas_tmp 1 26 27 28 29 30 3
24. eHSP to also extract the number of gaps if any in the alignment and print this on the screen 2 For each HSP that you select find the longest sub alignment that contains only iden tities and print it on the screen As an example the following alignment Query 41 FAGKDLESIKGTAPFETHANRIVGFFSKIIGELPN 75 FAGKDL S K TA F THA RIVGF S I N Sbjct 1 FAGKDLDSLKNTASFATHAGRIVGFVSEIVALMGN 35 has FAGKDL as the longest sub alignment with identities 3 For each alignment that you select count the number of sub alignments with only identities larger than a given number Task 3 is of the type to do if you have time 5 6 Hand in exercise 3 Select one of the following problems as the hand in exercise for this week 1 The program for converting between clustalw and phylip file formats first section of this chapter does not handle the conservation line used in the clustalw files Write a Perl program or add more functionality to an existing one so that when you convert from phylip to clustalw format the conservation line is also created This program should also recognize the input format i e one should not have to specify the direction of conversion Below is a definition of the conservation line 112 CHAPTER 5 APPLYING PERL Three characters are now used in the conservation line Indicates positions which have a single fully conserved residue Indicates that one of the following strong groups is fully conserved
25. entities 84 Query 1 RLDAPSQIEVKDVTDTTALITWFKPLAEIDGIELTYGIKDVPGDRTTIDLTEDENQYSIG LDAPSQIE KDVTDTTALITW KPLAEI GIELTYG KDVPGDRTTIDL EDENQYSIG Sbjct 454 KLDAPSQIEAKDVIDTTALITWSKPLAEIEGIELTYGPKDVPGDRTTIDLSEDENQYSIG Query 61 NLKPDTEYEVSLISRRGDMSSNPAKETFTT 90 NL P TEYEV LISRRGDM S P KE F T Sbjct 514 NLRPHTEYEVTLISRRGDMESDPMKEVFVT 543 s ID ref XP_005498448 1 lt gt Identities 83 Query 1 RLDAPSQIEVKDVTDTTALITWFKPLAEIDGIELTYGIKDVPGDRTTIDLTEDENQYSIG LDAPSQIE KDVTDTTALITW KPLA I GIELTYG KDVPGDRTTIDL EDENQYSIG Sbjct 741 KLDAPSQIEAKDVIDTTALITWSKPLADIEGIELTYGPKDVPGDRTTIDLSEDENQYSIG Query 61 NLKPDTEYEVSLISRRGDMSSNPAKETFTT 90 NL P TEYEV LISRRGDM S P KE F T 60 831 60 831 60 831 60 831 60 513 60 800 80 81 82 83 84 85 86 87 88 89 90 5 6 HAND IN EXERCISE 3 111 Sbjct 801 NLRPHTEYEVTLISRRGDMESDPMKEVFVT 830 ID gb KFQ87172 1 gt Identities 84 Query 1 RLDAPSQIEVKDVTDTTALITWFKPLAEIDGIELTYGIKDVPGDRTTIDLTEDENQYSIG 60 LDAPSQIE KDVTDTTALITW KPLAEI GIELTYG KDVPGDRTTIDL EDENQYSIG Sbjct 456 KLDAPSQIEAKDVIDTTALITWSKPLAEIEGIELTYGPKDVPGDRTTIDLSEDENQYSIG 515 Query 61 NLKPDTEYEVSLISRRGDMSSNPAKETFTT 90 NL P TEYEV LISRRGDM S P KE F T Sbjct 516 NLRPHTEYEVTLISRRGDMESDPMKEVFVT 545 Result of parse pl 1TEN_blastp res 5 5 1 Follow up tasks 5 4 Download the files parse pl and 1TEN_blastp res and complete the following tasks 1 Add code the subroutine Analyz
26. ery 1 RLDAPSQIEVKDVTDTTALITWFKPLAEIDGIELTYGIKDVPGDRTTIDLTEDENQYSIG 60 LDAPSQIE KDVIDTTALITW KPLAEI GIELTYG KDVPGDRTTIDL EDENQYSIG Sbjct 772 KLDAPSQIEAKDVIDTTALITWSKPLAEIEGIELTYGPKDVPGDRTTIDLSEDENQYSIG 831 Query 61 NLKPDTEYEVSLISRRGDMSSNPAKETFTIT 90 NL P TEYEV LISRRGDM S P KE F T Sbjct 832 NLRPHTEYEVTLISRRGDMESDPVKEVFVT 861 oO wa ND Hw ge Ww N JD ref XP_006137248 1 gt Identities 84 Query 1 RLDAPSQIEVKDVTDTTALITWFKPLAEIDGIELTYGIKDVPGDRTTIDLTEDENQYSIG 60 LDAPSQIEV DVTDTTALITWFKPLAEID EL YG KDVPGDRTTIDL EDE QYSIG Sbjct 804 KLDAPSQIEVRDVTDTTALITWFKPLAEIDDMELSYGPKDVPGDRTTIDLSEDESQYSIG 863 15 Query 61 NLKPDTEYEVSLISRRGDMSSNPAKETFTT 90 16 NLKP TEYEV LISRRGDM S P KETF T 17 Sbjct 864 NLKPHTEYEVTLISRRGDMTSDPVKETFVT 893 18 e e RP H e N O 19 D ref XP_005022417 1 20 gt Identities 84 21 Query 1 RLDAPSQIEVKDVTDTTALITWFKPLAEIDGIELTYGIKDVPGDRTTIDLTEDENQYSIG 60 22 LDAPSQIE KDVTDTTALITW KPLAEI GIELTYG KDVPGDRTTIDL EDENQYSIG 23 Sbjct 772 KLDAPSQIEAKDVIDTTALITWSKPLAEIEGIELTYGPKDVPGDRITIDLSEDENQYSIG 831 24 Query 61 NLKPDTEYEVSLISRRGDMSSNPAKETFTT 90 26 27 28 29 30 31 32 33 34 36 37 38 39 40 65 66 67 68 69 70 71 72 73 74 76 77 78 79 110 CHAPTER 5 APPLYING PERL NL P TEYEV LISRRGDM S P KE F T Sbjct 832 NLRPHTEYEVTLISRRGDMESDPVKEVFVT 861 D ref XP_005022415 1 gt Identities 84 Query 1 RLDAPSQIEVKDVTDTTALITWFKPLAEIDGIELT
27. ft data Store the sequenses in a hash my aux while my line shift data next unless line w my name seqtmp lentmp split line Seq name seqtmp Find the name in correct order unless defined aux name push Species name aux name 1 convert missing pl sub ReadClustalW Since the protein names are printed on every line of the alignment it is easy to store the sequences in the 4Seq hash More specific comments e The line while data 0 s shift data is a short way of skipping blank lines in the beginning e There is a regexp for detecting lines that contains the alignments namely w which matches the character or alphanumeric characters e The aux hash is only used to get the correct order of the alignment when storing the names in the Species array Next the routine that writes a ClustalW format convert missing pl sub WriteClustalW sub WriteClustalW Write the ClustalW header print CLUSTAL W 1 82 multiple sequence alignment n n n my nprot scalar Species my aacount while length Seq Species 0 gt 0 foreach my i 0 nprot 1 printf 10s t Species i Seq Species i s 1 60 my aas 1 print aas 16 17 18 19 20 21 22 23 24 25 26 27 28 Oo N DO TO P WwW N N N N N NY Ee Be Be Be Be Be HB BK H rk O N
28. l a method is just a method distinguished only by the type of its first argument You can think of an instance method as some action performed by a particular object such as printing itself out copying itself or altering one or more of its properties Class methods might perform operations on many objects collectively or provide other operations that aren t dependent on any particular object Method invocation This is how we invoke a method invocant gt method list invocant gt method E g seq1 gb gt get_Seq_by_acc ans 5 3 5 Follow up tasks 5 2 1 Make sure you understand the programs ref2 pl1 and ref3 pl 2 Write a subroutine that takes two arrays as arguments and returns a hash The hash is created from the two arrays such that the first element of the first array is used as a key and the first element of the second array is the corresponding value and so on Test your subroutine on some arrays A Koj ioe N O ot 11 12 13 14 15 16 100 CHAPTER 5 APPLYING PERL 3 Create an array and where each element of the array is a reference to a hash Write a subroutine that takes such an array of hashes and display the values and keys of all hashes defined Test your subroutine 5 4 Searching in large text files In this application and the exercises that follows we will use Perl to scan through large text files The example below will utilize the Swiss Prot database used in pre
29. llowing way Alignments with the number of identities between a user defined interval are extracted and optionally printed on the screen together with the first identification line and the actual number of identities in parse pl usr bin perl w HHHHHHH Program description 4 Title parse pl Author s Mattias Ohlsson Description This is a parser for result files produced by the Blastp program for sequence alignment List of subroutines GetSubjects GetHSPs AnalyzeHSP Overall procedure This program loads a result file from a blastp run The program does not check that the loaded file is a valid blastp file only that it is a non empty file The file is stored as a single scalar All the analysis is performed on scalar and not on an array containing lines of the result file The first step is to split the result into subjects this is performed in the GetSubjects routine The program then makes a loop over all found subjects and for each subject all the high scoring pairs are extracted GetHSP A second loop is over all HSPs found and where each HSP is analyzed in the AnalyzeHSP routine This routine extracts all the information needed in order to select the ones with a percentage of identities within a specified range The selected ones are printed together with the alignment optional Usage parse pl blast result file HH HH HHH HH HH H HOH OF
30. ram convert missing pl Main program use strict our Seq Species Global variables The main program open my FH lt ARGV 1 or die Cannot open file ARGV 1 n my indata lt FH gt close FH if uc ARGV 0 eq P2C ReadPhylip indata WriteClustalW elsif uc ARGV 0 eq C2P ReadClustalW indata WritePhylip else die Error Unknown conversion mode n convert missing pl Main program This program uses two global variables declared using the our Seq Species state ment The file is read using the statement my indata lt INFILE gt which reads every thing into the array indata The uc function converts characters to upper case Here is the subroutine that reads a phylip file convert missing pl sub ReadPhylip sub ReadPhylip Copy the supplied data to a local data array my data _ Use the first line to find the number of proteins my nprot plen split shift data 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 90 CHAPTER 5 APPLYING PERL Read the first nprot lines to get the species and the start of the seq my tmp for my i 0 i lt nprot i Species i tmp split shift Q data chomp tmp Seq Species i join tmp Continue to read lines of sequences my np 0
31. te hashes of arrays then we also need references Before we look at examples we need to know how to create references The backslash operator You can create a reference to any named variable or subroutine with a backslash Here are some examples scalarref foo Reference to a scalar arrayref ARGV Reference to an array hashref ZENV Reference to a hash constref 12341 42 Reference to a constant coderef amp myfunction Reference to a sub routine globref STDOUT Reference to a glob Inspiration of how to write this section is coming from Programming Perl third edition 5 3 REFERENCES OBJECTS AND METHODS 95 Anonymous data In the examples just shown the backslash operator makes a copy of a reference that is already in a variable name with one exception The 12341 42 is not referenced by a named variable it is just a value We can also create such anonymous arrays hashes and subroutines For a reference to an anonymous array we use square brackets anonarri 1 2 4 5 6 anonarr2 1 2 One Two Three Four The last one is a reference to an array where the third element itself is a reference to another array For hashes we use curly brackets to create the anonymous reference anonhashi Reference to a hash gt A gt CGA C gt TGC D gt GAC anonhash2 Reference to a hash of arrays ze gt
32. vious exercises and a cross reference file pdb2sprot txt linking PDB id s with Swiss Prot ID AC strings The task is to complete the missing information in the second column of the following table PDB Organism 2YTE 1K46 1RJJ 113L 151J 2QKD 9LDB 3BW6 3MEJ 1515 7ZNF 1DU5 1c4Y 1FAD 1HY5 The organism information can be found in the Swiss Prot database file and the link be tween an PDB ID and the Swiss Prot ID can be found in the pdb2sprot txt Let s look at the two files 5 4 1 The Swiss Prot flat file From the user manual of the Swiss Prot file on can read Swiss Prot is an annotated protein sequence database It was established in 1986 and maintained collaboratively since 1987 by the group of Amos Bairoch first at the Department of Medical Biochemistry of the University of Geneva and now at the Swiss Institute of Bioinformatics SIB and the EMBL Data Library now the EMBL Outstation The European Bioinformatics Institute EBI The Swiss Prot Protein Knowledgebase consists of sequence http www expasy ch sprot userman html 5 4 SEARCHING IN LARGE TEXT FILES 101 entries Sequence entries are composed of different line types each with their own format For standardization purposes the format of Swiss Prot follows as closely as possible that of the EMBL Nucleotide Sequence Database Here is one entry from the Swiss Prot file ID AC DT DT DT DE GN 0S Oc oc OX RN RP RC RX
Download Pdf Manuals
Related Search
Related Contents
Samsung HT-C330 home cinema system Southbend G-72-M User's Manual citizenwatch.jp gemm™ 製品情報のユーザビリティ 専門家育成に関する調査・研究 FY-CL08PS8D の取付工事説明書 Especificaciones Técnicas Medidores de Pinza de 200 A AC Accu-Chek Performa® Glucómetro Document Control and Copyright © All rights reserved.
Failed to retrieve file