Home

Exercise 1

1. Acetobacter pasteurlanus plasmid pAP12875 Acidithiobacillus ferrooxidans pTFA 1 plasmid Acidithiobacillus ferrooxidans plasmid pTFS Acinetobacter sp EBL plasmid pAC450 4 379 i Actinobacillus pleuropneumoniae plasmid pT Y MI 4 242 AE202375 Aeromonas salmonicida plasmid pRAS1 2 1L823 AY43209 Aeromonas salmonicida subsp salmonicida plasmid pRAS 1 11851 Amazis Arma Agrobacterium rhizogenes plasmid pRILT24 3 ee mea 217 554 a 2 i 153 BRS Ber ETE EASTA Agrobacterium bumelachens oclopine type TI plasmid Agrobacterium tumefaciens plasmid pTI SAKLURA 194 140 206 479 a M Agrobacte mam tunmelaciens sr C58 eiman plasmid AT 0 parts 547 859 ARIEI CON Agrobacterium tumelaciens str C58 Cerean plasmid Ti 20 pares 214 231 z z F F Pr Agrobacenum tumefaciens sir CER 0 Washington plasmid AT 45 pans 542 780 aa a Ee Oey Pe Aee AA inei 28 WHO TDR Bioinformatics Workshop at ICGEB New Delhi 2005 Module 7 ACT comparison files had Ge
2. fasta results for Gene_finder J from ES TIR ERE GRDIELET CONTIGS RETE E TRTUN picJARTEAS Tutorialffasta PASTA searches a protein or DMA sequence data bank version 3 35108 Jan 17 01 Please cite ue W B Pearson e D J Lipman PNAS 1988 85 2444 2448 fasta T percipi rie gene model MES IESEE TOT aa 9073 AAAAA Oene binder undefined product 2570 5365 reverse MW 90734 K QO R 5 Vi ey V PEG OR TET ANER va SWALL library ae eee BOUM e Oy ne e eu Bed pie yes y searching data blastdb psu swall 1 0 library searching data blastdb psu swall 2 0 library N K V L N w sg a e a a a a oOo Ius 3713381506 residues in 1155242 sequences statistics extrapolated from 60000 to 1164056 sequences Expectation n fit rho ln x 5 45980 4 0 0001859 mu 9 3265 4 0 011 mean vareTB 2654 156 138 s 158 Z trim 3 B trim 50n5 in 1 63 Lambe 0 1447 FASTA 2 96 June 2000 function optimized BLSU matrix 15 5 x3 ktup Z join 38 apt i gap pen 12 2 width 16 Scan time 137 10 The best scores scat opt z sec B 1164856 FRAG YEAST Q08162 Exosome complex exonuclease PRP 1001 1197 1347 4 3 5e 67 Q95212 g95212 Repddp homologue 972 930 1046 3 2 16 50 QUSHL7 O9SHLT Putative mitotic control protein di 933 874 2983 3 6 7e 47 Q9SNKA QSaHEd EST AUODGBZ0S CIZ430 corresponds to i 908 863
3. STY000S STYOO06 a en a K C M P R L P R T TGTTTGCCGATCTGTTACGGACCCTCTCATGGAAGTTAGGAGTTTAACATGG TGAAAGTG TATGCCCCGGCTTCCAGCGCGAACATGAGCG TCGGTT 0 2760 2770 2780 2790 2800 2810 2820 2830 2840 ACAAACGGCTAGACAATGCCTGGGAGAGTACCTTCAATCCTCAAATTGTACCACTTTCACATACGGGGCCGAAGGTCGCGCTTGTACTCGCAGCCAA NAS R N R V R E H F N P T 4 C P8 LTHOGPEKWRSCSRR N Tog cme ae 0 Gal ee Rh ea rg ae en ge eee P lt cea Gt fe es es TOOR 2 ug MPO qp deb jo deb dS db Jp 5 R TA TETRA TTG R ELT GT A depo NY RT ATDOT CDS 190 Zod Orthologue of E coli thrL LPT ECOLI Fasta hit to LPT ECORI CDS Ja arog Orthologue of E coli thr AK1H ECOLI Fasta hit to AKTH_ misc feature 343 369 PS00324 Aspartokinase signature misc feature 2314 2382 p501 0a Homoser ine dehydro n CDS 2001 3730 HATE oque of E coli thrB KHSE xen Fasta hit to KHSE H misc feature 3068 PS00627 GHMP PEIE putative ATP binding domain m CDS 3734 5020 Orthologue of E coli thrC THRC ECOLI Fasta hit to THRC E misc feature 4022 4066 PS00165 Serine threonine dehydratases pyridoxal phosphate ae NH CDs 5114 5887 c Orthologue of E coli yaaA YAAA EGOLI Fasta hit to YAAA E CDS 5966 7396 c Similar to Bacillus subtilis amino acid carrier protein sis misc feature 7091 7138 c PS00873 Sodium alanine s Je aa family signature E CDS 7665 8618 Fasta hit to TALA ECOLI i 6 aa 655 ident tee in 311 aa ove mien fantina 71565 7791 Trannaldalana nia
4. E F Forward Frame Lines en a a aAa a eon an a a a a aa a a aa a a a GU UPCR DU GI g BUE ARDEH AAS QVWLS TASS RER AA JAARRAAG PDK VS N A H g cF Reverse Frame Lines ROA T S M ROL V R 6 G 2 A EU P V V A G R HOROR G GO R O A R T R Y PT R I GGCGCGCGACGAGCATGCGGCTAGTCAAGTCTGGCTGAGCACTGCCAGTAGTCGCCOE 10 20 30 40 50 Boa GGCGG CcGCcGGCGCGGCOGGCAGCGGGCCCGGACAAGGTATCCAACGCGCATTCGGG Start Codons 60 70 80 9D 100 110 CCGCGCGCTGCTCGTACGCCGATCAGTTCAGACCGACTCOTGACGG TCATCAGCGqCCCGg ee ce cce OO cce ccce TC ccceae cc Te TTCcATAGG TTG Coco TAAGccc Stop Codons ARR AH P DLRASCQSMYDG P fbRPAAPILISPSGINrPSCIrPISMSMILRAIN P Feature Arrows UE E M P NE E R A V L M R 8 T L D P Q A 8 G T T AJP R G R R P PCR AR VL Y GY RM R sA feature Borders si _ All Features On Frame Lines 4 Show Source Features Delete genes that you don t think are real from the ORFS_100 entry 2305 WHO TDR Bioinformatics Workshop at ICGEB New Delhi 2005 Module 4 Gene Prediction Exercise 3 Using homology data The tools you have used so far have only looked at sequence properties to predict if a region is coding However you may also use homology evidence to identify if it is coding Evidence from Blast searches can be read into Artemis for this purpose To do this eGo to Read an entry in the File menu and select TB_v_swall blastx This file contains BLASTX hits from the M tuberculosis DNA sequence searched against the SWALL
5. T a poeni mm mm mmm T re ga screen out low DOM RW em me appear Scroll Aa exe EE Meee along the bar to LOCKED ram mmm m cec 9 7 Be ERN j 800 1600 2400 3200 4000 4800 5600 6400 7200 8000 SCOIInS hits Score CO ES SES p Cutoff 40 1600 2400 3200 4000 u 5600 6400 7200 anol E eeo C ee So c 3 Ss aximum Cutoff 287 Reset X LOCKED Close 800 1600 2400 3200 4000 4800 5600 6400 7200 8000 8800 70 WHO TDR Bioinformatics Workshop at ICGEB New Delhi 2005 Module 6 Comparative Genomics ACT N_crassa_qut embl vs A_fum_qut embl vs A_nid_qut embl vs P_anserina_qut embl vs N crassa qut embl i File Entries Select View Goto Edit Create Write Run Graph Display 2200 jaco 6600 EE 11000 13200 Cin 15400 OC ZAG So a bonn 4400 EE 11000 13200 15400 CX us oar i WI A oo 1 iff li 7 WE lutR F 2200 gano aon es00 11000 13200 LI Hm CE EE T a A fd Br Tr 6600 eee 6600 EE 11000 13200 15400 P 2200 4400 11000 13200 15400 qpns LOCKED rim t MM TEP o4 gt 4 Organism i SD 4 gt b Organism _ q P anserina t gt 4 DET E aap Organism BENE qut Baut q
6. eChoose the file TB cu in the current directory eTwo graphs will appear see below use the vertical slider to scroll the graphs eDecide which CDSs agree with the codon usage and delete those that Module 4 Gene Prediction definately do not b 2 48428 Of 195195 Of 33138 ae UCL Frequency 82823 a 3 15364 733 Total Number cuu 25 3 38015 ccu 21 9 32964 CAU 16 30 24577 CGU 15 3 24435 cve 7 3 10322 ccc 8 4f 125133 cAc 6 46 9653 cec 5 26 3316 CUA 8 6f 12357 ccA 12 7 190753 CAA 27 30 41066 cea 7 9 11896 cve 6 3f 9503 cce 4 6 6910 cae 10 96 16457 ces 3 06 4487 AVV 35 0 52636 ACU 22 3 34413 AAU 33 9 51009 Acu i4 7 22108 ave 12 66 19000 acc 10 9f 16378 AAC 17 3 268953 acc 4 2 139055 AVA 13 1 18726 ACA 13 9 208383 AAA 39 3 530733 AGA 11 16 15742 AUG 20 3 31376 ace 6 5 37443 aag 25 26 37825 age 5 16 7615 GU 29 34 44015 gov 30 2 45397 GAU 38 1 57240 cev 22 04 38101 cuc 11 06 16497 ccc 11 66 175183 cac 15 8 23743 cec 8 56 12717 GUA 12 36 184515 cca 15 7 236433 Gan 44 3 66550 Goa 15 7 23623 GUG 8 3f 12422 cce 5 36 80115 Gne 21 26 31973 Geo 4 36 5437 file mtrie Select View Goto Edit Create rite fin Grm Display l Entry MTS fasta Codon Usage Scores from IB cu Window izei X Forward codon usage plot DELL EE I Co cc Hl HL E HII l CEEE TE 1308 00 9 T Ig ion a ll j G Reverse
7. NCBI Pata C blast data Ms system32 t ymsagent Temp iC ymsapps Ey twain_32 Cy mui Cy VirtualEar C3 Connection Wizard C jPCHEALTH Web Cursors yPrefetch Ey WinSxS Debug it y RegisteredPackages E ModemLog Luce Driver Cache Registration E OEWABLog i ehome irepair E schedLgu Fonts Resources setuplog ity SchCache Cysecurity ic ShellNew ur i E srchasst 3 srchass system My Computer n e ncbi inil My Network Save as type Text Documents txt EET My Documents Encoding ANSI Running BLAST The BLAST software does not run in Windows but DOS an operating system that Windows runs in When you want to run blast you will need a DOS window a k a Command Prompt Internet 2 My Documents Internet Explorer E E mail 2 My Recent Documents gt EJ Microsoft Outlook 7 My Pictures PowerPoint e My Music P ftp Explorer Favorites B My Computer E Big ART a My Network Places w Word eg Control Panel Notepad al k Printers and Fa ES Command Prompt o Help and 9 8 WHO TDR Bioinformatics Workshop at ICGEB New Delhi 2005 Appendices Type cd blast Microsoft Windows AP Uersion 5 1 2668 CG Copyright 1985 2661 Microsoft Corp Press Return C2o Documents and Settings Team 1 gt _ This changes the directory to the blast folder which you have just down loaded and unpacked blas
8. and GC frame plot Compare this to the overall GC content by going to Graph and Choosing G C content Try to optimise the graphs by moving the slide bar on the right Ly nix File Entries Select View Goto Edit Create Write Run Graph Display Entry TB fasta ORFS 100 4 N l GC Frame ERG Window size 297 Fx ove slide f S 94 94 A coe f s bar to hp RA X A ASA X E ENT oc oer Yon a A f E A Sy Pi 18 KAA r ahn 65 v f j iy Uu ree P Je hy VR K N change s ac ia E y ud xA window s 32 32 f ZEEE 0 1606 06 16 0 IL IC III 0000 0 60 M I LE MENSUEL TE L0 ES size CDS CDS CDS CDS CDS CDS i i ill Ut E td Wud Ulu 3 m i ill dil gt l CDS CDS CDS CDS CD 2 Hl prp psp ips 1E NND EE CDS CDS CDS CDS CDS CD lano L600 2400 3200 jaooo asoo 5600 6400 7200 Bonn sso0 g6o0 GC frame plot LON lt iI lt A ILE I GESMEMDL E CDS CDS CDS CDS CDS CDS i HL TE EO LLL I Hw z Nh Ll Ol et CT PE OT CDS CDS CDS CDS c ILNI K 1 K IKK tt _ CDS CDS CDS CDS CDS CDS CDS CDS Ej G A RNR AG G OoN Ges R GH RE ee S CE ee Ge SENI GSB A GS es E oe OG Te O R A E Os E G R RS amp REE HAE Ao o0 Meow L Soo AS S Ry oR AS A 0A A GR SR A amp A A G OBOOD Ko W E NA H BG BS AnG G da D LA I o MoR L v K 9G A bh B w V A G RE B E Roe G OR A amp HB I HE P O TOR I1 KR D KH PB Be Y Gc CGCGCGACGAGCATGCG
9. You may noticed after you performed the merge function that one of the exons has subsequently jumped into another reading frame Artemis automatically splices the CDS and so 1f the exon boundaries have an additional partial codon then any following exon will be pushed into another reading frame to account for this To correct this you can edit the exon boundaries directly by turning on manual editing in the options menu of the Artemis start up window as shown above This will now allow you to edit the start and end positions of the feature boxes by using the left mouse button Click and hold down the curser over the first or last base of any feature and then drag the mouse The feature box should move as you drag it see below This can be a little tricky so please ask Artemis Entry Edit malaria sequence File Entries Select View Goto Edit Create rite Graph Display JM 34 16 M F un pal wr V m Double click to 4 168 LE p HE HERE MOIEN HELL ELE HELLE EL ML S edit LE HEEL HL MEAE LU HUC MHIL E T M LE BN VEEN TEP TUITE N U j EL HELLE EHE T TE LE D PFl35 0113 zano 3200 aono Jason S600 lsano 7200 MERU TE UMETIN UM yp 2 Click and drag with the cursor WE Wil LR UELLE WIID EE E a here to manually edit LE ME HEEL T 1 glimmerl t YLNIIYIYIYI ae gee E S A zt OS mo wo om wo m X mW TATCTAAATATTATATATATATATATATATATATAAACATATATTTTTATTTTTTATITTTTGCTAGAAATTTATAATGTAAGAL la7eo lason aa2n asao
10. to bring up an Artemis window eWithin the Artemis window use Graph menu and switch on the GC Content 96 window eUse Goto menu to select Navigator window and within the Navigator window select Goto Feature With This Text and type Phat4 eGo to the first ACT window and use the Options menu to select Enable Direct Editing eGo through the gene model of Phat4 and have a glance through the exon intron boundaries Can you suggest any alternative gene model after consulting the Table provided in Appendix IX containing several examples of experimentally verified splice site sequences for P falciparum eExample modifications Have a look at the misc_feature coloured in red location 15618 20618 Can you spot any difference in the red gene model of Phat4 at the exon intron boundaries Select the red feature click on Edit menu and select Edit Selected Features and in the new window that pops out change the Key from misc feature to CDS and click on OK button to close the window Now you can compare the automatically created blue gene model and the curated red gene models at protein level and predict any alternative splicing pattern i pii 0x8 File Entries Select View Goro dit Create Whita An Graph Displ g Gag selected base on LDorward abeand 21552 Entry F pint oS gle E pkn152 dio enhl nt A Window aig
11. 13000 19500 p amp 000 02500 p9000 Ss 5500 52000 5500 65000 rasoo T6000 p4500 ma a P knowlesi Pknowlesi_contig embl 61 WHO TDR Bioinformatics Workshop at ICGEB New Delhi 2005 File Entries Select View joc Content N Ad HE i diana mnm 783005 il i yn ll i ii V pao V AR Prediction of gene models WP M M J IN Afr Ww Wi TR M AM Wi M u PAL 2 Ww TW Vi y VAT Lf 5 185806 aT om ga TE li i i u p 0 FESTE ti 17686 EAN greio 22406 23200 24008 T ill sue N i Vul m Hi Te um Wm V i i i h my nm l i APA pu M ee n lin IRUR all mu J i V i Py NUT I Exercise 2 Part III There are several computer algorithms covered earlier in Module 3 that predict gene models based on training the algorithm with previously known gene sets with previously known experimentally verified exon intron structures in eukaryotes However no single programme can predict the gene structure with 100 accuracy and one needs to curate refine the gene models generated by automated predictions We have generated automated gene models for the P knowlesi contig using PHAT Pretty Handy Annotation Tool a gene finding algorithm see in Mol Biochem Parasitol 2001 Dec 118 2 167 74 and the automated annotation 1s saved in Pknowlesi contig embl Zoom into the P fa
12. 4 44414 K 4 The result of the BlastN comparison shows that there are regions of DNA shared between the plasmids pHCM shares 169 kb of DNA at greater than 99 sequence identity with R27 Much of the additional DNA in the pHCM1 plasmid appears to have been inserted relative to R27 and encodes functions associated with drug resistance What antibiotic resistance genes can you find in the pHCMI plasmid that are not found in R27 The two plasmids were isolated more than 20 years apart The comparison suggest that there have been several independent acquisition events that are responsible for the multiple drug resistance seen in the more modern S typhi plasmid 78 WHO TDR Bioinformatics Workshop at ICGEB New Delhi 2005 Module 7 ACT comparison files Exercise 2 In the previous exercise you used BlastN to generate a comparison file for two relatively small sequences gt 500 000 kb In the next exercise we are going to use another program from NCBI Blast distribution megablast that can be used for nucleotide sequence alignment searches 1 e DNA DNA comparisons If you are comparing large sequences such as whole genomes of several Mb the blastall program is not suitable The Blast algorithms will struggle with large DNA sequences and therefore the processing time to generate a comparison file will increase dramatically Megablast uses a different algorithm to Blast which is not as stringent which therefore makes the pro
13. Ap Click on yes if any small dialogue window appears while reading opening the files 69 WHO TDR Bioinformatics Workshop at ICGEB New Delhi 2005 Module 6 Comparative Genomics Can you see any conserved gene order between the A fumigatus amp A nidulans in the qut gene cluster Can you obtain a clearer picture of the ACT 4 way comparison figure by filtering out the low scoring segments using the blast score cut off feature which you have used previously Zoom in and look at some of the genes encoded within theses regions View the details by clicking on the feature and then select Edit selected feature from the Edit menu after selecting the appropriate CDS feature By comparing the blast similarity matches assign your own annotation gene product to the predicted gene models the blue genes on the P anserina gene model file Can you identify any gene NOT present in the qut cluster of ALL four fungi Note down the gene order and direction of transcription in each after you have completed annotation of the P anserina gnes in the qut cluster ACT N_crassa_qut embl vs A_fum_qut embl vs A_nid_qut embl vs P_anserina_qut embl vs N_crassa_qut embl Sigil Use the right click on your Mouse and select score cutoff window to File Entries Select View Goto Edit Create Write Run Graph Display 800 1600 2400 3200 4000 4800 b 5600 6400 7200 8000 1
14. ESTs ESTs Expressed Sequence Tags are sequences of cDNA derived from mature transcripts hence they give useful information about splice site boundaries Remember that they will contain UTRs and hence will not help find start and stop sites You can see the Plasmodium ESTs by reading in the entry EST blastn tab This compares BlastN results of the Plasmodium sequence against a DNA database of all Plasmodium EST sequences View BlastN hits from Plasmodium ESTs by reading the entry EST_blastn tab Try and use this information to help refine your gene models Remember ESTs are clear evidence that this region is transcribed and is useful for finding missing exons L WTAE TTI TPT MU TTI MUI EET C TIT TET TIE TIE IMT N TTT TIT LIN HE WE PH T MH POL D AE EHE EE UNE MN ERI HI UTE En LEO HE MEE d d i m CIII CMI HE LOO MH ETE d TE Ln LE EE HE EU a M HEEL M I MCN HU MILI MEL RE MAI EL I a Pig EERE 2 8 EST UTokyo 37600 38400 39200 40000 40800 41600 42407 43200 44000 44800 Ca ee EL 2 Pfa3D7 3 mw d d Ww d Mud p o UE E MEHR HE I UAI M MEM a DLL s NOH D DIU PEEL MII CNIT NU TT M NN D LLLI HE ME ELLE TUER OY VOY Y bs cP DH OI FON FOI FOS Y CONI 14 KOSOFOY LH PW IK CON YN LONE IY M Eu UUGROSRSUIBOY UKGUECURUNESOVOUE CISUESNURURSUPTSESUNDEM I FI COI L FVM PS SYF F Y FF i TTTATATGTATATTATTTGTAATGC CATE SATITIATTITITICTIAT HIG UA 42940 4295C Hin coL NICO NENNEN C T H F I Y I
15. Genome Research Limited For simplicity it is a good idea to In the Options menu open a new start up window for you can switch between each Artemis session and close prokaryotic and down any sessions once you have eukaryotic mode finished an exercise Select an EMBL file Enter path or folder name s paltthogen Advanced workshop Module 2 Artemis Filter Files r Malaria emb Folders PF tab 33 Single click to select DNA file S typhi cod Enter file name Open Update Cancel DNA sequence files will have the suffix dna Annotation files end with tab 4 Single click to open file in Artemis then wait WHO TDR Bioinformatics Workshop at ICGEB New Delhi 2005 Module 1 Artemis Prokaryotic 2 Loading annotation files entries into Artemis Hopefully you will now have an Artemis window like this If not ask a demonstrator for assistance E Artemis Entry Edit S typhi dna File Entries Select View Goto Edit Create Write Run Graph Display Noddy Nothing selected go0 600 2400 3200 la000 4800 5600 6400 1200 IL dug ging gr pm Mp gm wo prr p pw HN Ho PEE NO FELINE P I Ld g LH gn Wl E WO dur Lp pH LINE 0H IY nmm mE EH Ho P odd esed me eee ak a ao a y a u o i a e I ICOKCK O1 XR aa MNCNCRCECRUNCY I aora ecko aL oeron oN 4 sae ee a ee Ae ee ee ee ee ee e ee ee ee ee ee eUpug AGAGATTACGTCTGGTTGCAAGAGATCATAACAGGGGAAATTGATTG
16. but can be viewed by loading the appropriate file Click on File then Read an Entry and select the file PF tab Each Pfam match will appear as a coloured blue feature in the main display panel on the grey DNA lines To see the details click the feature then click View then View Selection or click Edit then Edit Selected Features Please ask if you are unsure about Prosite and Pfam Viewing the results of database searches Click the View menu then select Search Results and then Fasta results The results of the database search will appear in a scrollable window If you click on the button at the bottom of this window labelled view in browser then the results will be posted into an internet browser window Within this window there are many active links coloured blue to external sources of information such as the original database entries for all those aligning to your sequence as well as information stored in PubMed PFAM and many others This is your opportunity to explore some of the other features of Artemis whilst we are here to help Further information on specific Prosite or Pfam entries can be found on the web at http www expasy ch prosite and http www sanger ac uk software Pfam tsearch shtml WHO TDR Bioinformatics Workshop at ICGEB New Delhi 2005 Module 2 Artemis Eukaryotic Module 2 Artemis Eukaryotic Introduction Following a similar format to Module 1 this M
17. k rT on NER MARE i WHO TDR Bioinformatics Workshop at ICGEB New Delhi 2005 Module 4 Gene Prediction When manually editing your exons you can should look out for appropriate splice donor and acceptor sites Once you are happy with your newly created exon re run the fasta search and see how this compares with the other hits in the public databases If there are more exons to mark up try and complete the gene model To compare the output of different algorithms alongside each other it is necessary to use a different view in Artemis One line per entry Artemis Entry Edit malaria sequence Hghtclickon ing qua feature view panel IHE HEEL ELTE d MEE E EM Ed LUE TEE IE I select one line per lasnn Eton ioann linn Gne Line Per Entry EE TETUER EE EE ql TET E EAE HI IIT Phat7 TE ERE NO A M HE II qlimmerz A Von dq xu wm eee ee eR es or X x zx X X Wo ox yox m sou wow x3 won Ww uoo RTCTAAATATTATATATATATATATATATATATAAACATATATTTTTA a780 Jason I WHO TDR Bioinformatics Workshop at ICGEB New Delhi 2005 Module 4 Gene Prediction Now feature coordinates can be directly compared against each other After running fasta you can copy a feature that you are happy with to the malaria annotation file Artemis Entry Edit malaria sequence Edit GC Content X Window size Au MA Copy Selected Features To lannn lasnn 40 WHO TDR B
18. ktup 2 join 37 opt 25 gap pen 12 2 width 6 Scan time 109 550 E 1166362 2 8e 113 de 95 8e 85 9e 95 6e 94 1e 38 6e 38 2e 38 8e 37 8e 37 8e 37 5e 23 5e 23 5e 23 1e 22 5e 12 2e 07 5e 07 Cn C Cn Cn co ceo Co P c QO Q O J P OON JOoPFPRPRPRFP mmProm bBPqqGuPm PiED aog Close Send to browser 43 WHO TDR Bioinformatics Workshop at ICGEB New Delhi 2005 Module 4 Gene Prediction Artemis Feature Edit 124 Key CDS Common Keys Add Qualifier note aos location 153060 154385 Complement Grab Range Remove Range Gota Feature Select Feature MESS FASTA MESS BLASTP MESS GO colour 7 product nucleoside transporter fasta file fasta Tbrucei glimmer tab seq 00d07 0ut For each gene add a product description a gene name if you know it and a note if there is anything unusual that you wish to record You may also wish to add a colour to help navigate around the sequence Using view selection from the view menu you will see your annotation for a given feature in EMBL format This is the information that Artemis actually records For example ET CDE Loco 0r ooe IE os oc pel esu ee Heiarancoomecnr ET colour 4 44 WHO TDR Bioinformatics Workshop at ICGEB New Delhi 2005 Module 4 Gene Prediction Exercise 7 Gene finding for spliced genes Theileria annulata You will need to start a new Artemis session and open the file called T ann
19. 2005 Module 4 Gene Prediction fasta re Please Q9U763 Q9Y0IO 095203 Q95z04 Q9Y0H9 Q9NBV4 Q8T6M2 Q9GTP4 sults for 124 from nfs disk2 FASTA searches a protein or DNA sequence data bank version 3 3t08 Jan 17 2001 cite The best scores are Q9U763 Nucleoside transporter 2 Q9YOIO Adenosine transporter 1 Q95203 Adenosine transporter Q95z04 Adenosine transporter Q9Y0H9 Adenosine transporter 1r Q9NBV Inosine guanosine nucleoside transp Q8T6M2 Guanosine permease Q9GTP Nucleoside transporter 2 076269 Nucleoside transporter 1 2 Fragmen Q8T6M3 Adenosine permease 076343 Nucleoside transporter 1 1 Q9GTP5 Nucleoside transporter 1 Q86MB6 Nucleobase transporter Q9N9R1 Probable nucleoside transporter Q8MUN2 Nucleobase nucleoside transporter 8 Q8R139 Hypothetical 58 1 kDa protein QSNBM2 Hypothetical protein NT2RP2006435 Q8NAR3 Hypothetical protein FLJ34923 Q9FWY1 T14P 9 protein Q86WY8 Similar to equilibrative nucleoside Artemis Entry Edit Tbrucei dna Selected feature bases 1326 amino acids 442 124 label 124 colour 4 note LowScoreBy 123 L 100 3 15 De Entry Tbrucei dna Tbrucei glimmer tab Tbrucei embl GC Content Window size 395 22 270 VUE AMI HUE HU AH ELE AU E E EH IUE ELE APO UE DAD LL HM AON ESSE M TETTE LM DE M NUT CONI LEE UE LIEU UNE NUI LR 8 ONLINE TT ME BIHAN AAU VITE TNNT lr 0 s CDS 143000 14520
20. 4 i 41 4 H5 i Ii E EH A IY Lj HA 1 1 ii Ne M M LI r AE Ho l d i I I BIET Comte LE PI LOCKED 524900 1049800 1574700 2099600 2624500 3149400 3674300 4199200 4 Things to try out in ACT Load into the top sequence S typhi a tab file called laterally tab You will need to use the File menu and select the correct genome sequence S typhi dna before you can read in an entry If you are zoomed out and looking at the whole of both genomes you should see the above The small white boxes are the regions of atypical DNA covering regions that we looked at 1n the first Artemis exercise It is apparent that there 1s a backbone sequence shared with E coli K12 Into this various chunks of DNA specific the S typhi with respect to E coli K12 have been inserted 5 More things to try out in ACT Double click red boxes to centralise them Zoom right in to view the base pairs and amino acids of each sequence Load annotation files into the sequence view panels You could load in the appropriate tab files for each genome S typhi tab and EcK 12 tab and view the annotation of a particular region Also try using some of the other Artemis features eg graphs etc Find an inversion in one genome relative to the other then flip one of the sequences Once you have finished this exercise remember to close this ACT session down completely before starting the next exer
21. ACT GTAATAT TTTTTTTITTTATTTCCTAGIATG CAG GTAAATA ATAATGACATTTTGATACAG ATT AAT GTACATT TTATTTTTATTTATTTATAG AAA TAG GTATTTG ATATTTTTTACTTATGATAG TTA AGG GTAATAT TTTATTTTATTTTTTTTTTA TTT GGA GTAAGAG TTTTTATTATTTTATTGTAG TCC GGA GTAAGAG TTTTTATTATTTTATTGTAG TCC CAG GTAYGCT TTTAATTTTTTTTTCCTTCA TCA AAA GTAAGAA TATTTTTTTACAATTTTTAGI TTC AAG GTAAAAG TTTTTTTTTTTTTGTTTCAQGITTT CAG GTACATA TTTTTTTTTTTTTTTTTTAG GTG CAA GTAATTA TATATTTTATTTTTTCTTAGI GTT TACIGTTAGTT TTTTTTTTTTTTTTTTTTAG TGG ATTIGTAAGTT TATTTTTTTTTTTTTTTTAG TGA TGT GTAAGAA TTGTCATTATTTTTTTTTAG GTG AAA GTATAAA TTTATTTATTTTTTTTTTAG ATA CAG GTAAATA TTTTAATTTTTTTGTTTTAG AAA CTGIGTTTGTC CATATATTTCTTTATTTTAG ATA AGA GTAAAAA TTTCTTATATTTTCTTTTAG GTG CTGIGTTTGTC CATATATTTCTTTATTTTAG ATA ATG GTAAGAG TATTTTTGATACCTTTATAG AGT AAA GTAATTA CAATCATATTAACACAAAAG ATG GAG GTATACA TTATTATTCCCTTGCTTTAG ATC TCGIGTTAGTA TATTTATCATTTTTTTCCAG ATG GAA GTAAATC TTTTTTATTTTTCTCATTAG CTA TAG GTGTGTT TCATTACATTTTTACCTTAG GAT TTA IGTAAGTT CGTAATATATTTTTTTTTAG GAT ATG GTAAGAA TATTTTTATATTTTTTTTAG GCT AAC GTAAGTT TTATTTTTTTTTTCATATAG TGC TTGIGTATGCC TTTGTATTATTTAATTTTAG AAT TTGIGTATG TGTGTATTGTTTATTTTTAG AAT TGTIGTAAGGA TTTTTATATTTTTTCTTTAG CGA AAGIGTAACAA TATATGTATTTTTTTTTTAG TGC Acceptor motif Appendices MEN o 152 208 86 152 112 120 81 96 150 442 199 160 206 142 158 113 169 112 158 175 129 345 92 116 214 280 208 168 480 101 122 4
22. LO DLE E HH MEE E HE E D E LL d EE HE LI rj S BE aN SEA uS ke Ru E OSE OS Gee se Bo R EE BE Si SM en ER eK ORS ois ek GR Se GE GE a BSE sk E HS ENS SE ESL BS SB S ak SH CHS AN RS XA eK GR Ske OR RS oE Es se ic B ke c as as ONE a ae RB GRE sis eB es Te aE OM Bee a KG RS ae OE AES el TCCCATAAAAACTATTAGTTGTAATATTATTATTCCTTTTTTTTCTACTCTTCATAATTATAAATGTGTTTTAAAAAGGAAAAGAAAATTATTACAT 10 20 30 40 50 60 70 0 0 AGGGTATTTTTGATAATCAACATTATAATAATAAGGAAAAAAAAGATGAGAAGTATTAATATTTACACAAAATTTTTCCTTTTCTTTTAATAATGTA Wo 3h sBE cO SNO Y 0Y CX SE R R S b SES ESO Xe SNO CY Ee SH OR OU BO O5 SES OBS dX Lo Wi X r1 i 11 WHO TDR Bioinformatics Workshop at ICGEB New Delhi 2005 Module 2 Artemis Eukaryotic To compare the three CDS with others currently in the public databases run a fasta searc Left click the CDS click on the Run menu and then Run fasta on selected features When the search is finished a banner will appear saying fasta process completed see above The search may take a couple of minutes to run To view the search results click View then Search Results then fasta results The results will appear in a scrollable window You could also view these results in your Netscape Browser window as in the previous exercise How does your predicted gene model for this CDS compare with proteins pulled out of the public databases Is it possible that there are additional exons not featured in the c
23. N H K G I Q i Search Backward F Ignore Case J Allow Partial Match T P KR mise feature misc feature misc feature 3068 CDS 3134 5020 misc feature 4022 4066 CDS 5114 5887 c E CDS 5966 7396 c 7091 7138 c misc feature CDS antura Suggestions of where to go Goto Clear Close Orthologue of E PS00324 Aspartokinase signature PSO1042 eese deh PS00627 GHMP STE putative ATP binding domain Orthologue of E PS00165 Serine threonine dehydratases Orthologue of E Similar to Bacillus su PS0D0873 Sodium alanine s 316 aa family signature Fasta hit to TALA ECOLI 1 USE coli thr AK1H ECOLI Fasta hit to AK H E rogenase signature Fasta hit to oli thrB KHSE ECOLI Fasta hit to THRC E yridoxal phosphate at coli bay YAAA Eases pi Fasta hit to YAAA E tilis amino acid carrier protein alst Colt ENNE thee T ECOLT 6 aa Can W m 0 minr bm Km dk E E PS 65 identity in 311 aa ove YraANnCo Think of a number between 1 and 4809037 and go to that base notice how the cursors on the horizontal sliders move with you Your favourite gene name it may not be there so you could try fts Use Goto Feature With This Qualifier value to search the contents of all qualifiers for a particular term For example using the word pseudogene will take you to the next feature with the word pseudogene in any of its qualifiers Note how repeated clicking of the Goto button takes
24. Q EF L ER 4 K UL G H I F H L 1 L amp E L B DX h N B BR T RI T Ww R P Z F TI TI TI 1 4 4 I DNE ANREDE NE NE EON IAN UNE ELE UE J 42830 42 840 42850 42 860 42870 la2 apo T7 590 T7900 T7510 T7570 moa AAA TATACA TA TAA TAAA CA TTACGGGAGAAGTA TAAAAA TTAAAA TAAAAAAGAA TAB 2 nn a a nn rs DENT 55979 58762 B BLASTN HIT 17630 18053 c match to Pfa3D7 5938 2000 12 28 EST UTOkyo 1 424 blast score 733 percent jgentity 96 none Nl BLASTN HIT 29584 29641 c match to PfaDd2 003309 1999 12 28 EST UF a 33 90 blast score 95 6 percent Wdentitv 95 none B decl 42643 43305 match to Pfa3D 5671 2000 12 28 EST UTokyo 311 453 blast score 283 percent identity 100 none EST Blast Hit E Me ge hit here Once you have finished you may check how your gene prediction of this region compares to the final Sanger annotation by reading in the file Pfal_subseq tab c WHO TDR Bioinformatics Workshop at ICGEB New Delhi 2005 Module 4 Gene Prediction Exercise 5 Gene finding for spliced genes malaria You will need to start a new Artemis Session as before using the files malaria sequence malaria annotation malaria glimmer and malaria phat in the current directory The sequence you are going to look at is a small region of contrived sequence 24 kb taken from Plasmodium falciparum The file malaria annotation contains two annotated CDSs with multiple exons See if you agree with them one has only been partially characteris
25. Workshop at ICGEB New Delhi 2005 Module 1 Artemis Prokaryotic Once you have found this region have a look at some of the information that is available to you Information to view Annotation If you click on a particular feature you can view the annotation attached to it select a CDS feature or any other feature and click on the Edit menu and select Edit Selected Feature A window will appear containing all the annotation that is associated with that CDS The format for this information is constrained by that which can be submitted to the EMBL database as seen in Module 1 Viewing amino acid or protein sequence Click on the view menu and you will see various options for viewing the bases or amino acids of the feature you have selected in two formats 1 e EMBL or FASTA This can be very useful when using other programs that are not integrated into Artemis e g those available on the Web that require you to cut and paste sequence into them Plots Graphs Feature plots can be displayed by selecting a CDS feature then clicking View and Show Feature Plots The window which appears shows plots predicting hydrophobicity hydrophilicity and coiled coil regions for the protein product of the selected CDS Load additional files The results from Prosite searches run on the translation of each CDS should already be on display as pale green boxes on the grey DNA lines The results from the Pfam protein motif searches are not shown
26. a di is eL o ba er CDE LOS620 110095 c Eimilar to Escherichia coll ing protein ing IR O6 dqzl E EL AJ223475 94 aa fasta scores Ei aI EQ 26 WHO TDR Bioinformatics Workshop at ICGEB New Delhi 2005 Module 3 Artemis Advanced To see how well you have done turn back on the spi7 tab and have a look at the genes located at either side of your selection Go to and look at the CDS samA In reality this gene was disrupted by the insertion of this bacteriophage If you look at the FASTA results for this CDS you may be able to track the bases between which this phage inserted Your final task is to write out these files in EMBL format and create a merged annotation and sequence file in EMBL format O File Entries Select View Goto Edit Create Write Run Graph Display ie Read An Entry Read An Entry From Database Read Entry Into e Save Default Entry Ctrl S Save An Entry e Click File then Save An Entry As colour 0 label prophage 36 prophage tab ono name gt L b STY4595 SOpE STY4627 Save An Entry As UN EOS EMBL Format GENBANK Format i GFF Format J0 45500 52000 58501 EMBL Subg Save All Entries Clone This Winds EMBL Format misc feature qd C 4K amp 4 qd ST 4596 Y ST pin sTy4614 19 ST STY misc feature qa sTY4d6 ST 4604 ST STY4618 3 ht STY4580 sa STY4601 STYd610 STY4622 STY STY L
27. as shown above A small window will appear asking you whether you are sure you want to merge these features Another window will then ask you if you want to delete old features If you click yes the CDS features you have just merged will disappear leaving the single merged CDS If you select no all of the three CDS features the two CDSs that you started with plus the merged feature will be retained E File Options Re read Options Enable Direct Editing Artemis start up window 4 Eukaryotic Mode 4 Highlight Active Entry 4 Black Belt Mode Show Log Window Click here to enable direct editing Hide Log Window You may noticed after you performed the merge function that one of the exons has subsequently jumped into another reading frame Artemis automatically splices the CDS and so if the exon boundaries have an additional partial codon then any following exon will be pushed into another reading frame to account for this To correct this you can edit the exon boundaries directly by turning on manual editing in the options menu of the Artemis start up window as shown above This will now allow you to edit the start and end positions of the feature boxes by using the left mouse button Click and hold down the curser over the first or last base of any feature and then drag the mouse The feature box should move as you drag it see below This can be a little tricky so plea
28. called TB orpheus tab and TB_glimmer tab eYou can show all of the evidence ORFS 100 orpheus and glimmer on separate lines by right clicking on the frame lines and selecting One line per entry from the menu that appears see below 1 eCompare the different predictions and using the plot information Remove any genes that you think are not real from the ORFS_ 100 entry you created earlier Left click on them and press delete eYou can be conservative at this stage Dx File Entries Select View Goto Edit Create Write Run Graph Display Nothing selected Entry TB fasta ORFS 100 F TB_orpheus tab TB glimmer tab Orpheus Codon Usage Scores from TB cu Window size 279 OEA Reverse Codon Usage Scores from TB cu Window size 349 predictions Feature Viewer Menu Do Smallest Features In Front Select Visible Range Select Visible Features p CDS CDS CDS CDS CDS CDS CDS CDS cI CDs cps Set Score Cutoffs gt DE Entries sim 12 005641 sim 1 RPOE MYCTU nosim 94 63 CAD553 a Select 2 4 g00 L600 2400 3200 a000 lagoo 5600 le400 7200 B000 B800 E Glimmer I NUM predictions per Xpp amp RE qum GENNNNEEND quam 1 3 5 6 78 9 write T a cs x sim 29 BAC17588 osim 69 8 005842 nosim 88 32 Run temque lt KI quuqu CDS CDS DS CDS DS CDS CDS CDS CDS CDS GE cbs C cps CDS DS cps cp Feature Labels F One Line Per Entry
29. dC S S Q GCGON ON XT X X o8 ee 45 Ud H E ROG oe A 8 Lb HU S S V H WM b DW D H RR T 1i T LL E A N 5 1 B B RE T EOE W HH ct M bo 1 HBOGH Sh oko RSR ie Sg a X od 4 Dh 3 o0 amp 4 8 AAAGCAGGACGCTGCACTGGCATTATCGTCTGTCCATACCGGACTGACTGTAGCGGACAACAGAACAATACTGATACTTCCGGCTAACAGCATTTTG 10 20 30 jao 50 le0 70 B 20 TTTCGTCCTGCGACGTGACCGTAATAGCAGACAGGTATGGCCTGACTGACATCGCCTGTTGTCTTGTTATGACTATGAAGGCCGATTG TCG TAAAAC Pe ee o Ne wo R8 w T 4A H H b w 1 5t I 9 3 Lb h AX Lh LL Ww B 14g L W cR XB Nod D QRGHB Y BROCV WOES x d TY BR B VA N G n A RB Ro ov BP NOI I Q c Y KR V SQ L BP wc LC E b wv 9 Vv E P t t UC Kk od ml PI E cRNA 620 683 c possible truncated tRNA Phe misc feature 620 134181 c The major Vi antigen pathogenicity island SPI 7 CDS 761 1795 Weakly similar to the C terminus of several polysaccharide bic CDS E 3156 Similar to Bacteriophage P1 Ban helicase TR 060261 EMBL AJ011 J misc feature 2022 2445 PS00017 ATP GTP binding site motif A P loop CDS 3149 4946 no significant database hits CDS SLIR 5422 Doubtful CDS CDS 5550 6131 no significant database hits CDS 6216 6773 Weakly similar to Yersinia pestis orf 77 TR Q9Z381 EMBL AL031 CDS 7018 8361 no significant database hits A We are going to look at this region in more detail and to attempt to define the limits of the bacteriophage that lies within this region Luckily for us all the phage related genes within this region have been
30. estet T bastat LISE Application 21 04 2003 21 15 Cony Sis fie Z biastcust L680KB Applicaton 21 04 2003 21 15 Peer oss feto me ed 72 bisatpp 2 040R Appkcaton 21 04 2003 21 15 E Emal te fie copymet LIHID Applicaton 21 04 2003 21 16 X Delete this fie Z tace 1 496 M Apphcabon 21 04 2003 21 16 format 1 532 B Applicaton 21 04 2003 21 16 mosa 1828KB Apokabon 21 04 2003 21 17 Other Places 2 menat 1400KB Appicaton 21 04 2003 21 17 megabiast LOH Application 21 04 2001 21 17 ie hoe Tiree a 2B Fie 09 04 2003 21 17 My Documents amp README om BO fle 16 10 2002 10 40 i 9 Hv canoun M ee ain Top 20S 210 Included in the directory that has now Wenn mmus Rom Vanni enden ie mu sense 1B ere sapuan i5 been unpacked are several README files Det aie 2 D SREDE 1148 MBL Fie 28 03 2203 17 33 ataoe OD RPS Phe 12 12 2001 21 36 THO peer ui m eei that describe the various programs in the Agpicaton oseedio 1 050 8 Apcikcaton 21 04 2003 21 18 BLAST software package These files also 5 provide descriptions of the command line options that you can set when you run the programs To read these files double click on the icon or view them in notepad The README BLS file contains details of the main BLAST program and how to format DNA sequences prior to running BLAST WHO TDR Bioinformatics Workshop at ICGEB New Delhi 2005 Appendices
31. integrated bacteriophage Have a look at the CDSs within this region As before notice any stable RNAs that may have acted as the phage integration site Artemis Exercise 1 Part IV Continuing on from the analysis of Region 3 or SPI 7 the major Vi antigen pathogenicity island we are going to extract this region from the whole genome sequence and perform some more detailed analysis on it We will aim to write and save new EMBL format files which will include just the annotations and DNA for this region ERI WHO TDR Bioinformatics Workshop at ICGEB New Delhi 2005 Module 3 Artemis Advanced amp t File Entries Select View Goto Edit reste Wite Mn Graph Display Entry F8 typhi dna gc conten Winde 019 4 nce Window size Edit Selected Features Edit Subsequence and Features Edit Header Of Default Entry Change Qualifiers Of Selected Remove Qualifier oF Selected eee Eran ra goa pam den artes sa rx erus a qirret2n cs ert Duplicate Selected Features Ctrl D m lD desc e7z4si8 STUdS34 1 T PT ETYASE sTYASTO 714591 95 STHG2 of 4645 STT4663 Merge Selected Features Ctrl H P Ir Ip HD PINMPI I EL z i FE 811448 ilu 8134517 1524 pilP 1L ST STY4565 STY STYT4590 M 2774647 sSTY4664 Unmerge Selected Feature minc etus ature misc f misc featu ature unit ature RES misc feature misc pi anaa misc fen Delete Selected Featur
32. line are now denoted no name they represent the same information in the same order as the original Artemis window but simply have no assigned name Because the sub sequence is now viewed in a new Artemis session this prevents the original files from being over written i e S_typhi dna and S typhi tab We will now save them as new files to avoid confusion So click on the File menu then Save an entry as and then New file Another menu will ask you to choose one of the entries listed At this point they will both be called no name Left click on the top entry in the list A window will appear asking you to give this file a name Save this file as spi7 dna Do the same again for the other unnamed entry and save it as spi7 tab E ID x File Entries Select View Goto Edit Create Write Run Graph Display Read fin Entry strand 133814 133816 complement 1430 1432 Read An Entry From Database Read Entry Into ee Criss MEET WEIN HE IIE EE E II l 3 Sere She HIM TT m 1 NTE NINI MO TE WA Save An Entry As r Neu File Save Al Entries EMBL Format Hl TT i LE EE IIE Clone This Window GENBANK Format see MR Close GFF Format r 800 1600 EMBL Submission Format tRNA ULM E EHE TEILE LH ELNI UATE Enia EHE Bs A EE iL CEMT mort Hp EIN CIN WI CE HO E ME WS Hg OHIETI PI MI nu 0 ERE T d PE T IIIENA PN M MIEL LELEMEE T T II d ad K A CB Ge ee GO GR I Wo S B XR T
33. little time so be patient To view the graphs Click on the Graph menu to see all those available Perhaps some of the most useful plots are the GC Content 1 GC Deviation 2 and Karlin signature plots 3 as shown below To adjust the smoothing of the graph you change the window size over which the points on the graph are calculated using the sliders shown below If you are not familiar with any of these please ask file Earias fle Tie Gee fa Cree Fia Ba Gaa Mala py p ES te P P I A MA Am E mh 1 w Pal i f BF Sliders for i Deviation en pire Window DNA plots smoothing I 235m WERTE T CREE ET TIBET TE NET EET T EDT REIR ER LED imt i me Ni iin Vi RE ANI INE ELE MAVON O IU UE O MONIKON LE IURI E OORD UT IER M HOGA MO ERE E 3 iim CTTETICPTTTTEET TET TTE TTTCTITTE TT TITTIT MUI OE HUE KORU I NDA eet pes E b LL Enters permis unital paeem iiim Eu 1 sien IHEETMITTMENI D i A L5 ENT AED IE TIENE MMe mB IIT JST Be Te Te ee ee DHL IU fete tet em NN ON REID RRN a E E 2 d ELA EN N UU rmm mil qmm IN Lt u ilt org B n a E a LO MENM magni d E d DS Rf Q HH BE F FR Tt BEE TF OT LHHRALDISITI iT 6 C K BR S i T 4 T B FG BH 58 1 i f DIT I E RL VARLDIHIWI I i E E E I H AACA TTAOS TOTS TIOCARGATATCAT MACAGGOGARKTTGK ATTSAMAXTAARTAT TosccAGCASC cae TE TTTORMSAATOTOA TCARTTT MMAAKTTTATTOACTTAOUCOOOCADN ATA ACTTTAACC E ETT Ve
34. on the grey entry line of the Artemis window shown above Your Artemis window should now look similar to the one shown below 17 WHO TDR Bioinformatics Workshop at ICGEB New Delhi 2005 Module 3 Artemis Advanced um ee Bak d rere rye um Dmi A WewWfLLOET pues MEET Erssiid n Graph scaling menu a ugs ial Slider for ica Jv UN M Zooming out Plot Options a Scaling Ex pos TU Maximum Window cise j Nx j 100 w E Piped MANUAL m 200 5o 1000 2000 S000 werd prw an Leon pls1zo02 jr193an0 jer s oo prsT co pznnnon eoe gol io000 OIC LQ fr ltTo et Dp ft tte eo AOE FoR YT IW CER BE QOELI Eek TA Te Te Pe kn eee A R DM ER Oe Fe tT PA eH EQ Vo RB epg TTGCAMIAGATCATALCAGOGOGAAATTGATTS FRAATATATCOCCAGCAGCACATOAACKA COGAATUTATUI SOOO 100000 200000 SOOO 1000000 CICTASTATTS TIRTITATATAGCGGIUGICOTGTACTTGT CAR cu rm wv FT tt amp c cm FLEE QO t L D t oA LLOY VLE Bi1sMo A L i A ai Ln wm m TOCCTTIAACTAACI Fos 18590 TEP PTFE ft fF bt PM NEM E Click with the left mouse button in a eraph window A line and a number will appear The t Windag gire TD B nu number is the EM d TUNE NEU ERES UY HIE i relative position ee ESNE EEE within the genome bps Click and drag to highlight a region on the main DNA line Notice that the boundaries of this region sho
35. p blastn m 8 d pHCM1 dna i R27 dna o pHCM1 vs R27 Press Return tblastx could be substituted here if a translated DNA translated DNA comparison was required 0 designates the output file pHCM1_vs_R27 Ka en al File Edit gt blastall p blastn m 8 d pHCM1 dna i R27 dna o pHCM1 vs R27 CM m 8 designates the d designates the i designates the ACT readable database sequence query sequence output pHCMI dna R27 dna Terminal Go Help blastall is the Blast program p designates the flavour of Blast blastn in this instance a DNA DNA comparison 77 WHO TDR Bioinformatics Workshop at ICGEB New Delhi 2005 Module 7 ACT comparison files The pHCM1_vs_R27 comparison file can now be read into ACT along with the pHCM1 embl and R27 embl or pHCM1 dna and R27 dna sequence files v ACT pHCM1 embl ys pR27 emb IR EX File Entries Select View Goto Edit Create Write Run Graph Display i Ib Db DPI bi gt mD eee p m D pon Ub B b p bb BY gt a p lit EH bY DE PIT DP HE ie HP b bb Pb be Ip E I Dp TE pip lb EP p BP DEO in KEEN 58500 nx ce m EH q IK 4E M lt td It fi Query Flipped SUM 5 Ex eee MMe P D PDB Pp Dp gt MP PD PEt bb b gt ee b 117000 97500 78000 58500 39000 19500 175500 156000 136500 E qu qd iid 441414 4 4d ind 4 4 4 d 4 ad 440 KEIKI q 4 144 fl
36. subsequence dna The sequence you are going to look at is a 17 kb region from cT annulata genomic DNA Add to the sequence a graph of G C content as before and open up the following files one by one T ann PHAT gene model which contains PHAT gene predictions for this region T ann genefinder gene model which contains genefinder gene predictions for this region and T ann blastsearch SWALL which contains the blastx results against the SWALL database for this region Add the G C plot to the window as you did before use this slider bar to adjust the window size to N smooth the G C plots Artemis Entry Edit T ann subsequence dna File Entries Select View Goto Edit Create Write Run Graph DI Nothing selected Entry T ann subsequence dna T ann PHAT gene model T ann genefinder gene model GC Content Window size 120 E 3g mm wu mimup iu iip EHE AE UH ea MIA ET OE LEE Gene finder 1 Gene finder 4 HL HUE MET I HET MH WE HE EE LO HN E LM LO WEE HEEL 1 E HW OH LETT TL EHE 1 M O OA CRUNCH_D CRUNCH D ICH D B00 1600 2400 3200 a000 a800 5600 E400 7200 Bgg B80 zl DEUM NEUEM NEN NENNEN MEE EE Cm ine mn UEM E EE EU HI WE 1 ELE D EHE E E UI Gene finder 3 Gene WLAN EE EHE LEE E EO HIE E E MU IINIT Fhat TEE MEETS BRIT Gene finder D uf a Y A een e R R e 2 b Y R 11 R B b E k OR Neh ee oes nN b aa Te amp GACATGAGTAGCCCGAACGAGTTCGAGGAGGTCGCCGTTGGTAAC
37. such as SRS http srs ebi ac uk it is possible to specify that the sequences are in FASTA format File Edit View Terminal ff Make sure you are in the Module_7 directory You should now see both the new FASTA files for the pHCM1 and R27 sequences in the Module 7 directory as well as their respective EMBL format files Hint You can use the pwd command to check the present working directory the cd command to change directories and the Is command will list the contents of the present working directory WHO TDR Bioinformatics Workshop at ICGEB New Delhi 2005 Module 7 ACT comparison files You will treat pHCM1 dna as the database At the Command Prompt type formatdb i pHCM1 dna p F sequence and R27 dna as the query sequence formatdb is the database format program S Press Return hal LEN uu FEEL EEE TEE PEELE LILI UE File formatdb i pHCM1 dna p F i designates the input sequence pHCM1 dna it View Terminal Go Help p designates the sequence type DNA is F protein would be T Now you can run the Blast on the two plasmid sequences The program that you are going to use 1s blastall In addition to the standard command line inputs we have to add an additional flag m 8 to the command line so that the Blast output can be read by ACT This specifies that the output of Blast is in one line per entry format see appendix II At the Command Prompt type blastall
38. that a new entry will appear on the entry line called ORFS 100 This will contain all the ORFS you have just created Turn it on and off to check Then go to Select and All Then Edit and Trim Selected Features to Any this will give you all the possible open reading frame with a bacterial start codon Go to Mark Open Reading Frames Untitled ile Entries Select View Goto Edit Create Write Run Graph Display OIIE EES itry TB fasta REESE T I I LI ll Create Feature From Base Range Ctrl C HE Og HEIL MIN Mark Open Reading Frames UL EA D00 EL MAI ll p 3 Mark Empty ORFS Mark ORFs In Range Mark From Pattern lt lt 2 1147200 1148000 1148800 11491 PA LRERF S a ai ae S tay ies Gi ee fe amp RRERTEIEPGPES PIEPGPS ep ee PPR ei DONE st IB teh esti EE BE e BP al sf TOP by a foe iB e ap 4 S8 S P T A ee cg kg yn TRR Da gt jah i CCAGCTCTCCGACGGTTTTCGCTGCTGAATTCTGAGCAAGGCACACA TGGCTGACGTAGA CGCA CCCCAA TTGAA CCAGGA CCCTCCCCAATTGAACCAGGACCCTCAAGCGAACCTCCCCGCTGAGAA 6470 126480 126490 126500 126510 126520 126530 126540 126550 126560 126570 126580 126590 SGTCGAGAGGCTGCCAAAAGCGACGACTTAAGACTCGTTCCGTGTGTACCGACTGCATCTGCGTGGGGTTAACTTGGTCCTGGGAGGGGTTAACTTGGTCCTGGGAGTTCGCTTGGAGGGGCGACTCTT Way Tee Ef ty hE a ee te ey my eh ee ey Ei RA eS E HI SACRO RARAN ETS SOF eee Eve Cee OR Le Rev Ga aS aC EC Cle eo Gee eee E S Go eG Rosas W 8 E S P
39. the exercise directory 1 Yersinia_2004 embl The Yersinia DNA with gene models As mentioned in the introduction to this Module for larger scale analysis we cannot use the Cut and Paste approach and need to install and run these search programs locally The output can then be converted directly into a format that Artemis can read There are additional files in the current directory which contain this type of search so have a look after you have had a bash at cut and paste Pre run search files Yersinia_2004 embl The Yersinia DNA with gene models SignalP_2004 tab The output of a SignalP search signal sequences Prosite_2004 tab The output of a Prosite search protein motifs TMHMM_2004 tab The output of a TMHMM search membrane domains PF_2004 tab The output of a Pfam search protein motifs Yersinia cod The Yersinia codon usage table Blastx_2004 tab The output of a blastX search of Yersinia against SWALL 52 WHO TDR Bioinformatics Workshop at ICGEB New Delhi 2005 Module 6 Comparative Genomics Module 6 Comparative Genomics Introduction The Artemis Comparison Tool ACT also written by Kim Rutherford was designed to extract the additional information that can only be gained by comparing the growing number of genomes from closely related organisms ACT is based on Artemis and so you will already be familiar with many of its core functions ACT 1s essentially composed of three layers or windows The t
40. view panel 4 The slider allows you to filter the regions of similarity based on the length of sequence over which the similarity occurs sometimes described as the footprint ACT St dna vs EcK12 dna File Entries Select Yiew Goto Edit Create Write Run Graph Display e 524900 1049800 1574700 2099600 2624500 3149400 3674300 4199200 4724100 View Selected Matches Flip Subject Sequence Right button click in the Comparison View panel m ID Flip Query Sequence Lock Sequences Unlock Sequences Select either Set ee Set Score Cutoffs Set P t ID Cutoffs Score Cutoffs or a a a j Set Percent ID LOCKED f a Of fer To RevComp Score outas l 999 Cutoffs Minimum Cutoff 1556 524900 1049800 5747 E D E simurm Cutoff 41253 Reset Close er 4 eu ee ee ee ee a a WT ial ame ee et A Re T re LE Move the sliders to manipulate the comparison view image 58 WHO TDR Bioinformatics Workshop at ICGEB New Delhi 2005 Module 6 Comparative Genomics Li _jox File Entries Select View Goto Edit Create Write Run Graph Display a Ar p PE mw PS p D P P P 524900 1049800 1574700 2099600 2624500 3149400 3674300 4199200 4724100 Loan m 7m o A mnn A GEL T ide ae SE Ld CENE NC TTE nn a n 3 l yI l j i f l ye i Iw N E in iX a
41. 0 147400 149600 151800 156200 158400 160600 esso LUE EAE HEEL EAE ETE LU EL UE DUE EN AE AA a a A LE LAE HAE UE HUE HE ELE LE HAE I th h It f view the search resu Or genes LEER EE HE E E HN HE LI i that you think are real B8B1LSRFLSSSSNES PSSIrI Bwv gwvrx vym 8m amp BiLitct3HilsH IL N D BBpgSB3pE Tt HcITHTs31Lit 5 LGFESFSEFYVVIYYITI F 6 CTCGGGTTTGAGTCGTTTTCTGAGTTCGTCGTCTATGTCACGMM A TCTTTTTTGGGATGTCGGTGATGATGGTGACAAATGCCATTTACTCCATACCAGCCTTCTTTACGGA 000 153020 Aao 153060 153080 153100 GAGCCCAAACTCAGCAAAAGACTCAAGCAGCAGAM AGTGGAAGTAGAAAAAACCCTACAGCCACTACTACCACTGTTTACGGTAAATGAGGTATGGTCGGAAGAAATGCC sS PH SDHNHIESMIN t T EK RK FP TD TT 17 F Aa we ER Gb AK KF S KCRUECECOHOUETHUE b DIDGEDERPHRHHHHUCISGSHNSYSOGIYSNWSOIXESIKIRSI r Lp poe jos om EUM NE sol nh wok SW X 5 BOIHO 6 Por yeastpub2 E ican aulo fasta Tbrucei ae seq 00407 PETES W R Pearson amp D J Lipman PNAS 1988 85 2444 2448 fasta Tbrucei glimmer tab seq 00407 442 aa 2124 undefined product 153060 154385 forward MW 48327 vs SWALL library searching data blastdb psu swall 1 0 library searching data blastdb psu swall 2 0 library 374861679 residues in 1166689 sequences statistics extrapolated from 60000 to 1166362 sequences Expectation n fit rho ln x 5 3670 0 000193 mu 7 2592 0 011 mean var 76 09224 15 897 Lambda 0 1470 159 Z trim 4 B trim 0 in 0 64 FASTA 3 36 June 2000 function optimized BL50 matrix 15 5 x8
42. 00 5600 6 4 0 0Trim Selected Features To Next Any Ctrl Y Extend to Previous Stop Codon Ctrl Extend to Mext Stop Codon TREE E TU MENU BIN MINIS TUE MINI RE MIT OT Fix Stop Codons LL m WE ETUR LU ENLI DOO OU MALI LE NET LE HH LUE ELE LONE LEO NIU Delete Selected Bases cI mI fidd Bases fit Selection Bez NOR COR SO 8 EX 8 B 1 BE W L V H I Y T M R Add Bases From File H ox T X ur og L F J MO ON tees cs CH SO ON SE Sk Gs ee OM E a E di AAATGGTTAGTACACATATATACAATGAGAAATGTTAAGTATTTATTTATTTA 4660 4670 4680 4690 4700 ATTTACCAATCATGTGTATATATGTTACTCTTTACAATTCATAAATAAATAAAT RFP amp X V Y 1 C H B I N L I 4 Kk NW 4 amp dT te ow 0 qo L BOE Hood T NO 4 Gs Hy eN S MW B aus RSS E BIEN TERME eigens Hato ccu Des x Gene Names WEN HEEL T TE IL LM I Rm And Complement E E EGRA O eke eh o es e E N GCTCTTGAGGCAACTAATGTAGAGCTAGCTTTTCATCAAT 4620 4630 j4640 PEE CGAGAACTCCGTTGATTACATCTCGATCGAAAAGTAGT TAB A R Qoo vods T BOB XA KS e m Nous cB D X ho B Re sM Gs E 0 2P G the sH DS cae eG GR E D T I Select both the original gene model and the new CDS feature which is to be merged with it to form a new exon Tip to select more than one feature of any type you must hold the shift key down 2 WHO TDR Bioinformatics Workshop at ICGEB New Delhi 2005 Module 2 Artemis Eukaryotic The new CDS feature can then be merged with the original gene model
43. 2158100 e4ee400 2774700 3083000 3391300 369964 4007900 8316201 la6245 First region to investigate Second region to investigate Med ER Dco 9 WwONISLnECOSBI text gt AE e D S5 Doo 3 ROQ Q B m ICCESVESSORON OW ee DOR Oey RULITRCR CCAESDOOT L5 4 5 I St See CSIR B E o E A Cb ESB on ed ee aoe S E Q NHN Dd NOSE TDIM 99 S Peres Cage LR DO NASAR D OB NR G N LS L Rites Pk OB B E 5 D e RO dao ya ae oe OG A a aO AGAGATTACGTCTOGTTGCAAGAGATC DATAACAGGGGAAATTGATTGAAAATAAATATATOSC CAGCASCACATGAACAAGTTTOGGAATGTGATCAATTTAAAAATITATTGAC T1AGGCGGGCAGATAC TTTAACCA 100 110 120 130 TCTCTAATOCAGACCAACOTICTCTAGTATIGTCCCCTTTAACTAACTTETATTTATATAGOGOTOGTCBTGTAC ator FCAAAGccTTRCACTASTTARATTOTTAAATAACTGAATCOSCC cee Mech jae Oye Yaad WOCSSS LNSO7KXvYUDOPREREYIJS QU I RON UOCUODUMSIEXRLSNCRUCESETISEUNESICSUOSeUESEBSVSRSA Roy T Sb wo bb cM CU eS P L ue n A i bY VOR Nds ARS aol 45 N LB RA PRU S 5 l G J qe WES B m M E nim A m us p Ho EN L X ER HM I L E There are many examples where these anomalous regions of DNA within a genome have been shown to carry laterally acquired DNA In this part of the exercise we are going to look at several of these regions in more detail Starting with the whole genome view note down the approximate positions and characteristics of the three regions shown above Remember the locations of the peaks are given in the graph window if you click with the left mouse but
44. 34 5020 Orthologue of E coli thrC THRC ECOLI Fasta hit to THRC E misc feature 4022 4066 PS00165 Serine threonine dehydratases yridoxal phosphate at CDS 5114 5887 c Orthologue of E coli oro YAAA ECOLI Fasta hit to YAAA E E CDS 5966 7396 c Similar to Bacillus subtilis amino acid carrier protein alst misc feature 7091 7138 e PS00873 Sodium alanine s 316 aa family signature CDS a 7665 8618 Fasta hit to TALA ECOLI 1 6 aa 65 identity in 311 aa ove mics mate EER PI IA IA Trancaldalana oc32m ti v Drop down menus There s lots in there so don t worry about them right now Shows what entries are currently loaded bottom line and gives details regarding the feature selected in the window below in this case gene STY 0003 top line This 1s the main sequence view panel The central 2 grey lines represent the forward top and reverse bottom DNA strands Above and below those are the 3 forward and 3 reverse reading frames Stop codons are marked as black vertical bars Genes and other features eg Pfam and Prosite matches are displayed as coloured boxes We will refer to genes as coding sequences or CDSs from now on This panel has a similar layout to the main panel but is zoomed in to show nucleotides and amino acids Double click on a gene in the main view to see the zoomed view of the start of that gene Note that both this and the main panel can be scrolled left and right 7 below zoomed in and out 6 below This panel lists the
45. 4270500 4277000 a283500 4290000 4296500 43 4i 44 d 4 4 4 4d 4 J n misc misc feature gt mi misc feature e misc feature misc misc feature misc feature SlyX yrdA STY4412 STY4423 EL JEN RE Ga 18n hopD mscL fms STY4404 ST STY4d411 3 4416 STY4425 GG qi qi yh er yheU 38 yjaB STY4414 sTyd424 STY4 LEG E cuc MENT ME MET Ale ers AST Re a a N rM s MES NU MUN ES hM EE SRE S IMS MR E oR ei ne a GE SG RSS oN Sie ee Rs 3 T REQ ERES TT ANES EREL E i Ae AS SSN Ek GR EDA CS De EAA Ci T HA 8 RE Bes ia R a ee NG en ee ee ee ee RGR eRe Be ala A GS G ROW Re O A e i E O N e ACCAAGTCTCAGGAGTGAACACGTAATTCATTACGAAGTTTAATTCTTTGAGCATCAAACTTTT la257430 2257440 a257450 a257460 a257470 a257480 4257490 4257500 4257510 4257520 4257530 a257540 4257550 TGGTTCAGAGTCCTCACTTGTGCATTAAGTAATGCTTCAAATTAAGAAACTCGTAGTTTGAAAATTTAACTTCTCAAACTAGTACCGAGTCTAACTTGCGACCGCCGTCCGGATTGTGTACGTTCAGCT i da DUET SUSHI OM Ri dst es GR eR a ORO A ae BS ie a ee Nek ee Bs cE Eee e Se A ea REL Re MD ea eee ER eR es RE da eo Gb ae OR es eR Sa MO BU abo Res cio eM a 5N E eR Re as eee oe T S atf I hod Use one of the methods you have already used to take you to the second region of interest that you noted down Region two acts as a cautionary note when looking at anomalous regions within a genome Have a look at the CDSs within this region Does this region have any of the characteristics of pathogenicity island are the genes within this region essential
46. 52 430 157 179 175 214 Coppel and Black 1998 In Malaria Parasite Biology Pathogenesis and Protection I W Sherman ed ASM Press Washington DC pp185 202 93 WHO TDR Bioinformatics Workshop at ICGEB New Delhi 2005 Appendices Appendix X Downloading and installing BLAST on a Windows PC The following pages describe downloading BLAST onto a computer running Windows XP Downloading onto computers with other versions of Windows should be essentially the same but the windows will look different to the screen shots used here E The NCBI ftp site Mi File Edit View Favorites TMs Help Q 9 2 44 Seros Au O 0 2 HM LJ e Bw FTP site PubMed Entrez BLAST OMIM Books TaxBrowser Structure Address http www ncbi nlm nih gov Ftp index html Search Nucleotide vl for 89 NCBI Guide to NCBI resources The science behind our resources An introduction for researchers educators and the public sequence submission support and software d JO search 7 Folders zii Major resources available by ftp ftp ncbi nih gov I gt BLAST Basic Local Alignment Search Tool For stand alondiliquence comparison software I Cn3D Stand alone softWiire for viewing structures in three dimensions I Data Reposito Download collectifs of contributed molecular biology data I GenBank Download the full r fease database or daily updates Note there are miri sites for GenBank
47. 57490 a257500 42575140 4257520 4257530 4257540 42575 CACTTGTGCATTAAGTAATGCTTCAAATTAAGAAACTCGTAGTTTGAAAATTTAACTTCTCAAACTAGTACCGAGTCTAACTTGCGACCGCCGTCCGGATTGTGTACGTTCAGC T P Vo y NOW V Y OM Lo EOGK Lo Mon BOX OR Qo b T OQ D So Lh NOB A OD OB 100 L V cH i HO WIRE NE de ane R abe SB ad GRECO ACE OEC GEO ak BB CN BLA Be ee Wes MC a SRR SE Gs aa ERO SER B G4 EPS ES E S A E N e AS Re iN esc SE eRe Hs A I oe Sal a 20 WHO TDR Bioinformatics Workshop at ICGEB New Delhi 2005 Module 3 Artemis Advanced Region 2 t E TES nrix File Entries Select View Goto Edit Create Write Run Graph Display othing selecte Entry Ms_typhi dna PS_typhi tab GC Content Window size 9467 GC Deviation G C G 4C Window size 4100 Karlin Signature Difference Window size 3659 D Dp Dm EDD WD DH i gt DEP IP ED ED b IB Lr STY4337 yheM rr rp rpmc rpmJ hdN sm vrdB sTv4402 v4d405 STY4409 STY4418 STY4421 8 gt b gt b BOPP BP I gt E irgD STY43 yheV yheL t k rpsJ r rpl rpmD rg zntR aroE met STY4403 STY STY4 410 sTY4417 4421 STi STEX 442 ban a D gt ED Db E rp Lb yhfA yheR lyD rpsG yhe r rps SN R psM 1Q S yrdD sTy4401 sTy4406 STY STY4422 sTY44 b Hb b VWbPDHRRBIECHHPIBP bb Pp EE ppb b bbb b Ip b misc feature misc misc feature re f misc feature ture misc 165 rRNA m misc feature misc featu m misc featui misc feature 4212000 m 4225000 4231500 4238000 4244500 oe a257500 d264000
48. 6823 7890 sequencel dna 7211 8276 AF140550 seq 773 94 00 2837 3841 sequencel dna 2338 3342 AF140550 seq The columns have the following meanings in order score percent identity match start in the query sequence match end in the query sequence query sequence name subject sequence start subject sequence end subject sequence name The columns should be separated by single spaces 86 WHO TDR Bioinformatics Workshop at ICGEB New Delhi 2005 Appendices Appendix III Feature Keys and Qualifiers a brief explanation of what they are and a sample of the one s we use Feature Keys They describe features with DNA coordinates and once marked they all appear in the Artemis main window The ones we use are CDS Marks the extent of the coding sequence gt RBS Ribosomal binding site gt misc feature Miscellaneous feature in the DNA rRNA Ribosomal RNA gt repeat region gt repeat unit gt stem loop tRNA Transfer RNA 2 Qualifiers They describe features with protein coordinates Once marked they appear in the lower part of the Artemis window They describe the gene whose coordinates appear in the location part of the editing window The ones we commonly use for annotation at the Sanger Institute are gt Class Classification scheme we use in house developed from Monica Riley s MultiFun assignments see Appendix VI gt Colour Also used in house in order to differentiate betwe
49. 871 1 3 Z2e 46 ig8co74 Q amp co74 Similar to EXOSOME complex exonucle 687 857 966 0 6 6 46 IRR4d HUMAN QUT LI Exosome complex exonuclease PRP 5828 847 9552 5 3 38 45 PESE Qavch3 co 6d13 protein 982 832 35 6 3 1e 44 IQ950AT Q960A7 SD10981p 982 829 9e 4 Te 44 EAAIISS1 EAAJ4551 Hypothetical protein 1013 827 929 0 6 5e 44 FAAOSS76 EAA0S47 amp ERIPSTOd Fragment 965 B10 910 9 7 3e 43 QOBI20 QOBI2O Putative exoribonuclaease DI Z3 i 983 763 880 3 3 T7Te 41 Q iDB6 QBIDB6 Mitotic control protein disi homolo 1062 763 857 3 7 1e 40 IDIS3 SCHPO PiT7202 Mitotic control protein disi t 970 755 848 8 2 le 39 RE4d4 CAEEL Q17632 Probable exosome complex exonuc 1029 745 137 2 9 3e 39 IQ8EWd QBEW47 Hypothetical protein ECUO3 0700 835 699 796 5 b5 2e 35 Q0HO8S QUH BS Putative mitotic control protein di 096 664 746 6 le 3 QaKFZ4 QERFZ3 Ribonuclease II family protein i TiO 814 840 4 1 4e 30 manan ER k bonuni RN II RNB family protein 614 n Lou v Frosnit Povar Poirt 47 WHO TDR Bioinformatics Workshop at ICGEB New Delhi 2005 Module 4 Gene Prediction Artemis Entry Edit T_ann_subsequence dna File Entries Select View Goto Edit Create Write Run Graph Display PT A a Arsen V y V uy VIA VV For each gene add a product description a gene name if you know it and a note if there is anything unusual that you wish to record You may also wish to add a colour to help navigate aro
50. AAAATAAATATATCGCCAGCAGCACATGAACAAGTTTCGGAATGTGATCA 7 80 TCTCTAATGCAGACCAACGTTCTCTAGTATTGTCCCCTTTAACTAACTTTTATTTATATAGCGGTCGTCGTGTACTTGTTCAAAGCCTTACACTAGT eo en nn ee ee ee ee ENS BO es eae ee ee ee ot eos me fe annaa n annaa Fee Qu a Dn ARA n Am a PY CONT Peo De a eee ns eT a Now follow the numbers to load up the annotation file for the Salmonella typhi chromosome E Artemis Entry Edit S typhi dna ile Entries Select View Goto Edit Create Write Run Graph Display Noddy Nothing selected Read An Entry A Read Features Into t Wi UM Select an EMBL file Save Default Er Save An Entry v e AMEN Enter path or folder name Save All Entries Clone This Wine 7200 Te TET Close EOM e AMONT MALU 2 Folders Toe eee eee af ee ee ae Ee ee AR EO RTG AGAGATTACGTCTG m TCTCTAATGCAGAC S T Q Enter file name Cancel What s an Entry It s a file of DNA and or amino acid features which can be overlaid onto the sequence information displayed in the main Artemis view panel WHO TDR Bioinformatics Workshop at ICGEB New Delhi 2005 Module 1 Artemis Prokaryotic 3 The basics of Artemis Now you have an Artemis window open let s look at what s in there A Artemis Entry Edit S typhi dna moi File Entries Select View Goto Edit Create Write Run Graph Display Noddy Selected feature bases 930 amino acids 309 STYOO003 class 3 1 18 c
51. ACT using the megablast comparison reveals a high level of synteny conserved gene order This is perhaps not unsurprising as both genomes belong to strains of the same species Using results of comparisons like these it is possible to identify genomic differences that may contribute to the biology of the bacteria and also investigate mechanisms of evolution Both N315 and MW2 are MRSA however N315 is associated with disease in hospitals and MW2 causes disease in the community and is more invasive Scroll rightward in both genomes to find the first large region of difference Examine the annotation for the genes in these regions What are the encoded functions associated with these regions What significance does this have for the evolution of methicillin resistance in these two S aureus strains from clinically distinct origins zs WHO TDR Bioinformatics Workshop at ICGEB New Delhi 2005 References References Abbot J C et al 2005 Bioinformatics 21 18 3665 3666 Web AT an online companion for the Artemis Comparison Tool Carver et al 2005 Hacker J Blum Oehler G Muhldorfer I and Tschape 1997 Pathogenicity islands of virulent bacteria structure function and impact on microbial evolution Mol Microbiol 23 1089 97 Berriman M and K Rutherford 2003 Brief Bioinform 4 2 124 132 Viewing and annotating sequence data with Artemis Hacker J Blum Oehler G Muhldorfer I and Tschape 1997 Pathoge
52. CAGGACGAGGTTTTTGACGAGATTCAAGAGGAA CAAA TCCTCGCAGAGCTCGAATCTICACAA BG 0 20 30 40 50 leo 70 80 90 Loo 110 TCTGTACTCATCGGGCTTGCTCAAGCTCCTCCAGUOGGCAAUCCATTGGTCCTGCTCCAAAAACTGCTCTAACTTCTCCTIGTTTAGGAGCGTCTCGAGCTTAGAAGTGTTTITG ao E e PESE O e B 2 R R ee 2 R 8 0 R a b Y 2 6 kR b amp XD S W b b 6 M S D A T B b W e T k L I 4 A P E 0 i 6 P MOB EO A R w b g b b Y E W t Y b Y 6 W X Y b H bh b b 0 4 kh 3 R a b rT j acrs 5 1552 1 aBcrs 3038 125 c cos BUIl 19892 i 45 WHO TDR Bioinformatics Workshop at ICGEB New Delhi 2005 Module 4 Gene Prediction Select One Line Per Entry view as you did in previous exercises Then select all CDS Features from Select drop down menu and after selecting all CDS features select Run Fasta L on selected features under the Run drop down menu Similarly run blastp against SWALL on all of the selected CDS features It will take some time to run the Fasta and blastp searches and it will report to you when the fasta blastp searches are completed Create a new blank entry to store your own annotation and save your entry as my_annotation2 tab Now for check the Fasta and blastp results for each CDS feature and decide about the correct gene model based on the results of your searches You may need to combine the results from both the automated gene prediction algorithms such as PHAT and genefinder in this particular cas
53. CAO ceanca Te STi CTAGTATTGTCC COTPEMACTARE TIPIATTIATAT Wat TCU TC mu TAL CTTGTTCAAAGC me CACTANS TIAMATTTI TAMANT lectos ame TAT iah H v P CH H BR F B v c B Notice how several of the plots show a marked deviation around the region you are currently looking at To fully appreciate how anomalous this region is move the genome view by scrolling to the left and right of this region The apparent unusual nucleotide content of this region 1s indicative of laterally acquired DNA that has inserted into the genome 16 WHO TDR Bioinformatics Workshop at ICGEB New Delhi 2005 Module 3 Artemis Advanced As well as looking at the characteristics of small regions of the genome it is possible to zoom out and look at the characteristics of the genome as a whole To view the entire genome use the sliders indicated below However be careful zooming out quickly with all the features being displayed as this may temporarily lock up the computer To make this process faster and clearer switch off stop codons by clicking with the right mouse button in the main view panel A menu will appear with an option to de select stop codons see below If you have any problems ask a demonstrator To de select the annotation click here 0x File Entries Select foto Edit reste Write An Cah Display Entry H8 typhi dna B typhi tab GC Content Window size 354 a ZA 2198838 95 NN N j f Mh Y a LA N f H
54. E T I GARR A A F D E T 1 M p I V K 7 gt L i v v A S F p k G Q F T F Y pP z I F t v J OGCGCOCOACOAGCATOCOOCTAOTCAMOTCTOGCTOAGCACTOCCAG T AG TOS CCOOG COG COO COG COCOO COOG CAGCOOGO CCCOGACAAOG TATCCAACOCOCATTCOOGATCOGCCOGCOOTCCCATO TN It lat 3 jar Isa 60 3 1 11 20 It 30 COGCGOGCTGCTOGTACGCOGATCAGTTCAGACOGACTOG TGACOG TCATCAG COGCCOGCCOG OG COG OG COS CCOG TOGCCOOGG CCTG TTCCATAGG TTG OGOGTAAGCCCTAGCCGGCOGCCAGOGTACM F F A n a 4 L H A 8 Q W T D gt k 5 t F V L gt P V I id A N k I i i P w 1 VDECEGACLOVOACLOILOERCRCACACACACRORCACAIP ZZ LII ILAECIIIAIIZHNEI j pa n D AB ONR A te Not DS EE al dim ie YG Hy F B RU AT leno L A I IN mE UE o j DH DE DE cog DE CDS BIB j f BZN C WB CA E YS MA OO Re L g T 3 P Ss f i A A A F G A A f 3 p E SS Ce l CET 30 P9 Q A A gt V D emm eee i H I8 ID Sg GT RA FY I 1 Gd ORAFG N AK 9 G MF I RCCR2 8 8 S A Bini 2 H 4 codon usage plot M WHO TDR Bioinformatics Workshop at ICGEB New Delhi 2005 Module 4 Gene Prediction Introduction Automated Gene Prediction Gene finding software Glimmer and Orpheus that has been trained for M tuberculosis has been pre run for you To see the gene finding predictions eGo to the File menu and choose Read an entry eRead in both files
55. ENNE ST Ea s CE sS F S WV W TOI TOUSEOEDE GTOOBGUBMURUECITONUNCISISUECESODGUONOOX IOCEUCISESOOIOUNSOYUECTOLELOSE hi ee eae ee Sa ee eee ee ee ee ee PS B5 I SS SS WS CTT NSE SSS OITA SRT EJ gt CDS 789034 793351 hypothetical protein CDS 794541 794765 hypothetical protein CDS 797508 802409 hypothetical protein CDS 803143 805719 hypothetical protein CDS 808062 808499 hypothetical protein CDS 809399 813859 c hypothetical prot i cis 815823 824387 putative erythrocyte binding protein Ebl 1 CDS 825195 826939 conserved hypothetical protein conserved in P falciparum T cps 828419 829969 c hypothetical protein CDS 832642 836418 hypothetical pielen CDS 837288 838416 c hypothetical protei CDS 839671 841554 conserved hypothetical protein CDS 841956 844157 c hypothetical protei CDS 844779 845858 c hypothetical protein CDS 846747 848219 hypothetical protein E crs 850198 851491 small gtpase Rab1i CDS 852770 857098 Eypetheticar protein CDS 858040 859571 Exp Eder urs ier c ecu putative CDS 860389 861052 c cyc lophilin putativ PT AOA A A Example location 789034 793351 in Pfal_chr13 embl 63 WHO TDR Bioinformatics Workshop at ICGEB New Delhi 2005 Module 6 Comparative Genomics Exercise 2 Part V Curation of gene models in P knowlesi We are now going to edit the gene model for P knowlesi Use File menu from the ACT displaying P falciparum and P knowlesi to select entry Pknowlesi_contig embl and select Edit In Artemis
56. FEE EAE HE MEE E LUE EI LL M I uL TE CUIT MET E LN E HH P HE HE EE LL TE LL 1L 0 M0 LI HE EAE ELE VMTN ME HEEL DE EHE EE HH TI EE EE LU EO HEEL ELLE MEL HELLE TAME Pd ELEME MMIII ME EIE EHE LR ELE E LEE E LOEO HE HH IHE HELLE EIE LEE IE HEN ELE UAE EE IE I af Ia TI rep o a e e o a a a a d e e ae e 4 D K K N K P Q K M PF ARE NR s gp oW dLec e 032 06 0p uto 3 w a J2 Jy 0X Hj 2 dO i dU o Qo So Bo dm ds Dh o dh o0 63 dm gU jl TTATATATATATATATATATATATAAACATATATTTTTATTTTTTATTTTTTGCTAGAAA NTTATAAT 1630 1640 1650 1660 1670 1680 1690 1700 1710 1720 1730 AATATATATATATATATATATATATTTGTATATAAAAATAAAAAATAAAAAACGATCTTTAAANATTACATTCTGTTTTTTTTGTTCGGTGTTTTTTACTACTATTAAATAGGTA DXX d ws o w a o oe O E a E ao a eo e O a O O N do di S o a e a a a a e e a a a l aE ea e a a a a e a SHE A INR A a a r WOO ao dn ee a e e o o O O era en o e a a ae a a oe ee a a a ee a ye eh deb 4 Mb 3E Obs A Td Click and drag the last first base of exon To Change a gene model n the startup window make sure you have Enable direct editing selected eIn the DNA window click the base at the beginning or end of an exon and drag it to where you want it to go elf you want to merge two exons select them and go to the Edit menu and select Merge selected features eYou may want to refer back to Artemis Module 2 Exercise 2 Part I for help 34 WHO TDR Bioinformatics Workshop at ICGEB New Delhi 2005 Module 4 Gene Prediction
57. GAGA TTCAAGAGGAA CAAA TCCTCG CA GAGCTCGAA TCTTCA CAA BIB 1 20 30 o 50 leo 70 80 90 100 mE TCTGTACTCATCGGGCTTGCTCAAGCTCCTCCAGCGGCAACCATTGGTCCTGCTCCAAAAA CTGCTCTAAGTTCTCCTTGTTTAGGAGCGTCTCGAGCTTAGAAG TG TTTTG oe ey ge Tei er Sie bk eee cist ahs et ni a eae Sey eet eh ie Ol Set ey Se abe ty eS ie ee ee sie ys a ee ee ss dul dy ye 129 gU o uo mb mp db o dio d im S c du dep db 9 jy 0D 3 mb 1B dmp de dy x od E du JM My dm os dip d dBb o del dp 4E XC dd Wy dE hj D o X od d p i5 dy d J0b 2 de d po qx dm 02M 6X 0 43i NU For every gene prediction check the splice boundaries are correctly predicted following the GT AG rule and also check that the every gene you predict in you re my annotation2 tab file starts at a start codon and ends at a stop codon WHO TDR Bioinformatics Workshop at ICGEB New Delhi 2005 Module 4 Gene Prediction Artemis Entry Edit T ann subsequence dna File Entries Select View Goto Edit Create Write Run Graph Display Entry F T ann subsequence dna M T ann PHAT gene model T ann genefinder gene model GC Content Window size 120 A SAIS kj A M P f AP A i mo M PEE VV SK WO VV Vin wy V Tops Phat1 Phat3 Gene finder 1 Gene finder 4 pEEESEEEp C Pb Db CRUNCH D CRUNCH D ICH D 800 1600 2400 3200 4000 1800 5600 view the search result for genes aisimi m that you have selected ene finder 2 qum REEEENEENEEENN Phat2 AN
58. GCTAGTCAAGTCTGGCTGAGCACTGCCAGTAGTCGCCGGGCGGCCGCGGOCGCGGCGGGCAGCGGGCCCGGACAAGGTATCCAACGCGCATTCGGGATCGGCCGG CGGTC 10 20 30 40 50 60 70 80 90 100 110 120 CCGCGCGCTGCTCGTACGCCGATCAGTTCAGACCGACTCGTGACGGTCATCAGCGGCCCGCCGGCGCCGCGCCGCCCGTCGCCCGGGCCTGTTCCATAGGTTGCGCGTAAGCCCTAGCCGGCCGCCAG Be R SR OA Ho Bohs SD eR oe DB OC Qo Wo D G SB POR E A S P Gh BB GecB CR ee WR A ONO R OR eR OR 3 Ree B ee WW A2 G ROGA nh M eR a Gh DOSB 0 UE OT A PE ORB GR Ro eB P XB OX RUM d ve A RE M RE S OR tes Se E Pris Graph Display Hide All Graphs Add Usage Plots Add User Plot 4 GC Content X If you go to the Graph menu you will see that Artemis can 4 GC Content 6 With A 2 5 SD Cutoff display many different graphs of different DNA properties 4 AG Content Try as many as you can and decide which ones may be useful 4 GC Frame Plot for predicting genes A description of what each of these can 4 Reverse GC Frame Plot be found at www sanger ac uk software artemis v4 manual E 4 Reverse Correlation Scores 4 GC Deviation G C G 4C 4 AT Deviation A T A T 4 Karlin 5ignature Difference 30 WHO TDR Bioinformatics Workshop at ICGEB New Delhi 2005 Codon usage Most organisms have a preferred set of codons for protein coding DNA These are represented by codon usage tables see below and can be used to predict what regions of a genome are coding using a codon usage plot To do this go to Graph and Add usage plots
59. I STY4585 Select a file to save E A 0 BoOGUS UD GI IL y G B wR X SD cC 8 e oN KR OO ck ACA ch A SES M CSV Ho GL ox deco Ww D ON CERES Xr o R D H wW Ho B D R E RA OM A fr eg Cake e Cn we D enema E od s RT 10 0 0 60 i TTTCGTCCTGCGACGTGACCGTAATAGCAGACAGGTATGGCCTGACTGACATCGCCTGTTGTCTTGTTA RNAs cNLESTC D SUM GWDOOB W T A B m n w di L A a ee NS i a m 3 H WM cm sis x A POR Q BM Ge Es h Ae E E h s IET ERNA 620 683 c possible truncated tRNA Phe mise feature 620 134181 c The major Vi antigen pathogenicity CDS 761 1795 Weakly similar to the C terminus o1 CDS 1792 3156 Similar to Bacteriophage P1 Ban he misc feature 2022 2445 PS00017 ATP GTP binding site motif CDS 3149 4948 no significant database hits CDS 5117 5422 Doubtful CDS CDS 5550 6131 no significant database hits CDS 6216 6773 Weakly similar to Yersinia pestis c CDS 7018 8361 no significant database hits This will create two files one with the sequence and the other with the annotation in the directory within which you started Artemis To create a complete EMBL file use the UNIX you covered earlier and cat the files together E WHO TDR Bioinformatics Workshop at ICGEB New Delhi 2005 Module 4 Gene Prediction Module 4 Gene Prediction Introduction There are many automated gene prediction programs commonly used for both Manual and Automated annotation protocols Most of these progr
60. K R Q QI R L L A C M A SE T S A G WN F W S8 G G W N F W S G A F R GA SES F I lI EE 111 Emp Ee N we gpnm 141 p Il CDE CDS CDE CDA Che cn laua rena paoi a200 janoo pann S600 lean 200 lean Hani 600 I LE dt HTT lt 2 1 IE IEI d I HE CDS os Trimmed m M E TEH i dn T oH ll lo oT EE E TH CDS CDS C MITT KE II i l ik dl o o i F E i l E SS i ORFs cos chs CDS CDS cbs c m DG DS CDS CDS cCD8 9 CDS Kj j a G A ER KB A UC G E EE L A E H U Q 8 P GG amp KR G amp G A A amp H amp B UG y uU l1 ER A F amp l1 amp EK E 5 r1 A R D E HA A B Q V V L E T A B B HR R KBR A A BR H KR AG P D KR V E H A H E o 8 AGG A T5 M EL VE GG AL P V V GR FRR G6 G6 OR A RTRXYPIRIRERDERPA iVE X AA 10 au Bl TU 10 120 coscscemooTOSTACSCOSATCAGTICAGACOGACTOSTGACUGTCATCNGOGGCCOS CCUGOSCOSDOCOSCCOSTCGCOCUGGCCTGTICCATAGS TIGOSCU TANG CCOTAGCCUS COBCO Q WE D G BP P R P A A B L P G P C E I W R A N EF I P R R D TET ET 29 WHO TDR Bioinformatics Workshop at ICGEB New Delhi 2005 Module 4 Gene Prediction Exercise 2 Using DNA content plots The G C content in each reading frame of a piece of DNA is often different in coding regions compared to non coding regions due to the limitations imposed by codon usage You can visualise the codon usage for a DNA sequence in Artemis Go to Graph
61. Nothing selected Amino Acids Of Se Entry pHCM1 emb1 PIR Database Of Select HOME HET IT PTTL UTI gases of selection ee MTU MINEI TILT FIFI L1 I Upstream Bases of Selected ures fll ll DL P HIN UNLILI MILLI TI Downstream Bases Of Selected Fe LH 1 Write All Bases Raw Format rper unit goo 1600 2400 3200 Write Codon Usage of Selected Features FASTA Format x Ko 404 4 d 4 EMBL Format RBS ntl ature it BS BEBE I 1 a TIER Wl II WE i di TES HINT i UI SEIT Genbank Format HUTTE TINI i ni DI Ld coll LUTTE MH DI i e B i ETT LE UMM d HCM1 02c insB HCM1 HCM1 15c HCM1 20c HCM1 22 CEDE Gee os es Gc a Oo oS lt 2 oo V K P A FDKGIA P E E ae ey ee nee ee Wier ae les es Ges a Cs E RF 8 PV PH R GG YT S AS F R FPCE A S V R D CR GGCGATTTTCTCCAGTTCCTCATCGAGGCGGGTATACGTCAGCCTCCTGATTTTCAAGATTTCCTTGTGAAGCCAGCGTTTGATAAAGGGATTGCCG 10 20 30 jao 50 jen 70 B 20 CCGCTAAAAGAGGTCAAGGAGTAGCTCCGCCCATATGCAGTCGGAGGACTAAAAGTTCTAAAGGAACACTTCGGTCGCAAACTATTTCCCTAACGGC RONCEUG UG XCURUPCPONO WNCDUESCRCOOUNCRBULTCAUO COQUE ACL OT OQ FT LUS QUR S K RW NRM S A PIR GG S K K RT F GAN LP IAP source 1 216160 E cps 1 218160 e HCM1 01c possible membrane protein len 185 aa unknown func RBS 536 540 c possible RBS EN CDs 742 1053 c HcM1 02c hypothetical protein len 103 aa unknown function E CDS 1367 1690 e HCM1 03c hypothetical prote
62. OOBROOGBR O05 Ro o a a e E Ww EGCGE TF OE Wh ES ae pa HH X OYo E EE eo ae a oO Y Hor E da odgo Lb PSP EI ICI WV Y ERCE WD X tb 36 WHO TDR Bioinformatics Workshop at ICGEB New Delhi 2005 Module 4 Gene Prediction To compare CDSs with others currently in the public databases run a fasta search Left click the CDS click on the Run menu and then Run fasta on selected features When the search is finished a banner will appear saying fasta process completed see above The search may take a couple of minutes to run To view the search results click View then Search Results then fasta results The results will appear in a scrollable window You could also view these results in your Netscape Browser window as in the previous exercise How does your predicted gene model for this CDS compare with proteins pulled out of the public databases Is it possible that there are additional exons not featured in the current model If you think that there are additional exons that should have been included in the gene model you should add them to it Using G C content and results from your database search as guides roughly draw in where you think the additional exon s lie To create additional exons Select the region you think represents the exon by holding down the left mouse button and dragging the curser over the region of interest Then click the Create menu and select Create feature from base
63. R ee X 4o m 3v doo 3 Re oco mo qu pop ono 3 ud te ek ig Sog wo mo B AP ee Rk ee CTATGCCGAATATTTAATTTATAAAGAAATGCCGACAGATATGTAAATGTTTTC GATGGGGGAAAAGACAAAATCAGATGTCGGATCAGCAGTTTGCTGTAGGGAAAATGAAG za34an 243460 243480 2a3500 243520 a43540 GATACGGCTTATAAATTAAATATTTCTTTACGGCTGTCTATACATTTACAAAAGCTACCCCCTITTCTGTITTAGTCTACAGCCTAGTCGTCAAACGACATCCCTTTTACTTICC A oS PR 0 8 i sf be St er he 2b Se oir eh Ss 7 ep 8 Pw eS ep Ss Coe ep ENEMIES NEN SR fy RY home tom PP PP 8 Pe 3S 5 rh er 55 S oW poH ur k hal i 4 A1 WHO TDR Bioinformatics Workshop at ICGEB New Delhi 2005 Module 4 Gene Prediction a new entry can be added using the create menu SEE y Artemis Entry Edit Tbrucei dna l56z43 Entry Tbrucei dna Thrucei glimmer tab my annotation tab One selected base on forward strand GC Content Window size 395 156243 Pi Pa an Us d re ae WAS i 22 2708 EN NN E TENGO CNN E E EN i E i 36 134 133 1312 1 5 izi 121 118 11 115 112 11 109 147400 L49600 s180n 5400n Ls6200 leneon L62800 165000 beeni this entry contains work in progress that is saved in my_annotation tab Contains 356 Glimmer is designed for prokaryotic gene prediction so you will need to check that each gene starts with a Methionine codon If it does not trim it to the nearest methionine This can be done easily from the Edit menu 42 WHO TDR Bioinformatics Workshop at ICGEB New Delhi
64. S S OR sM ss g E O 2P G ths SH Ie A eu R E D A I f 2 NM ee OW Se ON ES SEU a y e e a AAATGGTTAGTACACATATATACAATGAGAAATGTTAAGTATTTATTTATTTA 4660 4670 4680 4690 4700 LATTTACCAATCATGTGTATATATGTTACTCTTTACAATTCATAAATAAATAAAT Poo SP E ate eM OG CHS See Gk Ns Ai a Sas RK NS asa E R SI GEN Ge Gi Ye Us cb ein He es NS Se w OR is Ae i AS SG My a Ss E e ei EO RS o A St Select both the original gene model and the new CDS feature which is to be merged with it to form a new exon Tip to select more than one feature of any type you must hold the shift key down 23 f WHO TDR Bioinformatics Workshop at ICGEB New Delhi 2005 Module 4 Gene Prediction The new CDS feature can then be merged with the original gene model as shown above A small window will appear asking you whether you are sure you want to merge these features Another window will then ask you if you want to delete old features If you click yes the CDS features you have just merged will disappear leaving the single merged CDS If you select no all of the three CDS features the two CDSs that you started with plus the merged feature will be retained LI _ olx File Options Reread Options 4 Enable Direct Editing Artemis start up window Artel Rela 4 Eukaryotic Mode 4 Highlight Active Entry a Black Belt Mode Show Log Window Click here to enable direct editing Hide Log Window
65. STY4 Note the bases ZET 800 1600 2400 3200 4000 4800 5600 6400 7200 have been mia PETE TOPE WTI eae renumbered from the first base you selected Iz aA o R C T G I I V C P Y RID CS 95 0 N N TU S c E O H KO SB A A L A b S 82 MV HEEG DT F 2D N ROT 1 E lI D P A N S o 1 Sb S ROR hehe N ea R Goes T ORED A NR Tne Ss G ee ee E A E ABAGCAGGACGCTGCACTGGCATTATCGTCTGTCCATACCGGACTGACTGTAGCGGACAACAGAACAATACTGATACTTCCGGCTAACAGCATTT 10 0 0 0 50 60 70 80 90 TTTCGTCCTGCGACGTGACCGTAATAGCAGACAGGTATGGCCTGACTGACATCGCCTGTTGTCTTGTTATGACTATGAAGGCCGATTGTCGTAAAA IGE Sa A ee CB A Ne SB Sw Pe oe A Ss Si ie Ne ee S A a a Be U D MB 1 3 SR ROB M C St a ee OR y ON a 2G EO T EREE VA NP L A P R Dov BE Mork A Se CoE Sa E E A O G De BERNA 620 683 e possible truncated tRNA Phe misc feature 620 134181 c The major Vi antigen pathogenicity island SPI 7 CDS 761 1795 Weakly similar to the C terminus of several polysaccharide bi CDS DA 3156 Similar to Bacteriophage P1 Ban helicase TR O80281 EMBL AJO011 misc feature 2422 2445 PS00017 ATP GTP binding site motif A P 1loop CDS 3149 4948 no significant database hits CDS 5117 5422 Doubtful CDS CDS 5550 6131 no significant database hits CDS 6216 6773 Weakly similar to Yersinia pestis orf 77 TR 992361 EMBL ALO51 CDS 7018 8361 no significant database hits 2 WHO TDR Bioinformatics Workshop at ICGEB New Delhi 2005 Module 3 Artemis Advanced Note that the two entries on the grey Entry
66. T at uni APT feature misc SONUS featul a lt an 0 40 4 EE BTYAC 8114492 umB S TY4513 pilN pilV2 STY4585 SI ETY4604 Y4618 STY4641 insA ETY STY4675 A ecnR 3 yies 4 40989 W 41 q QU GEN de did qd 10m 0 9B et Y STY4488 493 STY4500 P pee STTKSRD STY4 STY STY4622 Y4632 38 insB iB a STY4674 581 j frdD p artJ 3114487 449 dcuR hoN STY4562 STY4S pin S 1614 3 T4633 37 exB tviA S1 3TY46 STY4606 cdc eQ T P BCoV OOF TRD RP A IUNCRORD x Ve SARs Oo 2 Oe fi n gt 9 L I A 8 Q S me H A Log R b P FP E i L ETT VOLK ROBCOLDLORSR P TOROI 8B TG C9 D Aa 8 oR 8 R E OCINE E W ROS A N C R HI i Bre a 0C NO eee 8 0 DUOQUuER P R F R bn A CAS CELO EP de Ar Lh D C Yih SA See Jee A R P Ta I NE NE TATCCCCATTGTGTGCAGTTTCTTOGAGATCOGGOGACCAACCCCCCAGATTTOGTCTACTGGCTGCAGTGACAATAGTTTCTOCOTTOGCTTTGGATTGCCTGAAGTCAGAGCAAGCACGCCCTCAAACTGCOGCCA LE 54 622102 4473170 E uet d es MIO B a M re LES B uns E es in 4473280 ATAGCGCGG TAACACACG TCAAAGAA AG COG CTOG TIN vd A A wi I GTIIGACGGCO G W Q T e N R R J G AV UE G G 8 K 1 9 E E H c Y N R E E Q I p Q L Lom L C A R L S G I G A A n L Kg 1n Eu y I D Po DL 2 ED S 1 E E E A QS I Lov g sm RIVER Dans TERRENCE MS BEDS a RET ALL PTs NEN TS NO WEED CE TEL TEE GERE XC Go to region 3 as before Like region 1 this region 1s also referred to as a Salmonella pathogenicity 1sland SPI SPI 7 or the major Vi pathogenicity island is 134 kb in length and contains 30 kb of
67. WHO TDR Bioinformatics Workshop at ICGEB New Delhi 2005 Index WHO TDR Bioinformatics Workshop at ICGEB New Delhi 2005 Index Page lero REPRE 1 Module 1 Artemis Prokaryotic ccssssssseeeeees n 2 Exercise 1 3 Module 2 Artemis Eukaryotic ssssssseeeeeeesen e cece nnnnes 10 Exercise 1 11 Module 3 Artemis Advanced cc ccc ccc cc ccc ccc emere 15 Exercise 1 16 Module 4 Gene Prediction eeee RR mmm 28 Exercise 1 29 Exercise 2 30 Exercise 3 33 Exercise 4 34 Exercise 5 36 Exercise 6 41 Exercise 7 45 Module 5 Small Scale Annotation eee RR 49 Exercise 1 51 Exercise 2 52 Module 6 Comparative Genomics 660 000 ccc ccc ccc nnn nnn 53 Exercise 1 56 Exercise 2 60 Exercise 3 65 Exercise 4 68 Module 7 Generating ACT comparison files using BLAST 72 Exercise 1 72 Exercise 2 79 References uuusuniouwcaucdawlesgc ze e ur cid ilz 3e Diae d orebasierirrisfie 84 P dosis RR S EO TTL TEE 85 WHO TDR Bioinformatics Workshop at ICGEB New Delhi 2005 Module 1 Artemis Prokaryotic Module 1 Artemis Prokaryotic Introduction Artemis Rutherford et al 2000 is a DNA viewer program written by Kim Rutherford and used for both Prokaryotic and Eukaryotic annotations It allows the user to get away from the relatively faceless EMBL and Genbank style database files and view the sequence in a graphical and highly interactive fo
68. a 120 Y 2 l Vat zm a y i na f ih j wf ei kT M M n fh 4 M V i i A y Aj v l a N wi n d W l W 4 v E N Ms at Vr M L Automated gene ToT THREE EC IT ERE TP I IET re 8 Z T HT T TTWT Hes prediction for TIENE UID DUE EI mem RE o HE IEEE ERE LE LOEO UH Hui d mmm uu T ee TEE TETTE T TT T hypothetical a H ua TE E Apes Sri a dod ToS m e 2000 lend zieon 22406 regi aang 2285 Phang alternative based api 15206 pena regn gene phat4 Hg wg Till Hd d d HI PHI HEINE 2 On MI Pee I nun AO 24 U TE E MUN I EI Hn ngu ee HE MTEI 0 D GEL TH HET EN HM T dE HO Lil WM CETE Po DHL ON up dd dip HE HUE PEE TEIN E HE P ONE WITH TIERE a nn 00 EE T INEN Can you curate the Phat4 gene model ue Steen i A S and suggest any iem alternative splicing pattern such as the red model WH 5 er 3JHH2 g 30321 42109 46004 47755 ABSIT EUH17 c 21937 Sf GOSS EAT EREE ETUSS Tian T335R Example location 15618 20618 in Pknowlesi_contig embl 64 WHO TDR Bioinformatics Workshop at ICGEB New Delhi 2005 Module 6 Comparative Genomics Exercise 3 Introduction Having familiarised yourselves with the basics of ACT we are now going to use it to look at a region of synteny between T brucei and Leishmania Aim By looking at a comparison of the annotated sequences of T brucei and L major you will be able to analyse in detail
69. a nat oper or save this file File name st 20040725 ia32 win3z exe File type Application From FEp ncbi nih qav I This type af file could harm your computer if iE contains malicious code Would vau like to open the file or save it to pour computer C3 compaq C3 cPQAPPS Documents and Settings My Documents 4a My Computer File name blastz M My Network Save as type Application Save in lt Local Disk C E E compaq 2 C CPQAPPS My Recent Documents and Settings Documents 91386 New Folder 3 Program Files Temp winoows wuTemp My Documents gs My Computer blastz My Network Application 06 WHO TDR Bioinformatics Workshop at ICGEB New Delhi 2005 Appendices Save As We loses Once downloaded view the contents of the blast directory by clicking on the open folder button My Documents 4a My Computer File name blastz My Network Save as type Application blast 2 2 6 1a32 win32 exe is a compressed file that contains a re cat vew Fees Teas ret z host of other files Q 2 e 3 Search gt tede Iii Address D cbas ae Name Swe Type Dote Modified l File and Folder Tasks A m 7 29 Applicaton 16 06 2002 1430 ai Rename thes fie ete Fie Foldey 16 06 2003 14 31 Z biasen L924B Agpkcaton 21 04 2003 21 15 gy
70. ams use different algorithms data sets and criteria for gene calling Consequently if you ran all of these different gene prediction programs on the same piece of DNA they would all come up with different solutions sometimes markedly different describing the coding capacity of that section of DNA The importance of this should not be underestimated when you consider that many of these automatically assigned genes may find their way into the public databases and subsequently influence experimental design Aims The aim of this module is to compare the results generated by several gene prediction programs We will also use several other metrics with which to validate the output of these programs and finally generate a gene model for a given region of DNA We will cover both Prokaryotic and Eukaryotic worked examples 28 WHO TDR Bioinformatics Workshop at ICGEB New Delhi 2005 Module 4 Gene Prediction Gene Identification Exercise 1 Finding the open reading frames This exercise 1s designed to introduce the different methods used to identify genes in genomic sequence To start open the file TB fasta Mycobacterium tuberculosis genome in Artemis A quick method to identify all possible genes 1s to identify all possible open reading frames To do this select Create and then Mark open reading frames see below You can choose a minimum size of open reading frame that you want to create Try typing 100 Notice
71. at ICGEB New Delhi 2005 Module 2 Artemis Eukaryotic When manually editing your exons you should look out for appropriate splice donor and acceptor sites See below for a small list and Appendix IX for details of known acceptor and donor motifs for Malaria splice sites Once you are happy with your newly created exon re run the fasta search and see how this compares with the other hits in the public databases If there are more exons to mark up try and complete the gene model The three example CDS to analyse were selected because they have very good database hits This obviously makes the task of making the gene model far easier However several of the other CDS in this region have no significant database hits If you have time you may want to have a look at these too 14 WHO TDR Bioinformatics Workshop at ICGEB New Delhi 2005 Module 3 Artemis Advanced Module 3 Artemis Advanced Introduction This Module builds on the Prokaryotic exercise we completed in Module 1 Like Module 1 you will be looking at the Salmonella typhi genome sequence Salmonella typhi 1s the causative agent of Typhoid fever It has been known for some time that S typhi has evolved into a potent pathogen by acquiring large regions of DNA from other bacteria by a process called lateral gene transfer Many of these laterally acquired DNA regions encode genes that are important for virulence and consequently some of these regions have been called Salmonella
72. bacreria html Sabe Spei Hap SRS Sei t Ira n Bioinformati Agrobacterium renelaciens str 258 Cercon chromosome circular 254 par Agrobacterium tumefaciens str CS8 Geneon chromosome linear 187 parts ARAA T REENE IITE ORERE OCEAN T AREAREN EPEA ERONAT A E PAORA Za Agreb astern turnelasiens str 53 1 Washington chromosome circulari 255 pars 81 490 AEDOGDSS Agrobacterium rumefaciens str C58 U Washington chromosome linear 187 part 2 075 560 AOS Pujas angl ut pursue A X551335 ABMONST CON Bacillus anita Bacilus cereus ATCC 14579 1 pam Bacillus halodurans 14 parts Bordetella bronchiseptica strain RESO 16 pars Genomes Pages Gactera Mezilla Ele Edt View Go Bookmarks Tools Window Help ae m E 2 l2 3i dl hrp fwww ebi ac uk genomes bacteria hrml 3 m E 4bHome af Bookmarks 4 WebMail 4f Connections 4f Bizjoumal gf Smanupdate gf Mktplace 4 809 037 4 781 951 H o 2 495 279 AEU13929 CON 2 160 207 AP003343 CON WHO TDR Bioinformatics Workshop at ICGEB New Delhi 2005 v Save As Look in Module_7 v Size pHCM1 dna 221807 pHCM1 dna nhr 72 pHCM1 dna nin 88 pHCMI dna nsq 54542 pHCM1 embl 604646 pHCM1_vs_pR27 16313 pR27 dna 183480 pR27 embl 469177 File name N315 embl Files of type All Files Sh
73. blast 20040725 ia64 linux tar gy blast 20040725 mips64 irix tar gz blast 20040725 ppcs2 macosx tar gz blast 20040725 sparc64 solaris tar gz E ChangeLog txt E MDSSUM txt netblast 20040725 amd 4 linux tar gz netblast 20040725 axp64 trud4 tar gz netblast 20040725 ia32 freebsd tar gz netblast 20040725 ia32 linux tar gz netblast 20040725 ia32 solaris9 tar gz netblast 20040725 332 win32 exe netblast 20040725 ia64 linux tar gz Seles av Fay Highlight D wiMal SP Personals lt 2 Y Mobile netblast 20040725 mips64 irix tar gz netblast 20040725 ppc32 macosx tar gz netblast 20040725 sparco4 solaris tar gz E ReleaseMotes txt wiwiblast 20040725 amd 4 linux tar gz wiwblast 20040725 axpo4 tru64 tar gz wiwwblast 20040725 i332 freebsd Ear gz wiwnwblask 200407 25 ia32 linux tar gz wiwnwblast 20040725 332 so0laris3 tar gz wiwwblast 200407 25 ia64 linux tar gz lin geblast 20040725 mipsed irix tar gz wWNublast z 040725 ppc32 macosx tar gz wiwiwbiW 20040725 sparc amp 4 salaris8 tar qgz Appendices Blast 20040725 1a32win32 exe is the blast exe file for windows 95 WHO TDR Bioinformatics Workshop at ICGEB New Delhi 2005 Appendices You now need to save the blast 20040725 ia32 win32 exe file in a new directory blast on to the hard drive of your PC File Download Some files can harm your computer IF the file information below looks suspicious or you do nat fully trust the source d
74. cise 59 WHO TDR Bioinformatics Workshop at ICGEB New Delhi 2005 Module 6 Comparative Genomics 6 Exercise 2 Part I Plasmoidum falciparum and Plasmodium knowlesi Genome Comparison Introduction The parasite P falciparum is responsible for hundreds of millions of cases of malaria and causes over million deaths every year Treatment and control have become difficult with the spread of drug resistant malaria strains across the endemic countries in the world and there has been a major emphasis on research as part of our search for new drugs vaccine candidates to fight against malaria The analysis of the whole genome of P falciparum has been completed and is made publicly available by the Malaria Genome Sequencing Consortium Several animal models of malaria have also been used by researchers to study several aspects of malaria biology host parasite interactions Sequences representing partial genomes of some of these model malaria parasites are also available now This allows us to perform comparative analysis of the genomes of malaria parasites and understand the basic biology of their parasitism based on the similarities dissimilarities between the parasites at DNA predicted protein level Aim You will be looking at the comparison between a genomic DNA fragment of the primate malaria P knowlesi and the previously annotated chromosome 13 of P falciparum By comparing the two genomic fragments you will be able to
75. cteriophage P2 baseplate as Main view window CDS 77709 78287 c Similar to Bacteriophage P2 baseplate as CDS 78356 78802 c Similar to Bacteriophage P2 tail complet CDS 78795 79226 c Similar to Bacteriophage P2 tail complet CDS 79322 79747 c Similar to Bacteriophage P2 protein LvsE CDS 79747 680124 no significant database hits Contains p CDS 80129 80599 c Similar to Serratia marcescens putative CDS 80619 606354 c Similar to Serratia marcescens extracel Close The genes listed in 6 are only those fitting your selection criterion They can be copied or moved in to a new entry so we can view them in isolation from the rest of the information within spi7 tab Firstly in window 6 select all of the CDS shown by clicking on the select menu and then selecting All All the features listed in window 6 should now be highlighted To copy them to another entry file click Edit then move selected Features To then no name Close the two smaller feature selector windows and return to the SPI 7 Artemis window You could rename the lt no name entry as you did before Temporarily remove the features contained in spi7 tab file by left clicking on the entry button on the grey entry line Only the phage genes should remain WHO TDR Bioinformatics Workshop at ICGEB New Delhi 2005 Module 3 Artemis Advanced Additional methods of selecting extracting features using the Feature Selector It is worth noting that the f
76. e to reach to a consensus gene model which you think is the most likely gene model for a particular gene and copy the model to your own entry my_annotation2 tab and add your annotation such as the gene product and a specific colour based on the colour scheme mentioned later in the exercise You will need to use both the 6 reading frames and the One Line Per Entry views as and when required and also check the blastx hits Gf any for a given gene prediction from the file T_ann_blast_search_SWALL You can copy whichever genes you believe are real from the T_ann_PHAT gene_model and T_ann_genefinder gene_model to your own annotation file Artemis Entry Edit T_ann_subsequence dna File Entries Select View Goto Edit Creatgg write Run Graph Display MOCI Serece GC Content All Ctri A a new entry can be i added using the create None Ctrl N menu By Key CDS Features Same Key Open Reading Frame E Phat1 Features Overlapping Selection Phat 4 SSS Gene finder 1 Base Range Gene finder 4 Feature AA Range CRUNCH D i CRUNCH D CH 800 Toggle selection a000 4800 5600 6400 7200 8000 Am a Gene finder 2 Gene finder 3 Phat2 Ses E 2 ER VOR G G ER HM WM P NR G EF R D 8 EK G6 T NH P BM BM A BM I E D db EX 0 WW 0 Xo 9 dU dm oo do 004 Do Roo 092 gio Wo Ub A o UB R 012 3uy dU j95 J2 Sy dsl Di 1 AGACATGAGTAGCCCGAACGAGTTCGAGGAGGTCGCCGTTGGTAACCAGGACGAGGTTTTTGA C
77. e Sequencing Consortium has been submitting human draft sequence data to ihe Intemational Nucleotide Sequence Databases DD amp EMBL GenBank High throughput human sequences have been made available to the public immediately via the EMEL Database hightheggglhput genge dengion HTG while finished sequences have been Included in the Hurnan dension HUM EBI s Genome Monitoring Table Genome MOT provides undinished and finished human genome data sorted by chromosome Additionally ihe Genome MOT presentis the status of a number of large eukaryotic genome sequencing projects on the Word Wide Web The tables are updated daily and also provide access to EMBL database entries Genome Annotation and Proteome Analysis The Ensembl Genome Browser provides the best possible automatic annotation graphical view and weeb searchable datasets for a number of tuk anyotic genomes including human mouse drosophila anopheles zebrafish with olbers bo follow on on a large number of organisms is avallable from Swiss Prot Genomes Pages Plasmid Morilla 7 Bile Edit View Go Bookmarks Tools Window Help t 3 a JB nipciteww ebi ac uksgenomes plasmid haml fi Back Pond Reload 4hHome Bookmarks f WebMail Connections Biz Joumal g Smanupdate f Mktplace c UETETUES EN E Yat SRS irnia Map
78. e and the other the query sequence Therefore one of the sequences has to be formatted so that Blast recognises it as a database sequence This can be done as before using formatdb 81 WHO TDR Bioinformatics Workshop at ICGEB New Delhi 2005 Module 7 ACT comparison files We will treat N315 dna as the database sequence and MW2 dna as the query sequence File Edit View Terminal Go H formatdb i N315 dna p F Now we can run the megablast on the two MRSA genome sequences The default output format is one line per entry that ACT can read therefore there is no need to add an additional flag to the command line see appendix II View Terminal Go Help megablast d N315 dna i MW2 dna o N315 vs MW2 C 82 WHO TDR Bioinformatics Workshop at ICGEB New Delhi 2005 Module 7 ACT comparison files The N315 vs MW2 comparison file can now be read into ACT along with the N315 embl and MW2 embl or N315 dna and MW2 dna sequence files uad ACT HilS embl vs MWz embl File Entries Select View Goto Edit Create Write Run Graph Display ii i Van e 1 I rena padi an aea urii 3200 ERIT 17600 osan ET Pone 26a 2e l LEE APTE NA e n rw HU TTA TT APA d M Vini a i new I Pd A LIA ii Wi LL DEL DIU In UM HET par T utut Mn il iwi WER zu Bago pon pong 11000 13200 sn Hob nu ae 00 26400 saeco F ANNE Al lA AA ANT A comparison of the N315 and MW2 genomes in
79. e annotation Yersinia Module 6 Comparative Genomics Module 7 Generating ACT Comparison files 88 WHO TDR Bioinformatics Workshop at ICGEB New Delhi 2005 Appendix V Useful Web addresses Major Public Sequence Repositories DNA Data Bank of Japan DDBJ EMBL Nucleotide Sequence Database Genomes at the EBI GenBank Microbial Genome Databases Resources Sanger Microbial Genomes TIGR Microbial Database Institute Pasteur GenoList databases Including SubtiList Colbri TubercuList Leproma PyloriGene MypuList ListiList CandidaDB Pseudomonas Genome Database Clusters of Orthologous Groups of proteins COGs SCODBII S coelicolor database Protein Motif Databases Prosite Pfam BLOCKS InterPro PRINTS SMART InterPro Protein feature prediction tools TMHMM Prediction of transmembrane helices in proteins SignalP Prediction Server PSORT protein prediction Metabolic Pathways and Cellular Regulation EcoCyc ENZYME Appendices http www ddbj nig ac jp http www ebi ac uk embl html http www ebi ac uk genomes http www ncbi nlm nih gov http www sanger ac uk Projects Microbes http www tigr org tdb mdb mdbcomplete html http genolist pasteur fr http www pseudomonas com http www ncbi nIlm nih gov COG http www j11016 jic bbsrc ac uk S coelicolor http www expasy ch prosite http www sanger ac uk Software Pfam index shtml http blocks fhcerc org http www ebi ac uk inter
80. eature selector can be used in many other ways to select and extract subsets of features from the genome If you have a closer look at the Feature selector you will also see that you can use search terms to select a class or all those features with a particular amino acid motif ij Select by Key CDS F Common Keys FQoualifier note x MC UNE Space for a search Or by B term or amino acid j mino acid motif l motif F Ignore Case ll Allow Partial Match P Forward Strand Features Reverse Strand Features Select View Close Defining the extent of the prophage Even from this very cursory analysis it 1s clear from the selection that the prophage occupies a fairly discrete region within SPI 7 see below It is often useful to create a DNA feature to define the limits of this type of genome landmark To do this use the left mouse button to click and drag over the region that you think defines the prophage Click on the create menu and select Create feature from base range A feature edit window will appear The default lt key value given by Artemis when creating a new feature is CDS With this key the newly created feature would automatically be put on the translation line However if we change this it to misc feature an option in the key menu top left hand corner at the edit window Artemis will place this feature on the DNA line This 1s perhaps more appropriate and is easier to visuali
81. ed and may in fact be missing exons You will also see predicted CDSs from the algorithms Glimmer and Phat Make you own gene models based on the predictions in the tab file called malaria annotation Use the strong G C bias of malaria to guide your decisions G C content is a very good indicator of coding capacity in Malaria On average the coding regions are 23 G C and the non coding regions are 19 Have a look at the G C content for this region by selecting the appropriate graph Left click within the graph window and then select by clicking on the exons to see how this relates to the G C peaks on the graph Nothing selected f t b asta Danner Entry malaria sequence malaria annotation malaria glimmer malaria phat ULT EST IG I TD MN IEE TATE BEL TTT TTT PE IUS i EN WG A CME MT EET T aan Uo LET EE PO MALI3P1 113 EHE UE HEART LE HEEL UE LE IATA TT LIO ELE EE EE HELLE HUM PFIS DIIS8 lano 1600 za00 3200 lannn lason s600 lsano 7200 lannn glimmerl uENMPCEPEEENEEENESEEEBEMEESENMENMESESSES EXUUS UDIN EO SEC MU DE TI UU MED SP E T ek CCCATARAAACTATTAGTTGTAATATTATTATTCCTTTTTTTTCTACTCTTCATAATTATAARTGTGTTTTAAAAAGGAAAAGAAAATTATTACATATTTTTTTATATATCATAAAC 20 laa leo lao 100 GGGTATTTTTGATAATCAACATTATAATAATAAGGAAAAAAAAGATGAGAAGTATTAATATTTACACAAAATTTTTCCTITICTTTTAATAATGTATAAAAAAATATATAGTATITS GMP p ooh Gg BI E T Ib RE EE UR a ha oa ao oH OD Ee PE Oh SEO One EE EY On oo o do FE FE a rr a A A A sO OLEI OOESCODOONO O OR
82. en different types of genes and other features gt Gene This qualifier either gives the gene a name or a systematic gene number gt Label Allows you to label a gene feature in the main view panel gt Note This qualifier allows for the inclusion of free text This could be a description of the evidence supporting the functional prediction or other notable features information which cannot be described using other qualifiers gt Partial When a region in the DNA hits a protein in the database but lacks start and or stop codons and the match does not include the whole length of the protein it can be considered as a partial gene gt Product The assigned possible function for the protein goes here gt Pseudo Matches in different frames to consecutive segments of the same protein in the databases can be linked or joined as one and edited in one window They are marked as pseudogenes They are normally not functional and are considered to have been mutated The list of keys and qualifiers accepted by EMBL in sequence annotation submission files are list at the following web page http www3 ebi ac uk Services WebFeat 87 WHO TDR Bioinformatics Workshop at ICGEB New Delhi 2005 Appendices Appendix IV Schematic of workshop files and directories Key Directories and subdirectories Module_1 2 Artemis Module 3 Artemis Module 4 Gene Prediction Home directory Aspergillus position at login Module 5 Small scal
83. erine 3 1 04 Aspartate 3 1 11 Isoleucine 3 1 18 Threonine 3 1 05 Chorismate 3 1 12 Leucine 3 1 19 Tryptophan 3 1 06 Cysteine 3 1 13 Lysine 3 1 20 Tyrosine 3 1 07 Glutamate 3 1 14 Methionine 3 1 21 Valine 3 2 0 Biosynthesis of cofactors carriers 3 2 01 Acyl carrier protein ACP 3 2 09 Molybdopterin 3 2 02 Biotin 3 2 10 Pantothenate 3 2 03 Cobalamin 3 2 11 Pyridine nucleotide 3 2 04 Enterochelin 3 2 12 Pyridoxine 3 2 05 Folic acid 3 2 13 Riboflavin 3 2 06 Heme porphyrin 3 2 14 Thiamin 3 2 07 Lipoate 3 2 15 Thioredoxin glutaredoxin glutathione 3 2 08 Menaquinone ubiquinone 3 2 16 biotin carboxyl carrier protein BCCP 90 WHO TDR Bioinformatics Workshop at ICGEB New Delhi 2005 Appendix VI cont 3 3 0 Central intermediary metabolism 3 3 01 2 Deoxyribonucleotide metabolism 3 3 02 Amino sugars 3 3 03 Entner Douderoff 3 3 04 Gluconeogenesis 3 3 05 Glyoxylate bypass 3 3 06 Incorporation metal ions 3 3 07 Misc glucose metabolism 3 3 08 Misc glycerol metabolism 3 3 09 Non oxidative branch pentose pathway 3 3 10 Nucleotide hydrolysis 3 3 00 other 3 4 0 Degradation of small molecules 3 4 1 Amines 3 4 2 Amino acids 3 4 3 Carbon compounds 3 5 0 Energy metabolism carbon 3 5 1 Aerobic respiration 3 5 2 Anaerobic respiration 3 5 3 Electron transport 3 5 4 Fermentation 3 6 0 Fatty acid biosynthesis Appendices 3 3 11 Nucleotide interconversions 3 3 12 Oligosaccharides 3 3 13 Phosphorus compounds 3 3 14 Polyamine biosy
84. erme UE re t Jb E moim emacukigenomes ce T Home yf Bookmarks a WebMail Connections gf Biz Joumal af Smanupdate gf Mktplace p Nucleotide sequences iE co H Site search Sabe gt S C Map rect n Bioinformatics Institu Cutobases Carnes Pages ie Atthe EBI Access to Completed Genomes Please note that the genomes pages have been recently changed The old pages are still avallable if you stili need to check them The first completed genomes from yess phages and crganePes were deposited into the EMBL Database in the early 1990 s Since then molecular biology s shit bo obtain the complete sequences of as many genomes as possible combined with major developments in sequencing technology resulted in hundreds of complete genome sequences being added t lhe database including Archaea Bacteria and Eukaryora These web pages give access ba a lange number of complete genomes help is available to describe the layout Whole Genome Shotgun Sequences WGS Methods using whole genome shotgun data are uted to gain a lanpe amours of genome coverage Ter an organism WGS dala hor CS E number cd organisms are being Haran sak trae ai amd ane made available via EBPs Sequence Retrieval System SAS 21 hnos 1 amid the EBI FTP server al Human Draft Genome The completion of the human draft genome sequence was announced and published in February 2001 in Nature and Science Since the beginning of the Human Genome Fropect the Intemational Human Genom
85. es Ctrl Delete 3 BALLS n 00 4413309 E1400 441500 4455600 14469708 1483800 97900 f 2000 ase 6100 Delete Selected Exons misc feature ise EIUS mise ure unit unit misc feature fe micc UN reat A 4 aa 4 4 E a i 4 Move Selected Features To Da STY A STT4492 ump 5 STY pilN piive STY4 ST ST 4604 Y461 STY4641 insA Copy ected eatures A h 5 p 451 y 4580 4 z 4522 Y4632 E 4 i E is UM Nu 112515 STY STY4 ETY ETY4622 Y4632 insB 1B ES 3ele e q 4 4 449 44 MaI om qq q Mim Selected Features To Met 1 i on STY4562 sTY45 Din 5 1614 3 TY4633 17 exp tviA STi Selected Features To Any A AO C RATE _ Trim Selected Features To Next et Ctrl T DX DG EE EM Se ONES UM EN ES we ay VM GAS DE eS AE Ve Trim Selected Features To Next Any CtrleY Extend to Previous Stop Codon Ctr1 Q CH C Y NOR OR B o8 O EE o 2 Extend to Next Stop Codon TL TER SE TELELE r Fix Stop Codons Automatically Create Gene Names Fix Gene Names Reverse And Complement Delete Selected Bases Add Bases at Selection Add Bases From File mur E min re a rerit verter e eei eA rep LA Pp HE ET Ly MES WR Va PS 0 8 Note the entry names have changed Artemis tntry Edit Goto Edit Create Write Run Graph Display Nothing selegted Entry no name HERE E TED I 1 7 ILE DH TWEET A I mcii Sam es ee a Od T and M CA dimin Pee aiii RII E N STY4525
86. files at the San Diego SuperComputer ae fhin Cantar faanhkanle ma miim amd at Indiana lenient r1 ftp nchi nih gov iE My Documents address E fip ftp ncbi nih gov blast J My Network Places This page may appear slightly different if you are using Netscape 94 WHO TDR Bioinformatics Workshop at ICGEB New Delhi 2005 S ftp ftp ncbi nih gov blast executables Microsoft Internet Explorer File Edit View Tools Help Favorites Oo Back d Search le Folders T Address amp ftp ftp ncbi nih gov blast executables Other Places Fr blast E My Documents ELCJLATEST WWWEBLAST release snapshot special k My Network Places Details er ftp lftp ncbi nih gov blast executables snapshot 2004 0 7 25 Microsoft Internet Explorer File Edit View Favorites Tools Help ese T Bi yo Search gt Folders ak Address amp Ftp Ftp ncbi nih gow blast executables snapshot 2004 07 25 bd G0 Y ae Customise Folder Tasks mij Rename this item iy Move this item Copy this item XX Delete this item Other Places A snapshot My Documents amp J My Network Places Details type search here blast 20040725 amd 64 linux tar gz blast 20040725 axp amp 4 Eru amp tar gz blast 20040725 ia32 freebsd tar gz blast 20040725 6332 linux tar gz blast 20040725 ia32 solaris9 tar gz B blast 20040725 ia32 wind2 exe
87. given a colour code number 12 pink We are going to use this information to select all the relevant phage genes using the Feature selector as shown below and then to define the limits of the bacteriophage First we need to create a new entry click Create then New Entry Another entry will appear on the entry line called you guessed it no name We will eventually copy all our phage related genes into here 24 WHO TDR Bioinformatics Workshop at ICGEB New Delhi 2005 Module 3 Artemis Advanced Click Select then Feature Selector P Select View Goto Edit Create Write Run Graph Display Feature Selector Entry all Ctrl A WITT IAL I All Bases HET MOLIN HE EE TIE HE HE E MUI I orem None Ctrl N STY WM WILL HII LC RBIECH M EH L ST Prey STY4523 STy4524 Make sure the Hil I CDS Features d ee M IIL 1O UI COC RECT M INI Same KAY STY4525 sTvd526 buttons are down open Read 7 OU Features i select by 6400 7200 Base Rang tRNA Key CDS E Set Key to WLW Feature 4 ey tO CDS and Qualifier to colour F Qualifier colour tegis Containing this text MMI ELILE Allow Partial Match F Ignore C BG A H PAN S I RODEO ACT GCAGGACGCTGCACTG TCCGGCTAACAGCAT 3 Type search term 10 Amino acid motif ji ag 90 TTTCGTCCTGCGACGTGA AGGCCGATTGTCGTAZ RE s es Eu bod N L L V 98 C Q EF
88. gram faster This means that it is possible to generate comparison files for genome sequences in a matter of seconds rather than minutes and hours There are some drawbacks to using this program Firstly only DNA DNA alignments BlastN can be performed using megablast rather than translated DNA DNA alignments TBlastX as can be using blastall Secondly as the algorithm used 1s not as stringent megablast is suited to comparing sequences with high levels of similarity such as genomes from the same or very closely related species In this exercise you are going to download two Staphylococcus aureus genome sequences from the EBI genomes web page and use Artemis to write out the FASTA format DNA sequences for both as before in exercise 1 These two FASTA format sequences will then be compared using megablast to identify regions of DNA DNA similarity and write out an ACT readable comparison file The genomes that have been chosen for this comparison are from a hospital acquired methicillin resistant S aureus MRSA strain N315 BA000018 and a community acquired MRSA strain MW2 BA000033 79 WHO TDR Bioinformatics Workshop at ICGEB New Delhi 2005 Module 7 ACT comparison files Downloading the S aureus genomic sequences enomes Pages Ecterla Mozilla 14 amp 3 8 Back Fowadi Reload Stop T Home yf Bookmarks af WebMail f Connections a Biz Joumal af smanUpdate a Mktplac ab hip jwww ebi ac uk genomes
89. hi 2005 Appendices Appendices Appendix I Artemis minimum hardware and software requirements Artemis and ACT will in general work well on any standard modern machine and with most common operating systems It is currently used on many different varieties of UNIX and Linux systems as well as Apple Macintosh and Microsoft Windows systems Note that the ability to run external programs such as BLAST and FASTA from within Artemis and ACT is available only on UNIX and Linux systems Minimum memory requirements for people working on whole genomes are approximately 128 megabytes for Artemis and 128 megabytes per genome for ACT Analysis of cosmid sized sequences can comfortably be achieved with less memory Appendix II ACT comparison files ACT supports three different comparison file formats 1 BLAST version 2 2 2 output The blastall command must be run with the m 8 flag which generates one line of information per HSP 2 MEGABLAST output ACT can also read the output of MEGABLAST which is part of the NCBI blast distribution 3 MSPcrunch output MSPcrunch is program for UNIX and GNU Linux systems which can post process BLAST version 1 output into an easier to read format ACT can only read MSPcrunch output with the d flag Here is an example of an ACT readable comparison file generated by MSPcrunch d 1399 97 00 940 2539 sequencel dna 1 1596 AF140550 seq 1033 93 00 9041 10501 sequencel dna 9420 10880 AF140550 seq 828 95 00
90. il L TIL AO LI LLL LT LU 1i 1 NN AES E TESI lilt MA LAE dU d i ail 2 8TY2352 STY2 El R D Y V W L OQ E I I T G E I D K I Y RO OQ H M N K F R N V I B Oe H do ROB B X UO G E L I1 X N RE Y I X S 8 Y T S FG M CELTES V amp XR D H X EK G WX Lx I WX L 8 PB E H ir Gy 8 x 6 D AGAGATTACGTCTGGTTGCAAGAGATCATAACAGGGGAAATTGATTGAAAATAAATATATCGCCAGCAGCACATGAACAAGTTTCGGAATGTGATC 10 20 30 jao 50 jeo 70 B0 20 TCTCTAATGCAGACCAACGTTCTCTAGTATTGTCCCCTTTAACTAACTTTTATTTATATAGCGGTCGTCGTGTACTTGTTCAAAGCCTTACACTAG s G UT d NW C E X NoD 8 I B Q EK Yt X Y Rw OG C M P LN R RE YT 1 is 1i v D P Q0 L L D C B Y NI X K L Y I EK i V H V E K PE 1 H D n 4 cH Ro 4 A Lo sS Lo L P RE HN 4E I 4 1 D G A A C B C T ER wc H 8 5 a misc feature 2188349 2199512 c Base composition 37 8 G C CDS 2188394 2189107 c Unknown function Contains possible N terminal signal seque CDS 2189209 2189652 c Unknown function Contains probable N terminal signal seque CDS 2189768 2190217 c Unknown function CDS 2190285 2190764 ec Unknown function Contains possible N terminal signal seque RBS 2190771 2190775 possible RBS Nc 2190874 2191476 c Unknown function Contains possible N terminal signal seque Il CDS 2191545 2191823 Unknown function ll CDS 2191793 2192488 c Unknown function i N CDS 2192559 2193059 c Similar to Neisseria meningitidis hypothetical protein NMB04R A WHO TDR Bioinformatics
91. ilarity is broken up Zoom in and look at some of the genes encoded within theses regions What are the predicted products of the genes assigned to these locations View the details by clicking on the feature and then select Edit selected feature from the Edit menu after selecting the appropriate CDS feature Can you identify any genes in one organism that don t appear to be predicted in the other If so add these to your annotation ACT Thruceidna vs Leish dna File Entries Select View Goto Edit Create Write Run Graph Display 175500 156000 136500 117000 97500 78000 58500 39000 1 GU 4 i EX CH 4d 0 8 4 4 44 4 aod 4 AID 40 40 d 498 40 oad 315361 315097 gt 60532 _ 50816 score 57 percent id 3g Subject Flipped PERS HP BEIC HRHP E EH P PHHEP H Ve MP NE DUDES I gt Dol Noo 175500 E kason n SRSM a ini cof 312000 331501 E 1 SANA d ic TIm 14 T nc TEI 4i q KIE 67 WHO TDR Bioinformatics Workshop at ICGEB New Delhi 2005 Module 6 Comparative Genomics Exercise 4 Introduction The quinic acid gene cluster the gut cluster 1s present among many filamentous fungi including including Aspergillus fumigatus Neurospora crassa Aspergillus nudulans and Podospora anserina Although these fungi belong to the same fungal taxonomic family Ascomycetes they vary greatly in their biological characteristics In this exercise you will be studying and comparing the organisatio
92. in len 107 aa unknown function RBS 1697 1702 possible RBS repeat unit 1871 1883 13 bp inverted repeat flanking IS1 repeat unit 1871 2585 c IS1 E cps 1876 2346 c HCM1 04c insB possible IS1 transposase len 156 aa highly v Eee output file name Enter path or folder name i Module 7 Save the DNA sequence in the Module 7 directory Filter PHCM1 emb1 R27 emb1 Folders Save as pHCM1 dna Enter file name Save Update Cancel Also do this for R27 embl ps WHO TDR Bioinformatics Workshop at ICGEB New Delhi 2005 Module 7 ACT comparison files Running Blast There are several programs in the Blast package that can be used for generating sequence comparison files For a detailed description of the uses and options see the appropriate README file in the Blast software directory see appendix X In order to generate comparison files that can be read into ACT you can use the Blastall program running either BlastN DNA DNA comparison or TBlastX translated DNA translated DNA comparison protocols As an example you will run a BlastN comparison on two relatively small sequences the pHCM1 and R27 plasmids from S typhi In principle any DNA sequences in FASTA format can be used although size becomes and issue when dealing with sequences such whole genomes of several Mb see exercise 2 in this module When obtaining nucleotide sequences from databases such as EMBL using a server
93. ioinformatics Workshop at ICGEB New Delhi 2005 Module 4 Gene Prediction Exercise 6 Gene finding in kinetoplastid parasite Trypanosoma brucei You will need to start a new Artemis session and open the file called Tbrucei dna The sequence you are going to look at is a large region from chromosome 9 of T brucei 242kb Add to the sequence a graph of G C content as before and open up the file Tbrucei glimmer tab which contains Glimmer prediction for this region What can you already see about the sequence that will help you decide which genes are real use this slider bar to adjust the window size to smooth the G C plots Artemis Entry Edit Thrucei dna O x Fie Entries Select View Goto Edt Create Write Graph Display Selected feature bases 7077 amino acids 2359 358 label 358 colour 4 nate Contains 357 ContaQns 355 Entry mM Thrucel dna mM Thrucei glimmer tab GC Content X Window size 400 Pe pepo eT UE f O N PADI TET OTE O A OT 306 364 352 360 a4 LG I LR WETTER I ELE MT HILL UNRELATED ALT T NITE LIII TURIN NE 363 361 354 53 2200 Jaano lesno sso0 11000 3200 5400 17600 19800 HINT LT EIE EE E N EE TTA EHE ELE UE UBI M IE EHI I TO NI 350 ELEME MEE MEIN EIS M LESE EE IUUD EU l LU 359 35 356 RE IDE TT M ELLE E ELTE E ELE AL EE E EE 0 I E EE RITE LL E et 367 355 349 4 Fi Lhe At Pee et ke ee Rt ee RT Se he ee ee te SE E NEN GNE NE a Pt eee De rR ERT Se S
94. is run in the DOS Command Prompt window whereas in Linux it is run from a Xterminal window Aims The aim of this module is to demonstrate how you can generate you own comparison files for ACT from a stand alone version of the Blast software In this module you will use Blast to generate comparison files for sequences that you have downloaded from the EBI genomes web resource A copy of the Blast software has been installed locally You will run Blast from the command line using two different programs from the NCBI Blast distribution to generate ACT readable comparison file for two small sequences plasmids and for two large sequences whole genomes Exercise 1 In this exercise you are going to download two plasmid sequences in EMBL format from the EBI genomes web page You are then going to use Artemis to write out the DNA sequences of both plasmids in FASTA format These two FASTA format sequences will then be compared using BlastN to identify regions of DNA DNA similarity and write out a ACT readable comparison file The plasmids chosen for this comparison are the multiple drug resistance incH 1 plasmid pHCMI from the sequenced strain of Salmonella typhi CT18 originally isolated in 1993 and R27 another incH1 plasmid first isolated from S typhi in the 1960s WHO TDR Bioinformatics Workshop at ICGEB New Delhi 2005 Module 7 ACT comparison files Downloading the S typhi plasmid sequences ha cenomes Pages At the EBI Mozilla F
95. ite Run Graph Display 2 tn hi m a i T i PM n d fi rei rae hu zi i n T fill i go0 1600 2400 3200 4000 4800 5600 6400 7200 l Il HA I ur l I l l TN UL nd AE an Th n n i hi View Selected Matches Flip Subject Sequence 3 Flip Query Sequence Lock Sequences Unlock Sequences Set Score Cutoffs Set Percent ID Cutoffs 4 Offer To RevComp LOCKED Ll ee II Lh M M Th hi HL AE lu HELM Ht jn Mb lih TINTE f l l Ir 4 B00 1600 2400 3200 5 4800 5600 6400 7200 ja IIl MULT M I i i i a Au i AM MA ib i nn a m l Drop down menus These are mostly the same as in Artemis The major difference you ll find is that after clicking on a menu header you will then need to select a DNA sequence before going to the full drop down menu This is the Sequence view panel for Sequence file 1 Subject Sequence you selected earlier It s a slightly compressed version of the Artemis main view panel The panel retains the sliders for scrolling along the genome and for zooming in and out The Comparison View This panel displays the regions of similarity between two sequences Red blocks link similar regions of DNA with the intensity of red colour directly proportional to the level of similarity Double clicking on a red block will centralise it Artemis style Sequence View panel for Sequence file 2 Query Sequence Right button click in the Compariso
96. lciparum gene labelled PFM1010w shown below Can you compare the 2 gene models and identify the conserved exon s between the 2 species Use the slider on the comparison view panel to include some shorter similarity hits Can you now identify all the conserved exons of the PFM1010w orthologue in the P knowlesi contig For the time being disregard the misc feature for Phat4 coloured in red in the Pknowlesi_contig embl file Open the GC Content window from graph menu for both the entries Can you relate the exon intron boundaries to GC content for the P falciparum gene labelled PFMIOIOw Is it also applicable to the gene model Phat4 in the P knowlesi contig Example regions Pfal chr13 embl 789034 793351 Pknowlesi contig embl 15618 20618 DLL aE Goto Edit Greate Write n Graph Display Windew 16 Ib S Mn gs E V iM T me mr ij 185606 180408 E 208 132006 128800 723688 T4400 125200 156005 155800 127605 T J lu I n n 1 d M 1 tt M Ur n a b all l UM LN I Y Wl T ll iMi ny lH d m I P falciparum Pfal_chr13 embl TINI md alabie Mi pH M ram d T lul E i fi T Til P ln r n epis hi MI l P knowlesi Fa m A us Ny AA Wan My Ait ha Comparison between orthologous genes in P falciparum and P knowlesi 62 Module 6 Comparative Genomics Pknowlesi contig embl WHO TDR Bioinfor
97. level of zoom to get the whole genomes shown in the same screen as shown below File Entries Select View Goto Edit Create Write Run Graph Display ET 524900 1049800 1574700 2099600 2624500 3149400 3674300 l4199200 4724100 24900 1049800 1574700 2099600 2624500 3149400 3674300 4199200 lt lt i i i i i 9 JI _57 WHO TDR Bioinformatics Workshop at ICGEB New Delhi 2005 Module 6 Comparative Genomics Notice that when you scroll along with either slide both genomes move together This is because they are locked together Right click over the middle comparison view panel A small menu will appear select Unlock sequences and then scroll one of the horizontal sliders Notice that LOCKED has disappeared from the comparison view panel and the genomes will now move independentl B ACT Saina vs Fek a e EE File Entries Select View Goto Edit Creste Wite Run Graph Display 524900 1049900 11574700 29099600 12624500 3149400 3674200 1199200 4724100 View Selected Matches Flip Subject Sequence Flip Query Sequence Lock Sequences Unlock Sequences Set Score Cutoffs Set Percent ID Cutoffs LOCKED a Offer To RevComp 3149401 3674300 6199200 10458 M You can optimise your image by either removing low scoring or percentage ID hits from view as shown below 1 3 or by using the slider on the the comparison
98. ll and run these search programs locally The output can then be converted directly into a format that Artemis can read There are additional files in the current directory which contain this type of search so have a look after you have had a bash at cut and paste Pre run search files Af 2004 blastx swall crunch Blastx comparison file against all proteins in the public database Af 2004 blastx nidulans crunch Blastx comparison file against all A nidulans proteins Af 2004 signalp tab The SignalP output file Af 2004 tmhmm tab The TMHMM output file Af 2004 pfam tab The Pfam output file Af 2004 t gr tab The TIGRfam output file WHO TDR Bioinformatics Workshop at ICGEB New Delhi 2005 Module 5 Small Scale Annotation Exercise 2 This exercise 1s centred on a segment of bacterial DNA taken from the genome sequence of Yersinia sp X The file Yersinia_2004 embl contains the sequence and predicted CDS for this region All you have to do is to open the main DNA file containing the DNA sequence and the curated gene models and annotate at least one gene To extract the sequence click on the CDS feature you wish to annotate and click on view and view bases of selection or view amino acids of selection Note that you can view the sequence in different formats By cutting amp pasting the sequence into the Web tools you were introduced to you in one of the previous modules The file you will need is within
99. logs 0 0 1 Conserved in Escherichia coli 0 0 2 Conserved in organism other than Escherichia coli 1 0 0 Cell processes 1 4 0 Protection responses 1 1 1 Chemotaxis and mobility 1 4 1 Cell killing 1 2 1 Chromosome replication 1 4 2 Detoxification 1 3 1 Chaperones 1 4 3 Drug analog sensitivity 1 4 4 Radiation sensitivity 1 5 0 Transport binding proteins 1 6 0 Adaptation 1 5 1 Amino acids and amines 1 6 1 Adaptations atypical conditions 1 5 2 Cations 1 6 2 Osmotic adaptation 1 5 3 Carbohydrates organic acids and alcohols 1 6 3 Fe storage 1 5 4 Anions 1 5 5 Other 1 7 1 Cell division 2 0 0 Macromolecule metabolism 2 1 0 Macromolecule degradation 2 1 1 Degradation of DNA 2 1 3 Degradation of polysaccharides 2 1 2 Degradation of RNA 2 1 4 Degradation of proteins peptides glycoproteins 2 2 0 Macromolecule synthesis modification 2 2 01 Amino acyl tRNA synthesis tRNA modification 2 2 07 Phospholipids 2 2 02 Basic proteins synthesis modification 2 2 08 Polysaccharides cytoplasmic 2 2 03 DNA replication repair restriction modification 2 2 09 Protein modification 2 2 04 Glycoprotein 2 2 10 Proteins translation and modification 2 2 05 Lipopolysaccharide 2 2 11 RNA synthesis modif DNA transcrip 2 2 06 Lipoprotein 2 2 12 tRNA 3 0 0 Metabolism of small molecules 3 1 0 Amino acid biosynthesis 3 1 01 Alanine 3 1 08 Glutamine 3 1 15 Phenylalanine 3 1 02 Arginine 3 1 09 Glycine 3 1 16 Proline 3 1 03 Asparagine 3 1 10 Histidine 3 1 17 S
100. matics Workshop at ICGEB New Delhi 2005 Module 6 Comparative Genomics Exercise 2 Part IV Gene models for multi exon genes in P falciparum Use File menu to select entry Pfal chr13 embl and select Edit In Artemis to bring up an Artemis window In Artemis window use Graph menu and switch on the GC Content 96 window Use Goto menu to select Navigator window and within the Navigator window select Goto Feature With This qualifier value and type PFMIOIOw click then close the dialogue box Go through the annotated gene model for PFMIOIOw and have a look at the the exon intron boundaries and compare with the splice site sequences from P falciparum given in Appendix IX Also have a glance through a few other gene models for multi exon genes and have a look at the intron sequences as well Can you find any common pattern in the putative intron sequences Hint look at the complexity of the sequence You can delete exon s of any gene by selecting the exon s and then choosing Delete Selected Exons from Edit menu Similarly you can add an exon to a particular gene by co selecting the exon and the gene CDS features followed by selecting Merge Selected Features from the Edit menu Example regions Pfal_chr13 embl 789034 793351 657638 660023 672361 673753 Li ox File Entries Select View Goto Edit Create Write Run Graph Display 3 selected bases on forward st
101. n View panel brings up this important ACT specific menu which we will use later Es WHO TDR Bioinformatics Workshop at ICGEB New Delhi 2005 Module 6 Comparative Genomics 1 ACT St dna vs Eck12 dna 1 OE x File Entries Select View Goto Edit Create Write Run Graph Display LUE LE PTET EHE EAE HET TEIL I Ai i ub HAM a n Wl MT i TUM i TT Wht d Ld ow gl li i IN II IL EL HILL 1600 2400 3200 4000 4800 5600 6400 s E E E d E E Feature Viewer Menu a Right button iin IREE click here l Il n Il n mi aur TI IM Jh mie m uu ul u Smallest Features In Front Select Visible Ranae Select Visible Features Set Score Cutoffs Entries Select Goto view Edit Create Write De select stop codons a Feature Labels 3 One Line Per Entry Forward Frame Ling LOCKED TRAIT dH Reverse Frame Wh uh i il d i i il ih Mi Hl HE d l a HE LN i Il ann Mt if r a Start Codons leoo 1600 2400 2200 4000 laso0 s600 6400 F Stop Codons n i ul i UNIT DL T ui I p Tl un AM 3 c AJ 4 All Features On Frame Lines 3 Show Source Features a Flip Display 3 Colourise Bases 3 Exercise 1 Introduction amp Aims In this first exercise we are going to explore the basic features of ACT Using the ACT session you have just opened we firstly are going to zoom outwards until we can see the en
102. n of gut gene cluster among these 4 fungi using ACT Aim By looking at a comparison of the annotated sequences of N crassa A fumigatus and A nidulans you will be able to first add annotations to qut cluster genes in P anserina sequence and second compare those genes that are found in all 4 organisms as well as spot the differences and study the synteny The files that you are going to need are 1 N crassa qut embl sequence amp annotated file for N crassa 2 A fum qut embl sequence amp annotation file for A fumigatus 3 A nid qut embl sequence amp annotation file for A nidulans artificially Joined contig 4 P anserina qut embl sequence amp gene model file for P anserina without annotation 5 A fum N crassa comp tblastx comparison file of A fumigatus amp N crassa 6 A fum A nid comp tblastx comparison file of A fumigatus amp A nidulans 7 A nid P anserina comp tblastx comparison file of A nidulans amp P anserina 8 P anserina N crassa comp tblastx comparison file of P anserina amp N crassa First open an ACT window and then open the annotation and the appropriate comparison files in the order of 1 5 2 6 3 7 4 8 1 the numbers are designated above You will need to click on more files to upload more than 2 sequences and the comparison flies Click on apply after you have uploaded all the files 68 WHO TDR Bioinformatics Workshop at ICGEB Ne
103. nalysis these programs must be installed and run locally on your own computer This has the added advantage of allowing you to feed the input to these programs in batch 1 e sending off hundreds of CDS proteins in one operation This also makes it possible to convert the output of these searches into a form that can be read directly into Artemis examples to which will be included Unfortunately local installation of these software falls outside of the scope of this workshop To do this you need to have systems administrators clearances at your home institute and a detailed knowledge of your computer operating environment However so as not to dodge this important issue we can give you details of how to approach doing this and so I urge you to speak to the demonstrators about this during the course 49 WHO TDR Bioinformatics Workshop at ICGEB New Delhi 2005 Module 5 Small Scale Annotation Exercises The choices are Own Sequence Please ask a demonstrator Exercise 1 Aspergillus Eukaryotic filamentous fungi Or Exercise 2 Yersinia Gram negative Prokaryote Figure 1 To add more qualifiers look here Artemis Feature Edit 124 colour product nueleoside transporter fasta file fasta Tbrucei glimmer tab seq 00407 0out Qualifiers product etc For each gene add a product description a gene name if you know it and a note if there is anything unusual that you wish to record You may also wi
104. naturnn It may seem that Goto Start of Selection and Goto Feature Start do the same thing Well they do if you have a feature selected but Goto Start of Selection will also work for a regio which you have highlighted by click dragging in the main window So yes give it a try WHO TDR Bioinformatics Workshop at ICGEB New Delhi 2005 4 2 Navigator Module 1 Artemis Prokaryotic The Navigator panel is fairly intuitive so open it up and give it a try A Artemis Entry Edit S typhi dna Click Goto Goto Edit Create Write Run Graph Display Noddy then Navigator Navigator ctrl 0003 class 3 1 18 colour 7 ec ortholoque K Entry 8 typhi dna start of Selection Ctri Left ER end of Selection ctrl right TTE EE TPT TEI TE EU E WATT TAME MIS TES STY0002 HOMES MM TTT Tl Vd ee nm TERME TET ECT POMM ee ee Wie JLA ATT TIMERE MI Startof Seauenee criw 1 MM WM MI WMT N 0D HIIS b gt A Artemis Navigator IF gt E misc featur MB m B00 7200 wGoto Feature With Gene Name misc feati IL gd d l ee VT I Ch k th t v Goto Feature With This Qualifier Value ec a ETE TWAT Goto Feature With This Key the search INI v Find Base Pattern button 1s on a Er V F D L I Find Amino Acid String d By g et p e En Start search at beginning or end 0 2760 840 ACAAACGGCTAGACA v Start search at selection CGCAGCCAA N A RN R R
105. nicity islands of virulent bacteria structure function and impact on microbial evolution Mol Microbiol 23 1089 97 Majoros et al 2003 Nucleic Acids Research 31 13 3601 3604 GlimmerM Exonomy and Unveil three ab initio eukaryotic genefinders Parkhill J 2002 Methods in Microbiology 33 1 26 Annotation of Microbial Genomes Rutherford et al 2000 Bioinformatics 16 10 944 945 Artemis sequence visualization and annotation 84 WHO TDR Bioinformatics Workshop at ICGEB New Delhi 2005 References References Hacker J Blum Oehler G Muhldorfer I and Tschape 1997 Mol Microbiol 23 1089 97 Pathogenicity islands of virulent bacteria structure function and impact on microbial evolution Berriman M and K Rutherford 2003 Brief Bioinform 4 2 124 132 Viewing and annotating sequence data with Artemis Majoros et al 2003 Nucleic Acids Research 31 13 3601 3604 GlimmerM Exonomy and Unveil three ab initio eukaryotic genefinders Parkhill J 2002 Methods in Microbiology 33 1 26 Annotation of Microbial Genomes Rutherford et al 2000 Bioinformatics 16 10 944 945 Artemis sequence visualization and annotation Abbot J C et al 2005 Bioinformatics 21 18 3665 3666 Web ACT an online companion for the Artemis Comparison Tool Carver T J et al 2005 Bioinformatics 21 16 3422 3423 ACT the Artemis Comparison Tool 85 WHO TDR Bioinformatics Workshop at ICGEB New Del
106. nomes Pages Plasmid Mozilla Rhodococcus e Rhodothermus marinus R 21 plasmid pRM21 Rsemerella anatipestsler plasmid pCFCI Fiemenella anatipestifer plasmid price EN Fuminoconcus flanvefaciers Rise cryptic plasmid pEAWS3OI i 136 Salmonella choleraesuis sbrain 79500 plasmid pEFDIO A Salmonella enterica subsp engerica serovar Typhi CT1 amp plasmid pHCMZ Salmonella enaerinidis Jom Sitar ba Salmonella enberitidis senvar E nberitidis plasmid pc Salmonella entermidis servar Enteritidis plasmid pk amp almorella entertidis serovar Enteritidis plasmid pP Salmonella ryphimurium LTA strain SGSC1412 plasmid pSLT Selenemenas rum inantiven p JM plasmid Kl Save As Look in Module 7 File name pHCM1 embl Files of type All Files go Show hidden files and directories _74 WHO TDR Bioinformatics Workshop at ICGEB New Delhi 2005 Module 7 ACT comparison files In order to run BlastN you require two DNA sequences in FASTA format The pHCM 1 and R27 sequences previously downloaded from the EBI are EMBL format files 1 e they contain protein coding information and the DNA sequence In order to generate the DNA files in FASTA format Artemis can be used as follows Load up the plasmid EMBL files in Artemis each plasmid requires a separate Artemis window select Write Write All Bases FASTA format f Artemis Entry Edit pHCM1 embl File Entries Select View Goto Edit Create Write
107. non redundant protein database eFrom this evidence you will be able to remove more genes that are incorrect from your ORFS_100 file eAt this point you can run FASTA searches of the remaining ORF sequences using the Run menu Use this evidence to help you predict which genes are real and remove any others Also remember that bacterial CDSs rarely overlap by more than 3 5 codons i i Gx File Entries Select View Goto Edit reste rite Bun Graph Display time selected base on forward strand 5733 2 Entry 1 DTH fasta ORFS_100 TB orpheus tab ITB glimmer tab MTA v swall blastx odon Usage Ecorecs from IB cu Windog siaga 357 7 T ox i r r m f a me J 4 M CR Ay me E p r Check your predictions against the Sanger annotations by reading the entry TB tab Gene prediction for M tuberculosis was a relatively simple although time consuming task Once you have predicted several CDS for this bacterium repeat the same steps for M leprae All the files that you will need are in the current directory and named using the same conventions as the M tuberculosis files e g LEPRAE fasta and LEPRAE glimmer tab etc The exception is the BlastX file LEPRAE v TB blastx which is the results of a search of the M leprae proteins against those of M tuberculosis The reason for this 1s that the M leprae genome has undergone reductive evolution leaving many pseudogenes and gene fragments
108. nthesis 3 3 15 Pool multipurpose conversions of intermed metabol m 3 3 16 S adenosyl methionine 3 3 17 Salvage of nucleosides and nucleotides 3 3 18 Sugar nucleotide biosynthesis conversions 3 3 19 Sulfur metabolism 3 3 20 Amino acids 3 4 4 Fatty acids 3 4 5 Other 3 4 0 ATP proton motive force 3 5 5 Glycolysis 3 5 6 Oxidative branch pentose pathway 3 5 7 Pyruvate dehydrogenase 3 5 8 TCA cycle 3 6 1 Fatty acid and phosphatidic acid biosynthesis 3 7 0 Nucleotide biosynthesis 3 7 1 Purine ribonucleotide biosynthesis 4 0 0 Cell envelop 4 1 0 Periplasmic exported lipoproteins 4 1 1 Inner membrane 4 1 2 Murein sacculus peptidoglycan 4 2 0 Ribosome constituents 4 2 1 Ribosomal and stable RNAs 3 7 2 Pyrimidine ribonucleotide biosynthesis 4 1 3 Outer membrane constituents 4 1 4 Surface polysaccharides amp antigens 4 1 5 Surface structures 4 2 3 Ribosomes maturation and modification 4 2 2 Ribosomal proteins synthesis modification 5 0 0 Extrachromosomal 5 1 0 Laterally acquired elements 5 1 1 Colicin related functions 5 1 2 Phage related functions and prophages 6 0 0 Global functions 6 1 1 Global regulatory functions 7 0 0 Not classified included putative assignments 7 1 1 DNA sites no gene product 7 2 Cryptic genes 5 1 3 Plasmid related functions 5 1 4 Transposon related functions 91 WHO TDR Bioinformatics Workshop at ICGEB New Delhi 2005 Appendices Appendix VII List of colou
109. odule will introduce Eukaryotic sequence analysis using Artemis This exercise will look at a section of the Malaria genome Your task is to assess the gene models that we have given you and to assess whether they are acceptable or in need of modification To do this you will use G C content to identify possible missing exons and then run database searches in order to see if there are similar CDS in the public databases Note that there 1s not always a perfect answer when creating gene models and a certain amount of subjectivity can be involved Aims The aim of this Module is for you to become familiar with creating CDS features and merging them to create multi exon gene models for this region of sequence You will also find out how to run database searches against a locally installed public sequence database 10 WHO TDR Bioinformatics Workshop at ICGEB New Delhi 2005 Module 2 Artemis Eukaryotic Exercise 1 This exercise will look at a section of the Malaria genome You will need to close down the last Artemis exercise if you haven t already done so Then start a new Artemis Session as before using the file Malaria embl in the current directory Module_2 Artemis Unlike the Salmonella exercise in this instance the annotation and sequence are contained within the same file Malaria embl The sequence you are going to look at is a small region of contrived sequence 21 kb taken from Plasmodium falciparum chromosome 13 Yo
110. olour 7 ec orthologue k Entry 18 typhi dna MS_typhi tab N LEBER h LIE EE PTE HEIL ERE EE AETHERE EU ME EHI ELE ME ES i Fn WT HE LOL E ET LII EL 5 5 I HEIL ET Hn Pw m HEEL OH E TEM WI TETTE TT rj TI I Won 0 ST D D D p misc feature misc fea misc feature isc feature m 3 800 1600 2400 3200 la000 4800 5600 6400 7200 lt 1 misc feati LE HE HEN d HE HE TO LNI IH UM E PEEL TF PINE UL TE PEE 0 P OPEN IT Ig i mmu maga Ho P di Herd mer meno poa STY0005 ai V F A DL Lb RT LS WK LG v PH G E S V C P GQG CLP IC YG PS HG S8 EF NUIBDLLLUTLOUU D vr quo duo pow mp go go ge Jub 055 Wo dm ode 15s dU o do t3 du de dup ge qno 48 je go gH TGTTTGCCGATCTGTTACGGACCCTCTCATGGAAGTTAGGAGTTTAACATGGTGAAAGTGTATGCCCOCGGCTTCCAGCGCGAACATGA 4 0 2760 2770 2180 2190 2800 2610 2620 2630 2640 ACAAACGGCTAGACAATGCCTGGGAGAGTACCTTCAATCCTCAAATTGTACCACTTTCACATACGGGGCCGAAGGTCGCGCTTGTACTCGCAGCCAA NEAS ORNER ENET ETG TERNAR ERREN eh Ye eee i i diy at ip gp ey lee STYOO06 oak CDS 190 255 Orthologue of E coli thrL LPT_ECOLI Fasta hit to LPT ECO 7 CDS ks gm ag Orthologue of E coli thr AK1H ECOLI Fasta hit to AKIH EL misc feature 343 369 PS00324 Aspartokinase signature misc feature 2314 2382 PSOl0d2 Homosorine dehydrogenase signature 200 3730 ologue o coli thrB KHSE ECOLI Fasta hit to KHS 8 5 misc feature 3068 3103 PS00627 GHMP mem putative ATP binding domain CDS 37
111. op and bottom layers are mini Artemis windows with their inherited functionality showing the linear representations of the genomes with their associated features The middle window shows red blocks which span this middle layer and link conserved regions within the two genomes above and below Consequently if you were comparing two identical genome sequences you would see a solid red block extending over the length of the two sequences in this middle layer If insertions were present in either of the genomes they would show up as breaks between the solid red conserved regions Data used to draw these red blocks and link conserved regions is generated by running pairwise BlastN or tBlastX comparisons of the genomes details of how this 1s done are outlined in Appendix II and can be obtained from the ACT user manual http www sanger ac uk Software ACT manual Aims The aim of this Module is for you to become familiar with the basic functioning of ACT by using a series of worked examples Some of these examples will touch on exercises that were used in previous Modules this is intentional Hopefully as well as introducing you to the basics of ACT this Module will also show you how ACT can be used for not only looking at genome evolution but also to backup or question gene models and so on 53 WHO TDR Bioinformatics Workshop at ICGEB New Delhi 2005 Module 6 Comparative Genomics 1 Starting up the ACT software Make sure you re in
112. or dispensable Is it possible that the atypical base composition of this region is not a consequence of having originated from a foreign host The base composition may actually be reflective of the tight sequence constraints under which this region has been maintained in contrast to the background level sequence variation in the rest of the genome WHO TDR Bioinformatics Workshop at ICGEB New Delhi 2005 Module 3 Artemis Advanced Region 3 E ipix File Entries Select View Goto Edit reste Write An Graph Display Entry ES typhi dna Ms tab Pts Ir i FY uf W Vay oe Va ON af 01M bp oT amp OOP Biki gt e p p H Db b pP b Ep ug TY4481 c dmsB S09 STY AR d D ae ee 4574 STY4588 4599 TY4628 STY4648 S STY cl are ecnA y mi Ip gt PO UP bl DDO gt PI bE UPDPUD Ip Lb Bib bb wt a UU b bt dms 5114518 ST 4534 l ST STY4St STY4578 TY4501 95 pE T7462 cI 14645 STY4663 67 672 yjel TY4704 inte E E gt tb LDPDP D gt gt iP b b E STv446 STY4497 3 STY4517 1524 pil1P 1l ST S114565 STY STY4590 M 3114647 ST14664 ecnB yjeN rn uniB Ip ub b b ob b p b b gt be Debo ature atu ise f misc feature eature unit ature RBE misc feature mise featur misc feature mis misc fo mis 2 4971000 4305100 399200 4427 lt 4441500 455600 4469700 3483800 497900 4512000 4526100 D Jessioo 560400 las 4 E 2 7 Msg dA ise_featu misc feature peat unit unit mise feature fe mise TAT
113. orward Strand Features Reverse Strand Features R a EA N L A p R Ov H p p un G B i Selec View Close Click to select features eature FA TIG mas cr vi antigen pathogen Ne ty s lara PI 7 ERN 761 lg eakly similar to the C terminMs of several polysaccharide containing search term LTD ach Similar to Bacteriophage Pi Ba helicase TR OS0Z81 EMBL A misc feature tt 2445 PS00017 ATP GTP binding site moMif A P loop CDS 49 4946 no significant database hits CDS 511 ox CDS 55 Aus 621 File Select View Goto Edit Write Run Click to view s CDS 701 cps 65082 65459 no significant database hits T 5 CDS 65546 65764 c Similar to Escherichia coli prophage P2 CDS 65832 66932 c Similar to Bacteriophage PZ late gene cc selected features 66929 67414 c Similar to Bacteriophage P2 comp 67414 70194 Similar to Bacteriophage Similar to Bacteriophage 16 H TR Similar to Bacteriophage P2 complete ger A076 71153 Similar to Bacteriophage P2 major tail t 1203 72315 Similar to Bacteriophage P2 major tail c T2ZO1U T3632 Similar to Salmonella typhimurium invasi 70187 70306 70321 70623 aana Double click to 73830 74237 c Similar to Bacteriophage P2 probable tai CDS 74244 75863 c Similar to Bacteriophage P2 probable tai 6 bring feature into CDS 75860 76465 c Similar to Bacteriophage P2 tail proteir CDS 76458 77366 c Similar to Bacteriophage P2 baseplate as CDS 77353 77712 c Similar to Ba
114. ow hidden files and directories pamm Last Modified 01 22 2004 10 44 21 AM 01 22 2004 10 44 23 AM 01 22 2004 10 44 25 AM 01 22 2004 10 44 27 AM 01 22 2004 10 44 10 AM 01 22 2004 10 44 29 AM 01 22 2004 10 44 17 AM 01 22 2004 10 44 12 AM d E31 5 EF Save Cancel Save the file as N315 embl Repeat for the S aureus MW2 genome BA000033 Be careful when choosing the genome to download as there is another S aureus genome entry for strain Mu50 BA000017 Save as MW2 embl Module 7 ACT comparison files Save the EMBL sequence in the Module 7 directory Generate DNA files in FASTA format using Artemis for both the genome sequences as previously done in exercise 1 Hint In Artemis each genome requires a separate Artemis window select Write Write All Bases FASTA format Save the DNA sequences as N315 dna and MW2 dna for the respective genomes Running Blast In the previous exercise you used the blastall program to run BlastN on two plasmid sequences As the genome sequences are larger 2 8 Mb you are going to run megablast another program from the NCBI Blast distribution that can generate comparison files in a format that ACT can read see appendix II For a detailed description of the uses and options in megablast see the megablast README file in the Blast software directory appendix X Like Blast megablast requires that one sequence 1s designated as a database sequenc
115. pathogenicity islands Aims The aim of this Module is extend your knowledge of Artemis You will identify regions within the Salmonella genome that may have been acquired by lateral gene transfer and then edit one of these regions as a subsequence and to save this information to a newly created file 15 WHO TDR Bioinformatics Workshop at ICGEB New Delhi 2005 Module 3 Artemis Advanced Artemis Exercise 1 Follow the same procedures for starting Artemis as described in Module 1 All the files S_typhi dna and S_typhi tab you will need are contained in the directory Module_3_Whole_genome_analysis By a method of your choice 1 e use Navigator Feature Selector or Goto go to the region located between bases 2188349 to 2199512 on the DNA sequence This region is bordered by the fbaB gene which codes for fructose bisphosphate aldolase You can use either the Navigator Feature Selector or Goto functions discussed previously to get there The region you arrive at should look similar to that shown below In addition to looking at annotation for this region it is also possible to look at the characteristics of the DNA displayed This can be done by adding in to the display various plots showing different characteristics of the DNA This information is generated dynamically by Artemis and although this 1s a relatively speedy exercise for a small region of DNA on a whole genome view we will move onto this later this many take a
116. pro http www bioinf man ac uk dbbrowser PRINTS http smart embl heidelberg de http www ebi ac uk interpro index html http www cbs dtu dk service TMHMM 2 0 http www cbs dtu dk services SignalP http psort ims u tokyo ac jp form html http ecocyc org http www expasy ch enzyme Kyoto Encyclopedia of Genes and Genomes KEGG http www genome ad jp kegg MetaCyc Miscellaneous sites NCBI BLAST website The tnRNA website tRNAscan SE Search Server Codon usage database RNAgenie RNA gene prediction GO Gene Ontology Consortium Artemis homepage ACT homepage Glimmer Orpheus http ecocyc org http www ncbi nlm nih gov BLAST http www indiana edu tmrna http www genetics wustl edu eddy tRNAscan SE http www kazusa or jp codon http rnagene Ibl gov http www geneontology org http www sanger ac uk Software Artemis http www sanger ac uk Software ACT http www tigr org software glimmer http pedant gsf de orpheus 89 WHO TDR Bioinformatics Workshop at ICGEB New Delhi 2005 Appendices Appendix VI Prokaryotic Protein Classification Scheme used within the PSU This scheme was adapted for in house use from the Monica Riley s protein classification lt http genprotec mbl edu riley lab html gt More classes can be added depending on the microorganism that is being annotated e g secondary metabolites sigma factors ECF or non ECF etc 0 0 0 Unknown function no known homo
117. r codes 0 white Pathogenicity Adaptation Chaperones 1 dark grey energy metabolism glycolysis electron transport etc 2 red Information transfer transcription translation DNA RNA modification 3 dark green Surface IM OM secreted surface structures 4 dark blue Stable RNA 5 Sky blue Degradation of large molecules 6 dark pink Degradation of small molecules yellow Central intermediary miscellaneous metabolism S light green Unknown 9 light blue Regulators 10 orange Conserved hypo 11 brown Pseudogenes and partial genes remnants 12 light pink Phage IS elements light grey Some misc information e g Prosite but no function Appendix VIII List of degenerate nucleotide value IUB Base Codes R AorG S GorC B C GorT Y CorT W AorT D A GorT K GorT N A C G or T H A Cor T M AorC V A CorG 92 Appendix IX Splice site information Gere No 41 3 1 2 3 A 5 6 T 8 RhopH3 1 2 3 4 5 6 RNA pol III 1 2 3 A SERA 1 2 3 SERP H 1 2 3 Ag15 1 2 PfGPx 1 2 Calmodulin 1 PfPK1 1 MESA 1 Aldolase 1 KAHRP 1 GBPH2 1 GBP 1 FIRA 1 GARP 1 The splice acceptor and donor sequences for several P falciparum genes adapted from Donor motif WHO TDR Bioinformatics Workshop at ICGEB New Delhi 2005 GAA GTACACA CCTTCTTTTTCCATATTTAG CAA AAT GTTAAAA TTTTTTTTTTTAAACTTAG CCG GAG GTAAGAA ATTCATTATATATTTATAG GGA TCG GTATGGA TTTTGAAATACTTCCTCAG TTA
118. rand 796563 796565 Entry PF MAL13 1Mb embl GC Content X Window size 120 1 seh SE 796563 ul i T T aU y UN 5 83 UTE ETE HE HET TE FII seg a aaa a aui px IH TM c ET E d odo HE HN HIET T E Po MEE E M es w MNIE dod HH d HEN EE Ho P du o ggg m EEEN od P lu N E nni EE a mu go gg gum guo dE HE PEE P LH IE PU d g EE nra d HELME eee ES 10 0E HN CHE M d WII D MENTI I A il y wg pn gu mibi dH d d l 787200 788000 788800 789600 790400 791200 792000 722800 793600 794400 795200 796000 796800 I gH Hd EHE IM HIE MU LI sapere oso e LEN LN TL NIEIH E HE HU HE EHE CUTII PHIL O CELO M E HIN TOL HELL DOE UL PHI BIN Wu mQr MI n d mua m romam p gung gi dE Hd d UIN Wo np nno n gno pg LIIL Bg Hu i gm HE HH LEM ETE D D MI lI n d up Ho dH Hon o HH wo ee IIo nim pp gun poro gr go ny rmi d PN P Ho od i HEU O dI p aa Se be ye eee Sg ae LE OE 5 SF E A ST 2 a Foec py Eom q TSS rr v cor IOL uow m TOS OROK X TO TOY HOV I I LOK X HOI LOK EESE Y I LOW X I LY Lx 5 vi TOETOE TET TA TS TTO E E E N EE OPT WO XN US CGU RE a ae ee ee a 3 te ee Se Ae ee ee ee EEE 39000 789020 789040 789060 789080 789100 789120 89140 789 ATATATATATATATATATATATATATATAATGTTACOCTCTTGTAGTGTTCTTTTTCTTTGTTGGATAGTACATTATTAGGATTTCTACGTGTAAAATTTCTATTGCOAAAATTTATTTTTTOGTCAAAATATAAAACTACTTCTATAACATAAACTACTCACA Heei LDDDBIP LUN NI rtiY 4 9 1 12211 LLL LLL SS L1 LLL ALZC dS a S a e a e a a aid fe EO EO a a
119. range A new blue CDS feature will appear on the appropriate frame line See below LI ox File Entries Select View Goto Edit Create Write Run Graph Display 2 C f LA Ce LJ ci LJ a UJ a A LA Entry Malay emb1 J na Window size 120 Undo Ctrl U i Edit Selected Features Ctrl E Edit Subsequence and Features Edit Header OF Default Entry Change Qualifiers Of Selected Remove Qualifier Of Selected Duplicate Selected Features Ctrl D Merge Selected Features Ctrl M F 3 7 Unmerge Selected Feature z SMEG GT N E MOIMI MEE ALI IE aH LI oOo oo Milete Delete Selected Exons LI SUMMED LT ULL NE TIT h UL EE ratre to Merge Features HV ATT MA M AMOO CENT LT D t a n un un y sese rn n 19 Trim Selected Features To Met Trim Selected Features To finy Trim Selected Features To Next Met Ctrl T 800 1600 2400 3200 laog I 4800 5600 Trim Selected Features To Next Any Ctrl Y Extend to Previous Stop Codon Ctrl Extend to Mext Stop Codon WHEEL IE EHE EH HE TEE EE QTE EAE HERE MEME ee Stop codons LLION UTED EIT TOULI PT ee LER iya x Gene Names an rr ama an ana c aff m ese m Delete Selected Bases us fidd Bases fit Selection C hee ioe ONES GC R A Se BET K W L V H I X TAdd Bases From File JE sire cles gl Li ES A a E dn a eis GCTCTTGAGGCAACTAATGTAGAGCTAGCTTTTCATCAAT 0 4620 4630 j4640 LEE CGAGAACTCCGTTGATTACATCTCGATCGAAAAGTAGT TA EC OR ces a Oe eS ee E go Kesh G Skk X OB
120. reate Write Run Graph Display ne selects ase on orwar strana Entry Ms_typhi dna 7S typhi tab GC Content Window size 1840 GC Deviation G C G 4C Window size 1828 Karlin Signature Difference Window size 6385 0 01 D E COLD b DE Lb Lb STY2958 STY297 sit sitD iagB STY3027 0 STY3037 STY30 b i OD PL P bb gt b STY2960 2963 hypC 29_s STY2986a STY2996 i STY STY3033 ST STY ae LD STY2962 STY2976 if sitB STY3025 STY3044 bb bb b 2 p rp b RBS 2 RBS e RBS SPI 1 eature mis RBS ture misc fei mise featur misc fea 2834000 26840500 aaa 2853500 2860000 2866500 2873000 et ht 21892500 2899000 2905500 2912000 Kl 4H Hada d K H 4424 4l q NUES n RBS ea RBS BS RB RBS feat RBS m RBS 3B8 feature feature mis misc feature ature RBS f misc featur Ua q dq d a hyd STY2970 TY2974 STY2982 prgk st STY3003 sp spaQ spak invi STY30 STY3034 STY STY3045 qi REEE IIT x1 x qd dq a STYZ961 STY2967 STY2915 sp STYZ2989 SipF pD B spaP Wi I invG STY STY3036 rpo oO X DIE T hyp STY2968 2972 C prgI sp spaO ir invE STY3031 1h vgbJ 2 a gl MOON T E SES o SOER EU UNOSI En Be d ae iq Noe IE ee es EU ss rN E eG GR eB Noe E Por ae A Abe SRE N es RS Se n bo ee Se EE BS Be SG SS eS Es AS SOL B eA ESSE 2 SH SG BH V 1 H tf E v 4 F FT E H gt t B R 4 R B 5 1 M ST OG lt Q RS WR Q sS 4 HM QV GTGAACACGTAATTCATTACGAAGTTTAATTCTTTGAGCATCAAACTTTT 57440 d257450 4257460 4257470 4257480 42
121. rmat Artemis is designed to present multiple lines of information within a single context This manifests itself as being able to zoom in to look for fine DNA motifs as well as being able to zoom out and bring into view operons several kilobases of a genome or in fact to view an entire genome in one screen It 1s also possible to perform quite sophisticated analyses and store the output within the Artemis environment to be accessed later Aims The aim of this Module is for you to become familiar with the basic functioning of Artemis by using a series of worked examples These examples are designed to take you through the most immediately useful functions However there will be time and encouragement for you to explore other menus and gain a basic understanding of Artemis Like all the Modules in this workshop the key is if you don t understand please ask WHO TDR Bioinformatics Workshop at ICGEB New Delhi 2005 Module 1 Artemis Prokaryotic Artemis Exercise 1 Part I 1 Starting up the Artemis software Navigate your way into the correct directory for this module Then type art amp return A small start up window will appear see below Now follow the sequence of numbers to load up the Salmonella typhi chromosome sequence Ask a demonstrator for help if you have any problems E Artemis Release 5 beta Bm ES Click File gt File Options then Open LL Open l Open from EBI Quit Copyright 13238 2002
122. se If you also add in a qualifier such as label and add text following the label then click ok That text will be used as a feature label to be displayed in the main sequence view panel 00x File Entries Select View Goto Edit Create Wite Pun amp wh B Diselos cone selected base on forwar strand 1004 Bae Entry Dapi dna CjaspiT tab Hno name Lr LT Ep sonBE SZTr4b5l TF4 Ir T ST Ya599 STYa628 jn las Ooo E2000 5n nan 25000 71 nan TAOOD pason Binnn 5750n Logona rinsnn ii TOAD iL235nn lt Ipag L zi l Em ETY46D3 405 ETY3614 i15 ETY4363D ESTT463T To 4 d GEMEG Li E STYd600 2 l a STY461z2 306 b 3 cg api di insA E PS d ULF SB jl ETY3601 arTr4 10 135722 26 1 TYM SII jd insB l al E KG Ro CT IL I V C P RT D Ch G QQ H N T D T amp G amp Q M F V L L8 DP VT F vLmRaaA AM v O D D 2 G gd B WV H 1 0 Lh LI of D M RM i1 I b L P AN H Lh r n rI vob M Lh rt F I C Y oF D SR T LN H W HI KR L4 IRL 4 L t R I I ER Q I T FR L TI A F C Ler roe er FL eA oo 3 ARAGCHSGACOCTGCACTOGCATTATOOTC IUMNEG LOBO UNUM ae an an een cue NM T NEN eae TNT 60 ED li oo iin i ji 30 TC cni T A DOTAATAD NCMO TATIONS ALS pem e AF E t C UO pum mE Vemm CRCBOLN M pcre AUT A A H v 5 A K L pu H 4 n b voa co Ce RR D NOS QE TR Vv 6c or oof RR 8 V AWG RV ROT G POR D um v ae A A TSE tS EC a a ce CER A AC at ST
123. se ask 13 0 File Entries Select View Goto Edit Create Write Run Graph Display Selected feature bases 784 amino acids 261 PF13_0119 colour 2 gene PF13 0119 product Entry F Malaria embl GC Content Window size 120 F3 33 33 Ji 10 67 3 33 i NId B NELE HE TENE MOIMI ME ETE TEMA HE EE E I E S PF13 0120 I pci MCU ER HEIL D HU MEME LI Eo CHOLET WE UTI TELE PATE LIE TE PTE ELLE LE EAE LLLI TUMOUR TU AU TN Click to select PF13 0119 exon to edit i00 1600 2400 3200 2000 4800 5600 lea00 7200 a009 Re MI EMI MO LEE EHE HET Mi Ii NIN LAE HERE EE HEEL HII ULM MI LM LIL WD UTE DUMP EE NM NEMINEM TESTI LE NE EE HIN MI PONI LOL EE NUI MET M HEEL I I UIN I Mo CUNT TUN E kia KR ICE SoG NH I I Y Le St 4 T Y t P LI Bf X f E IY HV R B EN E E I d T v oi xor X E _ ae 2 E N IL d SGE Y o 1 I 3 Y AGAATTAAATAATATCTAAATATTATATATATATATATATATATATAA R Aag TTTTATTTTTTATTTTTTGCTAGAAATTTATAATGTAAGAC 4760 4770 4780 4790 4800 _ 4830 a840 LEE TCTTAATTTATTATAGATTTATAATATATATATATATATATATATATTTGTATATAAAAA BRARTAAAAAACGATCTTTAAATATTACATTCTG 8S N UEGh ow d NO XI X I X OE Ono X X NW SK SOS 2 Bou uro d GE Who X WE Lo oX o X OX Sa SER 4 GOK N Ll i X X RH BF I T X d X I X IL Y V I1 Kk I R amp KE z Click and drag with the cursor CDS 386 1858 Pfam hit to PF00835 SNAP 25 famil z CDS 3834 S073 SEEK YE here to manually edit WHO TDR Bioinformatics Workshop
124. sh to add a colour to help navigate around the sequence WHO TDR Bioinformatics Workshop at ICGEB New Delhi 2005 Module5 Small Scale Annotation Exercise 1 Aspergillus fumigatus is the most common mould pathogen of human and usually causes both invasive aspergillosis and allergies in immunocompromised patients and allergic diseases in patients with atopic immune systems We have provided you with a part of Aspergillus genomic DNA sequence that originated from a pilot project to sequence part of A fumigatus genome by construction of a bacterial artificial chromosome BAC library and subsequent BAC end sequencing and analysis done at the Sanger Institute in collaboration with the University of Manchester All you have to do is to open the main DNA file containing the DNA sequence and the curated gene models and annotate at least one gene To extract the sequence click on the CDS feature you wish to annotate and click on view and view bases of selection or view amino acids of selection Note that you can view the sequence in different formats By cutting amp pasting the sequence into the Web tools you were introduced to in one of the previous modules The file you will need is within the exercise directory 1 Af 2004 genemodels embl The A fumigatus DNA with gene models As mention in the introduction to this Module for larger scale analysis we cannot use the Cut and Paste approach and need to insta
125. study the degree of conservation of gene order and identify new genes in P knowlesi genome As part of the exercise you will also identify any gross dissimilarity visible between the the two genomic fragments and finally predict modify the gene model for one multi exon gene in P knowlesi genomic fragment The files that you are going to need are Pfal_chr13 embl annotation file with sequence Pknowlesi_contig seq sequence file without annotation Pknowlesi_contig embl annotation file with sequence Plasmodium comp crunch tblastx comparison file i File 767000 LOCKED Entries Select View Goto Edit Create Write Run Graph Display mum P falciparum 773500 780000 786500 793000 799500 sosoo0 812500 819000 825500 832000 538500 pem p51500 B58000 e643 a chr 13 fragment m UI UTE RR UNI BL LT O NAR IL mii fi ta Til d P knowlesi 6500 13000 19500 26000 32500 39000 45500 52000 58500 65000 71500 78000 84500 imn AAE OR R E 7s Comparison of P knowlesi contig and the annotated chromosome 13 fragment of P falciparum 60 WHO TDR Bioinformatics Workshop at ICGEB New Delhi 2005 Module 6 Comparative Genomics Exercise 2 Part II Conservation of gene order synteny In the ACT start up window load up the files Pfal_chr13 embl Pknowlesi_contig seq and the comparison file Plasmodium_comp crunch Use the slider on either sequence view panel to ob
126. t 2 2 6 1a32 win32 exe E39 Command Prompt lof x Microsoft Windows AFP Version 5 1 2606 G Copyright 1985 2001 Microsoft Corp C 5 Documents and Settings Team itcd blast Now that that you are in the blast directory you can start to run BLAST from the command line GiNblast m There are several programs in the BLAST package that you have now downloaded that can be used for sequence comparison For a detailed description of the uses and options see the appropriate README file 99
127. tain a global view of the genome comparison Also used the slider on the comparison view panel to remove the shorter similarity hits What effects does this have Can you see conserved gene order between the 2 species Can you see any region where similarity is broken up Zoom in and look at some of the genes encoded within this unique region in file Pfal_chr13 embl top sequence Example location Pfal chr13 embl 815823 829969 What are the predicted products of the genes assigned to this unique location View the details by clicking on the feature and then select Edit selected feature from the Edit menu after selecting the appropriate CDS feature Can you identify a few putative genes in P knowlesi contig based on their conserved and syntenic nature with P falciparum chromosome 13 Activate inactivate stop start codons in an entry using the right click button on the mouse This will allow you to see any potential ORFS Any thoughts about the possible biological relevance of the comparison i ox File Entries Select View Goto Edit Create Write Run Graph Display AD ai bi D B 767000 773500 780000 186500 193000 199500 BQ6000 812500 u jp mms jpis000 e2200 p3es00 p so0n peiso0 pee HI P falciparum Pfal chr13 embl d PF Ti PTI PENU Pri pENIUSUC What is the gene product LOCKED p Bl gt i m Bia ND esoo
128. that remain intact in the closely related and larger M tuberculosis genome sequence Many of these can be seen using the BlastX comparison data The region you will look at is equivalent to that you have just been looking at in M tuberculosis Note gene prediction in M leprae is very difficult 213 WHO TDR Bioinformatics Workshop at ICGEB New Delhi 2005 Module 4 Gene Prediction Exercise 4 Gene finding for spliced genes In many Eukaryotic organisms the principles covered in the earlier exercises still hold however some genes may contain introns hence gene identification becomes more complicated For the next exercises you will need to close the previous Artemis session eStart Artemis and load the sequence file Pfal_subseq embl eLoad the Phat gene predictions pfal_subseq_phat tab eFind which sequence plot would be most useful for this organism Plasmodium falciparum eLoad Blastx file swall_blastx crunch eUsing the Fasta searches and information you have loaded edit the gene models to fit the evidence you have fe Untitled E File Options Select enable direct Arten Re read Options DESEE a editing Relee F Enable Direct Editing Prok cop 7 Eukaryotic Mode PU UL AUT PW WS senor Highlight Active Entry B LIEU UI TE HU EE ELLA dd _ Black Belt Mode S ELI HEURE LULA UE HU HE ELE EL UI EDU VH a C m Ce Show Log Window a 6 i ae i DO 5600 6400 7200 8000 8800 rsere Hide Log Window FHETSTUSE Ti M
129. the correct directory Then type act amp return A small start up window will appear Now let s load up a S typhi versus Escherichia coli comparison The files you will need for this exercise are S typhi dna S typhi dna vs EcKI2 dna crunch EcK12 dna Ll ACT Release 2 beta Eala 1 Click File File Options P ou 2 then Open P pen ls Comparison Tool e 2 beta ryotic mode Copyright 1998 e 2002 Genome Research Limited Quit Sequence file 1 typhi dna NEUE gt gt Click and select Comparison file 1 _typhi dna_vs_EcK12 Chofe Sequence file 2 EcK12 dna Chod re files P appropriate files Y Apply Close For comparing more than two genomes Enter path or folder name Module 7 comparative genomics Click Apply Filter Files o EcK1Z embl EcE12 tah Pfal chri3 embl Pknowlesi contig emhl Pknowlesi contig seq Plasmodium comp crunch 5 typhi cod typhi dna and wait Folders Comparison files end with crunch For more info on comparison files see Appendix II Enter file name S typhi dna Open Update _54 WHO TDR Bioinformatics Workshop at ICGEB New Delhi 2005 Module 6 Comparative Genomics 2 The basics of ACT You should now have a window like this so let s see what s there 1 BH ACT St dna vs EcK12 dna for x File Entries Select View Goto Edit Create Wr
130. those genes that are found in both organisms as well as spot the differences You will also see how act can be used to study the different chromosome architecture of these two parasite species The files that you are going to need are Tbrucei dna T brucei sequence Tbrucei embl T brucei annotation Leish vs Tbrucei tblastx comparison file Leish dna L major sequence Leish embl L major annotation First load up the sequence files for T brucei and L major and the comparison file in ACT RS Leish dna File Entries Select View Goto Edit Create Write Run Graph Display 2200 ut ia di 13200 15400 7600 19800 TI l alah i MI PUn Mt UR LOU Mi M aa Me i i i Mi M i i U Ll Ld SS e ERE E Ln LOCKED hii Tope o OLD LL UP IM F 4400 6600 s800 1000 13200 m i 7600 rm 65 WHO TDR Bioinformatics Workshop at ICGEB New Delhi 2005 Module 6 Comparative Genomics d zoom out switch off stop codon to clarify the Eu ji16800 175200 y 315381 315097 gt 80532 SUS LORS i score 57 percent id 37 eR ens i IE 3 E d Ld y Lay am a if J an hour glass shape indicates an 233600 292000 350400 408800 lae 7200 inversion 66 WHO TDR Bioinformatics Workshop at ICGEB New Delhi 2005 Module 6 Comparative Genomics Can you see conserved gene order between the 2 species Can you see any region where sim
131. tire S typhi genome compared against the entire E coli K12 genome As for the Artemis exercises we should turn off the stop codons to clear the view and speed up the process of zooming out The only difference between ACT and Artemis when applying changes to the sequence views is that in ACT you must click the right mouse button over the specific sequence that you wish to change as shown above Now turn the stop codons off in the other sequence too Your ACT window should look something like the one below TB ACT Slat va Pek dna Dx File Entries Select Visi Goto Edit rema rite Run Graph Display eno Lena gana pena ono nan geao eson rano Use the vertical sliders to zoom out Drag or click the slider downwards from one of the genomes The other genome will stay in synch mn KEAL 2400 2200 pon gian Es S It 7200 E L i 56 WHO TDR Bioinformatics Workshop at ICGEB New Delhi 2005 Module 6 Comparative Genomics LI ACT St dna vs EcK12 dna File Entries Select View Goto Edit Create Write Run Graph Display 24900 1049800 1574700 2099600 26245 LOCKED 24900 1049800 1574700 2099600 26245 Once zoomed out your ACT window should look similar to the one shown above If the genomes in view fall out of view to the right of the screen use the horizontal sliders to scroll the image and bring the whole sequence into view as shown below You may have to play around with the
132. ton within it Genome location Characteristics of DNA plots Region 1 2 560 000 bps peak karlin troughs for G C and CG deviation Region 2 Region 3 We will now zoom back into the genome to look in more detail at the first of these three peaks Zoom into this position by first clicking on the DNA line at approximately the correct location If you then use the vertical side slider to zoom back in Artemis will go to the location you selected Remember that in order to see the CDS features lying within this region you will need to turn the annotation S typhi tab entry back on 19 WHO TDR Bioinformatics Workshop at ICGEB New Delhi 2005 Module 3 Artemis Advanced The region you should be looking at is shown below and is a classical example of what is referred to as a Salmonella pathogenicity island SPI The definitions of what actually constitutes a pathogenicity island are quite diverse However below is a list of characteristics which are commonly seen within these regions as described by Hacker et al 1997 1 Often inserted alongside stable RNA s 2 Atypical G C contents 3 Carry virulence related functions 4 Often carry genes encoding transposase or integrase like proteins 5 Unstable and self mobilisable 6 Of limited phylogenetic distribution Have a look in and around this region and look for some of these features Region SPI 1 omms Los Lj St nix File Entries Select View Goto Edit C
133. u will see 7 CDSs some with multiple exons As a gentle introduction to splicing we would like you to look at the genes named PF13_0119 MAL13P1 294 and PF13_0061 They have only been partially characterised and may in fact be missing exons Have a look at these CDSs and confirm edit or dismiss the proposed gene models by using G C content database searches and looking for splice sites Appendix IX G C content is a very good indicator of coding capacity in Malaria On average the coding regions are 23 G C and the non coding regions are 19 Have a look at the G C content for this region by selecting the appropriate graph Left click within the graph window and then select by clicking on the exons to see how this relates to the G C peaks on the graph Note we will cover the principals and methods of gene prediction in much more detail in a module 3 LI 10i x File Entries Select View Goto Edit Create Write Run Graph Display Entry F Malaria embl GC Content Window size 120 EY 34 16 fasta banner 3 33 J G E3 d E TUT ATI MENI CEON I TT E th PF13_0120 LI un al pe a MEER 2 EIE HEN E D RUSO 0x fasta process completed TURNI M UR E HE HE EE UE HILL EMI I LE all MOCIONI NIIIN CSE 13 0119 OK g00 1600 2400 3200 4000 1800 5600 le400 7200 OMMI EET TPMT MI HEEL MO EET ININ g TI HE MI MELLE HEU ILLE DE TEE EE HALE UM MU E wu E OL S NE HE ELE LUE TT PEA M LR E E LL E 1 0 OI d MY TUITE CIII MI HELME M
134. uld now be marked in the eraph windows that you previously clicked in D R D Y V W L Q E I I B I Bb E I 9 R Q QH MH k P PR MH v IN LOK IY LR RAD LT EI Ts GOA S o 085 L1 3M UY 1 AO 8 To T 8 FO GOM amp I KF Pp bee go I L i pg R L R L V A R D H M R OQ N L K I N I 8 P H E Q V 8 E C D Q F Mb bt AO R FN Q AGAGATTAOG TCTGGTTGCAAGAGATCATAACAGGGGAAATTGATTGAAAATARATATATCOCCAGCAGCACATGAACAAS TTTCOGAATG TGATCAATTTAAAAATTTATTGACTTAGGCOGGCA ao 50 a pa oo 110 120 130 MEER E HC RON RENI M EM NCC NM DAAATTOO L 8 T QN C 8 I H V P 8 18 QF Y 1L 1 R MC C HF bon POR BOR POF s oO LARA S VOR Y 8 I V DFP LUE B DTCPTHISsTrLuTYilMLLYrHnSWVLREPIHD IS PM eR RR UIS A SSS Ss ki A Lj ES EE ER ee TI m 18 WHO TDR Bioinformatics Workshop at ICGEB New Delhi 2005 Module 3 Artemis Advanced Artemis Exercise 1 Part III LI lgx File Fn riss Select View Coto Fdit Create Write Run Graph Display l W 2 i m r1 A i Wi Wwf MA Pew AWA AAN il M n D w Pu vu V A Mi 4 RLS VONNAPSJSA gm agni wy ws vow um P PN MM wel Jon ham a VW CC Deviation G C G4C Window size 20000 Karlin Signature Difference Window size 20000 Third region Wi na to investigate i ly NT Auli Ww UN m A iN o LA i ONAL eesti ha M NA 233200 1541500 t849e800
135. und the sequence CRUNCH D g ann Artemis Feature Edit gene 01 Key CDS EJ Common Keys Add Qualifier location Leeper Complement Grab Range Remove Range Goto Feature Select Feature MESS FASTA MESS BLASTP MESS GO gene gene 01 eit product ribonucleoprotein putative Ri fasta_file fasta no_name seq 00015 out blastp file blastp no name seq 00015 out colour 7 Using View Selection from the View menu you will see your annotation for a given feature in EMBL format This is the information that Artemis actually records For example ET CDS UNS IgE Produet RN Omnicom Teu Ore ed nol Te AU celle ab Ae FT gene gene_01 exo enses i 48 WHO TDR Bioinformatics Workshop at ICGEB New Delhi 2005 Module 5 Small Scale Annotation Module 5 Small Scale Annotation Introduction In this short Module you will attempt to annotate a small region of genomic DNA Using the web analysis tools covered in the previous Module such as Prosite and Pfam cut and paste the nucleotide or amino acid sequence into the submission box of the relevant web page s Aims This Module is also your opportunity to have a go at annotating one or hopefully more genes that have been predicted in the genomic segments detailed over the page Note It is not practical to rely on cut and paste searches for the analysis of whole genomes and so for large scale genome a
136. urrent model If you think that there are additional exons that should have been included in the gene model you should add them to it Using GC content and results from your database search as guides roughly draw in where you think the additional exon s lie To create additional exons Select the region you think represents the exon by holding down the left mouse button and dragging the curser over the region of interest Then click the Create menu and select Create feature from base range A new blue CDS feature will appear on the appropriate frame line See below LI Ee 2 File Entries Select View Goto Edit Create Write Run Graph Display LU I d LA Se LJ ol A LA Entry Malay emb 1 Change Qualifiers Of Selected nag Window size 120 Ctrl U Click Edit Edit Selected Features Ctrl E Edit Subsequence and Features M Edit Header Of Default Entry Remove Qualifier Of Selected Duplicate Selected Features Ctrl D Ctrl M Merge Selected Features Unmerge Selected Feature SEM G GE INE E MEME WT L2 TID BUS L delete Selected Features Delete Delete Selected Exons 1 SMe A MAOI MNO UUME RT pi amima i mwina Nt eee Merge Features wi AT E ME HE EE UL LRL VIE l M n LU ERE M INI 10 N I HI Ie Sie Ne Aaa ALL 19 m Selected Features To Met s Selected Features To finy Trim Selected Features To Next Met Ctr1l T B00 1600 2400 3200 laog I 48
137. ut Hj aut What are these genes ij qut gggqut EWqut yj aut 7 WHO TDR Bioinformatics Workshop at ICGEB New Delhi 2005 Module 7 ACT comparison files Module 7 Generating ACT comparison files using BLAST Introduction In the previous module you used ACT to visualize pairwise BlastN or TBlastX comparisons between DNA sequences In order to use ACT to investigate your own sequences of interest you will have to generate your own pairwise comparison files ACT is written so that it will read the output of several different comparison file formats these are outlined in appendix II Two of the formats can be generated using Blast software freely downloadable from the NCBI appendix X Both Windows and Linux versions of the software are available which can be loaded onto a PC or Mac For the purposes of this module the NCBI Blast distribution software has already been installed locally and therefore ready to use To give you an idea of how easy it 1s to download and install the software on a PC we have included a step by step guide in the appendixes Appendix X The example shown in appendix X is for downloading onto a PC with Windows XP The exercises in this module are based on the Linux version of the Blast software Although the operating systems are different the command lines used to run the programs are the same One of the main differences between the two operating systems 1s that in Windows the Blast program command line
138. various features in the order that they occur on the DNA with the selected gene highlighted The list can be scrolled 8 below Sliders for zooming view panels Sliders for scrolling along the DNA Slider for scrolling feature list WHO TDR Bioinformatics Workshop at ICGEB New Delhi 2005 Module 1 Artemis Prokaryotic 4 Getting around in Artemis The 3 main ways of getting to where you want to be in Artemis are the Goto dropdown menu the Navigator and the Feature Selector The best method depends on what you re trying to do and knowing which one to use comes with practice 4 1 The Goto menu The functions on this menu ignore the Navigator for now are shortcuts for getting to locations within a selected feature or for jumping to the start or end of the DNA sequence This one s really intuitive so give it a try E Artemis Entry Edit S typhi dna Goto Edit Create Write Run Graph Display Noddy Selected feature base Navigator ve Ctrl G 0003 class 3 1 18 colour 7 ec ortholoque K Start of Selection Ctri Left End of Selection Ctrl Right E STY000Z ME IEE PME E I usn bes pene den on ee poem oe enl Feature End 04 HI ALANNA M WI Sartor sequence e WML TN WM WMT HEF E gt End of Sequence Ctrl Down gt b gt misc_feature isc feature gon 1601 Feature Base Position 000 la800 5600 6400 7200 Feature Amino Acid feat ILL HER P HH HE d INI Hu m mo p tr pump ru Ho P010 dd d FT
139. w Delhi 2005 Module 6 Comparative Genomics Upload the files in sequential order as described in the previous page BE Sequence file 1 f Click on here to Comparison file 1 k Choose load more files Sequence file 2 CO and select the appropriate file more files Apply Close Je X Sequence file 1 Vinfs disk222 yeastouh rangs Choose Comparison file 1 Tints disk222 Ayeas Gr analyysi Choose Sequence file 2 Vinfs disk2 eastpuh analys Choose Comparison file 28 Choose Sequence file 3 I Choose yes more files Iu Click on here to read all the files that you have selected Close KEk Sequence file 1 n s di sk222 veastpub analysi Choose Comparison file 1 n s di sk222 veastpub analysi Choose Sequence file 2 n s di sk222 veastpub analysi Choose Comparison file 2 V n s di sk222 veastpub analysi Choose d Sequence file 3 Vn s di sk222 yeastpub analysi Choos Comparison file 3 Yak di sk292 veastpub analys Bose Sequence file 4 T nfs disk222 yeastpub anal Choose Comparison file 4 7 nfs disk222 yeastpubgfMalysi Choose ce Sequence file 5 nts disk222 veasg b analys Choose more files LI Glosa CDS can t have psu domain as a qualifier ignore error and continue CDS can t have psu domain as a qualifier ignore error and continue Yes No
140. y ti Y r A f A r7 rw P uw f Xp A V wr v Y J v W l fv k l V dV Od i y VM V LM hN y w T Feature Viewer Menu G C G C Window size 910 a T 1 p 0 Raise Selected Features F a j v VYN JY X i Lower Selected Features w vi A A f A Smallest Features In Front M JW IE AAA 2 y A f VAY Zoom to Selection V od 0 3 Select Visible Ranae Select Visible Features Set Score Cutoffs Entries Select Goto View Edit Create Write Run 3 k x kJ y kj 3 kJ F Feature Labels 4 Qne Line Per Entry F Forward Frame Lines F Reverse Frame Lines 4 Start Codons Stop Codons Feature Arrows Q E I I T G8 E I I 2 K I Y R Q ee feo Ee gee F R N gt 6G C K R 5 Q G KL I E N KE Y I A 8 E T T 5 F G M hb WV A R D H M R G OR Sok ER SA A H EQ V S E C Oi JOTTGCAAGAGATCATAACAGGOGAAATTGATTGAAAATAAATATATCO CCAGCAGCACATGAACAAGTTTCOG AA TG TU AN p 20 50 Feature Borders 4 Bll Features On Frame Lines Menu item for de selecting s gt 4 Show Source Features Flip Display stop codons 4 Colourise Bases No stop codons shown on frame lines You will also need to temporarily remove all of the annotated features from the Artemis display window In fact if you leave them on which you can they would be too small to see when you zoomed out to display the entire genome To remove the annotation click on the S typhi tab entry button
141. you through the pseudogenes as they occur on the chromsome tRNA genes Type tRNA in the Goto Feature With This Key Regulator binding DNA consensus sequence real or made up Note that degenerate base values can be used Appendix VIII Amino acid consensus sequences real or made up You can use X s Note that it searches all six reading frames encoded or not regardless of whether the amino acids are What are Keys and Qualifiers See Appendix III EA WHO TDR Bioinformatics Workshop at ICGEB New Delhi 2005 Module 1 Artemis Prokaryotic Clearly there are many more features of Artemis which we will not have time to explain in detail Before getting on with this next section it might be worth browsing the menus Hopefully you will find most of them easy to understand Artemis Exercise 1 Part II Artemis Entry Edit St dna File Entries Select View Goto Edit Create Write Run Graph Display Nothing selected Entry FS typhi dna FS typhi tab IH ME uo MM U EIE E LUE E UL NIE LIE M ILE HEU QUU Lnd M Hir CDS PITT TERT il LITT IHE EL E LE M ALL l va a uc T rl features qn a Aman Hi Tn DET LEE D TA LE DL IE MAI wun aa T Die i Wi IT b P E RBS misc feature mi RBS RBS uNc f 82400 2184600 2186800 2189000 2191200 2193400 2195600 2197800 2200000 2202200 22044 ii m m EHE UE LIE EE Vibes c nmi I TELE TIT e 1 UII N ANIE MONU ME U TO al iin do unu Went WO HL 1 HN Po

Exercise 1

Contents

Download Pdf Manuals

Related Search

Related Contents