Home
RaligNAtor User`s Manual
Contents
1. 2 gt DQ288865 1 58026001 168 5 0710188 8 43 20 56 48 44 59 81 97 24 60 92 101 134 52 93 117 149 173 36 N GOTTE EE DD D DD UGAWCUKD DNNNDNDNHNDMWWDYBVNVDNBWHDWADNNNNNNH VNHUAUUUADNBWUAC CARGAYSNVNNNNDGCRKYCCHVHRWNRUCYAG BHKHDHDSNBHDRGUNSNSNNNWNN LL a Erg reel III reel LIHI 444444444 444 4444444 H a nn UGAACUUG UCUCUCAACAAAAAGCCACCGACAUUAAGAGAGAGA CCCUAUUUAGGGUUAC CAGGAUCUGCAACAGCAUUCCUGUAUCAUCCAG GG UGAGGAUUGAGUUGACCUCAUC EET 2 ooo DIE gt EF517520 1 55135715 167 5 0710188 8 43 19 56 46 44 59 82 98 24 60 92 102 137 50 93 117 152 177 39 ION OU IM C UGAWCUKD D NNNDNDNHNDMWWDYBVNVDNBWHDWADNNNNNNH VNHUAUUUADNBWUAC CARGAYSNVNNN NDGCRKYCCHV HRWNRUCYAG BHKHDHDSNBHDRGUNSNSNNNWNN 18 3 3 Using RaligN Ator IRD Do et meeld llrereldd Ill 4444444 44 444 44 44444 4444444444444 O A UGAUCUUG UCGCAGAGGCAAAAAUUUGCACAGUAUAAAAUCUGCA ACCUAUUUAGGUUUAC CAAGAUCGGUGGAUAGCAGCCCUAUCAAUAUCUAG UUUAGAAGAUUAGGUAGUCUCUAAA ME Ew ga EET EE HE III C gt EF517519 1 55125714 167 5 0710188 8 43 19 56 46 44 59 82 98 24 60 9
2. A A A T 1 1 TI I 4444444444444 T UGUUGUGU U UGCGCGAUAAAUGCUGACGUGAAAACGUUGCGUA AGCUAUUUAGCUUUAC CAAGACGCCGUCGUGCAGCCCACAAAAGUCUAG GAGCAUACGCUAGGUCGCGUUG AC FE GR EE III GC gt EU282007 1 69357121 162 5 073116 8 43 15 50 44 44 59 68 84 24 60 92 88 121 51 93 117 134 158 37 nates TOT III GC CCO CONDES OD DDD 9999 Ee BIELIE DDD ED UGAWCUKD DNNNDNDNHNDMWWDYBVNVDNBWHDWADNNNNNNH VNHUAUUUADNBWUAC CARGAYSNVNNNNDGCRKYCCHVHRWNRUCYAG BHKHDHDSNBHDRGUNSNSNNNWNN Eee une EEE EEE teed LIHI 1444444444 ee 4444444 4444444444444 444444 UGUUGUGU U UGCGCGAUAAAUGCUGACGUGAAAACGUUGCGUA AGCUAUUUAGCUUUAC CAAGACGCCAUCGUGCAGCCCACAAAAGUCUAG GAGCAUACGCUAGGUCGCGUUG AC Sinan ET ES III AI ID EED DD DED Ie Total number of chains 17 19 3 Searching with RaligNAtor Each chain contains the description of the sequence where the chain occurs followed by the chain score chain length and matched strand direction for forward or for reverse In addition it contains the fragments coordinates i e expected or stacked start and end matching positions of the fragment actual start and end matching positions of the fragment and fragment score and the matching substring of the fragments along with their sequence structure alignment to the corr
3. Arc breaking 1 1 Arc removing 2 Arc altering Time 160822 0290 ms Number of matches 16033351 Searching for pattern ires2 in the forward sequence s Cost threshold edist 4 Max allowed indels 1 Min Max match length 35 37 Max match score 48 Costs Replacement 1 Deletion 1 Arc breaking 1 Arc altering 1 Arc removing 2 Time 3607395 4620 ms Number of matches 8950417 Searching for pattern ires3 in the forward sequence s Cost threshold edist 1 Max allowed indels 0 Min Max match length 16 16 Max match score 24 Costs Replacement 1 Deletion 1 Arc breaking 1 Arc altering 1 Arc removing 2 Time 96774 9180 ms Number of matches 1052 Searching for pattern ires4 in the forward sequence s Cost threshold edist 3 Max allowed indels 2 Min Max match length 31 35 16 Max match score 53 Costs Replacement 1 Deletion 1 Arc breaking 1 I NB Arc altering Arc removing Time 871779 0860 ms Number of matches 112 Searching for pattern ires5 in the forward sequence s Cost threshold edist 3 Max allowed indels 1 Min Max match length 24 26 Max match score 39 Costs Replacement 1 Deletion 2 Arc breaking Pr Arc altering Arc removing Time 798519 5760 ms Number of matches 1222639 Total number of matches 26207571 Chaining matches done Time 13660 1450 ms
4. Its default value is 0 indels number of allowed indels Its default value is the cost threshold divided by the cost of an indel i e cost deletion Note that since cost bounds the number of indels that can actually occur in a match if indels deletion gt cost RaligNAtor will also automatically set indels cost deletion weight a weight that is assigned to a chain fragment corresponding to a match of the respective pattern Its default value is the score associated to a match see match score definition in RaligN Ator s publication startpos used to compute the score of local chains it denotes the expected matching position of the pattern in the searched sequences It must be specified for none or all patterns If not specified the matching position of the patterns is automatically computed in a stacked way i e the matching position is the sum of the length of all patterns defined before it 1 Supplemental options must be provided between two pipe symbols and its keyword e g weight is followed by the equal sign and a value We observe that these options can also be provided in the command line call to RaligN Ator overriding the respective option value given in the patterns file The second line of the pattern definition contains the sequence information i e a sequence of bases possibly containing ambiguous IUPAC characters RaligN Ator au tomatically recognizes ambiguous characters and tries to match the correspond
5. sequence chain score chain length strand gt AB183472 1 62866484 171 5 f 07 10 18 8 8 43 19 54 47 44 59 79 95 24 60 92 99 132 53 93 117 147 172 39 aja av ta CHE ni ee EE N LE E DD DDD UGAWCUKD DNNNDNDNHNDMWWDYBVNVDNBWHDWADNNNNNNH VNHUAUUUADNBWUAC CARGAYSNVNNNNDGCRKYCCHVHRWNRUCY AG EA Zune D 1444444444 Ee UGAUCUGA UAGAAGUAAGAAAAUUCCUAGUUAUAA UAUUUUUA AGUUAUUUAGCUUUAC CAGGAUGGGGUGCAGCGUUCCUGCAAUAUCCAG ES CU RA OE III ER DODE EE gt AB017037 1 62866484 171 5 0710188 8 43 19 54 47 44 59 79 95 24 60 92 99 132 53 93 117 147 172 39 ea CL ni ee EE N GOOG O XCCCO vi DD DDD LE sr DD DDD UGAWCUKD DNNNDNDNHNDMWWDYBVNVDNBWHDWADNNNNNNH VNHUAUUUADNBWUAC CARGAYSNVNNNNDGCRKYCCHVHRWNRUCY AG EUA a 44 nn D 1444444444 Ee UGAUCUGA UAGAAGUAAGAAAAUUCCUAGUUAUAA UAUUUUUA AGUUAUUUAGCUUUAC CAGGAUGGGGUGCAGCGUUCCUGCAAUAUCCAG Ed COCO tia de ae DE ID EL gt AF218039 1 60286228 171 5 0710188 8 43 19 55 48 44 59 80 96 24 60 92 100 133 53 93 117 149 173 38 ge VE ar er EICH IM xoa CO CC DD DDD GOOG ll DD DDD UGAWCUKD DNNNDNDNHNDMWWDYBVNVDNBWHDWADNNNNNNH VNHUAUUUADNBWUAC CARGAYSNVNNNNDGCRKYCCHVHRWNRUCY AG IIHI ae T 44 ll Dl 1444444444 Ee led UGAUCUUG UUGUAAAUACAAUUUUGAGAGGUUAAUAAAUUACAA AGCUAUUUAGCUUUAC CAGGAUGCCUAGUGGCAGCCCCACAAUAUCCAG EE EG ee ead t
6. call and its screen output sufconstruct path to fasta_file Rfam fas rna lgslink s path to save index Rfam Fasta file Rfam fas Number of sequences 2756313 Total length 824991406 Computing suf done 2 Database preprocessing with sufconstruct Computing lcp Computing suf done done The program execution produces these files 1s goh total 11 0G rw r r 1 rw r r 1 rw r r 1 rw r r 1 790M rw r r 1 2 1G rw r r 1 790M rw r r 1 3 1G rw r r 1 3 1G rw r r 1 790M 2012 02 24 2012 02 24 2012 02 24 2012 02 24 2012 02 24 2012 02 24 16 16 16 16 16 16 08 08 02 08 08 02 68 2012 02 24 16 02 Rfam alph 11M 2012 02 24 16 02 Rfam base 67M 2012 02 24 16 02 Rfam des Rfam Rfam Rfam Rfam Rfam Rfam lcp lcpe seg suf sufinv tseg 3 Searching with RaligNAtor RaligNAtor can search for given sequence structure patterns in 1 a precomputed index using algorithm LESAAlign or LGSlinkAlign or 2 directly in a plain FASTA file using algorithm ScanAlign or LScanAlign For computing an index please refer to program suf construct above All algorithms deliver the same results differing for the user only in their running times For faster index based and online searches we recommend using algorithms LGSlinkAlign and LScanAlign respectively An overview of the options of RaligN Ator is given in Table 3 1 and are explained in more detail below 3 1 Search opti
7. indel Index based algorithmic variants lgslink lgslink_nof lesa Uses early stop acceleration enhanced suffix array and generalized suffix links Variant lgslink with disabled sequence based filter Uses early stop acceleration and enhanced suffix array leslink requires tables suf lcp and sufinv lesa requires only suf and Icp Online algorithmic variants scan lscan aligngl Chaining options global local wf lt wf gt maxgap lt width gt minscore lt score gt minlen lt length gt top lt gt allglobal show show2 Slides a window over the target sequence reusing matrix entries Scanning variant with early stop acceleration Aligns globally reporting the best alignment no pattern matching Perform global chaining Perform local chaining Apply weight factor gt 0 0 to fragments Allow chain gaps with up to the specified width Report only chains with at least the specified score Report only chains with number of fragments gt length Report only top scoring chains of each sequence Report for each sequence all global chains satisfying above criteria Show chains in the report Print complete sequences and omit all other matching information 10 Table 3 1 Overview of options of RaligN Ator 3 1 Search options sequence or the pattern The default cost for arc removing is 2 and for all others it is 1 cost cost i e sequence structure edit distance threshold for matches
8. of the respective patterns in the the patterns file 14 3 3 Using RaligN Ator global Option to perform global chaining of matches local Option to perform local chaining of matches wf lt wf gt wf takes as parameter a positive weight factor that is applied to all chain fragments For instance if a chain fragment of a pattern has score 2 a weight factor of 10 implies that the chain fragment will have score 20 maxgap lt width gt maxgap takes as parameter the maximum distance i e number of bases allowed between chain fragments minscore lt score gt Report only chains with at least the specified score minlen lt len gt Report only chains with at least the specified number of chain fragments top lt gt Report only top scoring chains If this option is not used all chains are reported allglobal Guarantees that all global chains are reported without discarding any chains with the same score show Show chain fragments and their coordinates i e start and end matching position and score in the chaining report show2 Print complete sequences for which at least one chain was found and omit all other matching information A sequence is only printed once Sequences are printed in their order of occurrence in the database We note that chains are reported in descending order of their chain score 3 3 Using RaligNAtor As an example we used RaligNAtor to search for five pat
9. 2 102 137 50 93 117 152 177 39 GERS od AE EE DEGIE DE C C UGAWCUKD D NNNDNDNHNDMWWDYBVNVDNBWHDWADNNNNNNH VNHUAUUUADNBWUAC CARGAYSNVNNN NDGCRKYCCHV HRWNRUCYAG BHKHDHDSNBHDRGUNSNSNNNWNN LL LAL ld meld O Ul 444444 tt 444 44 4444444444444 A UGAUCUUG UCGCAGAGGCAAAAAUUUGCACAGUAUAAAAUCUGCA ACCUAUUUAGGUUUAC CAAGAUCGGUGGAUAGCAGCCCUAUCAAUAUCUAG UUUAGAAGAUUAGGUAGUCUCUAAA ee ER EE EE DEGIE DE gt EF517521 1 55135715 167 5 0710188 8 43 19 56 46 44 59 82 98 24 60 92 102 137 50 93 117 152 177 39 Be red EE II C UGAWCUKD D NNNDNDNHNDMWWDYBVNVDNBWHDWADNNNNNNH VNHUAUUUADNBWUAC CARGAYSNVNNN NDGCRKYCCHV HRWNRUCYAG BHKHDHDSNBHDRGUNSNSNNNWNN Dll ld O a Ill 44 444 44 44444 4444444444444 4444444444 UGAUCUUG UCGCAGAGGCAAAAAUUUGCACAGUAUAAAAUCUGCA ACCUAUUUAGGUUUAC CAAGAUCGGUGGAUAGCAGCCCUAUCAAUAUCUAG UUUAGAAGAUUAGGUAGUCUCUAAA SEER ba AIS ED C C CG 2 gt AF178440 1 59256123 166 5 0710188 8 43 31 66 45 44 59 79 95 24 60 92 99 132 52 93 117 148 172 37 A Cl nennen III UGAWCUKD DNNNDN
10. DNHNDMWWDYBVNVDNBWHDWADNNNNNNH VNHUAUUUADNBWUAC CARGAYSNVNNNNDGCRKYCCHVHRWNRUCYAG BHKHDHDSNBHDRGUNSNSNNNWNN ll T reel ll Pettey Ill 1444444444 ee 4444444 4 4444444444 tt UGAUCUUG AUUCUGUACAUAAAAGUCGAAAGUAUU GCUAUAGU GCCUAUUUAGGCAUAC CAGGAUGGCGCGUUGCAGUCCAACAAGAUCCAG UCCUAUACCUCGAGUCGGGUUU GG ara pa GATE EE III C gt AF536531 1 66416834 165 5 075136 8 43 15 50 46 44 59 75 91 24 60 92 95 128 51 93 117 143 168 38 ee Bana Ce ED DD UGAWCUKD DNNNDNDNHNDMWWDYBVNVDNBWHDWADNNNNNNH VNHUAUUUADNBWUAC CARGAYSNVNNNNDGCRKYCCHVHRWNRUCYAG BHKHDHDSNBHDRGUNSNSNNNWNN T LLLI IHI 1444444444 ee 4444444 4 4444444444 4444444444 UAAUUUGA U UUAGGUUAUAAUGUUAGGACUAUAAAAAUUAGCU AGUUAUUUAACUUUAC CAAGAUGGCCGUUGGCAGCCCCACGAAAUCUAG CUAUUUUGAUUAGGUGGUCAGAUAG and ALS EE GC C gt AF022937 1 69357121 162 5 073116 8 43 15 50 44 44 59 68 84 24 60 92 88 121 51 93 117 134 158 37 Bag Bane AG IM UGAWCUKD DNNNDNDNHNDMWWDYBVNVDNBWHDWADNNNNNNH VNHUAUUUADNBWUAC CARGAYSNVNNNNDGCRKYCCHVHRWNRUCYAG BHKHDHDSNBHDRGUNSNSNNNWNN
11. NN LELIE T reel 1 1444411101141 444444444 444 4444444 4444444444444 4444444444 UGAUCUUG UUCCUUAUACAAUUUUGAGAGGUUAAUAAGAAGGAA AACUAUUUAGUUUUAC CAGGAUGCCUAUUGGCAGCCCCAUAAUAUCCAG CUUAUAUGAUUAGGUUGUCAUUUAG ee TOET III 2 gt AB006531 1 60036204 170 5 0710188 8 43 20 56 47 44 59 82 98 24 60 92 102 135 53 93 117 150 175 38 ee ge TOET DD DDD RR 2 UGAWCUKD DNNNDNDNHNDMWWDYBVNVDNBWHDWADNNNNNNH VNHUAUUUADNBWUAC CARGAYSNVNNNNDGCRKYCCHVHRWNRUCYAG BHKHDHDSNBHDRGUNSNSNNNWNN VELDE HEER reed UIL 14444 LIHI 1444444444 444 En nn nn UGAUCUUA AAAAUUAGGUUAAAUUUCGAGGUUAAAAAUAGUUUU GUAUAUUUAUACUUAC CAAGAUGGACCGGAGCAGCCCUCCAAUAUCUAG GCUCAAACAUUAAGUGGUGUUGUGC Pe TOET e a 2 GG OO DDD 9999 9 gt EU680971 1 184383 169 5 0710188 8 43 19 54 47 44 59 80 96 24 60 92 100 133 51 93 117 147 172 39 ERRARE ve OE ED III 2 UGAWCUKD DNNNDNDNHNDMWWDYBVNVDNBWHDWADNNNNNNH VNHUAUUUADNBWUAC CARGAYSNVNNNN DGCRKYCCHVHRWNRUCYAG BHKHDHDSNBHDRGUNSNSNNNWNN LL AL ld reel Ill 1444411101 dab 444444 EEE EEE a tetas UGAUCUUU AUCGGGACAUGCAAAUGCAAGG ACAAAACUCCGA
12. RaligNAtor User s Manual Fernando Meyer Center for Bioinformatics University of Hamburg Bundesstr 43 20146 Hamburg Germany June 20 2013 Contents 1 Introduction 2 Database preprocessing with sufconstruct 2 1 Preprocessing options 22 SS A 2 2 Using sufconstruct socorrer ora een ka ea es nme e ee EO EE SEE ES EN IE N ee e a GERED ED s 1 Introduction RaligNAtor is a software package for fast approximate matching of RNA sequence structure patterns It searches sequence databases for occurrences of user given patterns annotated with secondary structure Its main features are e Implementations of new efficient user selectable online and index based matching al gorithms e Matching computation based on a sequence structure edit distance with a full set of edit operations on single bases and base pairs e Patterns can describe any branching non crossing RNA secondary structures Se quence information can contain ambiguous IUPAC symbols e Search in DNA and RNA sequences possible due to flexible alphabet handling e Matching on forward and reverse complement strands e Customizable base pairing rules e Integrated fast algorithms for global and local chaining of matches e Output of results including matching positions sequence structure alignments scores etc For index based matching RaligN Ator uses a data structure based on the suffix array precomputed from the target sequence database This p
13. U GGAUAUUUAUCCUUAC CAGGAU CAGCUCAGGCAGCCCCGAAAAAUCCAG CUUCGAAGAGAAGGUGCUCUAGAAG Ee Ses OE EE II gt AF183905 1 56475848 168 5 07 10 18 8 8 43 20 55 47 44 59 81 97 24 60 92 101 136 50 93 117 151 176 39 lt aqa SEE sesse DDD DD 2 C 2 UGAWCUKD DNNNDNDNHNDMWWDYBVNVDNBWHDWADNNNNNNH VNHUAUUUADNBWUAC CARGAYSNVNNN NDGCRKYCCHV HRWNRUCYAG BHKHDHDSNBHDRGUNSNSNNNWNN ll Do 44 PE petted LIHI tt leed 4444444444444 4444444444 UGAUCUUG UGCGGAGGCAAAAUUUGCACAGUAUAAAA UCUGCA ACCUAUUUAGGUUUAC CAAGAUCGGUGGAUAGCAGCCCUAUCAAUAUCUAG UUUAGAAGAUUAGGUAGUCUCUAAA La Cleaner III C 2 gt EF517515 1 55125714 168 5 0710188 8 43 20 56 47 44 59 82 98 24 60 92 102 137 50 93 117 152 177 39 EEE OER ER III C 2 UGAWCUKD DNNNDNDNHNDMWWDYBVNVDNBWHDWADNNNNNNH VNHUAUUUADNBWUAC CARGAYSNVNNN NDGCRKYCCHV HRWNRUCYAG BHKHDHDSNBHDRGUNSNSNNNWNN LL EZB ld 4444 4H RD tte Ete LIHI 4444444 tt 444 44 EEE nn nn nn UGAUCUUG UGUGGAGGCAAAAAUUUGCACAGUAUAAAAUCUGCA ACCUAUUUAGGUUUAC CAAGAUCGGUGGAUAGCAGCCCUAUCAAUAUCUAG UUUAGAAGAUUAGGUAGUCUCUAAA ER TOT ER DDD DD C
14. a pat lt file gt for rev comp lt file gt byseq byscore byscorea table no overlaps silent progress Index name or FASTA file Use alphabet defined by file option applies only to FASTA file Use DNA alphabet A C G T and TUPAC wildcards default Use RNA alphabet A C G U and IUPAC wildcards Structural pattern s to search for Search in the forward sequence default Search in the reverse complement sequence For searching in the for ward sequence as well combine it with for Load base pair complementarity rules from file Sort matches by sequence and matching position Sort matches of the same pattern by descending score Sort matches of the same pattern by ascending score Print matches in table format Filter out low scoring overlapping matches of the same pattern Do not output matches Show progress message for each 5 processed data Operation costs and thresholds These do not override parameters set in the patterns file replacement lt cost gt deletion lt cost gt arc breaking lt cost gt arc altering lt cost gt arc removing lt cost gt cost lt x gt indels lt x gt Cost of a base mismatch default 1 Cost of base deletion insertion default 1 Cost of an arc breaking default 1 Cost of an arc altering default 1 Cost of an arc removing default 2 Allow edit distance lt x default 0 Allow number of indels lt x default cost cost of one
15. e it 99000 CCG DI ID Ee gt AF014388 1 60786278 170 5 0710188 8 43 19 55 48 44 59 80 96 24 60 92 100 133 52 93 117 150 174 38 ger VE ea ee DI CCG IM zu GEGEE AE EC ads DD DD CO CCC DD DDD UGAWCUKD DNNNDNDNHNDMWWDYBVNVDNBWHDWADNNNNNNH VNHUAUUUADNBWUAC CARGAYSNVNNNNDGCRKYCCHVHRWNRUCY AG ll HEER 44 A D 444444444 Ee 3 3 Using RaligN Ator CCC CCC DD BHKHDHDSNBHDRGUNSNSNNNWNN T CCUUGUAGUUUUAGUGGACUUUAGG Kelle DD CCC CCC DD BHKHDHDSNBHDRGUNSNSNNNWNN T CCUUGUAGUUUUAGUGGACUUUAGG DD BHKHDHDSNBHDRGUNSNSNNNWNN ttt HERE UUUUUCAGAUUAGGUAGUC GAAAA DD BHKHDHDSNBHDRGUNSNSNNNWNN T 17 3 Searching with RaligN Ator UGAUCUUG UUCCUUAUACAAUUUUGAGAGGUUAAUAAGAAGGAA AACUAUUUAGUUUUAC CAGGAUGCCUAUUGGCAGCCCCAUAAUAUCCAG UU AUAUGAUUAGGUUGUCAUUUAG EE PRs KIM Ca CCC DDD gt AF014388 1 60786278 170 5 07 10 18 8 8 43 19 55 48 44 59 80 96 24 60 92 100 133 52 93 117 149 174 38 saa pais TOET ED III 2 UGAWCUKD DNNNDNDNHNDMWWDYBVNVDNBWHDWADNNNNNNH VNHUAUUUADNBWUAC CARGAYSNVNNNNDGCRKYCCHVHRWNRUCYAG BHKHDHDSNBHDRGUNSNSNNNW
16. epresentative of the class Another more explicit way to specify the class representative is to end the class definition with a whitespace followed by the desired representative character As an example observe that the representative of the class of non matching characters of the target sequence above is B To set it to N define it instead as BbNnRrYySswWwKkMmDdHhVv N Below is an example of a complete alphabet file Aa A Cc C Gg G UuTt U AG R CTU Y CA M UTG UTA K W 2 Database preprocessing with sufconstruct CG S CGUT B AGUT D ACUT H ACG V ACGUT N NnRrYySsWwKkMmBbDdHhVv N This alphabet file defines four matching character classes whose representatives are A C G and U The class with representative U for example allows for the use in the pattern of both uppercase and lowercase Us and Ts such that any of these characters will match both uppercase and lowercase Us and Ts in the target sequence Because U is the class representative alignments found with RaligNAtor will show U wherever these characters occur The file also defines several wildcards that can be used in the pattern e g R to match uppercase and lowercase As and Gs in the target sequence Finally it defines a class of non matching characters of the target sequence This can contain characters of the previous two classes e g R However Rs occurring in the target sequence will cause mismatches whereas R used in the
17. es This can occur due to the different order by which matches are found and filtered out silent silent disables the output of matches progress progress shows a progress message for each 5 processed data replacement deletion arc breaking arc altering arc removing Options taking each a value that specifies the cost of the respective edit operation with meaning and default value as detailed above for option pat A used option holds for all patterns in a patterns file and overrides the respective value specified in that file To specify different operation costs for each searched pattern see option pat 13 3 Searching with RaligN Ator e cost indels Cost threshold and number of allowed indels for matches As with the edit operation costs provided in the command line the value given via these options holds for all patterns of a patterns file and override the respective value specified in that file To specify different cost thresholds and number of allowed indels for each searched pattern see option pat above e lgslink lesa Selects one of the index based algorithms LGSlinkAlign or LESAAlign These algo rithms require an index of the target database which can be generated with the sufconstruct tool above Note since version 1 1 of RaligNAtor LGSlinkAlign performs in a first step sequence based filtering with standard dynamic programming considering only edit operations on single bases i e i
18. es not re quire computing an index for the reverse complement sequences RaligNAtor handles this by automatically computing the reverse complement of the patterns and by using these patterns for search The patterns will contain complement characters accord ing to the IUPAC table This holds for alphabets specified with option dna rna or alph Characters not belonging to the IUPAC table cannot be complemented and re main unchanged Base pairing rules are also automatically complemented This means that given Watson Crick and wobble pairs Watson Crick pairs remain unchanged but accepted pairs derived from wobble U G and G U pairs automatically become A C and C A Note that A C and C A pairs must not be defined using op tion comp see below since these pairs are then allowed when searching the forward sequences comp lt file gt The parameter of option comp is a file specifying complementary bases A line with two bases given without any whitespaces or punctuation implies that matches to the patterns can contain such a base pair It is not necessary to specify the pairing rule twice For example for pairs C G and G C it suffices to provide a line CG Below is a sample file AU CG 3 1 Search options GA GU According to this file these base pairs are possible A U U A C G G C A G G A U G G U Note that if the option comp is not used Watson Crick base pairs are al
19. esponding patterns 20
20. f options of program sufconstruct resented by sequences of Ns There can be only one such character class specified in one line beginning with symbol We emphasize that this class does not do any transformation of pattern characters E g BbNnRrYySsWwKkMmDdHhVv All characters used in this example that occur in the target sequence cause mismatches to any pattern character However these characters can be used with a different behavior in the pattern see the following characters classes Matching characters a set of characters whose members are not distinguished between each other mapping pattern characters to match the same set of char acters in the target sequence In other words characters of both the pattern and the target sequence belonging to one such class are transformed to a single symbol Hence this character class can be used for alphabet reduction Such a character class is specified in one line with a simple list of the member characters E g Aa The class above indicates that A and a are not distinguished between each other Another didactic example is AaM This class allows M to be used in the pattern even if it belongs to non matching characters of the target sequence M will be able to match As and as of the target sequence but it will not match Ms if in the target sequence M is a non matching character We observe that in the alignments reported by RaligNAtor an align ment column of two matchin
21. g characters of the same class is marked with symbol e g an alignment of A with a 2 1 Preprocessing options Wildcards of the patterns a class of this type specifies a special pattern symbol that can be used to match characters belonging to different matching character classes A typical application is to specify a character e g Rto match As and Gs in the target sequence where A and G belong to two different matching character classes Such a class is specified in one line beginning with a E g KRAG This class defines a wildcard symbol R i e the first symbol after to match As and Gs in the target sequence In addition it will match every character belonging to the classes to which A and G belong for instance as and gs Attention make sure that all characters belonging to this class except R also belong to a matching character class Otherwise this wildcard class will not be accepted We observe that a wildcard character aligned to a matching character of its class is annotated with a in the RaligNAtor output as in the following example Pattern C CCCAA CCUUAAUCCAUARGA IILL CIEE LIT IHN Target CGCAACCCUU AUC AAAGGA Ir Naturally alignments found with RaligNAtor show for each non gapped position a single character of the corresponding character class Each such character is called a class representative By default the first character different from and of each line is the r
22. g with an index name each table that is constructed is stored on disk in its own file The name of each file is index name table name Addi tional files are also stored One file with extension alph stores the alphabet one with extension base stores basic information about the sequences such as their length and one with extension des stores the description of each sequence The sequences and alphabetically transformed sequences are stored in a file with extension seq and tseq respectively Note that all the generated files are binary e x This option prevents sufconstruct from saving alphabetically transformed sequences to file This is useful for saving disk space but it will require RaligNAtor to convert the sequences of the index for each search run e c c outputs the constructed tables and the corresponding suffixes to screen This option is only recommended for small databases say with sequence length up to 100 e t lt file gt t works like the option c but it directs the output to the specified file e time With this option the elapsed construction time of each table is displayed Be aware that the generated files may overwrite existing ones without warning 2 2 Using sufconstruct We show an example for preprocessing a database for search with algorithm LGSlinkAlign The database stored in file Rfam fas consists of sequences obtained from the full align ments of Rfam release 10 1 Below is the program
23. ing base e g A or G in place of an R The third line contains the structure information in dot bracket notation In this notation unpaired bases are represented by dots and paired bases are represented by and Observe that for specifying a completely single stranded pattern it is necessary to provide a sequence of dots As an example a patterns file may contain the following text gt tRNA pat replacement 2 deletion 3 arc removing 5 GSSVVYRURGYYYARYUGGUUARMRCRYYDSVYUBHHAMBCHRDWRRUYRYRGGUUCRAWUCCYDYHNBBNSYR CEC Ca Bstg ES RAS CAE Another example is a file containing multiple patterns as follows gt iresl cost 2 indels 0 UGAWCUKD gt ires2 indels 1 cost 4 11 3 Searching with RaligNAtor 12 DNNNDNDNHNDMWWDYBVNVDNBWHDWADNNNNNNH AA ns 113333 gt ires3 indels 0 cost 1 VNHUAUUUADNBWUAC KEINE gt ires4 indels 2 cost 3 CARGAYSNVNNNNDGCRKYCCHVHRWNRUCY AG CCC CC gt ires5 indels 1 cost 3 deletion 2 BHKHDHDSNBHDRGUNSNSNNNWNN LTR LET for Option for searching in the forward sequences This option is selected by default rev Option for searching in the reverse complement sequences If used in combination with the option for search is performed in both the forward and reverse complement se quences otherwise search is only performed in the reverse complement sequences Observe that searching in reverse complement sequences of a database do
24. lowed by default byseq With this option matches are reported by sequence and matching position such that matches at the beginning of a sequence are reported first Note that with this option matches are not reported during search as they are found but only once the search in the entire database is completed byscore byscorea With byscore or byscorea matches are sorted in descending or ascending order of their score respectively The match score is inversely proportional to the cost associated to a match see exact score definition in RaligN Ator s publication Note that since the score for different patterns is not normalized matches of the same pattern are reported consecutively table Option for reporting the matches in a table format with one match per row no overlaps no overlaps filters out low scoring overlapping matches of the same pattern More precisely if the starting and ending positions of a matched substring overlap with the starting and ending positions of another matched substring of the same pattern only the matched substring with a higher score is reported In the case of a tie one of the matches is arbitrarily filtered out RaligN Ator checks several times during search for overlapping matches hence avoiding a memory overflow in the case of highly sensitive patterns Note that this option used with the different online and index based search algorithms does not guarantee an identical output of match
25. nd the sequences are then said to be alphabetically transformed The index is constructed for the alphabetically transformed sequences This option also allows for alphabet reduction see below Note that the used alphabet will also be used to map pattern characters when the constructed index is searched with RaligNAtor Each line in the file specifies a class of characters of the alphabet These must be ASCII printable characters i e they must have character code between 32 and 127 A class of characters can be of three types Non matching characters of the target sequence specifies characters that can oc cur in the target sequence but cannot match any pattern character This is useful for cases in which stretches of the target sequence are unknown commonly rep 2 Database preprocessing with sufconstruct lt file gt Load FASTA file alph lt file gt Use alphabet defined in file dna Use DNA alphabet A C G T and IUPAC wildcards default rna Use RNA alphabet A C G U and IUPAC wildcards lesa Construct index for LESA Align tables suf and lcp lgslink Construct index for LGSlinkAlign and LESAAlign tables suf lcp and suf 1 s lt index gt Save constructed structures to given index name X Do not save alphabetically transformed sequence c Output constructed structures to screen t lt file gt Output constructed structures to text file time Display elapsed times Table 2 1 Overview o
26. nsertions deletions and replacements In a second step it con siders also edit operations on base pairs This filtering can considerably speed up search and affects neither sensitivity nor specificity but the following condition must be fulfilled If the cost of an insertion operation is set to e g 2 then the cost of an arc altering option arc altering and arc removing option arc removing must be set to at least 2 and 4 respectively since these imply one and two deletions The user is responsible for this consistency e lgslink_nof Selects algorithm LGSlinkAlign but does not perform sequence based filtering e scan lscan Selects one of the online algorithms ScanAlign or LScanAlign These algorithms op erate directly on the database provided as FASTA file e aligngl Aligns globally each sequence structure pattern and each sequence of the database reporting the best alignment and the respective sequence structure edit distance We remark that matches are reported on the standard output channel stdout whereas additional information such as set costs and thresholds is redirected to the standard error channel stderr 3 2 Chaining options The following options allow to chain matches of the different patterns specified in one patterns file A chain of matches is a sequence of non overlapping matches where each match is then called a chain fragment such that the order of the matches in the chain resembles the order
27. ons e lt data gt lt data gt is the path and target FASTA file or the path and prefix name of the files i e file name without extension storing an index RaligNAtor requires lt data gt to point to a FASTA file in case the user wants to perform an online search with algorithm ScanAlign or LScanAlign see options scan and 1scan below For index based searches with algorithm LESA Align or LGSlinkAlign RaligNAtor requires lt data gt to point to an index see options lesa and 1gslink below e alph alph takes as parameter the path and name of the text file specifying an alphabet See the full description of alphabet files above in the section about sufconstruct e dna rna Alphabet option for the respective kind of sequence See section about sufconstruct for details e pat lt file gt pat takes as parameter a text file containing one or multiple sequence structure patterns describing any branching non crossing RNA secondary structures Each pattern is specified in three consecutive lines The first line begins with the symbol gt followed by the description of the pattern Optionally the description may be followed by pipe symbols separating these supplemental options replacement deletion arc breaking arc altering arc removing cost of the respective edit operation being the same whether the operation occurs in the target 3 Searching with RaligNAtor lt data gt alph lt file gt dna rn
28. pattern will match uppercase and lowercase As and Gs in the target sequence Remember that all characters used to define patterns must belong to a matching character and or wildcard class and all characters occurring in the target sequence must belong to a matching char acter or non matching character class e dna rna These options allow transforming the input sequences to predefined DNA or RNA alphabets The alphabets are equal to the alphabet file shown above The DNA al phabet only differs from the RNA alphabet by having T as class representative instead of U If the target sequences contain other characters one can create a new alphabet in a text file and use it with the option alph e lesa lesa selects for construction the structures needed for searching the target database with algorithm LESAAlign The structures consist of the suffix array suf and the longest common prefix table Icp Note suf and Icp are also constructed via option lgslink Hence it is not necessary to select option lesa if the database was already processed for search with the LGSlinkAlign algorithm e lgslink 1gslink selects for construction the structures needed for searching the target database 2 2 Using sufconstruct with algorithms LGSlinkAlign and LESAAlign The structures consist of the suffix ar ray suf the longest common prefix table Icp and the inverse suffix array suf 1 e s lt index gt By using option s alon
29. recomputation is performed by the sufconstruct tool distributed with RaligNAtor which is described next RaligNAtor s description follows subsequently This software is available as open source under the GNU General Public License Version 3 1 Introduction 2 Database preprocessing with sufconstruct sufconstruct preprocesses a sequence database generating an index to be searched with RaligNAtor using algorithm LESAAlign or LGSlinkAlign In summary this procedure con sists of reading the target database in FASTA format mapping the sequences of the database to an alphabet consisting e g of characters A C G and U computing the required index structures according to the desired search algorithm and saving the structures to files on disk All this is performed smoothly where the user only needs to set a few options An overview of all possible options is given in Table 2 1Jand their detailed description is given below 2 1 Preprocessing options e lt file gt lt file gt is the path and name of the FASTA file for which the is index is to be constructed The file may contain one or more sequences and all are selected for index construction Note that index based search in the forward and reverse complement sequences only requires the construction of a single index e alph lt file gt alph takes as parameter the path and name of the text file specifying an alphabet The sequences characters are mapped to this alphabet a
30. terns derived from the consensus structure of the Rfam family Cripavirus internal ribosome entry site Acc RF00458 The patterns called iresl ires2 ires3 ires4 and ires5 are shown above in the description of option pat Here we stored these patterns in a file called ires pat The searched database 15 3 Searching with RaligNAtor contained sequences obtained from the full alignments of Rfam 10 1 To search using algo rithm LGSlinkAlign we preprocessed this database with sufconstruct generating an index called Rfam The allowed base pairs were A U U A C G G C G U and U G which were specified in a text file and used with the option comp We also set RaligN Ator to report global chains of matches with minimum length 5 by using the option minlen Due to the large number of expected matches for single patterns we used option silent to prevent matches from being printed out but used option show to print out the resulting chains The command call to RaligNAtor and the screen output are as follows RaligNAtor path to index Rfami0 pat path to patterns file ires pat comp path to comp_file rna comp lgslink silent global minlen 5 show Number of sequences 2756313 Total length 824991406 Searching for pattern iresi in the forward sequence s Cost threshold edist 2 Max allowed indels O Min Max match length 8 8 Max match score 8 Costs Replacement 1 Deletion 1
Download Pdf Manuals
Related Search
Related Contents
Flores InfraControl Klinikaustattung - Gebrauchsanleitung Manual Técnico Español User Manual 夏号 - 喜多方市立図書館 Manuel d`utilisation operating instructions and parts breakdown Copyright © All rights reserved.