Home

get homologues manual

1. required RAM MB genomes Figure 18 Computing time and RAM requirements of the BDBH COG and OMCL algorithms when processing input volumes of increasing size Performance is measured also with BerkeleyDB s flag 46 e BLAST jobs fail in my computer cluster how can I sort this out 5 2 When solving problems related with submitting jobs to the cluster it is necessary to check the generated queue files which capture any errors that might occur during job submission If you check one of these files an find a message such as blast error while loading shared libraries libbz2 so 1 cannot open shared object file this means that some cluster nodes will require the installation of 32 bit compatibility library libbz2 so 1 which can be done with root privileges typing this command from your cluster master node rocks run host compute yum y install bzip2 libs i386 Another solution would be to use BLAST binaries native to your cluster architecture as explained in section 2 2 Run options Could you please explain a bit more what option o is good for This option is useful in situations like the following you have access to a compute cluster with modern compute nodes 64bit cores and a good amount of memory but the master computer is an old 32bit machine with limited RAM say 1GB or less Under such a setting it would be convenient to distribute the first steps of the pipeline all against all BLAST jo
2. Figure 6 Two divergent ORFs flanking an intergenic region Only 180 bases from each ORF are taken for compiling intergenic clusters e f filters out cluster sequences with large differences in length This flag compares sequences within a cluster to the first arbitrary reference sequence Those with length difference either shorter or longer beyond the selected threshold will be removed This might cause the resulting cluster to be entirely removed if the final number of taxa falls below the t minimum e r allows the choice of any input genome of course included in d folder as the reference instead of the default smaller one If possible resulting clusters are named using gene names from this genome which can be used to select well annotated species for this purpose In addition when using the default 15 BDBH algorithm the reference proteome is the one chosen to start adding genes in the clustering process Therefore when using BDBH the choice of reference proteome can have a large impact on the resulting number of clusters By default the taxon with least genes is taken as reference It is possible to change the way clusters are named by editing subroutine extract gene name in file lib phyTools pm e excludes clusters with inparalogues defined as sequences with best hits in its own genome This option might be helpful to rule out clusters including several sequences from the same species which might be of interest
3. o a 4 8 Performing genome composition analyses o deana au baa n nedarai 48 1 Obtaining a p rigemome matrix ee a urea 0 ostas sula 48 2 Int rrogating a pangenome matrix s seo musica A 4 8 3 Calculating cloud shell and core genomes 4 8 4 Estimating core pan genome size by sampling genomes e 4 8 5 Estimating average identity matrices o a 4 8 6 Finding out best hits of a particular Sequence o 4 9 A script to test most get_homologues features with a sample dataset 5 Frequently asked questions FAQs SL a ETE G2 WE DTG tada a A AO a A Raed Do a EL AA a Re ee ee ee ee ee ee ae 6 Frequent warnings and error messages 7 Credits and references 45 45 47 49 51 52 1 Description This document describes the software get_homologues and provides a few examples on how to use it get_homologues is mainly written in the Perl programming language and includes several algorithms designed for three main tasks 1 Clustering protein and nucleotide sequences in homologous possibly orthologous groups on the grounds of sequence similarity 2 Identification of orthologous groups of intergenic regions flanked by orthologous open reading frames ORFs conserved across related genomes 3 Definition of pan and core genomes by calculation of overlapping sets of proteins While the program was mainly developed for the stu
4. are required to compute them What is the advantage of providing multiple cluster output directories to the d dir1 dir2 dir3 option of compare_clusters pl When provided with the output directories holding the clusters generated by 2 or 3 algorithms the script will select only those clusters that contain exactly the same sequences in each of the clusters This may be valuable for example if the user is interested in defining a very robust core genome set containing only those families with exactly the same members independently of the chosen algorithm However it is possible that several otherwise important families get lost for downstream analyses such as presence absence analyses of gene families in pairs of lineages which can be done with parse_pangenome_matriz pl We have sequenced 25 bacteria genomes and used get_homologues pl to get orthologs included in all the 25 strains As I already used BDBH method for core genome analysis now I cannot switch to COG method for pangenome matrix generation Could you indicate me how to get the full pangenome matrix using BDBH method BDBH results cannot be used for pangenome analysis but you could re run the software with the same 25 input genomes now adding t 0 M for OMCL probably the best choice for such an analysis This will re use all blast results previously calculated and resume until OMCL analysis is completed Usually core sets produced by BDBH COGS and OCML are very similar
5. It is important to note that during subsequent calculations these clusters are represented by the first sequence found in each However the output of the program will include all pre clustered sequences for convenience All remaining flags are options that can modify the default behavior of the program which is to use the bidirectional best hit algorithm BDBH in order to compile clusters of potential orthologous ORFs taking the smallest genome as a reference By default protein sequences are used to guide the clustering thus relying on BLASTP searches Perhaps the most important optional parameter would be the choice of clustering algorithm Table 2 name option BDBH default Starting from a reference genome keep adding genomes stepwise while storing the sequence clusters that result of merging the latest bidirectional best hits as illus trated in Figure 3 COGS G Merges triangles of inter genomic symmetrical best matches as described in PubMed 20439257 Note that a single sequence might occasionally be included in several COGS clusters with option x OMCL M OrthoMCL v1 4 uses the Markov Cluster Algorithm to group sequences with in flation F controlling cluster granularity as described in PubMed 12952885 Table 2 List of available clustering algorithms The remaining options are now reviewed Apart from showing the credits option v can be helpful after installation for it p
6. noted as misc_feature in the corresponding GenBank files Miscellaneous features report on diverse things such as cryptic prophages the target sites of resolvases involved in replicon replication and many more miscellaneous features Note that there is great heterogeneity both in the format and detail of annotating genome features By default get homologues pl will extract only the CDS feature i e the sequences for protein coding se quences from the GenBank file By using option a the user can select which features to parse such as RNA or TRNA Could you explain what the inflation parameter of orthoMCL is and how to decide what value to use The original OrthoMCL paper PubMed 12952885 explains it changing the inflation index affects cluster tightness Lower inflation values result in the inclusion of more seguences in fewer groups whereas increasing the inflation index fragments clusters and reduces the number of seguences included In our benchmarks with bacterial genomes this parameter shows a very modest negligible effect I ve run the get homologues pl pipeline some time ago I would like to add some new genomes to the previous analysis How do I proceed so as to reuse as much of the previous computations as possible If you still conserve the original input folder with FASTA or GenBank sequence files and the results _homologues directory both contained in the same directory all you will have
7. 40k 50k 79 22 198 225 231 245 a b a gt Figure 11 Map of plasmid OX105 highlighting 7 genes absent in pKP96 By default parse_pangenome_matriz pl requires present genes to be present in all genomes of A and none of B However as genomes might not be completelly annotated it is possible to make these tests more flexible by controlling the cutoff for inclusion by using flag P For instance the next command will require genes to be present only in 90 of A genomes and missing in 90 of B genomes parse_pangenome_matrix pl m sample_intersection pangenome_matrix_t0 tab A sample_plasmids_gbk A txt B sample_plasmids_gbk B txt g P 90 Note that the most flexible way of finding out genes absent in a set of genomes within a pangenome matrix is by using option a which does not require an A list rather a B list is sufficient It is called as in the example parse_pangenome_matrix pl m sample_intersection pangenome_matrix_t0 tab B sample_plasmids_gbk B txt a 37 4 8 3 Calculating cloud shell and core genomes parse_pangenome_matriz pl can also be employed to classify genes in these four compartments core Genes contained in all considered genomes taxa soft core Genes contained in 95 of the considered genomes taxa as in the work of Kaas and collaborators PubMed 23114024 cloud Genes present only in a few genomes taxa The cutoff is defined as the class next to the most populated non core cluster class
8. 546 357 357 546 574 357 574 1 420 82 466 466 466 523 324 324 317 523 324 466 2 327 36 319 319 434 315 310 315 313 318 309 319 3 310 4 313 313 313 311 304 304 311 313 304 313 same as first example As can be seen the output now contains two data frames which summarize the genome composition analysis done by sampling which are also stored as tab separated files These text files can be used to plot the core and pan genome with help from the accompanying script plot_pancore_matrix pl A suitable command would be plot_pancore_matrix pl i sample_buch_fasta_homologues core_genome_algBDBH tab f core_Tettelin The script also supports the core function as modified by Willenbrock and collaborators PubMed 18088402 as shown on the next figure in a more realistic set of 35 genomes Both fits can be superimposed by calling option f core both Besides standard core and pan genomes it is possible to estimate the evolution of the soft core genome which is a relaxed version of the core that considers genes found in a fraction by default 0 95 of genomes and thus accommodates some annotation or assembly errors This experiment can be done with either the OMCI or COGS algorithms by invoking options c z The resulting data file can be plotted the same way The genome composition analyses presented so far are actually random sampling experiments It is thus worth mentioning that the user can control the order in which genomes are s
9. 8 4 Please check section 2 3 for the requirements of this script check BDBHs pl is a script that can be used after a previous get homologues run to find out the bidirec tional best hits of a sequence identifier chosen by the user It can also retrieve the Pfam annotations of a sequence and its reciprocal best hits See section 4 8 6 add_pancore_matrices pl can be used to add pan core matrices produced by previous get_homologues c R runs on the same set of genomes with the aim of combining default CDS clusters and a TRNA tRNA results add_pangenome_matrices pl can be used similarly to add two pangenome matrices produced by com pare_clusters pl for instance from sets of CDS and rRNA tRNA clusters In addition two shell scripts are also included plot_matriz_heatmap sh calculates ordered heatmaps with attached row and column dendrograms from squared tab separated numeric matrices which can be presence absence pangenomic matrices or similarity identity matrices as those produced by get_homologues with flag A From the later type of matrix a distance matrix can optionally be calculated to drive a neighbor joining tree See example on section 4 8 1 hcluster_matriz sh generates a distance matrix out of a tab separated numeric matrix which is then used to call R functions hclust and heatmap 2 in order to produce a heatmap To check the options of any of these scripts please invoke them from the terminal with flag h For inst
10. M t 0 get_homologues pl d test_Streptococcus G t 0 2 2 build consensus pangenomic matrix compare_clusters pl d test Streptococcus homologues S f0 Otaxa algC0G e0 N test Streptococcus homologues S f0 Otaxa algOMCL e0 m T o test Streptococcus intersection 2 3 plot pangenomic compartments and estimate core and pan genome size with mixture models parse_pangenome_matrix pl m test Streptococcus intersection pangenome matrix t0 tab s 2 4 check accesory genes present in A genomes and absent in B parse_pangenome_matrix pl m test_Streptococcus_intersection pangenome_matrix_t0 tab A test_Streptococcus A list B test_Streptococcus B list g e Readers looking for a fully worked out example including a comprehensive interpretation of results can read our book chapter Robust Identification of Orthologues and Paralogues for Microbial Pan Genomics Using GET HOMOLOGUES A Case Study of pIncAC Plasmids PubMed 25343868 44 5 Frequently asked questions FAQs 5 1 Installation e Is it necessary to have a computer cluster to run get_homologues pl Not really It can be run on a single machine but if you have say more than microbial genomes the required all against all BLAST jobs would take some time to complete depending on your hardware If this is your case we recommend setting option n to the number of cores of your processor as this way BLAST HMMER jobs will be split and sent to different
11. for users employing these clusters for primer design for instance x allows COG generated sequence clusters to contain the same sequence in more than one cluster F is the inflation value that governs Markov Clustering in OMCL runs as explained in PubMed 12952885 As a rule of thumb low inflation values F 1 result in the inclusion of more sequences in fewer groups whilst large values produce more smaller clusters F 4 A tells the program to produce a tab separated file with average sequence identity values among pairs of genomes computed from sequences in the final set of clusters see also option t By default these identities are derived from BLASTP alignments and hence correspond to amino acid sequence identities However as explained earlier option a forces the program to use nucleotide sequences and run BLASTN instead and therefore a CDS combined A will produce genomic average nucleotide sequence identities ANI as used in the literature to help define prokaryotic species PubMed 19855009 z can be called when performing a genome composition analysis with clustering algorithms OMCL or COGS In addition to the core and pan genome tab separated files mentioned earlier see option c this flag requests a soft core report considering all sequence clusters present in a fraction of genomes defined by global variable SOFTCOREFRACTION with a default value of 0 95 This choice produces a composition report more robu
12. gt h 1 else fa h _ END foreach h sort keys 4 fa fa h Bio Seq gt new seq gt fa h gt translate gt seq print h n fa h n JJ your_CDS_file fna 3 4 Program options print version credits and checks installation directory with input FASTA files faa fna GenBank files gbk 1 per genome or a subdirectory subdir clusters subdir_ with pre clustered sequences faa fna allows for new files to be added later creates output folder named directory_homologues input amino acid FASTA file with taxon names in headers creates output folder named file_homologues Optional parameters 70 only run BLAST Pfam searches and exit report genome composition analysis set random seed for genome composition analysis save memory by using BerkeleyDB default parsing stores sequence hits in RAM runmode local cluster number of threads for BLAST HMMER in local runmode file with faa gbk files in d to be included Algorithms instead of default bidirectional best hits BDBH G M use COGtriangle algorithm COGS PubMed 20439257 use orthoMCL algorithm OMCL PubMed 12952885 Options that control seguence similarity searches C min coverage in BLAST pairwise alignments max E value require equal Pfam domain composition when defining similarity based orthology min seguence identity in BLAST query subj pairs min BLAST neighborhood correlatio
13. in the analysis 48 Within lib marfil_homology pm there is a global variable BLAST_DB_SIZE set to 100_000_000 for that purpose That value is the fixed effective search space during BLAST searches so that any resulting E values are comparable even across experiments or algorithm BDBH OMCL COGS I would like to get information of inparalogues and orthologues from each genome I found several files in the tmp directory such as inparalogues_Buch_aph_APS faa Could you explain about the files or a method to extract the information According to our working definition all sequences grouped together in the same faa fna cluster are likely orthologues although you should always keep in mind that orthology is an evolutionary concept and therefore sequence based approaches such as those in get_homologues pl are simpler approximations What is different with inparalogues They are supposed to be duplicated genes that appeared after species separation and therefore their orthology relationships are many to one or many to many Inparalogues are easy to spot in clusters produced by get_homologues pl because they are 2 sequences from the same genome in the same cluster cluster If you wish to know which inparalogue is most similar to an orthologous gene the best option is to run check_BDBHs pl which is explained on Section 4 8 6 I have been using GET yOMOLOGU ESandIcouldnot figureoutwhichisthede f aultvalue forsavingblastphits Imean 1 max argetseq
14. interested in finding plasmid genes present in Klebsiella orytoca which are not encoded in K pneumoniae KP96 In order to do this we first create a couple of text files to define sets A and B called A txt and B txt which we place inside folder sample_plasmids_gbk The content of A and B files should be one line per species In this example file A txt contains a single line K_oxytoca_plasmid_pK0X105 gb As well as B txt K_pneumoniae_KP96_plasmid_pKP96 gb We can now execute the script as follows parse_pangenome_matrix pl m sample_intersection pangenome_matrix_t0 tab A sample_plasmids_gbk A txt B sample_plasmids_gbk B txt g The output should be l taxa included in group A WW taxa included in group B matrix contains 76 clusters and 12 taxa 36 finding genes absent present in both groups file with genes present in set A and absent in B 7 pangenome_matrix_t0__pangenes_list txt It can be seen that 7 genes were found to be present in A and absent in B In the case of pangenome matrices derived from GenBank files as in this example it is possible to produce a map of these genes in the genomic context of any species included in A which should be queried using option 1 A valid syntax would be parse_pangenome_matrix pl m sample_intersection pangenome_matrix_t0 tab A sample_plasmids_gbk A txt B sample_plasmids_gbk B txt g p Klebsiella oxytoca K0X105 Ok 10k 20k 30k
15. obtain the core and pan genomes My plan is to run get homologues with BDBH and b to speed up the process How can I choose the most appropriate reference genome Option b is only suited for core genome calculation not pangenome If this is really the desired task the genome with the least number of annotated genes should be used as a reference which is what the program would do by default or else the best annotated among small genomes However note that this sort of core genome calculation is most sensitive to missing genes usually due to poor automatic annotations which is why compiling a pangenome matrix is recommended when possible see Section 4 8 1 so that a more robust soft core can be estimated Why does option t 0 not work with BDBH in get_homologues pl Is this also the reason BDBH cannot be used in a pangenome matrix analysis The reason is that BDBH uses a single reference genome and thus by definition cannot track genes not present in the reference Therefore pangenome matrices produced by the BDBH algorithm would be incomplete considering only clusters including genes from the reference genome For this reason BDBH is adequate for core genome calculation but not for pan genomes When the initial BLAST is being performed does get_homologues pl take into account the database size i e the number of genes being BLASTed or because the BLASTing is not done as an all vs all manner do you not consider this as a factor
16. output YYYY seems to be empty please remove in put_homologues and re run Another BLAST Pfam error which can happen if the pro grams fails to parse the results The simplest solution is usually to do as suggested and re run WARNING please remove rename re sults directory XXXX if you change the sequences in your gbk faa files or want to re run This warning is issued only to make it clear that the pro gram is recycling previous BLAST results which is usually a good idea unless you specifically changed the contents of your input files which should t be that common EXIT cannot compile intergenic clusters as not all input GenBank files are valid This message appears when the user requested intergenic clusters option g but not all parsed GenBank files con tained nucleotide sequences The solution is to check the input files and correct the offending one which likely is uncomplete and lacks the nucleotide sequence at the bot tom WARNING skipping cluster 123_XXX fna seems to duplicate 456_Y Y Y fna This is issued by compare_clusters pl when it finds usu ally singleton clusters with identical sequences produced by the COG or OMCL algorithms This can happen when such clusters contain short sequences or perhaps with com position biases that yield few or even no BLAST hits when compared to all other sequences in a given setup As these kinds of clusters can confound posterior ana
17. sequences taxa considered 12 sequences 756 residues 184339 MIN_BITSCORE_SIM 16 0 mask EcoliplasmidpMURO50_f0_alltaxa_algBDBH_e0_ _algBDBH sed running BLAST searches done concatenating and sorting blast results sorting _E_coli_ST131_plasmid_pKC394 gb results 0 026MB sorting _E_coli_plasmid_pMURO50 gb results 0 026MB sorting _IncN_plasmid_R46 gb results 0 026MB sorting _K_oxytoca_plasmid_pKOX105 gb results 0 031MB sorting _K_pneumoniae_12_plasmid_12 gb results 0 036MB sorting _K_pneumoniae_9_plasmid_9 gb results 0 027MB sorting _K_pneumoniae_KP96_plasmid_pKP96 gb results 0 026MB sorting _S_enterica_subsp_enterica_serovar_Dublin_plasmid_pMAK2 gb results 0 025MB sorting _Uncultured_bacterium_plasmid_pRSB201 gb results 0 029MB sorting _Uncultured_bacterium_plasmid_pRSB203 gb results 0 023MB sorting _Uncultured_bacterium_plasmid_pRSB205 gb results 0 026MB sorting _Uncultured_bacterium_plasmid_pRSB206 gb results 0 026MB done parsing blast result sample_plasmids_gbk_homologues tmp all blast 0 33MB parsing blast file finished creating indexes this might take some time lines 7 61e 03 construct_taxa_indexes number of taxa found 12 number of file addresses 7 6e 03 number of BLAST queries 7 6e 02 25 clustering orthologous sequences clustering inparalogues in E_coli_plasmid_pMURO50 gb reference 2 sequences EENI looking for valid ORF clusters n_of_taxa 12 number
18. shell Remaining genes present in several genomes taxa Table 3 Definitions of pangenome compartments as used by parse_pangenome_matricz pl The script is invoked as follows parse_pangenome_matrix pl m sample_intersection pangenome_matrix_t0 tab s The output is as follows matrix contains 79 clusters and 12 taxa cloud size 27 list sample_intersection pangenome_matrix_t0__cloud_list txt shell size 19 list sample_intersection pangenome_matrix_t0__shell_list txt soft core size 33 list sample_intersection pangenome_matrix_t0__softcore_list txt HH H core size 24 included in soft core list sample_intersection pangenome_matrix_t0__core_list txt shell bar plots sample_intersection pangenome_matrix_t0__shell png shell pdf shell circle plots sample_intersection pangenome_matrix_t0__shell_circle png circle pdf pan genome size estimates Snipen mixture model PMID 19691844 shell_estimates tab Core size Pan size BIC LogLikelihood components 24 79 521 631345479665 254 261500961132 components 23 82 369 668268427632 173 910514582648 components 12 82 367 79871965143 168 606292342080 components 12 82 376 537523230116 168 606246278956 components 11 82 385 276721241776 168 606397432319 components 11 82 394 015364162271 168 606271040100 components 12 82 402 754305849755 168 606294031375 components 11 82 411 493489104614 168 606437806337 10 components 11 82 420 232678626918 168 60658
19. to do is to copy the new sequence files to the input folder and re run get_homologues pl This will ensure that only new required BLAST PFam searches are completed conserving the previous results as much as possible How does get_homologues pl decide how to name a certain cluster file Is this affected by the use of r Cluster files are named using gene names from the reference genome It is possible to change the way clusters are named by editing subroutine extract gene name in file lib phyTools pm What happens if I perform the above explained steps but using a different reference genome The most obvious effect is that any resulting clusters will now be named according to gene names of the new reference genome A more subtle consequence for BDBH jobs is that now all genomes are compared to this reference see figure 3 and this will change the order in which bidirectional best hits are computed How are the gene clusters named if no r reference genome is specified In this case the genome with the least number of genes features will be taken as the reference and resulting sequence clusters will be named according to gene names of this reference genome which might not be the best annotated genome in your set For this reason it is often a good idea to set as reference genome one with a good annotation for instamce the species or strain described in RefSeq I have 40 draft genomes annotated in gbk format and I am using get_homologues to
20. 10 m local n 2 M 0 G 0 P 0 C 75 S 1 E 1e 05 F 1 5 N 0 B 50 s 0 D O g O a 0 x RO results_directory sample_buch_fasta_homologues parameters MAXEVALUEBLASTSEARCH 0 01 MAXPFAMSEQS 250 checking input files Buch_aph_APS faa 574 Buch_aph_Bp faa 507 Buch_aph_Cc faa 357 Buch_aphid_Sg faa 546 4 genomes 1984 sequences taxa considered 4 sequences 1984 residues 650959 MIN_BITSCORE_SIM 17 2 mask BuchaphCc_f0_alltaxa_algBDBH_e0_ _algBDBH running makeblastdb with sample_buch_fasta_homologues Buch_aph_APS faa fasta running makeblastdb with sample_buch_fasta_homologues Buch_aph_Bp faa fasta running makeblastdb with sample_buch_fasta_homologues Buch_aph_Cc faa fasta running makeblastdb with sample_buch_fasta_homologues Buch_aphid_Sg faa fasta running BLAST searches done concatenating and sorting blast results sorting _Buch_aph_APS faa results 0 12MB sorting _Buch_aph_Bp faa results 0 11MB sorting _Buch_aph_Cc faa results 0 084MB sorting _Buch_aphid_Sg faa results 0 11MB done parsing blast result sample_buch_fasta_homologues tmp all blast 0 42MB parsing blast file finished creating indexes this might take some time lines 9 30e 03 construct_taxa_indexes number of taxa found 4 number of file addresses 9 3e 03 number of BLAST queries 2 0e 03 19 clustering orthologous sequences clustering inparalogues in Buch_aph_Cc faa reference O sequences clustering inparalogues in Buch_
21. 4715022 Sample 24 79 NA NA O 0 SI 0 01 BAUN Apart from text files listing the cluster names that take part in each of the four compartments two types of plots are generated Note that the output also includes estimates of the pan and core genome sizes as calculated by the binomial mixture model of Snipen and collaborators PubMed 19691844 A simple interpretation is that as soon as likelihood converges then adding more components does not improve the mixture model Please check that paper for a full explanation 38 total clusters 79 cloud shell soft core O com o N 2 2 o 3 51 o E Q D o 5 o E JE al 1 2 3 4 5 6 7 8 9 10 11 12 number of taxa in clusters Figure 12 Barplot of the pangenome matrix created in section 4 8 1 Core clusters are in white for clarity but note that according to the definitions in Table 3 the soft core also includes the strict core cloud 27 taxa lt 3 shell 19 softcore 33 taxa gt 11 core 24 taxa 12 total gene clusters 79 taxa 12 Figure 13 Area plot of the pangenome matrix created in section 4 8 1 Note that the soft core compartment includes also the core as implied by the definition in Table 3 39 4 8 4 Estimating core pan genome size by sampling genomes The pioneer work of Tettelin and collaborators PubMed 16172379 unveiled that bacterial genomes are dy namic containers that harbour essential genes and also acces
22. A reduced binary matrix in a format suitable for PHYLIP discrete character analysis software which looks like this 12 180 lt 12 taxa 180 clusters 0000000000 0000000011001000000000100000111111101100000000000 0000000001 0000000010000000000000000000000000001000000000000 0000000002 0000000010000000000000100000000001111111111111111 0000000003 0000000010000000000000000000000000000000000000000 0000000004 0000000010000000000000000000000000000000010000000 0000000005 0000000110011100100000101000000000100001000000000 0000000006 0000000011000000000000100000000000000000000000000 0000000007 0000000001011000000001111111111110000000000000000 0000000008 0000000101111111111111000000000000000000000000000 0000000009 0000000000000000000000000000000000000000000000000 0000000010 0000000010001000000000000000000000000000000000000 0000000011 1111111110000000000000000000000000000000000000000 Indeed when option T is toggled as in the next example compare_clusters pl o sample_intersection m T d N sample_plasmids_gbk_homologues Uncultured _f0_Otaxa_algCOG_e0_ sample_plasmids_gbk_homologues Uncultured _f0_0taxa_alg0MCL_e0_ then the script calls program PARS from the PHYLIP suite to produce one or more alternative parsimony trees that capture the phylogeny implied in this matrix adding the following lines to the produced output parsimony results by PARS PHYLIP suite evolution genetics washington edu phylip doc pars html pangenome
23. AATACCCTCCCTGGTAGTCTTTAGCGTAACGATTCAGAAAGGACTGAATGAAGTGATCTGCGCTGAAGAAAGCG CCACGAAATGCCGCAGGCATGAAGTTCATGCGGGCGTTTTCAGAAATGTAGCGGGCGGTGATTTCGATAGTTTCCATgatacttcctctttaagccgataccg gcgatggttaagcggcaggcacatcacctgccactttttaattatcgtacaatggggcgttaaagtcaatacaagtacggattatatttacctaattttatgc ccgtcagagcatggaaggcgacctcgccggactccaccggacaccgggggcaaatcgccggaaactgcgggactgaccggagcgacaggccaccccectccct getagcccgccgccacgcggccggttacaggggacactgagaaagcagaaagccaacaaacactatatatagcgttcgttggcagctgaagcagcactacata tagtagagtacctgtaaaacttgccaacctgaccataacagcgatactgtataagtaaacagtgatttggaagatcgctATGAAGGTCGATATTTTTGAAAGC TCCGGCGCCAGCCGGGTACACAGCATCCCTTTTTATCTGCAAAGAATTTCTGCGGGGTTCCCCAGCCCGGCCCAGGGCTATGAAAAGCAGGAGTTAAACCTGC ATGAGTATTGTGTTCGTCACCCTTCAGCAACTTACTTCCTGCGGGTTTCTGGC gt 2 intergenic3 coords 9538 10293 length 756 neighbours GI 209574108 1 GI 209574109 1 CGCGCCATTGCTGGCCTGAAGGTATTCCCAATACCCTCCCTGGTAGTCTTTAGCGTAACGATTCAGAAAGGACTGAATGAAGTGATCTGCGCTGAAGAAAGCG CCACGAAATGCCGCAGGCATGAAGTTCATGCGGGCGTTTTCAGAAATGTAGCGGGCGGTGATTTCGATAGTTTCCATgatacttcctctttaagccgataccg gcgatggttaagcggcaggcacatcacctgccactttttaattatcgtacaatggggcgttaaagtcaatacaagtacggattatatttacctaattttatgc ccgtcagagcatggaaggcgacctcgccggactccaccggacaccgggggcaaatcgccggaaactgcgggactgaccggagcgacaggccaccccectccct getagcccgccgccacgcggccggttacaggggacactgagaaagcagaaagccaacaaacactatatatagcgttcgttggcagctgaagcagcactacata tagtagagtacctgtaaaacttgccaacctgaccataacagcgatactgtataagtaaacaGTGATTTGGAAGATCGCTATGAAGGTCGATATTTTTGAAAGC TCCGG
24. CGCCAGCCGGGTACACAGCATCCCTTTTTATCTGCAAAGAATTTCTGCGGGGTTCCCCAGCCCGGCCCAGGGCTATGAAAAGCAGGAGTTAAACCTGC ATGAGTATTGTGTTCGTCACCCTTCAGCAACTTAC 32 4 8 Performing genome composition analyses The next few examples illustrate how get homologues pl might be used to analyze the genomic evolution of a group of related organisms the core genome and the pan genome using the terms coined by Tettelin and collaborators PubMed 16172379 4 8 1 Obtaining a pangenome matrix First we will try option t 0 in combination with the OMCL or the COG algorithms By enforcing this option we are actually asking for all possible clusters including those which might not contain sequences from all input genomes taxa For this reason the use of this option usually means that a large number of clusters are reported This is particularly true for COG runs since this algorithm does not resolve clusters involving less than 3 genomes The default algorithm BDBH is not available with this option For instance by calling get_homologues pl d sample_plasmids_gbk t 0 G we obtain 199 clusters looking for valid ORF clusters n_of_taxa 0 number_of_clusters 199 cluster list EcoliplasmidpMUR050 f0 Otaxa algCOG e0 cluster list cluster directory sample plasmids gbk homologues EcoliplasmidpMUR050 f0 Otaxa algC0G e0 By choosing the OMCL algorithm we obtain a smaller set of clusters which we can test by typing on the terminal get_homologue
25. Cc_f0_alltaxa_algBDBH_e0_ to gether with file sample_buch_fasta_homologues BuchaphCc_f0_alltaxa_algBDBH_e0_ cluster_list which lists the found clusters and their taxa composition It can be seen that the folder name contains the key settings used to cluster the sequences contained therein 20 BuchaphCc_f0_alltaxa_algBDBH_e0_ l e option was not used inparalogues are in the clustering algorithm is BDBD default all clusters contain at least 1 sequence from each taxa default t behavior f option not used no length filtering reference proteome In this case a total of 305 protein sequence clusters are produced which include the original FASTA headers plus information of which segment was actually aligned by BLAST for inclusion in the cluster gt gil116515296 Rho Buchnera aphidicola str Cc Cinara cedri aligned 1 419 420 MNLTKLKNTSVSKLIILGEKIGLENLARMRKQDIIFSILKQHSKSGEDIFGDGVLEILQDGFGFLRSSDSSYLAGPDDIYVSPS gt gi 15617182 termination factor Rho Buchnera aphidicola str APS aligned 1 419 419 MNLTALKNMPVSELITLGEKMGLENLARMRKQDIIFAILKQHAKSGEDIFGDGVLEILQDGFGFLRSADSSYLAGPDDIYVSPS gt gi 27905006 termination factor Rho Buchnera aphidicola str Bp aligned 1 419 419 MNLTALKNIPVSELIFLGDNAGLENLARMRKQDIIFSILKQHAKSGEDIFGDGVLEILQDGFGFLRSSDSSYLAGPDDIYVSPS gt gi 21672828 termination factor Rho Buchnera aphidicola str Sg aligned 1 419 419 MNLTALKNMPVSELITLGEKMGLENLARMRKQDIIFAILKQHAKSGEDIFGDGVLEI
26. LQDGFGFLRSADSSYLAGPDDIYVSPS If we wanted to test a different sequence clustering algorithm we could run get_homologues pl d sample_buch_fasta G which will produce 298 clusters employing the COG triangles algorithm see Table 2 in folder sample_buch_fasta_homologues BuchaphCc_f0_alltaxa_algC0G_e0_ Furthermore typing get_homologues pl d sample_buch_fasta M produces 308 clusters employing the OMCL algorithm in folder sample buch fasta homologues BuchaphCc f0 alltaxa algOMCL e0 Now we can make use of script compare clusters pl to get the intersection between these cluster sets and choose only the consensus subset We will need to type without any blanks between folder names in a single long line and execute compare_clusters pl o sample intersection d N sample buch fasta homologues BuchaphCc f0 alltaxa algBDBH e0 N sample_buch_fasta_homologues BuchaphCc_f0_alltaxa_algC0G_e0_ N sample buch fasta homologues BuchaphCc f0 alltaxa algOMCL e0 The following output is produced t number of input cluster directories 3 parsing clusters in sample buch fasta homologues BuchaphCc f0 alltaxa algBDBH e0 cluster list in place will parse it BuchaphCc_f0_alltaxa_algBDBH_e0_ cluster_list number of clusters 305 parsing clusters in sample_buch_fasta_homologues BuchaphCc_f0_alltaxa_algC0G_e0_ cluster_list in place will parse it BuchaphCc_f0_alltaxa_algC0G_e0_ cluster_list number of clusters 298 parsi
27. TATDNSGNRIKELQLVYNKVRQANITQELNEI VSGASAVSID gt gi 21672839 ref NP_660906 1 Buchnera aphidicola str Sg Schizaphis graminum MHLNKMKKVSLKTYLVLFFLIFFIFCSFWFIKPKEKKLKLEKLRYEEVIKKINAKNNQNLKSVENFITEN KNIYGTLSSLFLAKKY ILDKNLDKALIQLNNSLKYTKEENLQNILKIRIAKIKIQQNKNQDAIKILEEIK DNSWKNIVENMKGDIFMKNKEIKKAILAWKKSKYLEKSNASKEIINMKINEIKR It is possible to analyze the provided sample input file sample buchnera faa with the following command get_homologues pl i sample buchnera faa Obtaining t results directory sample buchnera homologues parameters MAXEVALUEBLASTSEARCH 0 01 MAXPFAMSEQS 250 t checking input files t sample buchnera faa created file sample buchnera homologues tmp all fa 4 genomes 1984 sequences taxa considered 4 seguences 1984 residues 650959 MIN BITSCORE SIM 17 2 mask BuchneraaphidicolastrCcCinaracedri3 f0 alltaxa algBDBH e0 _algBDBH running makeblastdb with sample buchnera homologues tmp all fa t running local BLAST search done parsing blast result sample_buchnera_homologues tmp all blast 0 44MB parsing blast file finished creating indexes this might take some time lines 9 30e 03 construct_taxa_indexes number of taxa found 4 number of file addresses 9 3e 03 number of BLAST queries 2 0e 03 clustering orthologous sequences 23 clustering inparalogues in Buchnera_aphidicola_str__Cc__Cinara_cedri__3 faa reference O sequences nal looking f
28. Therefore most of your previously 49 tested core genes should also be picked up by OCML on this second run and presumably you could now proceed to pangenome analysis I am a PhD student doing some work on different strains My input is gbk from draft genomes and my aim at this point is to see what are core genes in each of 3 groups and which of these are shared with the other 2 groups I have been working under the impression that I can make 3 different groups and compare them or can I only make 1 big group and compare the results for different methods You have calculated 3 core genomes from 3 different sets of strains and now you want to know the subset of core genes present in all individual core genomes If you check compare_clusters pl option you ll see that it says by default cluster dirs are expected to be derived from the same taxa which is exactly what s failing in your examples If you want to find common core clusters present if all 3 sets and also those shared by only some of the taxa I guess you should build a pangenome matrix by running get_homologues pl with all strains together with option t 0 The matrix itself will give you the clusters present absent in all and some of the strains which I guess is what you need The script parse pangenome matrix pl can read this matrix and further help you identify these clusters Could you please give a use case for compare_clusters pl r Option r can be used to compare a list of
29. _identity tab Color Key Pan genome tree gower dist Value Uncultured_bacterium_plasmid_pRSB206 gb_ Uncultured_bacterium_plasmid_pRSB205 gb S_enterica_subsp_enterica_serovar_Dublin_f Uncultured_bacterium_plasmid_pRSB201 gb_ Uncultured bacterium plasmid pRSB203 gb E coli plasmid pMURO050 gb features E coli ST131 plasmid pKC394 gb features IncN plasmid R46 gb features K pneumoniae KP96 plasmid pKP96 gb fe K oxytoca plasmid pKOX105 gb features e 3 a K pneumoniae 12 plasmid 12 gb features a o a K pneumoniae 9 plasmid 9 gb features b features features features features features features features IncN_plasmid_R46 gb_features _plasmid_pKP96 gb plasmid_pRSB203 gb 1 plasmid pMAK2 gb 1 plasmid pRSB205 gb 1 plasmid pRSB206 gb E coli plasmid pMUR050 gb features _plasmid_pRSB201 gb K_oxytoca_plasmid_pKOX105 g E_coli_ST131_plasmid_pKC394 gb_features K_pneumoniae_9 plasmid_9 gb_features K_pneumoniae_12 plasmid_12 gb features meumoniae_KP96 ured_bacterium ured_bacterium a_serovar_Dublin ured bacterium ured bacterium Figure 16 Example heatmap derived from an average nucleotide identity matrix calculated with get_homologues pl 42 4 8 6 Finding out best hits of a particular sequence After running get homologues with almost any set of parameters you will always end up with a lot of BLAST files of all against all involv
30. _of_clusters 24 cluster_list _homologues EcoliplasmidpMURO50_f0_alltaxa_algBDBH_e0_ cluster_list cluster directory sample_plasmids_gbk_homologues EcoliplasmidpMURO50_f0_alltaxa_algBDBH_e0_ runtime 28 wallclock secs 1 62 usr 0 33 sys 24 31 cusr 1 57 csys 27 83 CPU RAM use 19 5 MB This outcome is similar to that explained in example 4 1 with the notable difference that now both protein and nucleotide sequence clusters 24 are produced as GenBank files usually contain both types of sequences File EcoliplasmidpMURO50_f0_alltaxa_algBDBH_e0_ cluster_list cluster_list summarizes the contents and composition of the clusters stored in folder EcoliplasmidpMURO50_f0_alltaxa_algBDBH_e0_ For instance the data concerning cluster 100_traJ looks like this cluster 100_traJ size 12 taxa 12 file 100_traJ faa dnafile 100_traJ fna E_coli_plasmid_pMUROBO gb E_coli_ST131_plasmid_pKC394 gb IncN_plasmid_R46 gb K_oxytoca_plasmid_pKOX105 gb K_pneumoniae_12_plasmid_12 gb K_pneumoniae_9_plasmid_9 gb K_pneumoniae_KP96_plasmid_pKP96 gb S_enterica_subsp_enterica_serovar_Dublin_plasmid_pMAX2 gb Uncultured_bacterium_plasmid_pRSB201 gb Uncultured_bacterium_plasmid_pRSB203 gb Uncultured_bacterium_plasmid_pRSB205 gb Uncultured_bacterium_plasmid_pRSB206 gb The two FASTA files produced for this cluster are now dissected Note that each header includes the co ordinates of the sequence in the context of a g
31. _phylip tree sample_intersection pangenome_matrix_t0 phylip ph pangenome_phylip log sample_intersection pangenome_matrix_t0 phylip log 34 K pneumoniae 9 plasmid 9 gb E coli ST131 plasmid pKC594 gb E coli plasmid pMUROSO gb Incll_plasmid R46 gb K pneumoniae KP96 plasmid pKP96 gb Uncutured bacterium plasmid pRSB203 gb Uncutured bacterium plasmid pRS6206 96 Uncutured bacterium plasmid pRSB201 gb 8 enterica subsp enterica serovar Dublin plasmid pMAK2 gb Uncutured bacterium plasmid pRSB205 gb K oxytoca plasmid pKOXI05 gb K pneumoniae 12 plasmid 12 gb 5 0 Figure 9 Example of pangenomic tree of the consensus COG amp OMCL pangenomic matrix obtained for a few plasmids Such trees can be useful to create the A and B lists discussed in section 4 8 2 Plot produced with FigTree with midpoint root Please note that the resulting Newick format ph file can contain several trees separated by one per line In order to plot them as in the next figure it might be necessary depending on the software used to leave only one This is the case for instance for MEGA or TreeView A complementary view of the same data con be obtained with script plot matrix heatmap sh cd sample intersection plot matrix heatmap sh i pangenome matrix t0 tab 35 Color Key pangenome_matrix_t0 tab Uncultured_bacterium_plasmid_pRSB20 Uncultured_bacterium_plasmid_
32. ampled during these simulations 40 1 expl 91 0 72 2 g 490 5 Xg 1 1243 exp coregenes g 205 414 exp pangenee 9 Ag 1 2772 1 exp 550 600 500 67 35 550 450 500 450 400 residual standard error pan genome size genes core genome size genes 350 400 300 350 genomes g genomes g Figure 14 Core genome left and pan genome right estimates after ten random samples of 4 taxa Fitted curves follow functions first proposed by Tettelin in 2005 PubMed 16172379 Residual standard errors are reported on the right margin as a measure of the goodness of fit 5 1 02 coregenes g 1052 8707 exp genome size genes 2000 2500 3000 3500 4000 4500 5000 5500 13579 12 15 18 21 24 27 30 33 genomes g Figure 15 Core genome estimate after ten random samples of 35 taxa by enforcing a list of genomes with option I already introduced in section 4 7 or by setting the seed of the random number generator with option R In the first case only one sampling is performed and therefore the standard deviation of the core and pan values is zero The second strategy ensures that sampling order is conserved in different program executions and thus allows merging CDS core genomes and non coding genes such as rRNAs core genomes computed separately over the same set of taxa with help from accompanying script add_pancore_matrices pl 41 49 78 re
33. an be installed from the terminal sudo apt get y install r base r base dev Ubuntu Debian based distros yum y install R RedHat and derived distros zypper assume yes R patched R patched devel Suse Please visit CRAN to download and install R on macOSX systems which is straightforward In addition to R itself plot_matriz_heatmap sh and hcluster_matriz sh require some R packages to run which can be easily installed from the R command line with gt install packages c ape gplots cluster dependencies TRUE Finally the script compare_clusters pl might require the installation of program PARS from the PHYLIP suite which should be already bundled with your copy of get_homologues 3 User manual This section describes the available options for the get_homologues software 3 1 Input data The input required to run get_homologues can be of two types 1 A single file with amino acid sequences in FASTA format in which headers include a taxon name between square brackets including at least two words and a first capital letter as shown in the example gt protein 123 Species 1 MANILLLDNIDSFTYNLVEQLRNQKNNVLVYRNTVSIDIIFNSLKKLTHPILMLSPGPSLPKHAGCMLDL PEKFVINSYFEKMIMSVRNNCDRVCGFQFHPESILTTHGDQILEKIIHWASLKYITNKKQ gt gil10923402 Species 2 IKKVKGDIPIVGICLGHQAIVEAYGGIIGYAGEIFHGKASLIRHDGLEMFEGVPQPLPVARYHSLICNKI PEKFVINSYFEKMIMSVRNNCDRVCGFQFHPESILTTHGDQILEKIIHWASLKYITNKKQ 2 A directory or folder containing several file
34. ance typing compare_clusters pl h in the terminal will produce the following this message comma separated names of cluster directories which usually have associated dir cluster_list files These lists avoid parsing taxon names from FASTA headers which might be error prone output directory use nucleotide sequence fna clusters take first cluster dir as reference set which might contain a single representative sequence per cluster use only clusters with syntenic genes use only clusters with single copy orthologues from t taxa 17 produce produce produce produce clusters with single copy seqs from ALL taxa in file intersection pangenome matrices cluster report in OrthoXML format parsimony based pangenomic tree 18 4 A few examples of use This section presents a few different ways of running get_homologues pl and the accompanying scripts with provided sample input data 4 1 Clustering orthologous proteins from a few FASTA files This example takes the provided sample input folder sample_buch_fasta which contains the proteins sets of four Buchnera aphidicola strains and compiles clusters of BDBH sequences which are candidates to be orthologues with this command H HH H HF HH HOH FH get_homologues pl d sample_buch_fasta The output should look like this contained in file sample output txt get_homologues pl i 0 d sample_buch_fasta o 0 e 0 f 0 r 0 t all c 0
35. aph_APS faa 1 sequences finding BDBHs between Buch_aph_Cc faa and Buch_aph_APS faa 324 sequences clustering inparalogues in Buch_aph_Bp faa O sequences finding BDBHs between Buch_aph_Cc faa and Buch_aph_Bp faa 326 sequences clustering inparalogues in Buch_aphid_Sg faa O sequences finding BDBHs between Buch_aph_Cc faa and Buch_aphid_Sg faa 317 sequences looking for valid ORF clusters n_of_taxa 4 number_of_clusters 305 cluster_list sample_buch_fasta_homologues BuchaphCc_f0_alltaxa_algBDBH_e0_ cluster_list cluster_directory sample_buch_fasta_homologues BuchaphCc_f0_alltaxa_algBDBH_e0_ runtime 64 wallclock secs 0 74 usr 0 08 sys 61 49 cusr 0 47 csys 62 78 CPU RAM use 20 3 MB In summary the output details the processing steps required e Reading and parsing input files Buch_aph_APS faa Buch_aph_Bp faa Buch_aph_Cc faa Buch_aphid_Sg faa which contain 574 507 357 and 546 protein sequences respectively In total there are four input taxa and 1984 sequences e Preparing input sequences for BLAST e Running BLAST searches and sorting the results e Parsing the complete volume of sorted BLAST results e Searching for orthologous sequences using the BDBH algorithm which requires a reference taxon or proteome to start with see Figure 3 e Clustering orthologous sequences and put them in files inside an appropriate folder In this example the relevant output is directory sample_buch_fasta_homologues Buchaph
36. bs on the cluster using options o m cluster The downstream parsing of the BLAST results which is memory consuming could then be executed logged in from one of the more powerful compute nodes or from a powerful server Would it be possible after running the all against all BLAST jobs using option o to log into three different compute nodes and execute the downstream pipeline on each node for a different clustering algorithm i e BDBHs COG and OMCL No as each get_homologues pl job requires exclusive access to the data directory and prevents other jobs to access it simultaneously Does the I option have any impact on the calculation of the core and pan genomes including their graphical representations Of course as this option enforces the program to sample once the list of genomes in the implicit order of the list The fitting of pan core genome functions will be affected as there will be less points to do the noon linear fitting What is the minimum value for BLAST neighborhood correlation parameter This parameter captures the concept of BLAST neighborhood explained in PubMed 18475320 and is calculated as a a positive Pearson correlation coefficient In this context sequences A and B should have a similar list of BLAST matches if they truly are orthologues This parameter is by default set to 0 so it takes no effect As its values approaches to 1 it gets more difficult to call bidirectional best hits as they will be require
37. core genes from a single genome G that is with clusters containing only sequences from G to clusters of a larger group of taxa A B C G that includes G The parsimony tree produced by compare_cluster pls T cannot be opened by MEGA Program PARS from the PHYLIP suite often produces several alternative parsimony trees contained in the pangenome_matrix_t0 phylip ph file one per line Some phylogeny programs such as MEGA require splitting these trees in separate files in order to properly read and plot them Parsing a pangenome matrix with A amp B lists yields zero clusters When a command such as parse_pangenome_matrix pl m pangenome_matrix_t0 tab A A txt B is invoked the passed A amp B files must contain taxon names matching exactly those of corresponding input files including the extension Instead of a list such as taxonA taxonB_001 taxonZ244 the list should be taxonA faa taxonB_001 faa taxonZ244 faa 50 B txt g 6 Frequent warnings and error messages error message practical meaning EXIT cannot find previous input file XXXX please re run everything This can happen when re running the program with an input d directory which used to contain more sequences files or with different names This prevents the software to recycle previous results as it cannot ensure that sequences are still numbered consistently WARNING could not extract nucleotide sequences from file XXXX You ll se
38. could probably set up a windows machine to run get_homologues pl and most of its dependencies Nevertheless we have tested the package only on Linux and macOSX systems and currently the only option for Windows users would be to set up a VirtualBox Linux box and then installing get_homologues pl there NOTE If you don t know how to do this please check these resources Install Ubuntu on VirtualBox or virtualboxes e How much memory does my computer require to run get_homologues pl This depends directly on the size and number of the genomes to be analyzed In our benchmarks jobs with up to approximately 40 microbial genomes can be carried out on a 8Gb RAM Linux box On January 2014 a user reported that he had managed to analyze 176 bacterial genomes with option M on a cluster queue with 80GB of RAM As mentioned in section 3 4 the s flag option reduces the memory footprint of your get_homologues up to 20 fold allowing running large jobs in machines with small RAM resources Therefore if you are running a large job that fails to finish succesfully the most common explanation would be that it was killed by the operating system for taking too much memory In those cases it is advisable to re run the same job with the s flag BDBH OMCL COG BDBHs OMCLs COGs 45000 40000 35000 30000 25000 20000 15000 10000 5000 0 computing time s 10000 9000 8000 7000 6000 5000 4000 3000 2000 1000
39. d Lipman DJ 1997 Gapped BLAST and PSI BLAST a new generation of protein database search programs Nucl Acids Res 25 17 3389 3402 Stajich JE Block D Boulez K Brenner SE Chervitz SA Dagdigian C Fuellen G Gilbert JG Korf I Lapp H Lehvslaiho H Matsalla C Mungall CJ Osborne BI Pocock MR Schattner P Senger M Stein LD Stupka E Wilkinson MD Birney E 2002 The Bioperl toolkit Perl modules for the life sciences Genome Res 12 10 1611 8 hmmscan search sequence s against a profile database HMMER 3 0 March 2010 http hmmer org Copyright C 2010 Howard Hughes Medical Institute Freely distributed under the GNU General Public License GPLv3 RD Finn J Mistry J Tate P Coggill A Heger JE Pollington OL Gavin P Gunesekaran G Ceric K Forslund L Holm EL Sonnhammer SR Eddy A Bateman 2010 The Pfam protein families database Nucl Acids Res 38 D211 222 If you use the accompanying scripts the following references should also be cited e R Core Team 2013 R A Language and Environment for Statistical Computing http www R project org R Foundation for Statistical Computing Vienna Austria ISBN3 900051 07 0 52
40. d to have very similar lists of BLAST hits Could you explain a bit more about the meaning effect of the t option You should use t 0 zero when you need to get all clusters of homologous sequences This will generate files containing 2 or more homologous sequences from one or more taxa By default get_homologues pl sets t numberOfTaxa that is it will provide the user with clusters of homologous sequences that contain at least one sequence from each taxon Note that singleton clusters just one sequence per organism will be produced if combined with option e which excludes clusters containing inparalogues In this later case the resulting clusters will contain only single copy genes from each taxon i e the orthologues This is convenient if you want to use resulting gene families to do for example genome level phylogenetic analyses using only the repertoire of orthologous single copy genes Note however that such clusters are not suitable for pangenome analyses For such analyses instead please use auxiliary script compare_clusters pl with the set of clusters obtained with t 0 And what about option a what are these other GenBank file features Thoroughly annotated genomes see for example the GenBank file for Escherichia coli K12 MG1655 Es cherichia_coli_K_12_substr_MG1655_uid225 gbk have many features incluing ribosomal rRNAs integenic 47 spacers tRN Aas mat_peptide repeat_region and miscellaneous features
41. directory replacing INSTALL_PATH by the full path of the installation folder export GETHOMS INSTALL_PATH get_homologues_X Y export PATH GETHOMS PATH This change will be effective in a new terminal or after running source bash_profile The rest of this section might be safely skipped if installation went fine it was written to help solve instal lation problems 2 1 Perl modules A few Perl core modules are required by the get_homologues pl script which should be already installed on your system Cwd File Fetch Net FTP FindBin File Basename File Spec Getopt Std Benchmark and Storable In addition the Bio Seq Bio SeqlO Bio Graphics and Bio SeqFeature Generic modules from the Bioperl collection and module Parallel ForkManager are also required and have been included in the get_homologues bundle for your convenience Should this version of BioPerl fail in your system as diagnosed by install pl it might be necessary to install it from scratch However before trying to download it you might want to check whether it is already living on your system by typing on the terminal perl MBio Root Version e print Bio Root Version VERSION If you get a message Can t locate Bio Root Version then you need to actually install it which can sometimes become troublesome due to failed dependencies For this reason usually the easiest way of installing it provided that you have root privileges it
42. dy of bacterial genomes in GenBank format it can also be applied to eukaryotic sets of sequences although in this case the meaning of terms such as core or pan genome might change and intergenic regions might be much larger 2 Requirements and installation get_homologues pl is a Perl program bundled with a few binary files The software has been tested on 32 and 64 bit Linux boxes and on Intel MacOSX 10 8 4 systems Therefore a Perl interpreter is needed to run this software which is usually installed by default on these operating systems In addition the package includes a few extra scripts which can be useful for downloading GenBank files and for the analysis of the results In order to install and test this software please follow these steps 1 Unpack the software with tar xvfz get_homologues_X Y tgz 2 cd get_homologues_X Y 3 install pl Please follow the indications in case some required part is missing 4 Type get homologues pl v which will tell exactly which features are available 5 Test the main Perl script named get homologues p1 with the included sample input folder sample_buch_fasta by means of the instruction get_homologues pl d sample_buch_fasta You should get an output similar to the contents of file sample_output txt 6 Optionally modify your PATH environment variable to include get_homologues pl Please copy the following lines to the bash_profile or bashrc files found in your home
43. e bundled file sample genome list txt First it can be seen that all completed genomes have NC accession numbers and indeed they all have their own folder in the NCBI site for bacterial genomes ftp ftp ncbi nih gov genbank genomes Bacteria These genomes can be added to the download list as follows NC_009381 Yersinia_pestis_Pestoides Moreover there are also a few draft genomes with NZ accessions which can be browsed at ftp ncbi nih gov genomes Bacteria DRAFT These entries can be listed in our download file by adding the four letter code of their accession numbers as follows AAYU Yersinia_pestis_B42003004 Sometimes it might not be possible to successfully find a sequence as happened with ERA000177 when writing this part of the manual Finally the genome_list txt file will look as this Yersinia_pestis_Pestoides_F Yersinia_pestis_Pestoides NC_010159 Yersinia_pestis_Angola NC_005810 Yersinia_pestis_91001 NC_008150 Yersinia_pestis_Antiqua NC_003143 Yersinia_pestis_C092 NC_006155 Yersinia_pestis_1P32953 NC_004088 Yersinia_pestis_KIM AAYU Yersinia_pestis_B42003004 AAYR Yersinia_pestis_UG05 0454 AAYV Yersinia_pestis_E1979001 ABCD Yersinia_pestis_CA88 4125 AAUB Yersinia_pestis_FV 1 ABAT Yersinia_pestis_F1991016 AAOS Yersinia_pestis_IP275 AAYS Yersinia_pestis_MG05 1020 ACNQ Yersinia_pestis_Nepal516 AAYT Yersinia_pestis_K1973002 4ERA000177 Yersinia_pestis_1P674 unavailable Note that only the first two columns separa
44. e this warning when using an uncomplete input GenBank file lacking the nucleotide sequence at the bot tom WARNING can only extract genes not CDSs from file XXXX Occurs when reading a GenBank file lacking CDS features WARNING cannot use nucleotide se quences in file XXXX as they do not match those in file YYYY This warning occurs when a twin XXXX fna file see Table 1 contains a different number of sequences than the corresponding YYYY faa file and cannot therefore be safely used to compile DNA clusters EXIT XXXX does not exist Pfam search failed Occurs when a Pfam job submitted to the cluster option D failed to report back and terminate The solution is often to re run the program as it will only re submit the missing Pfam jobs When solving problems with submit ting jobs to the cluster queue it is helpful to check the queue files EXIT cannot format BLAST sequence base Happens when for some reason the collection of input se quences could not be formatted for BLAST This might surface hard drive trouble or simply an architecture issue EXIT XXXX blastout does not exist BLAST search failed Again a BLAST error spotted for failing to produce a BLAST output Often the solution is simply to re run as this might be simply a cluster overload problem When solving problems with submitting jobs to the cluster queue it is often helpful top check the queue files EXIT parsed XXXX
45. ed taxa The accompanying script check_BDBHs pl can help you find out which are the best BLAST hits of any sequence that might interest you First as cluster names are not always informative you ll need to find out the internal identifier used by the software to handle your target sequence For instance we might want to investigate protein CspE among our 4 Buchnera taxa The command to achieve this would be check_BDBHs pl d sample_buch_fasta_homologues g i CspE And we obtain Sequences containing label CspE 1360 Buch_aph_Cc faa gi 116515229 ref YP_802858 1 CspE Buchnera aphidicola str Cc Cinara cedri Now that we know the identifier 1360 we can check its best hits check_BDBHs pl d sample_buch_fasta_homologues i 1360 Output contains the identifiers of best hits in both directions their bit scores E values alignment coverages and annotated Pfam protein domains when available query 1360 query fullname gil 116515229 ref YP_802858 1 CspE Buchnera aphidicola str Cc Cinara cedri list of bidirectional best hits dir query sbjct bits Eval cover Pfam annotation Buch_aph_Bp faa gt 1360 972 135 le 41 100 0 NA gil27904911 cold shock protein E lt 972 1360 135 le 41 100 0 NA Buch_aph_APS faa gt 1360 467 136 4e 42 100 0 NA gil15617086 cold shock protein E lt 467 1360 136 4e 42 100 0 NA Buch_aphid_Sg faa gt 1360 1883 136 4e 42 100 0 NA gil21672738 cold shoc
46. enomes in GenBank format from NCBI FTP site cd test_Streptococcus download_genomes_ncbi pl test_Streptococcus_download_list txt cd 1 1 run BLAST jobs with 4 CPU cores and optionally HMMER get_homologues pl d test_Streptococcus n 4 o get_homologues pl d test_Streptococcus n 4 D o 1 2 calculate core genomes with all BDBH OMCL amp COG algorithms get_homologues pl d test_Streptococcus get_homologues pl d test_Streptococcus M get_homologues pl d test_Streptococcus G get_homologues pl d test_Streptococcus M D 1 3 calculate consensus core genome with syntenic genes compare_clusters pl s n o test_Streptococcus_intersection d test Streptococcus homologues S f0 alltaxa algCOG e0 N test Streptococcus homologues S f0 alltaxa algOMCL e0 N test Streptococcus homologues S f0 alltaxa algBDBH e0 1 4 calculate core genome with coverage and identity as in the Tettelin paper cover 50 1D 50 get_homologues pl d test Streptococcus M C 50 S 50 1 6 calculate core intergenic clusters get_homologues pl d test_Streptococcus g 1 7 estimate and plot core and pangenome sizes with all BDBH OMCL amp COG algorithms get_homologues pl d test_Streptococcus c get_homologues pl d test_Streptococcus M c get_homologues pl d test_Streptococcus G c 2 1 calculate pan genome with OMCL amp COG algorithms get_homologues pl d test_Streptococcus
47. enomic contig For instance the first sequence was extracted from the leading strand of GenBank contig AY522431 positions 44726 46255 out of a total 56634 nucleotides Furthermore the names of neighboring genes are annotated when available in order to capture some synteny information These syntenic data can be valuable when evaluating possible orthologous genes as conservation of genomic position also operon context strongly suggests orthology among prokaryots gt GI 109390522 Escherichia coli traJ 1530 AY522431 56634 44726 46255 1 neighbour_genes tral trak ATGGACGATAGAGAAAGAGGCTTAGCATTTTTATTTGCAATTACTTTGCCTCCAGTGATGGTATGGTTTCTAGTT and gt GI 109390522 Escherichia coli traJ 1530 AY522431 56634 44726 46255 1 aligned 1 509 509 MDDRERGLAFLFAITLPPVMVWFLV 26 4 4 Clustering genes and proteins that share Pfam domain architecture The BDBH algorithm in get_homologues pl can be modified by requiring bidirectional best hits to share the same domain architecture annotated in terms of Pfam domains For large volumes of sequences this taks should be accomplished on a computer cluster but of course can also be performed locally The command on the terminal could then be get_homologues pl d sample_plasmids_gbk D H HH HH H AREA A The generated output should be results_directory sample_plasmids_gbk_homologues parameters MAXEVALUEBLASTSEARCH 0 01 MAXPFAMSEQS 250 checking inp
48. er and speed up your calculations For cluster based operations three bundled Perl scripts are invoked _cluster_makeHomolog pl _cluster_makeInparalog pl and _cluster_makeOrtholog pl It is also possible to invoke Pfam domain scanning from get_homologues This option requires the bundled binary hmmscan which is part of the HMMER3 package whose path is set in file lib phyTools pm variable ENV EXE_HMMPFAM Should this binary not work in your system a fresh install might be the solution say in your path hmmer 3 0 In this case you ll have to edit file ib phyTools pm and modify the relevant if defined ENV EXE_HMMPFAM ENV EXE_HMMPFAM your path hmmer 3 0 src hmmscan noali acc cut_ga The Pfam HMM library is also required and the install pl script should take care of it However you can manually download it from the appropriate Pfam FTP site This file needs to be decompressed either in the default db folder or in any other location and then it should be formatted with the program hmmpress which is also part of the HMMER3 package A valid command sequence could be cd db wget ftp ftp sanger ac uk pub databases Pfam current_release Pfam A hmm gz gunzip Pfam A hmm gz your path hmmer 3 0 src hmmpress Pfam A hmm Finally you ll need to edit file lib phyTools pm and modify the relevant line to if defined ENV PFAMDB ENV PFAMDB db Pfam A hmm In order
49. er list in place will parse it Uncultured _f0_Otaxa_algOMCL_eO_ cluster_list number of clusters 193 3d oH HH H HH HF OF 33 intersection size 180 clusters intersection list sample_intersection intersection_t0 cluster_list pangenome_file sample_intersection pangenome_matrix_t0 tab pangenome_phylip file sample_intersection pangenome_matrix_t0 phylip input set sample_intersection Uncultured _f0_0taxa_algC0G_e0_ venn_t0 txt input set sample intersection Uncultured f0 Otaxa algOMCL e0 venn t0 txt H dk H Venn diagram sample_intersection venn_t0 pdf Venn region file sample_intersection unique_Uncultured _f0_0taxa_algC0G_e0_ venn_t0 txt 17 Venn region file sample_intersection unique_Uncultured _f0_0taxa_alg0MCL_e0_ venn_t0 txt 13 Note that skipped clusters correspond precisely to COG unresolved clusters This script produces two versions of the same pangenomic matrix e A full detailed matrix in tab separated columns with taxa genomes as rows and sequence clusters as columns in which cells with natural numbers indicate whether a given taxa contains one or more sequences from a given cluster Such files can be read and edited with any text editor or spreadsheet software source folder 3_EcoRII faa 8 tnpA faa 22_traA faa 36_beta lactamase K_pneumoniae_12_plasmid_12 gb 1 0 1 2 K_pneumoniae_9_plasmid_9 gb 1 0 0 2 K_oxytoca_plasmid_pKOX105 gb 1 0 1 1 Less e
50. genic regions are defined by three global variables variables within get_homologues pl as explained in section 3 4 These default values might be edited for specific taks for instance chloroplast intergenic regions are usually much smaller than 200 bases the default size and therefore variable MININTERGENESIZE should be set to a smaller value Moreover in this example we restrict the search for conserved intergenic segments to Klebsiella pneumoniae plasmids by creating a file sample_plasmids_gbk include_list txt with these contents K_pneumoniae_12_plasmid_12 gb K_pneumoniae_9_plasmid_9 gb K_pneumoniae_KP96_plasmid_pKP96 gb We can now execute get_homologues pl d sample_plasmids_gbk g I sample_plasmids_gbk include_list txt results_directory sample_plasmids_gbk_homologues parameters MAXEVALUEBLASTSEARCH 0 01 MAXPFAMSEQS 250 checking input files E_coli_ST131_plasmid_pKC394 gb 55 intergenes 7 E_coli_plasmid_pMURO50 gb 60 intergenes 12 IncN_plasmid_R46 gb 63 intergenes 11 K_oxytoca_plasmid_pKOX105 gb 69 intergenes 13 K_pneumoniae_12_plasmid_12 gb 92 intergenes 11 K_pneumoniae_9_plasmid_9 gb 87 intergenes 12 K_pneumoniae_KP96_plasmid_pKP96 gb 64 intergenes 18 S_enterica_subsp_enterica_serovar_Dublin_plasmid_pMAK2 gb 52 intergenes 9 Uncultured_bacterium_plasmid_pRSB201 gb 58 intergenes 9 Uncultured_bacterium_plasmid_pRSB203 gb 49 intergenes 7 Uncultured_bacterium_plasmid_pRSB205 gb 52 intergenes 8 Unc
51. get_homologues manual Bruno Contreras Moreira Laboratory of Computational Biology Estaci n Experimental de Aula Dei CSIC Fundaci n ARAID http www eead csic es compbio Zaragoza Espana Pablo Vinuesa Center for Genomic Sciences Universidad Nacional Aut noma de M xico http www ccg unam mx vinuesa Cuernavaca M xico August 14 2015 Contents 1 Description 2 Requirements and installation 21 Perlimodiles o raiot e wae ee SRS eee eee ta eae we ae be ae 22 ROGOVE DINAS 0 0 an aoe ee eA Ee ee Ee Re Re E e 2 3 Optional software dependencies 2 0 ce ee 3 User manual AL AGUAS a era ra REE RR OS ER Ree ek oe Eee eRe Aa 3 2 Obtaining bacterial GenBank input files o o 0002004 3 3 Eukaryotic FASTA amino acid input files o o e e e oe Program opos o eaa e a a s Mel Re A Le Ral 3 0 Actompanying SETIPES coca ewe PAA id AA A a 4 A few examples of use 4 1 Clustering orthologous proteins from a few FASTA files o o e e 4 2 Clustering orthologous proteins from a single FASTAfile 4 3 Clustering genes and proteins extracted from GenBank files o 4 4 Clustering genes and proteins that share Pfam domain architecture 4 5 Clustering syntenic neighbor genes ee 4 6 Comparing clusters with external sequence sets 2 2 2 ee emme na 4 7 Clustering intergenic segments from GenBank files
52. is example genes X and Z of species 1 2 and 3 are found to be syntenic regardless of their orientation 29 4 6 Comparing clusters with external sequence sets Sometimes we will need to compare clusters of possibly orthologous sequences produced by get_homologues pl in any of the ways explained earlier with a set of sequences defined elsewere for instance in a publication This can be done to validate a set of core clusters and to check that nothing important was left out We can accomplish just this with help from script compare_clusters pl invoking option r which indicates that the first parsed cluster folder is actually a reference to be compared To illustrate this application we have set a folder with 4 protein sequences from Buchnera aphidicola from strain Cinara cedri directory sample_buch_fasta sample_proteins each sequence in a single FASTA file Note that these clusters must contain sequences contained in the larger dataset which we want to compare with otherwise the script will not match them Headers are not used by the program only the sequences matter In order to check whether these sequences are clustered in any of the clusters generated earlier say with BDBH we will issue a command such as compare_clusters pl o sample_intersection r d sample_buch_fasta sample_proteins sample_buch_fasta_homologues BuchaphCc_f0_alltaxa_algBDBH_e0_ HOH OH HH A The following output should be produced nu
53. is to use the software manager of your Linux distribution such as synaptic apt get in Ubuntu yum in Fedora or YaST in openSUSE If you prefer the terminal please use the cpan program with administrator privileges sudo in Ubuntu cpan i C CJ CJFIELDS BioPerl 1 6 1 tar gz This form should be also valid perl MCPAN e install C CJ CJFIELDS BioPerl 1 6 1 tar gz Please check this tutorial if you need further help 2 2 Required binaries The Perl script install pl already mentioned in section 2 checks whether the included precompiled binaries for COGtriangles hmmer MCL and BLAST are in place and ready to be used by get_homologues However if any of these binaries fails to work in your system perhaps due a different architecture or due to missing libraries it will be necessary to obtain an appropriate version for your system or to compile them with your own compiler In order to compile MCL the GNU gcc compiler is required although it should most certainly already be installed on your system If not you might install it by any of the alternatives listed in section 2 1 For instance in Ubuntu this works well sudo apt get install gcc The compilation steps are as follows cd bin mcl1 02 063 configure make To compile COGtriangles the GNU g compiler is required You should obtain it by any of the alternatives listed in section 2 1 The compilation would then include several steps cd bin COGsoft cd COGlse ma
54. jobs and the required computing time proportionally reduced core m2cores te 4cores 8cores 250 a 2 200 2 a 2 150 a 8 E 100 E v i i 50 Ss 2 0 100 250 500 800 700 600 s S 500 gt y 400 2 g 300 200 REA AA A K A 100 4 0 100 250 500 batch size sequences Figure 17 Runtime of a typical bacterial genomic BLASTP jobs as calculated with 1 to 8 CPU cores when splitting the job in batches Note that the optimal batch size in this test is 100 sequences compared to 250 and 500 sequences e How do I set up get_homologues pl to run in my computer cluster Cluster jobs are submitted by invoking option m cluster instead of explicitely calling qsub The fist time such a job is run the following error can be seen running BLAST searches Unable to run job You have to specify a queue with the q option to qsub Exiting Liew ed This usually means that qsub jobs inside get_homologues pl lack a target queue in the cluster Please find out the name of the right queue for your jobs perhaps something as simple as such as q default and edit the relevant line in the header of get homologues pl my QUEUESETTINGS q default 45 e Does get_homologues pl run on Windows As Perl is not installed by default on Windows not BerkeleyDB is easily installed there we haven t ported this software to these systems However with some work you
55. junction with c to control the order in which genomes are considered during pan and core genome analyses Taking the sample_buch_fasta folder a valid list_file txt could contain these lines Buch_aph_APS faa Buch_aph_Bp faa Buch_aph_Cc faa e option C sets the minimum percentage of coverage required to call two sequences best hits as illustrated in the figure The larger these values get the smaller the chance that two sequences are found to be reciprocal best hits The default coverage value is set to 75 This parameter has a large impact on 13 the results obtained and its optimal values will depend on the input data and the anticipated use of the produced clusters LN re L SL m OL 2 AA 62 subject 1 subject 2 coverage el s1 e2 52 L segment coverage e2 s1 L Figure 4 Coverage BDBH OMCL and overall segment coverage COGS illustrated with the alignment of sequence query to two aligned fragments of sequence subject where 1 s1 e1 s2 e2 and L are alignment coor dinates e E sets the maximum expectation value E value for BLAST alignments This value is by default set to le 05 This parameter might be adjusted for nucleotide BLAST searches or for very short proteins under 40 residues e S can be passed to require a minimum sequence identity for two sequences to be called best hits This option does not affect COGS runs its default value is set to 1 e N sets a minimum neighborhood correla
56. k protein E lt 1883 1360 136 4e 42 100 0 NA If previous get_homologues jobs included the calculation of Pfam domains then option D can be added to produce a richer report that now includes the identifiers of Pfam domains such as PF00313 sorted on their position along the sequence dir query sbjct bits Eval cover Pfam annotation E Buch_aph_Bp faa gt 1360 972 135 le 41 100 0 PF00313 gil27904911 lt 972 1360 135 le 41 100 0 PF00313 Buch_aph_APS faa gt 1360 467 136 4e 42 100 0 PF00313 gil15617086 lt 467 1360 136 4e 42 100 0 PF00313 Buch_aphid_Sg faa gt 1360 1883 136 4e 42 100 0 PF00313 gil21672738 lt 1883 1360 136 4e 42 100 0 PF00313 Note that this script works by parsing files all p20 csv and all bpo which are created at run time by get_homologues in folder tmp within the results directory These are text files that can be inspected with help from any text editor 43 4 9 A script to test most get homologues features with a sample dataset File HOWTOTettelin is a shell script which performs typical uses of get_homologues pl This script can be made executable on the terminal with chmod x HOWTOTettelin and then executed with HOWTOTettelin The first task carried out by the script is to download the same GenBank files used in the landmark work of Tettelin and collaborators PubMed 16172379 afterwards several analyses are sequentially undertaken 1 0 optionally download g
57. ke cd COGmakehash make cd COGreadblast make cd COGtriangles make Regarding BLAST get_homologues can use binaries from both the BLAST and BLAST versions which can be easily downloaded from the NCBI FTP site The packed binaries are blastp and makeblastdb from version ncbi blast 2 2 27 If these do not work in your machine or your prefer to use older BLAST versions then it will be necessary to edit file lib phyTools pm First environmental variable ENV BLAST PATH needs to be set to the right path in your system inside subroutine sub set_phyTools_env Variables ENV EXE_BLASTP and ENV EXE_FORMATDB also need to be changed to the appropriate BLAST binaries which are respectively blastall and formatdb 2 3 Optional software dependencies It is possible to make use of get_homologues on a computer farm or high performance computing cluster managed by gridengine In particular we have tested this feature with versions GE 6 0u8 6 2u4 2011 11p1 in Rocks Cluster distributions 4 3 5 4 and 6 1 invoking the program with option m cluster For this command to work it might be necessary to edit the get_homologues pl file and add the right path to set global variable SGEPATH To find out the installation path of your SGE installation you might try the next terminal command which qsub In case you have access to a multi core computer you can follow the steps in this blog post to set up your own Grid Engine clust
58. luster In the second case global variable SGEPATH might need to be appropriately set as explained in section 2 3 12 G genomes sort genomes by size take smallest as reference reference 2 5 51 find inparalogues incremental clustering BDBH reference 2 G E A gt _ _ __ lt A A reference 2 3 67 genes core genes 1 x 2 x x x x x 2 3 XX x x XX x 3 4 x x ies x x x N 1 x x XX x x N 1 N x x x i sequence clusters Figure 3 Flowchart of BDBH algorithm with default parameters and G genomes First inparalogues defined as intra specific bidirectional best hits BDBHs are identified in each genome second new genomes are incrementally compared to the reference genome and their BDBHs annotated finally clusters containing at least 1 sequence per genome are conserved as well as QUEUESETTINGS that specificies for instance a particular queue name for your cluster jobs It s worth mentioning that get homologues takes advantage of the default home export in Rocks Cluster distributions so it should be run with input files folders located within home e n sets the number of threads CPUs to dedicate to each BLAST HMMER job run locally which by default is 2 e I list_file txt allows the user to restrict a get homologues job to a subset of the genomes included in the input d folder This flag can be used in con
59. lysis they are currently ignored by compare_clusters pl Table 4 Frequent warnings and error messages produced by get_homologues and kin scripts 51 7 Credits and references get homologues pl is designed created and maintained at the Laboratory of Computational Biology at Estaci n Experimental de Aula Dei CSIC in Zaragoza Spain and at the Center for Genomic Sciences of Universidad Nacional Aut noma de M xico CCG UNAM The code was written mostly by Bruno Contreras Moreira and Pablo Vinuesa but it also includes code and binaries from OrthoMCL v1 4 algorithm OMCL M COGtriangles v2 1 algorithm COGS G NCBI Blast 2 2 and BioPerl 1 5 2 We ask the reader to cite the main reference describing the get_homologues software e Contreras Moreira B and Vinuesa P 2013 GET_HOMOLOGUES a versatile software package for scal able and robust microbial pangenome analysis Appl Environ Microbiol 79 7696 7701 and also the original papers describing the included algorithms and databases accordingly e LiL Stoeckert CJ Jr Roos DS 2003 OrthoMCL identification of ortholog groups for eukaryotic genomes Genome Res 13 9 2178 89 Kristensen DM Kannan L Coleman MK Wolf YI Sorokin A Koonin EV Mushegian A 2010 A low polynomial algorithm for assembling clusters of orthologous groups from intergenomic symmetric best matches Bioinformatics 26 12 1481 7 Altschul SF Madden TL Schaffer AA Zhang J Zhang Z Miller W an
60. mber of input cluster directories 2 parsing clusters in sample_buch_fasta sample_proteins no cluster list in place checking directory content WARNING taxon names will be automatically extracted from FASTA headers please watch out for errors number of clusters 4 parsing clusters in sample_buch_fasta_homologues BuchaphCc_f0_alltaxa_algBDBH_eO_ cluster list in place will parse it BuchaphCc_f0_alltaxa_algBDBH_e0_ cluster_list number of clusters 305 intersection output directory sample_intersection intersection size 4 clusters intersection list sample_intersection intersection_t0 cluster_list input set sample_intersection sample_proteins venn_t0 txt input set sample_intersection BuchaphCc_f0_alltaxa_algBDBH_e0_ venn_t0 txt Venn diagram sample_intersection venn_t0 pdf Venn region file sample_intersection unique_sample_proteins venn_t0 txt 0 Venn region file sample_intersection unique_BuchaphCc_f0_alltaxa_algBDBH_e0_ venn_t0 txt 301 30 4 7 Clustering intergenic segments from GenBank files The use of input files in GenBank format also allows the extraction of clusters of flanked orthologous inter genic regions which might be of interest as these are expected to mutate at higher rates compared to coding sequences In this example this feature is illustrated by processing folder sample_plasmids_gbk with options g I sample_plasmids_gbk include_list txt The restraints that apply to the parsed inter
61. mits Advanced Help Display Settings Y GenBank full Send y ae a TA 7 A 9 Complete Record shown Yersinia pestis strain C790 plasmid pCD1 pMT1 partial sequence coding sequences ke a GenBank HQ612242 1 Choose Destination 7 X FASTA Graphics File Clipboard Go to y D Collections Analysis Tool CIO OO gt mm iquence 2 Locus HQ612242 67028 bp DNA linear BCT 16 JAN 2011 DEFINITION Yersinia pestis strain C790 plasmid pCD1 pMT1 partial sequence Download 1 Rems ACCESSION HQ612242 Format VERSION HQ612242 1 GI 317374452 fe Features y KEYWORDS Ge Bank ful j f SOURCE Yersinia pestis G te Fil pce ORGANISM Yersinia pestis Create File Bacteria Proteobacteria Gammaproteobacteria Enterobacteriales a p Enterobacteriaceae Yersinia LinkOut to external resources s Figure 1 NCBI download widget showing the choice of GenBank full format which contains the raw reference nucleotide sequences Often users take custom made GenBank files resulting from in house genome assemblies to be analysed In most cases genes from such files don t have GenBank identifiers assigned yet and so we recommend adding the field locus_tag to each CDS feature so that parsed sequences can be properly identified For their use with get_homologues GenBank files for the same species for example from the main chromo some and from a couple of plasmids must be concatenated For instance the genomic sequences of Rhizobium etli CFN 42 comprise se
62. n PubMed 18475320 compile core genome with minimum BLAST searches Options that control clustering t za report sequence clusters including at least t taxa report clusters of sequence features in GenBank files instead of default CDS GenBank features report clusters of intergenic sequences flanked by ORFs in addition to default CDS clusters filter by length difference within clusters reference proteome faa gbk file exclude clusters with inparalogues 10 overrides i use of pre clustered sequences ignores c g required unless d is set useful to pre compute searches follows order in I file if enforced ignores r t e optional requires c example R 1234 required for mixing c with c a runs default local default 2 takes all by default requires d requires 3 genomes taxa range 1 100 default 75 default 1e 05 max 0 01 recommended with m cluster range 1 100 default 1 BDBH OMCL range 0 1 default 0 BDBH OMCL ignores c BDBH default t number0fTaxa t 0 reports all clusters OMCLICOGS requires d and gbk files example a tRNA rRNA NOTE uses blastn instead of blastp ignores g D requires d and gbk files range 1 100 by default sequence length is not checked by default takes file with least sequences with BDBH sets first taxa to start adding genes by default inparalogues are included x allow seque
63. nces clustering inparalogues in E_coli_plasmid_pMURO50 gb reference 2 sequences re using previous results 4 looking for valid ORF clusters n_of_taxa 12 number of clusters 24 cluster list EcoliplasmidpMUR050 f0 alltaxa algBDBH Pfam e0 cluster list cluster directory sample plasmids gbk homologues EcoliplasmidpMUR050 f0 alltaxa algBDBH Pfam e0 28 4 5 Clustering syntenic neighbor genes The sequence clusters derived from a set of GenBank files can be further processed in order to select those that contain only syntenic genes defined as those having at least one neighbor included in other clusters Again we will invoke script compare_clusters pl for this task compare_clusters pl o sample_intersection s d N sample_plasmids_gbk_homologues EcoliplasmidpMURO50_f0_alltaxa_algBDBH_e0_ The following output is produced number of input cluster directories 1 parsing clusters in sample_plasmids_gbk_homologues EcoliplasmidpMURO50_f0_alltaxa_algBDBH_e0_ cluster list in place will parse it EcoliplasmidpMURO50_f0_alltaxa_algBDBH_e0_ cluster_list number of clusters 24 intersection output directory sample_intersection intersection size 21 clusters syntenic intersection list sample_intersection intersection_t0_s cluster_list cluster X cluster Z Figure 8 A cluster is called syntenic when it contains neighboring genes which are also contained in other single clusters In th
64. nces in multiple COG clusters F orthoMCL inflation value A calculate average identity of clustered sequences by default uses blastp results but can use blastn with a z add soft core to genome composition analysis by default sequences are allocated to single clusters COGS range 1 5 default 1 5 OMCL optional creates tab separated matrix recommended with t O OMCL COGS optional requires c OMCLICOGS faa gbk files extract sequences features intergenes i local cluster COGtriangles 121 oe sequence clusters fna faa files flanked intergene i clusters pangenome matrices an core genome A p nE Be ates syntenic clusters Figure 2 Flowchart of get_homologues Typing get_homologues pl h on the terminal will show the available options shown on the previous pages The only required option is either i used to choose an input file or d instead which indicates an input folder as seen in section 3 1 It is important to remark that in principle only files with extensions faa and gbk are considered when parsing the d directory By using faa input files in theory you might only calculate clusters of protein sequences In contrast the advantage of using gbk files is that you obtain both nucleotide and protein clusters If both types of input files are combined only protein clusters will be produced However if each input faa file has a twin fna file in place co
65. ng clusters in sample_buch_fasta_homologues BuchaphCc_f0_alltaxa_algOMCL_e0_ cluster list in place will parse it BuchaphCc_f0_alltaxa_alg0MCL_e0_ cluster_list number of clusters 308 HH HH HOH FH OF intersection output directory sample_intersection intersection size 295 clusters intersection list sample_intersection intersection_t0 cluster_list 21 input set sample_intersection BuchaphCc_f0_alltaxa_algBDBH_e0_ venn_t0 txt input set sample_intersection BuchaphCc_f0_alltaxa_algC0G_e0_ venn_t0 txt input set sample_intersection BuchaphCc_f0_alltaxa_alg0MCL_e0_ venn_t0 txt Venn diagram sample_intersection venn_t0 pdf Venn region file sample_intersection unique_BuchaphCc_f0_alltaxa_algBDBH_e0_ venn_t0 txt 5 Venn region file sample_intersection unique_BuchaphCc_f0_alltaxa_algC0G_e0_ venn_t0 txt 0 Venn region file sample_intersection unique_BuchaphCc_f0_alltaxa_algOMCL_eO_ venn_t0O txt 5 HH dk The 295 resulting clusters those present in all input cluster sets are placed in a new folder which was designated by parameter o sample_intersection Note that these are clusters that belong to the core genome as they contain sequence from all input taxa A Venn diagram such as the one in Figure 7 might also be produced which summarizes the analysis Figure 7 Venn diagram showing the overlap between clusters of orthologous sequences produced by three different algorithms and otherwise identical
66. ntaining the corresponding nucleotide sequences in the same order the program will attempt to produce the corresponding clusters of nucleotide sequences The possible input file combinations are summarized in Table 1 The use of an input folder or directory d is recommended as it allows for new files to be added there in the future reducing the computing required for updated analyses For instance if a user does a first analysis with 5 input genomes today it is possible to check how the resulting clusters would change when adding an extra 10 genomes tomorrow by copying these new 10 faa gbk input files to the pre existing d folder so that all previous BLAST searches are re used In addition to gbk and faa files the input directory can also contain one subfolder with pre clustered sequences This feature was designed so that users can add previously produced get homologues clusters or any other set of grouped sequences in FASTA format to be analysed For such a subfolder to be recognized it must be named subdir clusters or subdir_ Sample data folder sample_buch_fasta contains such an 11 input file extensions output clusters gbk amino acid DNA sequence faa amino acid sequence gbk amp faa amino acid sequence faa amp fna amino acid DNA sequence gbk amp faa amp fna amino acid DNA sequence Table 1 Valid input file combinations example subfolder which can be uncompressed to be tested
67. or valid ORF clusters n_of_taxa 4 number_of_clusters 305 cluster_list BuchneraaphidicolastrCcCinaracedri3_f0_alltaxa_algBDBH_e0_ cluster_list cluster_directory BuchneraaphidicolastrCcCinaracedri3_f0_alltaxa_algBDBH_e0_ runtime 55 wallclock secs 0 76 usr 0 04 sys 51 75 cusr 0 23 csys 52 78 CPU RAM use 21 3 MB 24 4 3 Clustering genes and proteins extracted from GenBank files The use of input files in GenBank format allows clustering nucleotide sequences in addition to proteins since this format supports the annotation of raw genomic sequences This example illustrates this feature by taking the input folder sample_plasmids_gbk which contains 12 GenBank files of plasmid replicons which we analyze by running get_homologues pl d sample_plasmids_gbk HH H HH HH HH H HOH H HHH HHH HH HH HH OH results_directory sample_plasmids_gbk_homologues parameters MAXEVALUEBLASTSEARCH 0 01 MAXPFAMSEQS 250 checking input files E_coli_ST131_plasmid_pKC394 gb 55 E_coli_plasmid_pMURO5O gb 60 IncN_plasmid_R46 gb 63 K_oxytoca_plasmid_pKOX105 gb 69 K_pneumoniae_12_plasmid_12 gb 92 K_pneumoniae_9_plasmid_9 gb 87 K_pneumoniae_KP96_plasmid_pKP96 gb 64 S_enterica_subsp_enterica_serovar_Dublin_plasmid_pMAK2 gb 52 Uncultured_bacterium_plasmid_pRSB201 gb 58 Uncultured_bacterium_plasmid_pRSB203 gb 49 Uncultured_bacterium_plasmid_pRSB205 gb 52 Uncultured_bacterium_plasmid_pRSB206 gb 55 12 genomes 756
68. pRSB20 Uncultured_bacterium_plasmid_pRSB20 Uncultured_bacterium_plasmid_pRSB20 IncN_plasmid_R46 gb E_coli_plasmid_pMURO50 gb K_pneumoniae_KP96_plasmid_pKP96 g E_coli_ST131_plasmid_pKC394 gb S enterica subsp enterica serovar Dut K oxytoca plasmid pKOX105 gb K pneumoniae 12 plasmid 12 gb K pneumoniae 9 plasmid 9 gb X298 sa 8 Xai x X606 type T E X382_ cons RIOS o0 X400_DNA poy St many accesi X240 puts im x26 putate Door pg X921 arsonical pi X348 group IL ntron eneoded n Figure 10 Heatmap of the previous pangenome matrix with dendrograms sorting genomes and seguence clusters 4 8 2 Interrogating a pangenome matrix Script parse_pangenome_matriz pl can be used to analyze a pangenome matrix such as that created in the previous section It was primarily designed to identify genes present in a group A of species which are absent in another group B but can also be used to find expansions contractions of gene families If you require the genes present expanded in B with respect to A just reverse them Expanded clusters are defined as those where all A taxa contain more sequences than the maximum number of corresponding sequences in any taxa of group B We now review these features with the same plasmid set of previous sections analyzing the pangenome matrix produced by intersecting several cluster sets on section 4 8 1 Let s say we are
69. resources required for calculating BLAST jobs grow quadratically Instead the BDBH algorithm with option b requires only 3G BLAST searches in grey for any reference genome e t is used to control which sequence clusters should be reported By default only clusters which include at least one sequence per genome are output However a value of t 2 would report all clusters containing sequences from at least 2 taxa A especial case is t 0 which will report all clusters found even those with sequences from a single genome e a forces the program to extract user selected sequence features typically contained in GenBank files such as tRNA or rRNA genes instead of default CDSs When using this option clusters are compiled by comparing nucleotide sequences with BLASTN Note that such BLASTN searches are expected to be less sensitive than default BLASTP searches e g can be used to request the compilation of clusters of intergenic sequences This implies the calculation of ORF clusters and then a search for pairs of orthologous ORFs which flanking conserved intergenic regions with the constraints set by three global variables in the header of get homologues pl my MININTERGENESIZE 200 minimum length nts required for intergenic segments to be considered my MAXINTERGENESIZE 700 my INTERGENEFLANKORF 180 length in nts of intergene flanks borrowed from neighbor ORFs 100 200 300 ado 500 sdo 700 flanking ORFs
70. rints the enabled features of the program which of course depend on the required and optional binaries mentioned in sections 2 2 and 2 3 o is ideally used to submit to a computer cluster the required BLAST and Pfam searches preparing a job for posterior analysis on a single computer c is used to request a pan and core genome analysis of the input genomes which will be output as a tab separated data file The number of samples for the genome composition analysis is set to 10 by default but this can be edited at the header of get homologues pl check the NOFSAMPLESREPORT variable In addition variables MIN_PERSEQID_HOM and MIN_COVERAGE_HOM with default values 0 and 20 respectively control how homologues are called These values can as well be edited to require more stringent identity and coverage requirements in order to call a sequence homologous and therefore redundant R takes a number that will be used to seed the random generator used with option c By using the same seed in different c runs the user ensures that genomes are sampled in the same order s can be used to reduce the memory footprint provided that the Perl module BerkeleyDB is in place please check section 2 3 This option usually makes get_homologues slower but for very large datasets or in machines with little memory resources this might be the only way to complete a job m allows the choice of runmode which can be either m local the default or m c
71. s Couldyouclari fythattome please BLAST parameter max_target_seqs is set to the number of sequences of the query proteome which usually is a large number that ensures all good quality hits are recovered Downstream analyses What are pancore and pangenome files matrices Pancore matrices contain estimates of core and pan genomes and they are produced by get_homologues pl with option c These files take names such as pan_genome_algBDBH_C75 tab which record the algorithm employed and are generated by random sampling genomes Sampling can be controlled and reproduced by using the same random number generator seed see section R in Section 3 4 Such files can be used to render plots and fitted functions with script plot_pancore_matrix pl as shown in Section 4 8 4 Instead pangenome matrices are generated by accessory script compare_clusters pl and contain informa tion about what which genomes contain sequences from gene clusters with no sampling involved see Section 4 8 1 They take names such as pangenome_matrix_t0 tab Is it possible to plot pangenome matrices with compare_clusters pl using a single cluster directory Yes no problem The script will generate the intersection_t0 cluster_list pangenome matrix t0 phylip and pangenome_matrix_t0 tab files based on the clusters found by the chosen algorithm The only thing you won t find in the directory are Venn diagrams since at least 2 cluster sets generated by 2 algorithms
72. s in either FASTA format extensions faa or fasta containing amino acid sequences or GenBank files extension gbk one file per organism The advantage of this option is that new files can be added to the input folder in the future and previous calculations will be conserved This might be useful to study a group of organisms for which a few genomes are available and then keep adding new genomes as they become available This directory can actually contain a mixture of FASTA and GenBank files 3 2 Obtaining bacterial GenBank input files The GenBank format is routinely used to describe genomic sequences usually taking one file per chromosome or genomic contig Each file contains a reference DNA genomic sequence plus a collection of genes and their products making it possible to extract simultaneously the sequence of every ORF and its corresponding protein products GenBank files are the recommended input format for bacterial sequences as they permit the compilation of DNA and protein sequences clusters which might have different applications There are many ways to obtain GenBank files starting by manual browsing and downloading from the NCBI site keeping in mind that full files which include the reference nucleotide sequences should be downloaded In fact get_homologues pl will fail to parse any ORF in summary GenBank files 2 NCBI Resources Y How To Y Sign in to NCBI Nucleotide cleotide EE Li
73. s pl d sample plasmids gbk t 0 M looking for valid ORF clusters n_of_taxa 0 number_of_clusters 193 t cluster_list EcoliplasmidpMURO50_f0_Otaxa_algOMCL_eO_ cluster_list cluster directory sample_plasmids_gbk_homologues EcoliplasmidpMURO50_f0_Otaxa_algUMCL_e0_ We can now take advantage of script compare_clusters pl and the generated cluster directories to compile the corresponding pangenome matrix This can be accomplished for a single cluster set compare_clusters pl o sample_intersection m d sample_plasmids_gbk_homologues Uncultured _f0_0taxa_algC0G_e0_ or for the intersection of several sets in order to get a consensus pangenome matrix compare_clusters pl o sample_intersection m d sample_plasmids_gbk_homologues Uncultured _f0_0taxa_algC0G_e0_ M sample_plasmids_gbk_homologues Uncultured _f0_0taxa_alg0MCL_e0_ The ouput of the latter command will include the following lines PEN number of input cluster directories 2 parsing clusters in sample_plasmids_gbk_homologues Uncultured _f0_0taxa_algC0G_e0_ cluster list in place will parse it Uncultured _f0_Otaxa_algCOG_e0_ cluster_list ERROR skipping cluster 62_transposase faa seems to duplicate 59_transposase faa ERROR skipping cluster 116_tnpA faa seems to duplicate 59_transposase faa number of clusters 197 parsing clusters in sample_plasmids_gbk_homologues Uncultured _f0_0taxa_algOMCL_e0_ clust
74. settings If we are interested only in clusters containing single copy proteins from all input species as they are probably safer orthologues we can add the option t 4 to our previous command as our example dataset contains 4 input proteomes 22 4 2 Clustering orthologous proteins from a single FASTA file A similar analysis could be performed with a single input FASTA file containing amino acid sequences provided that each contains a taxon name in its header as explained in section 3 1 gt gi 10957100 ref NP_057962 1 Buchnera aphidicola str APS Acyrthosiphon pisum MFLTEKRRKLIQKKANYHSDPTTVFNHLCGSRPATLLLETAEVNKKNNLESIMIVDSAIRVSAVKNSVKI TALSENGAEILSILKENPHKKIKFFEKNKSINLIFPSLDNNLDEDKKIFSLSVFDSFRFIMKSVNNTKRT SKAMFFGGLFSYDLISNFESLPNVKKKQKCPDFCFYLAETLLVVDHQKKTCLIQSSLFGRNVDEKNRIKK RTEEIEKKLEEKLTSIPKNKTTVPVQLTSNISDFQYSSTIKKLQKLIQKGEIFQVVPSRKFFLPCDNSLS AYQELKKSNPSPYMFFMQDEDF ILFGASPESSLKYDEKNRQIELYPIAGTRPRGRKKDGTLDLDLDSRIE LEMRTNHKELAEHLMLVDLARNDLARICEPGSRYVSDLVKVDKYSHVMHLVSKVVGQLKYGLDALHAYSS CMNMGTLTGAPKVRAMQLIAEYEGEGRGSYGGAIGYFTDLGNLDTCITIRSAYVESGVATIQAGAGVVFN SIPEDEVKESLNKAGAVINAIKKAHFTMGSS Exc gt g1115616637 ref NP_239849 1 Buchnera aphidicola str APS Acyrthosiphon pisum MTSTKEIKNKIVSVTNTKKITKAMEMVAVSKMRKTEERMRSGRPYSDIIRKVIDHVTQGNLEYKHSYLEE RKTNRIGMITISTDRGLCGGLNTNLFKQVLFKIQNFAKVNIPCDLILFGLKSLSVFKLCGSNILAKATNL GENPKLEELINSVGIILQEYQCKRIDKIFIAYNKFHNKMSQYPTITQLLPFSKKNDQDASNNNWDYLYEP ESKLILDTLFNRYIESQVYQSILENIASEHAARMIAMK
75. sidual standard error 4 8 5 Estimating average identity matrices If we recall for a moment the example GenBank files analyzed on section 4 3 we can demonstrate how to calculate average identity matrices which can then be used to compare genome members of a pangenome To do so we will add a few flags to the previous command in addition to flag A which specifically asks for an identity matrix to be calculated get_homologues pl d sample_plasmids_gbk A t 0 M This will produce the following output J number_of_clusters 193 cluster list UnculturedbacteriumplasmidpRSB203_f0_Otaxa_algOMCL_eO0_ cluster_list cluster directory UnculturedbacteriumplasmidpRSB203_f0_Otaxa_algOMCL_eO_ H HM average_identity_matrix_file UnculturedbacteriumplasmidpRSB203_f0_Otaxa_algOMCL_eO_Avg_identity tab NOTE matrix computed on blastp results Note that on this example the produced identity matrix was calculated with the BLASTP scores among protein sequences included on the resulting clusters 193 If average nucleotide identities are desired the command must be modified to get_homologues pl d sample_plasmids_gbk a CDS A t O M Such matrices can then be used to calculate heatmaps and dendrograms that capture how similar the coding sequences are among genomes An example of this would be cd sample_plasmids_gbk_homologues hcluster_matrix sh i EcoliST131plasmidpkC394_f0_Otaxa_CDS_algOMCL_eO_Avg
76. sory elements which might be unique to each community get_homologues pl can be used to perform such genome composition analyses The rationale is to sample a set of genomes present in the input folder and keep adding genome after genome keeping track of i the novel genes added to the pool and ii the genes that fall in pre existing clusters This sampling experiment can be done with any of the included 3 algorithms please see Table 2 by invoking option c For instance we could try get_homologues pl d sample_buch_fasta c same as first example genome composition report samples 10 permutations 24 genomic report parameters MIN_PERSEQID_HOM 0 MIN_COVERAGE_HOM 20 genome order O Buch_aph_APS faa 1 Buch_aph_Bp faa 2 Buch_aph_Cc faa 3 Buch_aphid_Sg faa HH H HH ik OF sample O Buch_aph_APS faa 0 1 2 3 adding Buch_aph_APS faa core 574 pan 574 Loved pan genome number of genes can be plotted with plot_pancore_matrix pl file sample_buch_fasta_homologues pan_genome_algBDBH tab genomes mean stddev samples 0 490 90 574 507 507 546 357 357 546 574 357 574 1 572 28 598 585 585 587 521 521 562 592 575 598 2 597 5 606 593 594 594 594 596 600 599 591 606 3 608 5 615 602 600 608 605 605 611 613 605 615 core genome number of genes can be plotted with plot_pancore_matrix pl file sample_buch_fasta_homologues core_genome_algBDBH tab genomes mean stddev samples 0 490 90 574 507 507
77. st to assembly or annotation errors than the core genome 16 3 5 Accompanying scripts The following Perl scripts are included in the bundle to assist in the interpretation of results generated by get homologues pl h d download_genomes_ncbi pl a script which is described in section 3 2 with examples compare_clusters pl primarily calculates the intersection between cluster sets which can be used to select clusters supported by different algorithms or settings This script can also produce syntenic clusters pangenome matrices OrthoXML reports and Venn diagrams this last optional feature requires R please check section 2 3 Examples of use of this script are presented in sections 4 1 4 5 and 4 8 1 parse_pangenome_matriz pl is a script that can be used to analyze pan genome sets in order to find genes present in a group A of species which are absent in set B The identified genes can be mapped onto the underlying genome contigs of a reference genome included in A Moreover this script can be used for calculating and plotting cloud shell and core genome compartments Please see examples in sections 4 8 2 and 4 8 3 make_nr_pangenome_matriz pl is provided to post process pangenome matrices in case the user wishes to remove redundant clusters plot_pancore_matriz pl a Perl script to plot pan core genome sampling results and to fit regression curves with help from R functions An example of use of this script is given in section 4
78. ted by a blank are read in and that lines can be commented out by adding a as the first character Now we can run the following terminal command to fetch these genomes download_genomes_ncbi pl genome_list txt which will put several _Yersinia_pestis_ gbk files in the current directory which are now ready to be used by get_homologues 3 3 Eukaryotic FASTA amino acid input files Due to the complexity of eukaryotic genomes which are split in many chromosomes and contigs and usually contain complex gene models the preferred format taken by get_homologues for their sequences is FASTA While eukaryotic GenBank files can be fed in during development we have not tested nor benchmarked the compilation of clusters of nucleotide eukaryotic sequences which can be more error prone due to the inclusion of for instance introns and pseudogenes Therefore we currently cannot recommend the use of eukaryotic GenBank input files Of course FASTA format can also be used for prokaryotic amino acid sequences as in the case of the example sample_buch_fasta folder which contains protein sequences found in four Buchnera aphidicola genomes If your data are DNA coding sequences you can translate them to protein sequences for use with get_homologues for instance by means of a Perl command in the terminal with a little help from Bioperl 2 1 It is a long com mand which is split in three chunks to fit in this page perl MBio Seq lne if
79. tion as defined in PubMed 18475320 for two sequences to be called best hits In this context neighborhood is the set of homologous sequences reported by BLAST with the idea that two reliable best hits should have similar sets of homologous sequences e D is an extra restriction for calling best hits that should have identical Pfam domain compositions Note that this option requires scanning all input sequences for Pfam domains and this task requires some software to be installed see section 2 3 and extra computing time ideally in a computer cluster m cluster While for BDBH domain filtering is done at the time bidirectional best hits are called this processing step is performed only after the standard OMCL and COGS algorithms have completed to preserve each algorithm features 14 e b reduces the number of pairwise BLAST searches performed while compiling core genomes with algo rithm BDBH reducing considerably memory and run time requirements for G genomes 3G searches are launched instead of the default G It comes at the cost of being less exhaustive in finding inparalogues but in our bacterial benchmarks this potential undesired effect was negligible reference JH ANS KK KI a Ee es 3 Mm M Figure 5 For G genomes a typical get homologues job requires running G BLAST searches in order to compare all against all sequences including against itself to help infer inparalogues Therefore the
80. to reduce the memory footprint of get_homologues it is possible to take advantage of the Berke ley_DB database engine which requires Perl core module DB_File which should be installed on all major Linux distributions Similarly in order to take full advantage of the accompanying script parse_pangenome_matriz pl particularly for option p the installation of module GD is recommended An easy way to install them provided that you have administrator privileges is with help from the software manager of your Linux distribution such as synaptic apt get in Ubuntu yum in Fedora or YaST in openSUSE This can usually be done on the terminal as well in different forms sudo apt get y install libgd gd2 perl Ubuntu Debian based distros yum y install perl GD Redhat and derived distros zypper assume yes install perl GD SuSE cpan i GD will require administrator privileges sudo perl MCPAN e install GD will require administrator privileges sudo The installation of perl GD on macOSX systems is known to be troublesome The accompanying scripts compare_clusters pl plot_pancore_matriz pl parse_pangenome_matriz pl plot_matrix_heatmap sh hcluster_matriz sh require the installation of the statistical software R which usually is listed by software man agers in all major Linux distributions In some cases some SuSE versions and some Redhat like distros it will be necessary to add a repository to your package manager R c
81. ultured_bacterium_plasmid_pRSB206 gb 55 intergenes 10 H HH HH HH HOH H AE OF 12 genomes 756 sequences included input files 3 K_pneumoniae_12_plasmid_12 gb K_pneumoniae_12_plasmid_12 gb 92 K_pneumoniae_9_plasmid_9 gb K_pneumoniae_9_plasmid_9 gb 87 K_pneumoniae_KP96_plasmid_pKP96 gb K_pneumoniae_KP96_plasmid_pKP96 gb 64 ssl looking for valid ORF clusters n_of_taxa 3 number_of_clusters 31 cluster_list sample_plasmids_gbk_homologues include_list txt_algBDBH_eO_ cluster_list cluster_directory sample_plasmids_gbk_homologues include_list txt_algBDBH_e0_ looking for valid clusters of intergenic regions n_of_taxa 3 parameters MININTERGENESIZE 200 MAXINTERGENESIZE 700 INTERGENEFLANKORF 180 number_of_intergenic_clusters 1 31 intergenic_cluster_list _intergenic200_700_180_ cluster_list intergenic_cluster_directory sample_plasmids_gbk_homologues _intergenic200_700_180_ runtime 1 wallclock secs 0 10 usr 0 01 sys 0 05 cusr 0 01 csys 0 17 CPU RAM use 67 8 MB Intergenic clusters illustrated by Figure 3 4 include upper case nucleotides to mark up the sequence of flanking ORFs with the intergenic region itself in lower case and the names of the flanking ORFs in the FASTA header with their strand in parentheses gt 1 intergenic18 coords 63706 64479 length 774 neighbours GI 165928631 1 GI 165928630 1 CGCGCCATTGCTGGCCTGAAGGTATTCCC
82. ut files E_coli_ST131_plasmid_pKC394 gb 55 E_coli_plasmid_pMURO5O gb 60 IncN_plasmid_R46 gb 63 K_oxytoca_plasmid_pKOX105 gb 69 K_pneumoniae_12_plasmid_12 gb 92 K_pneumoniae_9_plasmid_9 gb 87 K_pneumoniae_KP96_plasmid_pKP96 gb 64 S_enterica_subsp_enterica_serovar_Dublin_plasmid_pMAK2 gb 52 Uncultured_bacterium_plasmid_pRSB201 gb 58 Uncultured_bacterium_plasmid_pRSB203 gb 49 Uncultured_bacterium_plasmid_pRSB205 gb 52 Uncultured_bacterium_plasmid_pRSB206 gb 55 12 genomes 756 sequences taxa considered 12 sequences 756 residues 184339 MIN_BITSCORE_SIM 16 0 mask EcoliplasmidpMURO50_f0_alltaxa_algBDBH_Pfam_e0_ _algBDBH_Pfam skipped genome parsing sample_plasmids_gbk_homologues tmp selected genomes submitting Pfam HMMER jobs See done concatenating Pfam files _E_coli_ST131_plasmid_pKC394 gb fasta pfam done parsing Pfam domain assignments generating sample plasmids gbk homologues tmp a11 pfam skip BLAST searches and parsing WARNING please remove rename results directory nome contrera codigo cvs get_homologues sample_plasmids_gbk_homologues if you change the sequences in your gbk faa files or want to re run creating indexes this might take some time lines 7 61e 03 construct_taxa_indexes number of taxa found 12 number of file addresses 7 6e 03 number of BLAST queries 7 6e 02 27 creating Pfam indexes this might take some time lines 7 54e 02 clustering orthologous seque
83. ven files which can be concatenated into a single _Rhizobium_etli CFN42 gbk file In order to assist in this task this software package includes the accompanying script download_genomes_ncbi pl We will explain its use by fetching some of the Yersinia pestis genomic sequences used in a 2010 paper by Morelli et al Group Name Accession Number Status 0 PE2 Pestoides F NC_009381 Completed Sanger genome 0 PE3 Angola NC_010159 Completed Sanger genome 0 PE4 91001 NC_005810 Completed Sanger genome 0 ANT2 B42003004 NZ_AAYU00000000 Draft Sanger genome 1 ANT1 UGO5 0454 NZ AAYR00000000 Draft Sanger genome 12 3X coverage 1 ANT1 Antigua NC 008150 Completed Sanger genome 1 1N3 E1979001 NZ AAYV00000000 Draft Sanger genome 1 0RI1 CA88 4125 NZ ABCD00000000 Draft Sanger genome 1 0RI1 FV 1 NZ 44UB00000000 Draft Sanger genome 1 0RI1 C092 NC_003143 Completed Sanger genome 1 0RI2 F1991016 NZ ABAT00000000 Draft Sanger genome 1 0RI3 IP674 ERA000177 Draft 454 genome 82X coverage 1 0RI3 IP275 NZ_AA0S00000000 Draft Sanger genome 7 6X coverage 1 0RI3 MGO5 1020 NZ_AAYS00000000 Draft Sanger genome 12 1X coverage 2 ANT1 Nepal516 NZ ACNG00000000 Draft Sanger genome 2 MED1 KIM NC 004088 Completed Sanger genome 2 MED2 K1973002 NZ AAYT00000000 Draft Sanger genome In order to use download genomes ncbi pl is is necessary to feed it a text file listing which genomes are to be downloaded The next examples show the exact format required as does th

get homologues manual

Contents

Download Pdf Manuals

Related Search

Related Contents