Home
CLC Genome Finishing Module
Contents
1. Bibliography 18 80 80 81 82 85 85 86 87 89 89 92 92 92 Chapter 1 Introduction to the CLC Genome Finishing Module 1 1 Genome finishing High throughput sequencing technologies enable rapid full genome sequencing of genomes How ever short read lengths and repetitive sequences often complicate full genome assembly and result in fragmented assemblies The CLC Genome Finishing Module has been developed to help finishing small genomes such as bacterial genomes in order to reduce the extensive work load previously associated with genome finishing and to facilitate as many steps in the procedure as possible The CLC Genome Finishing Module is an add on module to the CLC Genomics Workbench with a number of new tools 1 2 CLC Genome Finishing Module The CLC Genome Finishing Module called Finishing Module in the following is a collection of tools that can be used in different combinations The individual tools are listed below and described in detail in the following chapters Align Contigs Aligns contigs to a reference sequence or in the absence of a reference to the contigs themselves Analyze Contigs Analyzes the contig read mappings for possible misassemblies single strandedness coverage broken pairs and unaligned ends Annotate from Reference Transfers annotations to contigs from one or more already annotated references Collect Paired Reads Statistics Detects paired reads that map to
2. Figure 12 1 Select the contigs to use for joining The next dialog shown in figure 12 2 contains options related to the four different types of analyses the tool can perform Contig analysis types e Use paired reads When this option is selected paired reads mapped to the contigs are used to detect neighboring contigs Minimum paired reads is the minimum number of paired reads required to span two contigs before a join is considered CHAPTER 12 JOIN CONTIGS 69 Gx Join Contigs Select join contigs parameters 1 Choose where to run E Contig analysis types N Select contigs Use paired reads Da Select join contigs ds Y lise long reads Long reads all overlapping reads o Align to reference s Align contigs BLAST options Match options Figure 12 2 Options for detection of possible joins e Use long reads Enable the use of long reads for joining contigs Click on the folder OY to select one or more sets of long reads e Align to reference s Align the contigs to one or more reference sequences using BLAST and identify neighboring contigs Click on the folder jg to select the relevant reference s e Align contigs Align the contigs using BLAST and look for overlaps between contig ends BLAST options BLAST is used to align contigs against reference sequences and for aligning contigs against each other e BLAST word size Specifies the minimum number
3. CHAPTER 2 SYSTEM REQUIREMENTS AND INSTALLATION OF THE CLC GENOME FINISHING MODULE R License Assistant 13 es ee eee DO CLC Plugins You need a license In order to load the plug in CLC Microbial Genome Finishing Module you need a valid license Please choose how you would like to obtain a license for this plug in Request an Evaluation License 5 Download a License Choose this option if you would like to try out the plugin for 14 days Please note that only a single evaluation license will be allowed for each computer Choose this option if you have a License Order ID and would like to download a license Import a License from a File Choose this option if you have a License File on your computer and would like to import it Configure License Server Connection Choose this option if your company or institution is using a central CLC License Server This option also enables you to disable a license server connection If you experience any problems please contact The CLC Support Team Proxy Settings Previous Next Skip loading this Plugin Figure 2 2 The license assistant showing you the options for getting started Select the appropriate option and click on button labeled Next To use the Download option in the License Manager your machine must be able to access the external network If this is not the case please see section 2 5 5 2 3 1 Request an evalua
4. E ilumina mise X Contig unpaired ilumina miseg contig 70 1339340 unpaired ilumina miseg contig 70 1309014 unpaired ilumina miseg contig 70 1309555 unpaired ilumina miseg contig 70 1310089 unpaired ilumina miseg contig 70 1310083 unpaired ilumina miseg contig 36 1324305 unpaired ilumina miseg contig 70 1326548 unpaired ilumina miseg contig 70 1326726 unpaired ilumina miseg contig 82 1340091 unpaired ilumina miseg contig 83 1340091 unpaired ilumina miseg contig 45 1340091 unpaired ilumina miseg contig 88 unpaired ilumina miseg contig 106 unpaired ilumina miseg contig 58 unpaired ilumina miseg contig 62 unpaired ilumina miseg contig 106 unpaired ilumina miseg contig 49 unpaired ilumina miseg contig 76 unpaired ilumina miseg contig 56 1 1 1 1 1 7 1 1 1 1 1 1 1 1 2 1 1 1 1 Figure 3 4 The Contig match table length of contig matches This value is the sum of all the aligned contig bases of all the hits on the reference Contig matches count describes the number of matches found for each contig The Contig table allows the following functions Show Contigs Shows the contigs If the contigs used as input had reads mapped to them this action displays the read mapping Add Extract Makes it possible to add additional contigs or to extract contigs to be handled with other tools described in section 3 3 5 Copy Contig Makes one or more copies of the selected contig Reads mapped to the origi
5. GrM CLC Genome Finishing Module USER MANUAL User manual for CLC Genome Finishing Module 1 5 1 Windows Mac OS X and Linux August 20 2015 This software is for research purposes only CLC bio a QIAGEN Company Silkeborgvej 2 Prismet DK 8000 Aarhus C Denmark LC big A QIAGEN Company Contents 1 2 Introduction to the CLC Genome Finishing Module 1 1 Genome finishing lt ee ee 4 o 1 2 CLC Genome Finishing Module 2 68 co vee awe wee ee ee Eee De Ew A 1 3 Genome finishing and working with shared data 5006 1 4 LatestimprovementS 1 oaoa a a a ee sas System requirements and installation of the CLC Genome Finishing Module 2 1 System requirements 1 4 6c dee eee ae RRS DEE E Sw 2 1 1 Special requirements for Join Contigs 2 2 How to install a Workbench plugin 0 00 ee 2 3 VIOIMDGNCIVLICENGGS o s na ejercerse rasa Sa 2 3 1 Request an evaluation license ee a DGCLUGOWNIOAQ bec cons rociar eee HEE MSO HE eG Go to license download web page 0 02 eee o Accepting the license agreement 0 000 eee eee ee 2 3 2 Download a license using a license orderID DICE CGOWMNGEG lt rotar E MESA Go to license download web page 0 22 eee Accepting the license agreement 2 000 eee eee ee 2 3 3 Importa license fromafile o
6. Match options Minimum match size 100 Previous gt Next Figure 3 2 Select the contig mapping parameters The parameters to be specified in this step are Reference s e Use input contigs as reference If no reference sequence is available the contigs can be aligned using themselves as a reference e Use selected reference s When a reference sequence is available the contigs can be aligned to the reference Reference sequence s can be selected by clicking on the folder gy Blast options e BLAST word size Specifies the minimum number of nucleotides that must be fully preserved before BLAST finds a match Using a small value increases the sensitivity but will also report more random matches and slow down the BLAST search on large data sets e Maximum BLAST e value The BLAST e value describes the number of hits that are expected by chance Hence this option specifies the maximum e value of matches from BLAST to be included in the alignment Match options CHAPTER 3 ALIGN CONTIGS 32 e Minimum match size Specifies the minimum match size allowed in the alignment After the Result handling step click Finish Note When contigs are used as reference s the most interesting matches are often small overlaps between contig ends To avoid that such small overlaps are filtered out due to a high e value contig ends are aligned in a separate step The alignment of contigs ends consider
7. e Go to a computer with internet access open a browser window and go to the relevant network license download web page https secure clcbio com LmxWSv3 GetLicenseFile e Paste in your license order ID and the host ID that you noted down earlier into the relevant boxes on the webpage e Click on download license and save the resulting lic file e Take this file to the machine with the host ID that you used when downloading the license file Place it in the folder called licenses that can be found within the CLC Server installation directory e Restart the CLC Server software CHAPTER 2 SYSTEM REQUIREMENTS AND INSTALLATION OF THE CLC GENOME FINISHING MODULE 28 2 5 6 Network license installation Network licenses are necessary to run CLC Genome Finishing Module analysis tasks on grid nodes Network licenses are made available using a separate piece of software called the CLC License Server This software is normally run as a service CLC client software such as Workbenches and gridworkers contact the CLC License Server to obtain a network license when needed For a description of how to download and install a license on a CLC License Server please refer to the following section in the CLC License Server manual http clcsupport com clclicenseserver current index php manual License_download html The same number of network plugin licenses as there are CLC gridworker licenses for the CLC Server setup are required A lice
8. 1 157 663 25 5 Thymine T 1 118 847 24 6 1 2 Contig measurements Figure 17 2 A de novo assembly report is useful for evaluating the quality of an assembly 88 Chapter 18 Workflows In CLC Genomics Workbench and Biomedical Genomics Worbench you can link tools to one another to be processed in sequential order enabling repeated execution of a workflow Working with workflows is described in detail in http www clcbio com files tutorials Workflow intro pdf The CLC Genome Finishing Module contains a workflow that you can start here Toolbox Workflows PacBio De Novo Assembly Pipeline beta To explore a workflow and see the tools it is made of select the workflow and right click on its name to select the Open Copy of Workflow option 18 1 PacBio De Novo Assembly Pipeline beta Please note that the tools Correct PacBio Reads beta De Novo Assemble PacBio Reads beta were optimized for the use of PacBio data and readily support data generated with different generations of PacBio chemistry sequencing reagents Due to such algorithm optimizations the use of these tools for other data types is not supported Moreover for the tool Correct PacBio Reads beta we are relying on certain methods which are the intellectual property of Pacific Biosciences The use of Correct PacBio Reads beta tool or the predefined workflow PacBio De Novo Assembly Pipeline with any data other than data generated on a Pa
9. 10 2 How to run Reassemble Regions 0 0 62 11 Extend Contigs 65 s Ms Ted ExXeNnNO CONUS sei 65 11 2 HOW tO rUn Ten COMES ce RR R R eke eee eee eee RS EE EE ee 66 12 Join Contigs 67 12 1 What is the Join Contigs tool bw sss eae Oe aw sis dade ew A 67 12 2 How to run the Join Contigs tool aw nomoan oaoa a a sd 68 13 Remove Extension of Contigs 12 13 1 What is the Remove Extension of Contigs tool 1 2 13 2 How to run the Remove Extension of Contigs tool 13 14 Annotate from Reference 14 14 1 What is the Annotate from Reference tool 2 0 08808887 14 CONTENTS 14 2 How to run the Annotate from Reference tool 2 2 2 2 2 8 88 888834 15 Import of PacBio reads 16 Correct PacBio Reads beta 16 1 What is the Correct PacBio Reads tool 2 0 0 08 8 ee eens 16 2 How to run the Correct PacBio Reads tool 2 2 2 0 08 08 we eens 16 3 Error correction report mew bee ae awe Dawe de HES ERE DEY Ew A 17 De Novo Assemble PacBio Reads beta 17 1 What is the De Novo Assemble PacBio Reads tool 2 2 2 2 2 17 2 How to run the De Novo Assemble PacBio Reads tool 4 2 2 17 3 De Novo Assemble PacBio Reads report 0 0 a ee ee ee 18 Workflows 18 1 PacBio De Novo Assembly Pipeline beta 19 Available tutorials 19 1 Aligning contigs manually using the Genome Finishing Module
10. Biochemistry 37 9435 9444 Allawi and SantaLucia 1998c Allawi H T and SantaLucia J 1998c Thermodynamics of internal c t mismatches in dna Nucleic Acids Research 26 2694 2 701 Bommarito et al 2000 Bommarito S Peyret N and SantaLucia J 2000 Thermodynamic parameters for DNA sequences with dangling ends Nucleic Acids Res 28 9 1929 1934 Chin et al 2013 Chin C S Alexander D H Marks P Klammer A A Drake J Heiner C Clum A Copeland A Huddleston J Eichler E E et al 2013 Nonhybrid fin ished microbial genome assemblies from long read smrt sequencing data Nature methods 10 6 563 569 Novere 2001 Novere N L 2001 Melting computing the melting temperature of nucleic acid duplex Bioinformatics 17 12 1226 1227 Pevzner et al 2001 Pevzner P A Tang H and Waterman M S 2001 An eulerian path approach to dna fragment assembly Proceedings of the National Academy of Sciences 98 17 9748 9753 Peyret et al 1999 Peyret N Seneviratne P A Allawi H T and SantaLucia J 1999 Nearest neighbor thermodynamics and nmr of dna sequences with internal a a c c g g and t t mismatches Biochemistry 38 3468 347 7 SantaLucia et al 2000 SantaLucia J Allawi H T and Seneviratne P A 2000 Improved nearest neighbor parameters for predicting dna duplex stability Biochemistry 35 3555 3562 von Ahsen et al 2001 von Ahsen N Witt
11. CHAPTER 6 CREATE PRIMERS 48 al Create Primers JS Choose where to run Setprimer Fens a ba AY A P N Select one or more nudeotide sequences Set primer placement Select regions to amplify Annotation types Nothing selected oP Maximum region length 5000 Da Primer type Type of primer PCR Edge primers 4 Create edge primers for all input sequences Primer placement Minimum primer length 18 Maxmum primer length 25 Minimum distance to annotation 0 Maximum distance to annotation 30 Relative to annotation Outside w ED E Eres Figure 6 2 Set primer placement parameters e Maximum annotation length Allows specification of the maximal length of annotations that will be considered for primer design Annotations above this length will not be considered for primer design Primer type e Type of primer Two types of primers can be created The PCR primer option creates a primer pair around a target region see figure 6 4 The sequencing primer option creates a single primer sequence for a target region on either the forward or the reverse strand see figure 6 3 Contigs joined Amplicon2 small ecoli contig 1 Contigs joi small ecoli contig 1 Contigs joi small_ecoli contig 1 Contigs joi small ecoli contig 1 Contigs joi Contigs joined Amplicon1 Contigs joined Contigs joined Amplicon3 Contigs joined Amplicon4 Figure 6 3 A region covered by evenly spaced seq
12. DS Find Sequence LE Join Contigs Reassemble Regions io Remove Extension of Contigs FE Correct PacBio Reads beta d De Novo Assemble PacBio Reads beta Figure 1 1 List of all the functionalities found in the toolbox 1 3 Genome finishing and working with shared data When running tools from the Finishing Module on data located on a shared system such as a CLC Genomics Server or a shared file location some precautions have to be taken The following tools modify existing objects instead of outputting new objects which means that two users cannot work concurrently on the same objects e Analyze Contigs e Annotate from Reference e Create Amplicons e Create Primers CHAPTER 1 INTRODUCTION TO THE CLC GENOME FINISHING MODULE 9 If an object is being modified while another user is accessing or modifying it the result is often an error but in some cases the result can be undefined In the worst case scenario the object will become corrupted and cannot be used for further analysis 1 4 Latest improvements CLC Genome Finishing Module is constantly under development and a detailed list that includes a description of new features improvements bugfixes and changes for the current version can be found at HEEDI Www clebio com produecrts ele m erobial genome f1n1shina module ele microbial genome finishing tool latest improvements Chapter 2 System requirements and installation of the CLC Genome Finishing Module 2 1 Syste
13. are detected according to the following parameters e Max nonspecific coverage percentage is the allowed percentage of nonspecific cover age Only regions above this percentage are detected e Minimum coverage percentage is the minimum amount of coverage required before checking for nonspecific coverage Broken pairs When Detect broken pairs is checked regions with broken pairs are detected according to the following parameters e Max broken pairs percentage is the allowed percentage of broken pairs e Minimum coverage requirement Only regions above this value are detected The final step shown in figure 4 4 is to specify the Output options and the Result handling e Add analysis annotations When checked annotations are added to the regions detected in the contig analysis e Create report When checked a report is generated containing statistics on the problems identified This report is useful for quickly evaluating the quality of an assembly CHAPTER 4 ANALYZE CONTIGS 42 El Analyze Contigs 1 Choose where to run Misc Output options 2 Select read mappings er l l 4 Add analysis annotations 3 Set parameters for E contig analysis 1 Create report Set parameters for Indude contig specific statistics contig analysis 2 J Create table 3 Result handling Make log e Previous gt Next Enh x Cancel Figure 4 4 Set output parameters for contig analysis e Include contig specific statistics Wh
14. seq contig 19 Reverse paired ilumina miseg contig 1 Before paired_illumin seq contig 19 Forward paired_illumina_miseq contig 1 Before paired_illumin seq contig 15 Reverse paired_illumina_miseq contig 1 Before paired_illumin seq contig 14 Forward paired ilumina miseg contig 1 Before paired_illumin seq contig 13 Reverse paired ilumina miseg contig 1 Before paired ilumina miseg contig 4 Forward paired ilumina miseg contig 1 Before paired ilumina miseg contig 3 Forward paired ilumina miseg contig 1 Before paired_illumin seq contig 61 Forward paired_illumina_miseq contig 1 Before paired_illumin seq contig 62 Reverse paired_illumina_miseq contig 1 Before paired_illumin seq contig 49 Reverse paired_illumina_miseq contig 1 Before paired_illumin seq contig 49 Forward paired_illumina_miseq contig 1 Before paired_illumin seq contig 51 Forward paired_illumina_miseq contig 1 Before paired_illumin seq contig 41 Forward paired ilumina miseg contig 1 Before paired illumin seg contig 38 Reverse paired ilumina miseg contig 1 Before paired_illumin seq contig 38 Forward paired_illumina_miseq contig 1 Before paired_illumin seq contig 37 Forward paired_illumina_miseq contig 1 Before paired_illumin seq contig 39 Reverse paired_illumina_miseq contig 1 Before paired_illumin seq contig 33 Forward paired_illumina_miseq contig 1 Before paired_illumin seq contig 36 Forward Roooo o OoSOOoSoosSoocsdoocdovood o 131662 98116 235128 92
15. transfer the annotations from this reference to a set of contigs This is useful for both detecting misassemblies and for speeding up the finishing process Annotations are transferred by identifying contigs that overlap with annotated regions in the reference The overlaps are detected using a BLAST search where matches are filtered based on user defined thresholds as explained below The tool does not perform a BLAST search for each annotation Instead the result of the Align Contigs tool See section 3 1 is used to identify contigs that match the reference and thus overlap with annotations in the reference If multiple contigs match the same annotated region in the reference the annotation is transferred to all matching contigs A table showing both the annotations that were transferred and the ones that were not can be generated Figure 14 1 shows an example where a transferred annotation is selected As a result the corresponding match in the target contig becomes highlighted note that this requires that the contig match view is open Figure 14 2 shows an example where an annotation was not transferred because it was not possible to find a contig that matched the annotated region within the user defined quality thresholds Statistics on annotation transfer can be output in a report as shown in figure 14 3 Note that each annotation in the reference is only counted once in this report even though it might be transferred to multiple contig
16. 1 Select one or more nudeotide sequences Mispriming parameters 2 Set primer placement 41 Check for mispriming parameters El Exact matching Set primer parameters Required matches 15 RoG aiir a Minimum 3 matches 3 y Set mispriming parameters Figure 6 8 Set mispriming parameters Mispriming parameters e Check for mispriming Select whether check for mispriming should be performed When disabled the running time of the tool is reduced e Exact matching When ticked only unique primers with a perfect match are created When disabled detailed parameters needs to be specified for Minimum number of base pairs required for a match and for the Number of consecutive base pairs required in the 3 end Disabling this option can increase the running time of the tool significantly Note The check for mispriming is done on all input Sequences so one can check for mispriming on a reference genome by simply adding the genome to the input of the tool Adjust the parameters and click Next This opens the dialog shown in figure 6 9 El Create Primers 1 Choose where to run 2 Select one or more nucleotide sequences l Primer naming Set primer placement parameters W Use sequence name in primer name Set primer parameters 7 Use strand in primer name Set BIC anchor parameters Use pair information in primer name Ed 7 Set mispriming parameters Use region in primer name Set na
17. 8 8 contig 104 Figure 11 1 Example of contigs that have been extended in both directions In figure 11 2 the reads used for the de novo assembly have been mapped again to an extended contig 65 CHAPTER 11 EXTEND CONTIGS 66 50 100 150 Extended ti miseg contig 1 Consensus Mm Coverage o EEE E ts ee ee o Figure 11 2 Reads have been mapped to a contig that has been extended 11 2 How to run Extend Contigs Toolbox Genome Finishing Module 21 Extend Contigs This opens the dialog shown in figure 11 3 where at least one assembly must be selected E Extend Contigs i Select one or more assemblies Select one or more oe mena Navigation Area Selected elements 1 3 A LC Data paired ilumina miseg tutorial assembly Ea a MGFM A paired ilumina miseg tutorial assembly Y S CLC References 4 Qr lt enter search term gt Batch Figure 11 3 Select de novo assembly If a read mapping is chosen rather than a de novo result the extended contig will consist of the reference sequence being extended Click Next The next step in figure 11 4 shows the parameters which controls when the extension of the contig should stop in cases where the number of supporting reads is too low or the fraction of unaligned ends is too high r Gx Extend Contigs Select parameters 1 Select one or more 2 Select parameters Extend contigs options Minim
18. Accepting the license agreement ee 2 3 4 Configure license server connection EOMOWINS AMO ecc heise ce dteiaeeeh eae ce dama bE Ss Common issues when using a network license 2 3 5 Download a static license on a non networked machine O ON N A CONTENTS 2 4 How to uninstall a Workbench plugin a 2 5 How to install a Server plugin lt lt a 2 5 1 Static license installation oso sooo 2 5 2 Windows license download xicas eras dana eat E Ze Mac US license download s a i s ssa be ee Sad edee ewe tu E E dE 2 9 4 LINUX license d wnNl d s s raawr a e556 ARA 2 5 5 Download a static license on a non networked machine aaao oaoa 2 5 6 Ne twork license installation lt s s s sm eee eee e AAA 2 5 Server plugin download installation and removal 3 Align Contigs 3 1 What is the Align Contigs tool o 3 2 How to run the Align Contigs tool 2 3 3 How to use the Align Contigs tool 0 e ee SL INS 6 c e 4 c lt persons teehee heteneeeeteheteaneess 3 3 2 The Contig match table 1 3 3 9 Joining tWO CONES eau easier o aa de RR AA Se PILI CONDES sp wk ee bo eee e Sp DE ud Se S00 Adding NEW data rra 4 Analyze Contigs 4 1 What is the Analyze Contigs tool oa oa a a a es 4 2 How to run the Analyze Co
19. Detect single stranded regionsis checked regions with single stranded coverage are detected using the specified parameters CHAPTER 4 ANALYZE CONTIGS 41 r 3 Analyze Contigs parameters for contig analysis 2 1 Select read mappings Parameters for contig analysis 1 Single stranded coverage Detect single stranded regions parameters for contig l analysis 2 Maximum single stranded percentage 80 Minimum coverage requirement 10 Nonspecific coverage Y Detect nonspecific regions Maximum nonspecific coverage percentage 20 Minimum coverage requirement 10 Broken pairs Y Detect broken pair regions Maximum broken pairs percentage 20 Minimum coverage requirement 10 Figure 4 3 Set the parameters for contig analysis 2 e Max single stranded percentage specifies the maximum percentage difference between coverage of either strand with the extremes being 0 that allows only the same number of reads in both directions and 100 that allows all reads to be in one direction Hence with a max single stranded percentage of 80 single stranded regions will be detected when the difference in the number of reads in each direction exceeds 80 e Minimum coverage requirement Specifies the minimum amount of coverage required before checking for single stranded coverage Nonspecific coverage When Detect nonspecific regions is checked regions with nonspecific cov erage reads with ambiguous mapping
20. In the table it is possible to right click on the search hit of interest which enables you to open the relevant element Chapter 9 Collect Paired Read Statistics 9 1 What is the Collect Paired Read Statistics tool The Collect Paired Read Statistics tool identifies paired reads between pairs of contigs and can be used to collect evidence for how contigs are positioned to one another Hence the Collect Paired Read Statistics tool provides information about potential overlaps and unknown gaps between pairs of contigs which further can be visualized when combined with the Align Contigs tool The tool searches for broken paired reads in all contig read mappings and for each broken paired read that is identified the contig with the mate read is registered The output is a table summarizing occurrences of these events name of the involved contigs as well as the orientation and distance between the contigs relative to each other Paired reads with one read in one contig and the mate read in another contig are often reported in cases with many sequencing errors or areas with repeats In these cases the de Bruijn graph has not been capable of using the paired reads in the assambly process which in stead are reported in the Paired Read Statistics table 9 2 How to run the Collect Paired Read Statistics tool Toolbox Genome Finishing Module Collect Paired Read Statistics This opens the dialog shown in figure 9 1 E Colle
21. K iS Y R Unaligned ends Read layout e Compactness Packed y M small_ecoli contig 26 Gather sequences at top 67 9 Y Show sequence ends Coverage Ea 9 0 K Show mismatches _ ae 0 H a Disconnect pairs Packed read height Medium l J m l Find Confiict la H _ Low coverage threshold 8 1 Find Low Coverage Sequence layout Annotation layout Annotation types MA 7 Broken pair O a EB 7 Unaligned ends E E unstable coverage 5 Select All r Deselect Al lt ED O Figure 4 5 A split view showing the contig analysis table at the top and the reads mapped to the contig below This example shows a possible misassembly as several reads have unaligned ends and a sharp drop in coverage can be observed Chapter 5 Create Amplicons When trying to finalize a genome to completion it can be necessary to resequence areas and generate supplementary sequences to close the gaps After the initial de novo assembly the result may be up to thousands of contigs depending on the quality of the reads and the size of the genome In cases with a reference sequence being available it may be necessary to sort out potential differences between the reference and sequenced genome or to fill out regions with missing data In addition in cases with or without a reference genome being available for alignment of the contigs it may be necessary to extend the assembled reads For these purposes the Cre
22. PacBio Reads report In the last dialog of the de novo assembly you can choose to create a report of the results see figure 17 2 The report contains the following information Nucleotide distribution Fraction of the assembly covered by each nucleotide A C G and T Contig measurements This section includes statistics about the number and lengths of contigs Count The total number of contigs Total The total number of bases in the result This can be used for comparison with the estimated genome size to evaluate how much of the genome sequence is included in the assembly N50 N75 and N90 The N5O contig set is calculated by summarizing the lengths of the longest contigs until you reach 50 of the total contig length The minimum contig length in this set is the N50 value of a de novo assembly The N75 and N90 values are computed in a similar fashion Minimum maximum and average This refers to the contig lengths Contig length distribution A graph showing the number of contigs of different lengths Accumulated contig lengths This shows the summarized contig length on the y axis and the number of contigs on the x axis with the biggest contigs ranked first This answers the question how many contigs are needed to cover e g half of the genome CHAPTER 17 DE NOVO ASSEMBLE PACBIO READS BETA 1 Summary de novo report 1 1 Nucleotide distribution uci Adenine A 1 113 919 24 5 Cytosine C 1 142 129 25 2 Guanine G
23. attention by analyzing up to seven different parameters Identified events such as broken pairs regions with low coverage and single stranded coverage are annotated and presented in a table 4 2 How to run the Analyze Contigs tool To run the Analyze Contigs tools Toolbox Genome Finishing Module Analyze Contigs tool This opens the dialog shown in figure 4 1 e Gx Analyze Contigs Select read mappings 1 Select read mappings a Navigation Area Selected elements 1 gt CLC_Data E pared ilumina miseq tutorial assembly E ER TS CLC References Qr lt enter search term gt Batch Figure 4 1 Select the contigs to be analyzed Select contigs and click Next This leads to the Set parameters for contig analysis 1 step shown in figure 4 2 The parameters to be specified in this step are General parameters e Minimum length Specifies the minimum length of annotations Does not apply to sudden changes in coverage and unaligned ends 39 CHAPTER 4 ANALYZE CONTIGS 40 El Analyze Contigs L Choose wher e to run Set parameters for contig analysis General parameters 2 Select read mappings Minimum length 3 Set parameters for Minimum distance to contig ends contig analysis 1 ad 4 Ignore scaffold regions Coverage 4 Detect sudden changes in coverage W Detect low coverage Low coverage threshold Detect high coverage High coverage threshold Unaligne
24. b In the web administrative interface on the master CLC Server check that the plugin is enabled for each job node This is described in more detail in the CLC Server manual at http www clcsupport com clcgenomicsserver current admin index php manual Configuring_your_setup html CHAPTER 2 SYSTEM REQUIREMENTS AND INSTALLATION OF THE CLC GENOME FINISHING MODULE 29 Ty Element Info L History Sequence Text SExportdata C2 Import data 4 Main configuration Authentication E Global permissions O Job distribution 3 Queue Gt Status and management q Plugins Installed plugins Uninstall Additional Alignments Additional Alignments Server Plugin 1 5 2 Build 140618 1308 112144 by CLC bio 9 ig ia ai by Server Plugin m Install new plugin AdditionalAlignmentsServerPlugin cpa Install Plug in More plugins from CLC bio gt External applications S Workflows Audit log 3 BLAST Databases Figure 2 23 Installing and uninstalling CLC Server plugins is done via the Plugins section of the web administrative interface To uninstall a CLC Server plugin simply click on the button that has Uninstall on its label next to the relevant plugin Chapter 3 Align Contigs 3 1 What is the Align Contigs tool The Align Contigs tool provides a platform to easily visualize and edit contigs It is one of the most important tools in the finishing package and also the too
25. directly from the length of the gap and the insertion or deletion cost This model often favors small fragmented gaps over long contiguous gaps CHAPTER 7 ADD READS TO CONTIGS 96 Insertion cost Can be set at 1 2 or 3 Deletion cost Can be set at 1 2 or 3 e Affine gap cost An extra cost associated with opening a gap is introduced such that long contiguous gaps are favored over short gaps Insertion open cost Cost of opening an insertion in the read a gap in the reference sequence Insertion extend cost Cost of extending an insertion in the read a gap in the reference sequence by one column Deletion open cost Cost of a opening a deletion in the read gap in the read sequence Deletion extend cost Cost of extending a deletion in the read gap in the read sequence by one column e Length fraction Minimum length fraction of a read that must match the reference sequence e Similarity fraction Minimum fraction of similarity between read and reference se quence e Color space alignment and Color error cost When working with data in color space data from SOLID systems the color space checkbox is enabled and a corresponding cost for color errors can be set If you do not have color space data these will be disabled and are not relevant e Auto detect paired distances Determine the insert size of paired data sets e Global alignment If selected end gaps are treated as mismatches If not checked end gaps ha
26. e Go to license download web page In a browser window show the license download web page which can be used to download a license file This option is suitable in situations where for example you are working behind a proxy so that the Workbench does not have direct access to the CLC Licenses Service If you select the option to download a license directly and it turns out that the Workbench does not have direct access to the external network because of a firewall proxy server etc you can click Previous button to try the other method After selection on your method of choice click on the button labeled Next Direct download After choosing the Direct Download option and clicking on the button labeled Next the dialog shown in figure 2 4 appears License Wizard x CGD CLC Plugins Requesting a license Requesting and downloading an evaluation license by establishing a direct connection to the CLC bio License Web Service An Evaluation License was successfully downloaded The License is valid until 2008 07 03 If you experience any problems please contact The CLC Support Team Proxy Settings Previous Next Quit Workbench Figure 2 4 A license has been downloaded A progress for getting the license is shown and when the license is downloaded you will be able to click Next Go to license download web page After choosing the Go to license download web page option and clicking on the button l
27. license you will need to connect the Workbench to the network again so it can contact the CLC Licene Server to obtain one Note Your CLC License Server administrator can choose to disable to the option allowing the borrowing of licenses If this has been done you will not be able to borrow a network license using your Workbench servers respond to the broadcast The Workbench then uses TCP communication for to get a license assuming one is available Automatic server discovery works only on local networks and will not work on WAN or VPN connections Automatic server discovery is not guaranteed to work on all networks If you are working on an enterprise network on where local firewalls or routers cut off UDP broadcast traffic then you may need to configure the details of the CLC License server manually instead CHAPTER 2 SYSTEM REQUIREMENTS AND INSTALLATION OF THE CLC GENOME FINISHING MODULE 21 T License Manager CLC Genomics Workbench License overview Product ID License type Expires in Status Borrow limit Borrow CLCGENOMICSWB Network localhost Never Valid days M License Information Borrow License License Borrowing If you use a license server and need to work outside of your organization network you can borrow a copy of your licenses from the license server The borrowed license will allow you to use the application for the specified number of hours Borrow the se
28. mean 0 27 1400 1200 reads 0 1 2 3 error rate Figure 16 3 The distribution of error rates after error correction on a whole genome E coli dataset from PacBio RS II P5 C3 The average error rate is 0 27 Correct PacBio Reads beta Error Correction options 1 Choose where to run 2 Select sequencing reads 3 Error Correction options Coverage parameters Coverage percentage of reads to correct 30 7 S Previous Next Finish Cancel Figure 16 4 Set Coverage percentage of reads to correct for the error correction The remaining shorter reads are used to perform the correction For example if the Coverage percentage of reads to correct is set to 25 the tool will correct a subset of the longest reads that amounts to 25 of the total coverage using the remaining shorter reads The De Novo Assemble PacBio Reads tool see Chapter 17 needs at least 25 30x coverage on microbial genomes in order to obtain a high quality assembly Thus the Coverage percentage of reads to correct should be chosen such that the corrected reads supply a coverage of at least 25 30x This means that if your dataset has coverage of about 200x you should set Coverage percentage of reads to correct to 12 15 For datasets with very high coverage you can get a better error correction by lowering the Coverage percentage of reads to correct and at the same time get a sufficiently high coverage by the corrected reads to obtain a good assemb
29. of nucleotides that must have a perfect alignment before BLAST finds a match A small value increases the sensitivity but will result in more random matches and slow down the BLAST search on large data sets e Maximum BLAST e value The BLAST e value indicates the number of hits that are expected by chance where an e value of O indicate a unique hit while an e value of 10 is a random match Lowering the e value threshold gives a more stringent alignment which help avoid misassemblies but it also decreases the chance of identifying neighboring contigs that can be joined Match options e Minimum match size Specifies the minimum match size allowed in alignments When contigs are aligned against each other the most interesting matches are often small overlaps between contig ends To avoid that such small overlaps are filtered out due to a low e value or minimum match size contig ends are aligned in a separate step The alignment of contigs ends allow matches of length gt 8bp and matches that are close to the contig ends are considered to be more significant compared to matches far from the contigs ends When it is possible to perform more than one of the four types of analyses described above it is often a good idea to start out by performing each analysis separately This will give an indication CHAPTER 12 JOIN CONTIGS 10 of how much each analysis contribute to improvements in the assembly An analysis that cannot improve the assembly s
30. separate contigs Create Amplicons Tool for placing amplicon annotations on sequences Used before the Primer Creator to subdivide regions of interest into fragments of suitable sizes Create Primers Automated primer design for re sequencing purposes CHAPTER 1 INTRODUCTION TO THE CLC GENOME FINISHING MODULE S Add Reads to Contigs Allows addition of additional sequence data to existing contigs Sample Reads Allows a user defined reduction of the number of reads Find Sequence Tool to search for names sequences or annotations in sequencing data Reassemble Regions Reassembly of selected regions in contigs Useful for solving small misassemblies Extend Contigs Extends contigs with existing reads Join Contigs An automated way of joining contigs Remove Extension of Contigs Allows the user to remove the extensions from the contigs after the extended contigs have been joined Import PacBio Reads An automated way to import the 2 file formats conatining PacBio reads Correct PacBio Reads beta Corrects sequencing errors and detects and resolves untrimmed adapter sequences and chimeric reads in PacBio SMRT reads De Novo Assemble PacBio Reads beta Assembles error corrected long reads into high quality contigs Genome Finishing Module L de Add Reads to Contigs ae Align Contigs EE Analyze Contigs oa Annotate from Reference qe Collect Paired Read Statistics Dad Create Amplicons Apt Create Primers dt Extend Contigs
31. sequence or annotation e Name Search for the specified text string in sequence object names e Annotation Search for the specified text string in annotations on selected sequences e Sequence Search for the specified text string in selected sequences When a search is to be performed in a sequence three new options become available Tick off the relevant parameters of CHAPTER 8 FIND SEQUENCE 58 Find Sequence search parameters 1 Select elements to search in Search options 2 search parameters Search text ACGTGGCTAGCTAGTCTTTAGCGATT Name Annotation Sequence Search both strands v Treat ambiguous symbols as wildcards in search term Treat ambiguous symbols as wildcards in sequence Search sequence target o All Reads Reference and consensus Figure 8 2 Select the parameters for name sequence or annotation search Search both strands Treat ambiguous symbols as wildcards in search term Treat ambiguous symbols as wildcards in sequence Sequence selection e All sequences Search for the specified text string in all sequences e Reads Search for the specified text string in only the reads of selected contigs e References and consensus Search for the specified text string in reference and consensus sequence of the selected contigs 8 2 1 The Find Sequence output The output is a table showing the search hits with name location and involved objects
32. the button labeled Next Direct download After choosing the Direct Download option and clicking on the button labeled Next the dialog Shown in figure 2 9 appears A progress for getting the license is shown and when the license is downloaded you will be able to click Next Go to license download web page After choosing the Go to license download web page option and clicking on the button labeled Next the license download web page appears in a browser window as shown in 2 10 CHAPTER 2 SYSTEM REQUIREMENTS AND INSTALLATION OF THE CLC GENOME FINISHING MODULE 1 License Wizard DOY LC Plugins Requesting a license with id CLC LICENSE SRENMNSTED 0D43CA9 Requesting and downloading a license by establishing a direct connection to the CLC bio License Web Service Your License was successfully downloaded The License is valid until 2008 08 01 If you experience any problems please contact The CLC Support Team Figure 2 9 A license has been downloaded Download a license License Order ID CLC LICENSE SRENMNSTED 0D43CA SEDF 4XXXXXD844A 4COC 480000 To begin using your license Choose License File butt Figure 2 10 The license download web page Click the Request Evaluation License button You can then save the license on your system Back in the Workbench window you will now see the dialog shown in 2 11 License Wizard zs GG CLC Plugins Import a license from a file Please cl
33. where to run 2 Select sequencing reads 3 De novo options Graph parameters M Automatic word size Word size 20 Minimum word coverage 3 Contig polishing No contig polishing e Contig polishing using input reads Contig polishing using separate reads Reads for error correction Contig length Minimum contig length 1 000 IJIRA Previous Next inish Cancel Ne Figure 17 1 Select assembly parameters Graph parameters e Automatic word size The word size is automatically estimated by default but can also be set manually We recommend to use a word size of 1 24 A small word size should be used for small genomes while a large word size should be used for large genomes When using an automatically estimated word size you can see the actual word size in the history Z of the result files Please note that the range of word sizes is limited to 12 24 on 32 bit machines and 12 64 on 64 bit machines e Minimum word coverage It specifies the minimum number of times a given word must occur in the input reads in order for it to be included in the de Bruijn graph used by the assembler The default minimum word coverage is 3 Using a smaller minimum word coverage will result in fewer contigs while it may reduce the contig quality Similarly using a larger minimum word size will result in more contigs with a higher contig quality If you have very high coverage you may obtain a better assembly by choosing a larger minimum word cover
34. 00 c20 contig 98 reads 180 250 1100 c20 contig 68 _ _ _ reads 180 250 1100 c20 contig 66 mann ee y Figure 13 1 Overlap between contig 14 and contig 98 that was created when the selected region was extended If these two contigs aren t joined the overlap region will include some nucleotides twice 2 CHAPTER 13 REMOVE EXTENSION OF CONTIGS 3 13 2 How to run the Remove Extension of Contigs tool To run Remove Extension of Contigs tool Toolbox Genome Finishing Module 21 Remove Extension of Contigs In the dialog that appears select the contigs that have been extended and open or save the result Figure 13 2 shows an example of contigs that have been extended and the result after the extended contigs have been subjected to the Remove Extension of Contigs tool paired_illumi x 26 580 26 600 26 620 26 640 a MENESES l l l K S Iz paired_illumina_miseg contig 30 GGCGCTCATACGGCGT AAT TTTGGCGTCGGCGAGCAAAATCCCTTGTTTAAAGGTATTTTGCCAGCTGCCGT i 26 580 Sequence layout Annotation layout l l paired illumina miseg contig 31 AGCCATTCACCGTAAAAACCGTTTACGGAAAAGAGCGAGGCCAGCGTTAAACCGGCGACAGCCGTTGGCAGC 26 580 EDO E 620 TT paired illumina miseg contig 32 GAAGGTGCGAAT MM E Broken pair O E contigs joined EJ paired illumina miseg contig 33 i 26 580 26 600 26 620 26 640 E Extended region l paired ilumina miseg contig 34 CCATTCGCCATCAATATAAGGCTGGATAGGATCTTTCG
35. 1 315 No Figure 12 4 Table containing details on each join made by the tool An annotation on the sequence also indicates whether the join was performed using an overlap or a gap figure 12 5 coli reads 180 250 contig 96 dai 80 250 contig 93 327 420 TGTTCGTCCATTTCCGCGCAGACGATGACGTCACTGCCCGGCTGTATGCGCGAGGTTACCGACTGCGI Figure 12 5 An example of a gap between two contigs that has been filled based on long reads The second output is a table of contigs not joined see figure 12 6 The column Reason differentiates between two sorts of contigs CHAPTER 12 JOIN CONTIGS 11 e Not part of any join describes contigs that were not joined at all It can happen if the contigs are a result of contamination in the sample or if there was insufficient information to join the contig correctly e Repeat not included enough times are contigs that were identified as repetitive and joined in some contigs but not in all the contigs expected based on the estimated copy number of the repeat contig calculated by the Contig Joiner tool coli reads 180 250 contig 15 coli reads 180 250 contig 15 coli reads 180 250 contig 24 coli reads 180 250 contig 35 coli reads 180 250 contig 38 coli reads 180 250 contig 42 coli reads 180 250 contig 46 coli reads 180 250 contig 47 Reason Repeat not induded enough times Not part of any join Repeat not induded enough times Repeat not induded enough times Repeat not induded
36. 32 27671 2 N 0 1 5 A02BE 1 9 1295 5 A02BE 1 3 5880 10184 2 N 0 1 5 A02BE 1 4 14286 24559 2 N 0 1 5 A02BE 1 4 142 5 A02BE 1 3 23137 9078 2 N 0 1 5 A02BE 1 8 15784 16455 1 N 0 1 5 A02BE 1 8 157 5 A02BE 1 2 19016 2205 2 N 0 1 5 A02BE 1 10 26024 18903 1 N 0 1 5 A02BE 1 10 21 5 A02BE 1 9 19844 9707 2 N 0 1 5 A02BE 1 5 12942 5533 1 N 0 1 5 A02BE 1 5 1294 5 A02BE 1 1 9935 19033 2 N 0 1 5 A02BE 1 10 28643 20476 2 N 0 1 5 A02BE 1 10 2 5 A02BE 1 4 18599 7976 1 N 0 1 5 A02BE 1 4 5 A02BE 1 1 15209 26418 1 N 0 1 5 A02BE 1 1 1521 5 A02BE 1 7 15373 6152 2 N 0 1 5 A02BE 1 7 1537 5 A02BE 1 1 4051 10553 1 N 0 1 5 A02BE 1 1 4051 Figure 3 10 Dialog for distributing reads between split contigs 400 500 600 700 Shared between split contigs po position Figure 3 11 Left contig of a split where the contig shares a small region with the right contig 3 3 5 Adding new data If more contigs become available they can be added later To import more contigs possibly with reads mapped to them click the Add Extract button in the Contig Table and select Add Contigs This brings up a dialog where the contigs can be selected and when Finish is clicked both tables will be updated with the new contigs and matches from these Chapter 4 Analyze Contigs 4 1 What is the Analyze Contigs tool The Analyze Contigs tool identifies problematic regions that need further
37. 634 163698 23913 50001 67116 105426 Ka Kk K qu KA pu KA qua KA qu KA NM Ka qui Ka Ne qui qua qu Figure 9 3 Paired read statistics table The table lists e Contig The name of the first contig in the contig pair that shares paired reads CHAPTER 9 COLLECT PAIRED READ STATISTICS 61 e Mate Contig is Before After The localization of the mate contig relative to the first contig e Mate Contig The name of the mate contig in the contig pair that shares paired reads e Mate Contig Orientation Orientation of mate contig The first contig is always in forward direction e Ocurrences The number of paired reads shared by the two contigs e Average Distance The average distance between the two contigs A negative number indicates the size of an overlap e Standard Deviation The standard deviation of the average distance The table can be used to identify contigs that potentially can be joined or at least positioned relative to one another Misassemblies may also be detected in cases with several shared reads a large overlap indicated with a large negative distance and a small standard deviation One way to start using the table is to look at the contigs with most shared reads by clicking twice on the Occurrence column to sort after the most abundant paired reads Entries with only few occurrences can be ignored or discarded by creating a filter that hides the least frequent entries When potentially interesting conti
38. 834 141368 unpaired ilumina miseg contig 9 complement 56220 58754 Z 1 00 NC 010473 141588 143831 unpaired Mumina _miseq a da duma eile bc K 1 00 we 5 2300 440 Dalrec UMN niseg co Jl JU NC 010473 x NC 010473 5 Coverage 0 unpaired_illumina_miseg contig 37 unpaired_illumina_miseg contig 9 580 Y Figure 14 1 The Annotation transferred table shows all annotations which could be transferred to the contigs E assembly cont X E assembly cont x EEB assembly cont X E assembly cont x OOOO Reference Reference Region Reason NC_010473 complement 748680 748754 No intersection NC 010473 complement 748680 748754 No intersection NC 010473 complement 748778 748862 No intersection NC 010473 complement 748778 748862 a intersection Ge 010473 complement 748872 748948 s E NC 010473 complement 748872 748948 iere ve NC 010473 781398 785591 Low overlap Gene NC_010473 786035 787468 Low overlap Gene NC 010473 complement 840646 843130 Low overlap Gene NC 010473 1146027 1148721 Low overlap U OMDIEME OW overlap Boy NC 010473 x 748 500 749 000 metT 22 2 e cla E NC 010473 5 Coverage 0 unpaired ilumina miseg contig 99 unpaired ilumina miseg contig 91 EB EQ Figure 14 2 A table can be output of the annotations not transferred thresholds for when matches between contigs and annotated refere
39. Both tools are designed for microbial genomes and small Eukaryotic genomes for example C elegans with a LOOMb genome Assembly of the error corrected PacBio reads is done using a de Bruijn graph based ap proach Pevzner et al 2001 but uses a number of novel techniques to close gaps in the graph correct discrepancies in the graph and finally solve the graph The use of a de Bruijn graph in contrast to a string overlap graph as in for example PacBio s HGAP Chin et al 2013 results in an extremely fast and memory efficient assembler 85 CHAPTER 17 DE NOVO ASSEMBLE PACBIO READS BETA 86 17 2 How to run the De Novo Assemble PacBio Reads tool If your input is raw SMRT sequencing reads you should start by running the Correct PacBio Reads tool see Chapter 16 to correct the reads To start the assembly tool go to Toolbox Genome Finishing Module 71 De Novo Assemble PacBio Reads beta e This will open a dialog where you can select sequences to assemble If you already selected sequences in the Navigation Area these will be shown in Selected Elements You can alter your choice of sequences to assemble by using the arrows to move sequences between the Navigation Area and the Selected Elements box You can also add sequence lists Click Next to set the parameters for the assembly This will show a dialog similar to the one in figure 17 1 De Novo Assemble PacBio Reads beta De novo options 1 Choose
40. C Support Team Proxy Settings Previous Finish Quit Workbench Figure 2 12 Read the license agreement carefully Please read the EULA text carefully before clicking in the box next to the text I accept these terms to accept and then clicking on the button labeled Finish 2 3 3 Import a license from a file If you already have a license file associated with the host ID of your machine it can be imported using this option When you have clicked on the Next button you will see the dialog shown in 2 13 License Wizard ES MO CL Plugins Import a license from a file Please click the button below and locate the file containing your license No file selected Choose License File If you experience any problems please contact The CLC Support Team Proxy Settings Previous Next Quit Workbench Figure 2 13 Selecting a license file Click the Choose License File button and browse to find the license file When you have selected the file click on the Next button Accepting the license agreement Part of the installation of the license involves checking and accepting the end user license agreement EULA You should now see the a window like that in figure 2 14 CHAPTER 2 SYSTEM REQUIREMENTS AND INSTALLATION OF THE CLC GENOME FINISHING MODULE 19 A License Wizard ES DO cL Plugins License Agreement Please read and accept the license agreement below t
41. End User License Agreement EULA is a legal agreement between you either an individual person or a single legal entity who will be referred to in this EULA as You and CLC bio A S CVR no 28 30 50 87 for the software products that accompanies this EULA including any associated media printed materials and electronic documentation the Software Product I accept these terms If you experience any problems please contact The CLC Support Team Proxy Settings Previous Finish Quit Workbench Figure 2 7 Read the license agreement carefully Please read the EULA text carefully before clicking in the box next to the text I accept these terms to accept and then clicking on the button labeled Finish CHAPTER 2 SYSTEM REQUIREMENTS AND INSTALLATION OF THE CLC GENOME FINISHING MODULE 16 2 3 2 Download a license using a license order ID Using a license order ID you can download a license file via the Workbench or using an online form When you have chosen this option and clicked Next button you will see the dialog shown in 2 8 Enter your license order ID into the text field under the title License Order ID The ID can be pasted into the box after copying it and then using menus or key combinations like Ctrl V on some system or 38 V on Mac License Wizard DO LC Plugins Download a license Plea banaa ur Lice ee N nd choo seething ppt hens When proc hee request the license sei will
42. ING MODULE 11 e Running the tool on highly fragmented assemblies e large genome To help estimate the required memory consumption both for bacterial sized genomes and larger genomes some examples are given below The memory consumption was measured on a machine with four cores and the memory consumption for the long reads scaffolding can be larger for machines with more cores as Siya Long read scaffolding 2 3 232 454 reads 5GB _ 45 Me Reference ased sf avg tengo 0 s cerevisiae Paired read scaffolding 22 262 192 pia oe E coli Long read scaffolding 163 478 PacBio reads SGB asno CC pleno B lactucae Long read scaffolding 6 086 612 PacBio reads 10GB so O a teng zan O 2 2 How to install a Workbench plugin Workbench plugins are installed using the Plugin Manager To start up the Plugin Manager go to Help in the Menu Bar Plugins and Resources E or Plugins E in the Toolbar The plugin manager has three tabs at the top e Manage Plugins This is an overview of plugins that are installed e Download Plugins This is an overview of plugins available to download e Manage Resources This is an overview of installed resources To install a plugin first click the Download Plugins tab to see a list of plugins available for download see figure 2 1 When a particular plugin entry is selected by clicking on it in the left hand column a button labeled Download and Install should appe
43. O IA Map Reads to Contigs 2 Reads Track Read Mapping Mapping Report Un mapped Reads Read Mapping Corrected reads mapped to contigs Mapping Report Corrected reads mapped to contigs report Read Mapping Analyze Contigs Read Mapping Analysis Report Analysis Report with Contigs Analysis Results Table Analysis Report Contig analysis report Analysis Results Table Contig analysis table Figure 18 1 The PacBio De Novo Assembly Pipeline workflow Chapter 19 Available tutorials 19 1 Aligning contigs manually using the Genome Finishing Module Using a public available E coli data set this tutorial is an introduction to joining splitting and extending contigs manually using the Align Contigs tool the Analyze Contigs tool and the Extend Contigs tool of the CLC Genome Finishing Module The tutorial can be downloaded from our website http www clcbio com clc plugin genome finishing module 92 Bibliography Allawi and SantaLucia 1997 Allawi H T and SantaLucia J 1997 Thermodynamics and nmr of internal g t mismatches in dna Biochemistry 30 10581 10594 Allawi and SantaLucia 1998a Allawi H T and SantaLucia J 1998a Nearest neighbor thermodynamic parameters for internal g a mismatches in dna Biochemistry 37 2170 2179 Allawi and SantaLucia 1998b Allawi H T and SantaLucia J 1998b Nearest neighbor thermodynamics of internal a c mismatches in dna Sequence dependence and ph effects
44. Server manual at http www clcsupport com clcgenomicsserver current admin index php manual Starting_stopping_server html There are three different server setups A short description of each setup and a summary of the plugin licensing requirements are below e Single server setup A single machine is running the CLC Server software Jobs are submitted to this server which receives and executes them In this setup a single machine acts both as a master and an executor of jobs Here a single static license for the plugin is installed in the CLC Server software e Job node setup More than one machine is running the CLC Server software The system acting as the master server receives job requests and then submits these jobs to other machines the job nodes for execution Here a single static license is installed on each machine running the CLC Server software That is a static license is installed on the master node and on each job node e Grid setup One machine runs the CLC Server software and receives job requests It then submits these to a third party scheduler The scheduler then chooses an appropriate grid machine or node to submit a given job to for execution Here a a single static license for the plugin is installed on the master server and the same number of network plugin licenses as there are network gridworker licenses needs to be made available by installing these in the CLC License Server software For a more detailed d
45. TGAAGTCATGCACGGTGTCGTTATTTTTCTGCGG MM T tow coverage 2 MA 0 scaffold O BM O Single stranded coverage O Annotation types HE Alternatives exduded O f Eos eeeo0Y iZ paired_illumi x U l a a ICE OST JETU io ired_illumina_miseg contig 31 CCGGCGACAGCCGT TGGCAGCGCAAAGGGTAAAT CCATCAGCGCATCAAGCAGCGTGCGGCCTGGGAAGCGA rum E 2 gt Sequence layout 3GAAGGTGCGAAT Annotation layout k y Annotation types 4 ired ilumina miseg contig 33 EM DP Alternatives excluded O 1 6 580 2 600 26 620 2660 O Broken pair ired_illumina_miseq contig 34 GTGAAGTCATGCACGGTGTCGTTATTTTTCTGCGGCACCATATCGAGGTGGGCCTGTAAGACGACCGGTTTA E 7 Contigs joined E 26 580 26 600 26 620 26 640 l E Low coverage EJ umina miseg contig 35 C C AT CGTGTGCTAAAAAGTCTTGAAGAT AATCTTTTTGCAGCGCTTGGTGAACGTAATATCGCTGAGTTAAA Asa o E singee stranded coverage E unaligned ends E CEBEEERE Figure 13 2 Top Contigs that have been extended Extended regions can be identified by ticking Extended region under Annotation types Bottom The result after the extended contigs have been subjected to the Remove Extension of Contigs tool The extended region which have not been used to perform a join have been removed Chapter 14 Annotate from Reference 14 1 What is the Annotate from Reference tool When a closely related reference which has already been annotated is available this tool can
46. abeled Next the license download web page appears in a browser window as shown in 2 5 Click the Request Evaluation License button You can then save the license on your system Back in the Workbench window you will now see the dialog shown in 2 6 Click the Choose License File button and browse to find the license file you saved When you have selected the file click on the button labeled Next CHAPTER 2 SYSTEM REQUIREMENTS AND INSTALLATION OF THE CLC GENOME FINISHING MODULE 15 Request an Evaluation License This fro m Figure 2 5 The license download web page License Wizard Ss MO LC Plugins Import a license from a file Please click the button below and locate the file containing your license No file selected Choose License File If you experience any problems please contact The CLC Support Team Proxy Settings Previous Next Quit Workbench Figure 2 6 Importing the license file downloaded from the web page Accepting the license agreement Part of the installation of the license involves checking and accepting the end user license agreement EULA You should now see the a window like that in figure 2 7 License Wizard 53 DO CLC Plugins License Agreement Please read and accept the license agreement below to begin using you license END USER LICENSE AGREEMENT FOR CLC BIO SOFTWARE CLC Genomics Workbench 1 0 1 Recitals 1 1 This
47. adjusting the parameters click Next This opens the dialog shown in figure 6 7 El Create Primers 1 Choose where to run Mess MANIM Me mino 2 Select one or more G C anchor parameters nucleotide sequences 7 Use 3 end G C anchor 3 Set primer placement End length 2 Parameters Min no of G C O Max no of G C 2 5 Set G C anchor parameters P Use 5 end G C anchor 4 Set primer parameters End length 2 Min no of G C 0 Max no of G C 2 Figure 6 7 Set G C anchor parameters G C anchor parameters e Use 3 end G C anchor parameters Checking the box makes it possible to specify the preferred number of G C occurrences at the 3 end of the primer End length The number of consecutive bases to consider at the 3 end Min no of G C The minimum number of G C s in the considered interval Max no of G C The maximum number of G C s in the considered interval e Use 5 end G C anchor parameters Checking the box makes it possible to specify the preferred number of G C occurrences at the 5 end of the primer CHAPTER 6 CREATE PRIMERS 91 End length The number of consecutive bases to consider at the 5 end Min no of G C The minimum number of G C s in the considered interval Max no of G C The maximum number of G C s in the considered interval Adjust the parameters and click Next This opens the dialog shown in figure 6 8 EM Create Primers Set mispriming parameters
48. age Otherwise we recommend that you leave it at 3 CHAPTER 17 DE NOVO ASSEMBLE PACBIO READS BETA 87 Contig polishing Contig polishing is the last step of the assembly algorithm in which putative assembly errors in the contigs are resolved by mapping a set of reads to the contigs and building a consensus of this read mapping e No contig polishing will speed up the assembly process e Contig polishing using input reads uses the error corrected input reads that were used for the actual assembly e Contig polishing using seperate reads uses another set of reads Including the contig polishing step improves the assembly quality significantly but it may also double the execution time To obtain optimal assembly quality we recommend to use raw PacBio reads for contig polishing by selecting these as input for the Contig polishing using seperate reads option However if these are not available the assembly quality is also improved greatly when the error corrected input reads are used Minimum contig length Contigs below the specified length will not be reported The default value is 1 000 bp For very large assemblies the number of contigs can be large in which case the contig polishing step will be slow In this case it is an advantage to raise the minimum contig length to reduce the number of contigs that have to be considered Click Next to set the output options and finally click Finish to start the assembler 17 3 De Novo Assemble
49. ar in the plugin description area Additional information about that plugin is displayed in the right hand panel Click the CLC Genome Finishing Module and click on the Download and Install button A dialog displaying the progress of the download an installation of the plugin will be shown If you have downloaded the cpa file for the CLC Genome Finishing Module Plugin you can install this by clicking the Install from File button at the bottom of the Plugin Manager window This will open a dialog where you can browse for the plugin cpa file and choose to install it tHow to do this differs for different operating systems To run the program in administrator mode on Windows Vista or 7 right click the program shortcut and choose Run as Administrator CHAPTER 2 SYSTEM REQUIREMENTS AND INSTALLATION OF THE CLC GENOME FINISHING MODULE 12 Fr Manage Plugins and Resources 6 Z OG Manage Plugins Download Plugins Manage Resources Additional Alignments a a GD CLC bio support cicbio com Version 1 5 1 Build 131211 2142 102901 GD Additional Alignments Perform alignments with ClustalO ClustalW and MUSCLE 3 Size 15 0 MB Download and Install This module allows for use of two other alignment methods which are otherwise not distributed with the CLC Workbench Annotate with GFF file Q a pra H When the plug in is installed you will see the new alignment methods in nanas a the Toolbox under Alignments and Tree
50. as h5 bax h5 General options Discard read names Discard quality scores Figure 15 1 Importing data from PacBio bas h5 fastq and fasta files are supported We support import of three file formats containing PacBio reads e H5 files bas h5 bax h5 which contain one of two things bas h5 files produced by instruments prior to PacBio RS Il contain sequencing data such as reads and quality scores bas h5 files from more recent PacBio instruments contain a list of bax h5 files where the actual sequencing data is stored When importing H5 files the user needs to select both the bas h5 file and all the accompanying bax h5 files belonging to a data set e Fastg files astq which contain sequence data and quality scores Compressed Fastq fastq gz files are also supported e Fasta files fasta which contain sequence data Compressed Fasta fasta gz files are also supported Under General options you have the following choices 18 CHAPTER 15 IMPORT OF PACBIO READS 19 e Discard read names For high throughput sequencing data the naming of the individual reads is often irrelevant given the huge amount of reads This option allows you to discard read names to save disk space e Discard quality scores Quality scores can be visualized in the mapping view and used for SNP detection If this is not relevant for your work you can choose to Discard quality scores Discarding q
51. assembly 64 Chapter 11 Extend Contigs 11 1 What is Extend Contigs Contig joining is often based on overlaps between contigs However in some cases the de novo assembler create contigs with no or small overlaps between neighboring contigs In such cases the Extend Contigs tool can be used to create large overlaps which makes identification of possible joins easier When reads are mapped to contigs reads will often continue outside the start or end of a contig There can be many reasons for this but one common cause is repeat regions which the de novo assembler has failed to connect to a contig The Extend Contigs tool extends a contig with the consensus of the reads that continue outside the ends of the contig This will often result in large overlaps between neighboring contigs and enable such contigs to be joined with the automatic join tool Care should be taken whenever the extended region of a contig constitutes a repeat and a join should if possible be confirmed by other evidence such as paired reads Spanning the overlapping region or an alignment of the contigs to a reference sequence See figure 11 1 for an example of contigs that have been extended Extended region 8 8 contig 100 Extended region 400 Extended region Extended region 300 Extended region Extended region Extended region Extended region 300 4 l 8 8 contig 101 8 m contig 102 8 m contig 103
52. ate Amplicons tool and Create Primers tools can be useful 5 1 What is the Create Amplicons tool Create Amplicons is a tool that allows the addition of amplicon annotations to a sequence of interest These annotations can subsequently be used as target for the Create Primers tool The advantage of using the Create Amplicons tool prior to primer design is that the Create Amplicons tool can subdivide regions of interest into fragments of suitable sizes 5 2 How to run the Create Amplicons tool To run the Create Amplicons tool Toolbox Genome Finishing Module 71 Create Amplicons This opens the dialog shown in figure 5 1 e E Create Amplicons Select an element containing one or more nudeotide sequences with annotations 1 Select an element Navigation Area Selected elements 1 containing one or more nucleotide sequences s l A CLC Data E paired ilumina miseg tutorial assembly with annotations MGFM paired_illumina_miseq_1 paired s CLC References Figure 5 1 Select a contig or sequence 44 CHAPTER 5 CREATE AMPLICONS 45 Select a sequence or contig and click Next Amplicon creation is directed by annotation types This means that it is possible to create amplicons to e g all regions with a certain annotation such as scaffolds or genes in the input sequence However it is also possible to narrow down the region to be used for amplicon creation to for example single gene lev
53. back on the sequence within the start and end positions that were specified in the algorithm figure 5 4 Amplicon Amplicon Amplicon Amplicon a Figure 5 4 Amplicon annotations are added to the sequence back to back Chapter 6 Create Primers 6 1 What is the Create Primers tool The Create Primers tool is an automated way of creating primers to specific regions using settings specified by the user The Create Primers tool is useful whenever resequencing is required e g in regions with poor read quality repeats or low coverage 6 2 How to use the Create Primers tool To run Create Primers too Toolbox Genome Finishing Module Create Primers 4 This opens the dialog shown in figure 6 1 Gx Create Primers Select one or more nucleotide sequences Navigation Area Selected elements 1 E CLC Data paired ilumina miseq tutorial assembly 3 MGFM 1 Select one or more nucleotide sequences xXx E coli DH10B paired ilumina miseg 1 paired S CLC References Qy lt enter search term gt Batch Figure 6 1 Select any number of contigs or sequences Select any number of sequences or contigs and click Next This opens the dialog shown in figure 6 2 The parameters to be specified in this step are Set regions to amplify e Start out by clicking on the Select annotation type icon 4 to specify which annotation types to be included in the primer design 4
54. che ithe License Order vailable For dd ad mre computer License Order ID CLC LICENSE SRENMNSTED 0D43CA9EDF90000 D84444C0C480000 Direct Download The workbench will attempt to contact the CLC Licenses Service and download the license directly This method requires internet access from the workbench Go to License Download web page The workbench will open pr forge ai From there you will be able to download your lice a file and im in the next If you experience any problems please contact The CLC Support Team Proxy Settings Previous Previous Next Quit workbench Quit Workbench Figure 2 8 Enter a license order ID for the software In this dialog there are two options e Direct download Download the license directly from CLC bio This method requires that the Workbench has access to the external network e Go to license download web page In a browser window show the license download web page which can be used to download a license file This option is suitable in situations where for example you are working behind a proxy so that the Workbench does not have direct access to the CLC Licenses Service If you select the option to download a license directly and it turns out that the Workbench does not have direct access to the external network because of a firewall proxy server etc you can click Previous button to try the other method After selection on your method of choice click on
55. cific Biosciences instrument constitutes a violation of the end user license agreement that users of the CLC Genome Finishing Module agree to during installation The PacBio De Novo Assembly Pipeline workflow see Figure 18 1 takes raw PacBio reads in FASTQ or H5 format as input and produces a high quality assembly together with a number of reports that can be used to evaluate the quality of both the input data and the assembly It consists of seven steps running six different tools from the CLC Genome Finishing toolbox and the general CLC Genomics Workbench toolbox 1 Raw PacBio reads import Raw PacBio reads are imported from FASTQ or H5 files see Chapter 29 2 Correct PacBio Reads Sequencing errors are corrected and chimeric reads and untrimmed adapters are resolved in a subset of the longest reads in the input data set see Chapter 16 The corrected reads are output in a file named Corrected reads and a summary of the error correction is saved in a file named Corrected reads report This report can be 89 CHAPTER 18 WORKFLOWS 90 used to both evaluate the quality of the input reads and to assess the error correction and assembly parameters 3 De Novo Assemble PacBio Reads The error corrected reads are assembled into high quality contigs see Chapter 17 4 Map Reads to Contigs The corrected reads are mapped to the contigs in order to be able to run the Join Contigs tool 5 Join Contigs Contigs are joined by auto
56. ct Paired Read Statistics Select a read mapping with paired reads E Navigation Area Selected elements 1 CLC Data paired_illumina_miseg_tutorial_assembly MGFM 1 Select a read mappi with paired reads TLS CLC References Figure 9 1 Select the read mappings to analyze Select the relevant read mappings and click Next The next wizard window figure 9 2 makes it possible to choose how the paired reads statistics are collected The default option is to 59 CHAPTER 9 COLLECT PAIRED READ STATISTICS 60 only consider reads that map to the contig ends which help filter out noise from reads that are erroneously mapped or reads that map to repetitive regions and thus make it easier to determine if two contigs are neighbors Alternatively statistics can be generated from all read pairs mapped to the contigs which can make misassemblies evident as large overlaps between contigs It is also possible to restrict collection of paired statistics to reads from specific paired libraries This is done in step 2 of the wizard by selecting the option Include subset of libraries and then selecting one or more libraries which have reads mapped to the contigs Please note that the libraries are named after the file from which the reads were imported E Collect Paired Read Statistics O Select parameters 1 Select a read mapping with paired reads Collect paired read options 2 Select parameters Restrict to contig
57. d read ends Detect unaligned read ends Unaligned read ends threshold in percent Minimum coverage requirement Figure 4 2 Set parameters for contig analysis 1 e Minimum distance to contig ends Specifies the minimum distance an annotation must have to the contig ends e Ignore scaffold regions By ticking the box regions between scaffolded contigs are ignored Coverage e Detect sudden changes in coverage A sudden change in coverage in adjacent regions can imply a misassembly e Detect low coverage Regions with low coverage can indicate a misassembly Ticking the box allows specification of a threshold value for the minimum number of required overlapping reads e Detect high coverage Regions with high coverage can indicate a misassembly Ticking the box allows specification of a threshold value for the maximum number of accepted overlapping reads Unaligned read ends e Detect unaligned read ends Unaligned ends of reads can imply a misassembly Ticking the box allows specification of a threshold value for unaligned ends which is the maximum percentage of unaligned read ends allowed at a position compared to neighboring positions e Minimum coverage requirement Specifies the minimum amount of coverage required before checking for unaligned ends After adjustment of the parameters click Next This opens the dialog shown in figure 4 3 The parameters to be specified in this step are Single stranded coverage When
58. e Percentage of the contig nucleotides covered by the match Identity Percentage of matching nucleotides in the match The Contig match table describes the mapping of the contigs relative to the selected reference Two functions are available in the Contig match table e Show Contig Matches Shows a visulization of the matches e Refresh Contig Matches Updates the contig matches after manual editing of the contigs Note After manual editing of the contigs you must manually refresh the contig matches otherwise the match table and the match view will not be up to date 3 3 3 Joining two contigs It can be relevant to join two contigs for several reasons e g if you 1 Ze de detect two overlapping contigs using the contig aligner have contigs which map to the reference genome and are separated by a gap have resequenced regions made de novo assembly with the resequenced reads included and want to join the new contigs with the existing ones It is possible to join two contigs in different ways e Joining contigs using the Join Contigs button in the Contig table view figure 3 3 is performed without using a reference sequence You can select the two contigs you wish to join in the Contig table by holding down the ctrl key and clicking on the two contigs Alternatively you can select two contigs from the Contig match view and then select a region in the reference containing matches from the two contigs Because the Contig match vi
59. e reference is shown at the top and all matches are shown below The match that was double clicked is high lighted as shown in figure 3 5 Escherichia x 190 000 204 000 210 000 d58979 reference genome Coverage 14 mall_ecoli_paired contig 10 1all_ecoli_paired contig 120 mall_ecoli_paired contig 69 mall_ecoli_paired contig 61 mall_ecoli_paired contig 50 mall_ecoli_paired contig 12 mall ecoli paired contig 78 mall ecoli paired contig 78 mall ecoli paired contig 38 a p 14 P EBEQY K 4 Read layout Compactness Low Z Gather sequences at top V Show sequence ends Show mismatches Disconnect pairs Packed read height Medium Find Conflict Low coverage threshold 8 Find Low Coverage Sequence layout No spacing Y Numbers on sequences Relative to Numbers on plus strand m Figure 3 5 The contigs aligned to a reference sequence Note that the Compactness in the Side Panel is set to Low which makes it possible to see the names of the contigs When no reference sequence is available the contigs will be aligned against each other as shown in figure 3 6 small_ecoli x 32 000 34 000 36 000 38 000 l imall_ecoli_paired contig 50 Coverage fF imall_ecoli_paired contig 50 imall_ecoli_paired contig 78 imall_ecoli_paired contig 78 imall_ecoli_pair
60. e s the Workbench is currently using To do this open the License Manager using the menu option Help License Manager lt lt The license manager is shown in figure 2 19 This dialog can be used to e See information about the license e g what kind of license when it expires e Configure how to connect to a license server Configure License Server the button at the lower left corner Clicking this button will display a dialog similar to figure 2 15 e Upgrade from an evaluation license by clicking the Upgrade license button This will display the dialog shown in figure 2 2 e Export license information to a text file e Borrow a license If you wish to switch away from using a network license click on the button to Configure License Server and uncheck the box beside the text Enable license server connection in the dialog When you restart the Workbench you can set up the new license as described in section 2 3 2 3 5 Download a static license on a non networked machine To download a static license for a machine that does not have direct access to the external network you can follow the steps below CHAPTER 2 SYSTEM REQUIREMENTS AND INSTALLATION OF THE CLC GENOME FINISHING MODULE 23 License Manager E g CLC Genomics Workbench License overview Product ID License type Expires in Status Borrow limit Borrow CLCGENOMICSWB Network localhost Never Valid 7 days y License Informat
61. eassemble Available Alternatives exduded Scaffold Broken pair Contigs joined Low coverage Single stranded coverage Sudden coverage change e ia a Unstable coverage Unstable coverage region sali Figure 10 3 Select annotations to consider for reassembly If the Reassemble Region tool has been capable of solving the problem the sequence will now be reassembled as shown in figure 10 4 If the Reassemble Region tool was incapable of correcting the problem the black pop up box will announce this and the sequence will remain unchanged CHAPTER 10 REASSEMBLE REGIONS 4 660 4 680 4 700 4 720 4 740 For reassembly 2q contig 7 3TTCCTTTATCGACAGGTCAGGTCACCGCTCACCCGCCGACGAGAAAGCAACACTGACATGCTAAAGCAAAAAATAGATGAATAAGTTG 3 A _ u AAA Coverage oc 0 STTCCTTTATCGACAGGT CAGGTCACCGCT CACCCGCCGACGAGAAAGCAAGACTGACATGC TAAAGC AAAA _TAAGTTG ACACTGACATGC TAAAGCAAAAAATAGATGAATAAGT TG 3TTCCTTTATCGACAGGTCAGGT _HRHHoo AT GC TAAAGCAAAAAATAGATGAATAAGTTG TCACCCGCCGACGAGAAAGCAACACTGACAT GC TAAAGCAAAAAATAGATGAATAAGTTG GACATGC TAAAGCAAAAAATAGATGAATAAGT TG GACGAGAAAGCAACACTGACATGCTAAAGCAAAAAATAGATGAATAAGTTG TAAAGCAAAAAATAGATGAATAAGTTG GACATGCTAAAGCAAAAAATAGATGAATAAGTTG TGAATAAGTTG TTG GAATAAGTTG STTCCTTTATCGACAGGTCAGGTCACCGCTCACCCGC STT STTCCTTTATCGACAGGTCA STTCCTTTATCGACAGGTCAGGT STTCCTTTATCGACAGGTCAGGTCACCGCTCA Figure 10 4 The region from figure 10 1 after re
62. ed reads 4 For each seed read compute a consensus sequence and output this sequence as a corrected read The longest reads are selected as seed reads because they give the assembler most information to resolve large repeats Figure 16 1 16 3 illustrates the error rates of an E coli dataset before and after error correction Original reads vs corrected reads corrected reads error rate es y ha i 0 10 20 30 40 50 original reads error rate Figure 16 1 Error rates before and after error correction on a whole genome E coli dataset from PacBio RS II P5 C3 Distribution original reads mean 12 80 F r is Figure 16 2 The distribution of error rates before error correction on a whole genome E coli dataset from PacBio RS Il P5 C3 The average error rate is 12 80 16 2 How to run the Correct PacBio Reads tool To start the error correction tool go to Toolbox Genome Finishing Module 21 Correct PacBio Reads beta 75 In this dialog you can select one or more sequence lists containing the raw PacBio reads that should be corrected Click Next to set the parameters for the error correction This opens the dialog shown in Figure 16 4 In this dialog you can set the Coverage percentage of reads to correct The error correction tool will correct a number of long reads amounting to the entered fraction of the total coverage CHAPTER 16 CORRECT PACBIO READS BETA 82 Distribution corrected reads
63. ed contig 35 K i Read layout Compactness Low Y Gather sequences at top Y Show sequence ends Show mismatches Disconnect pairs Packed read height Medium Find Conflict Low coverage threshold 8 Find Low Coverage v Sequence layout No spacing Numbers on sequences Relative to V Numbers on plus strand Ww Figure 3 6 The contigs aligned to themselves In this example the top match is the contig itself with a perfect alignment There is a big overlap with contig 78 which seems to share a region with contig 50 The bottom match from contig 35 is faded which mean that contig 35 does not match contig 50 in the region shown but there is a match somewhere else The contig match table contains the following columns CHAPTER 3 ALIGN CONTIGS 35 Reference The name of the reference sequence Contig The name of the contig Reference start Start position of the match in the reference sequence Reference end End position of the match in the reference sequence Contig start Start position of the match in the contig sequence Contig end End position of the match in the contig sequence Contig span length Span size in the underlying contig for the match including regions between linked matches Match count Number of linked sub matches contained in this match Aligned nucleotides The number of aligned nucleotides in the match excluding regions between linked matches Contig percentag
64. el This is done using the Restrict by qualifiers function The dialog shown in figure 5 2 allows specification of which regions should be used for amplicon creation El Create Amplicons Choose where to run Mess on para Amplicon options 2 Select an element n containing one or more Amplicon length 100 nudeotide sequences with annotations Overlap 100 Set amplicon parameters ai Amplicon placement Annotation types Offset relative to annotation ends Restrict by qualifiers Qualifier key product F Qualifier value metalloproteinase inhibitor 2 precursor Figure 5 2 Specify parameters for the Create Amplicons tool CHAPTER 5 CREATE AMPLICONS 46 The parameters to be specified in this step are Amplicon options e Amplicon length Allows specification of the desired length of the amplicon annotations to be created e Overlap size A positive value specifies of the number of nucleotides by which the amplicon annotations should overlap if tiling amplicons are desired A negative overlap designates the number of nucleotides by which amplicon annotations should be separated Amplicon placement e Annotation type Contains a drop down list that makes it possible to annotate the type of problematic regions the amplicons are created to e Offset relative to annotation ends A positive value will extend each amplicon by that number in both directions and a negative value will shrink e Restrict by qualifier E
65. en checked the report will contain a section for each contig with statistics for only that contig e Create table Click Finish 4 3 How to use the Analyze Contigs tool 4 3 1 The contig analysis table The contig analysis generates a table that lists start and end position as well as length of all problematic regions detected for each contig The function of the table is to provide an overview that can form basis for manually discriminating actual misassemblies from correct assemblies The table by itself does not give access to editing the data which needs to be done either directly in the contig sequence possibly with reads mapped or through the contig aligner A good starting point for the further analysis can be to look in the top left corner of the ta ble where the number of rows in the table is shown In cases with many rows it can be an idea to adjust some of the parameters in order to potentially remove false positive results and thereby reduce the number of rows When the parameter settings have been optimized the table can be used for manual evaluation of the problematic regions eg using the filter tool 4 3 2 How to edit data following contig analysis To edit data the relevant contig must be opened from the read mapping results By selecting the row of interest in the contig analysis table this region will automatically be highlighted in the read mapping For clarity it can be an idea to enable annotation types correspondi
66. ends 9 Indude all libraries Indude subset of libraries Included libraries Nothing selected oF es Ervas Cara Se Xen Figure 9 2 Select whether to collect paired reads only from the ends of contigs or from the entire contig Optionally restrict the collection of paired statistics to a subset of paired libraries Finally click Next and Finish Note The Collect Paired Reads Statistics should only be performed on de novo assemblies where the contig has not been edited If run on modified contigs the distance estimates will not be accurate If your contigs have been modified you can extract the contig sequences by opening the de novo assembled data select all contigs and click on Extract Contig The extracted contig sequences can next be used as reference in a new read mapping using the NGS core tool Map Reads to Contigs This new read mapping can now be used as input in the Collect Paired Read Statistics tool 9 3 How to use the Collect Paired Read Statistics tool The output for the Collect Paired Read Statistics tool is the paired statistics table shown in figure 9 3 Escherichia X EB paired_illumi x Rows 2 670 Contig Mate contig is before after Mate contig Mate contig orientation Occurences Average distance Standard deviation 89450 130189 69498 paired ilumina miseg contig 1 Before paired ilumin seg contig 21 Forward paired ilumina miseg contig 1 Before paired_illumin
67. enough times Repeat not induded enough times Repeat not induded enough times Repeat not induded enough times Not part of any join Repeat not induded enough times Not part of any join Repeat not induded enough times Repeat not induded enough times Repeat not induded enough times Repeat not induded enough times Repeat not induded enough times Repeat not induded enough times Repeat not induded enough times Repeat not induded enough times Repeat not induded enough times Figure 12 6 Table containing a list of contigs that was not part of any join or not part of enough joins in the case of repeat contigs Chapter 13 Remove Extension of Contigs 13 1 What is the Remove Extension of Contigs tool When using the Extend Contigs tool to create larger overlaps between contigs these overlaps remain unless the contigs are actually joined In the process of joining the overlapping nucleotides are reduced to only being included once However extended ends of contigs not forming part of a join will remain The Remove Extension of Contigs tool removes extensions that were not included in a join See figure 13 1 for an example of an overlap of a contig that should be removed if the region isn t included in a join B reads 180 250 X Ac reads 180 2 x 500 1 000 Extended regio 50 1100 c20 contig 14 MmOBEm es OY o coli 8 LO Coverage FR o TO L reads 180 250 1100 c20 contig 14 reads 180 250 11
68. ersects the split region the two contigs which are the result of the split will be extended with nucleotides from the other contig to preserve read alignments As a simple example consider a split where a single read intersects the split position If the read is placed to the left of the split the left contig is extended with nucleotides from the right contig such that alignment of the read is preserved in the left contig Consequently a split at a position with intersecting reads will result in two contigs containing overlapping regions Besides preserving the alignment of intersecting reads the extension of split contigs is often convenient as the extended area will often overlap with the correct neighboring contig Figure 3 11 illustrates the left half of a split where the split function has annotated the split position together with the region of the contig that overlap with the right half of the split CHAPTER 3 ALIGN CONTIGS 38 E Split contig e 1 Distribute split reads Atire isa between the new contigs Reads in first contig 5 A028E 1 3 26246 11910 1 N 0 1 5 A02BE 1 3 5 A02BE 1 6 13725 15732 1 N 0 1 5 A02BE 1 6 13 5 A02BE 1 3 12632 16441 1 N 0 1 5 A02BE 1 3 5 A02BE 1 4 26 136 11498 1 N 0 1 5 A02BE 1 11 29152 11472 1 N 0 1 5 A02BE 1 7 13333 5085 2 N 0 1 5 A02BE 1 7 1333 5 A02BE 1 2 25651 19442 2 N 0 1 5 A02BE 1 9 24095 4243 2 N 0 1 5 A02BE 1 9 2409 5 A02BE 1 3 5880 10184 1 N 0 1 5 A02BE 1 9 1299
69. es Allow subset of annotation types Annotation types Gene ncRNA rRNA tRNA tmRNA Annotation overlap fraction 0 8 Annotation identity fraction 0 5 Figure 14 5 Specify parameters for deciding in which cases an annotation should be transferred CHAPTER 14 ANNOTATE FROM REFERENCE EM Annotate from Reference Result handling Choose where to run Select contigs that have Output options been aligned Select annotations to Create contigs with annotations a E Create report Result handling Create table of transferred annotations Create table of annotations not transferred Result handling E Open Save Log handling E Open log ES Figure 14 6 Output options for the Annotate from Reference tool TT Chapter 15 Import of PacBio reads Choosing the PacBio import will open the dialog shown in figure 15 1 S Gx PacBio High Throughput Sequencing Import Set parameters Look in de pacbio 1 Choose where to run 2 Import files and options L El ml40213 230323 42129 c100520410120000001823082509281362_s1_X0 1 bax h5 na E m140213 230323 42129 c100520410120000001823082509281362_s1_X0 2 bax h5 Recent Items E m140213 230323 42129 c100520410120000001823082509281362 si X0 3 bax h5 B m140213 230323 42129 c100520410120000001823082509281362 si X0 bas h5 File name 1323 42129 c100520410120000001823082509281362 s1 X0 bas h5 Files of type PacBio HS reads b
70. escription of the different server setups please refer to the CLC Server man ual at http clcsupport com clcgenomicsserver current admin index php manu ual Job Distribution him 2 5 1 Static license installation In each of the server models described above a static license is installed in the CLC Server on a master machine In the case of a job node setup static licenses are also installed on each machine acting as a job node Static licenses for the Server Finishing Module are downloaded and installed into the licenses folder in the CLC Server installation area Downloading a license is similar for all supported platforms but varies in certain details Please see the platform specific instructions below for details on how to download a license file on the system you are running the CLC Server on See section 2 5 5 for a description on how to download a license for a machine that does not have access to the internet For the master machine and for each machine in a job node setup 1 Log on to the machine that is running the CLC Server 2 Move into the CLC Server installation directory where the license download script can be found 3 Download and install the Finishing Module license as described in the relevant section below CHAPTER 2 SYSTEM REQUIREMENTS AND INSTALLATION OF THE CLC GENOME FINISHING MODULE 26 2 5 2 Windows license download License files are downloaded using the licensedownload bat script To run the sc
71. ew is synchronized with the Contig table contigs in the selected region will be selected in the match table In both case clicking the Join Contigs button opens a wizard with the following options Automatic find overlap and align A function that identifies the overlap between two contigs using BLAST followed by an alignment to calculate the consensus contig This function favors overlaps at the ends of the contigs CHAPTER 3 ALIGN CONTIGS 36 e Gx Join Contigs Select contigs Navigation Area 1 Choose where to run 2 Select contigs 4 9 5 MG1655 Methylome MG1655_pacbio Escherichia coli K 12 substr MG1 contigs short pairs H E contigs short pairs only gt MG1655 MG 1655 updated E all overlapping reads H E all small overlapping reads H E overlapping reads ES ali rasade 42800 92600 manninn HE E Selected elements 1 a coli reads 180 250 mapping Es Rm Batch Qr lt enter search term gt Figure 3 7 Contig Table Join contigs wizard Manual gap Function that can be used to join sequential and non overlapping contigs when the orientation and gap size is known orientation must be specified When ticked gap size and contig e It is also possible to join two contigs from the Contig match view by selecting a region in the reference sequence where two contigs overlap and right click that selection Select Join Two Contig
72. gs have been identified this information can be used to edit the contigs This can be done in different ways If a reference sequence is available the Align Contigs tool can be used to join or split contigs Splitting of contigs can also be performed directly on read mappings or de novo assembled data Hence no golden standard exist for how to process the data following detection of paired reads as it will depend on whether a reference sequence is available or not and on the type of problem to be solved Additionally the Collect Paired Read Statistics tool can be used together with the Align Contigs tool to see whether they support the same conclusions An example of this is shown in figure 9 4 FES paired illumi X Rows 2 670 Contig Mate contig is before after Mate contig Mate contig orientation Occurences Average distance Standard deviation paired_illumin seq contig 44 Before paired_illumin seq contig 24 Forward 1 34545 Da paired_illumin seq contig 44 Before paired_illumin seq contig 19 Reverse v 125240 0 paired_illumin seq contig 44 ca paired_illumin seq contig 46 dai 55832 0 paired_ilumin seq contig 44 paired_ilumin seq contig 14 Em o gd 1 paired illumin seg contig 44 mran paired_illumin seq contig 79 me 18952 0 paired_illumin seq contig 44 Before paired_illumin seq contig 77 Forward 19172 0 paired_illumin seq contig 44 Before paired ilumin seg contig 40 Forward 123606 0 paired_illumin
73. he location of the lic file you have just downloaded If the License Manager does not start up by default you can start it up by going to the Help menu and choosing License Manager e Click on the Next button and go through the remaining steps of the license manager wizard CHAPTER 2 SYSTEM REQUIREMENTS AND INSTALLATION OF THE CLC GENOME FINISHING MODULE 24 2 4 How to uninstall a Workbench plugin Workbench plugins are uninstalled using the Plugin Manager Help in the Menu Bar Plugins and Resources Es or Plugins 4 in the Toolbar This will open the dialog shown in figure 2 20 L A Download Plugins Manage Resources E Manage Plugins and Resources o Manage Plugins ments C ache apport dcio com Version 1 5 1 Build 131211 2142 102901 Perform alignments with ClustalO ClustalW and MUSCLE Uninstall Disable Annotate with GFF ca O CLC bio support cicbio Version 2 2 6 Build e 2143 102901 Using this plug in it is possible to annotate a sequence from list of annotations found in a GFF file Located in the Toolbox CLC Microbial Genome enne Module GD CLC bio support cicbio Version 1 3 2 Build 140318 1029 Various tools for genome finishing aimed to dose and produce high quality genomes in sequencing projects CLC Workbench Client Q CLC bio support dcbio Version 6 0 Build 140207 0940 105889 Client plugin for connecting to a CLC Genom
74. ick the button below and locate the file containing your license No file selected Choose License File If you experience any problems please contact The CLC Support Team Proxy Settings Previous Next Quit Workbench Figure 2 11 Importing the license file downloaded from the web page Click the Choose License File button and browse to find the license file you saved When you have selected the file click on the button labeled Next CHAPTER 2 SYSTEM REQUIREMENTS AND INSTALLATION OF THE CLC GENOME FINISHING MODULE 18 Accepting the license agreement Part of the installation of the license involves checking and accepting the end user license agreement EULA You should now see the a window like that in figure 2 12 License Wizard EJ CD CLC Plugins License Agreement Please read and accept the license agreement below to begin using you license END USER LICENSE AGREEMENT FOR CLC BIO SOFTWARE CLC Genomics Workbench 1 0 1 Recitals 1 1 This End User License Agreement EULA is a legal agreement between you either an individual person or a single legal entity who will be referred to in this EULA as You and CLC bio A S CVR no 28 30 50 87 for the software products that accompanies this EULA including any associated media printed materials and electronic documentation the Software Product I accept these terms If you experience any problems please contact The CL
75. icking Next leads to the next set of parameters to be specified figure 6 6 El Create Primers 1 Choose where to run Setprmer pa ZUL Primer parameters Preferred GC content 36 50 0 2 Select one or more nudeotide sequences S Maximum self annealing 50 3 Set primer placement parameters Maximum self end annealing 30 4 Set primer parameters Buffer parameters Salt concentration mM 100 Primer concentration mM 200 Melting temperature parameters Minimum temperature 52 Target temperature 57 0 Maximum temperature 63 Figure 6 6 Set parameters for primer conditions CHAPTER 6 CREATE PRIMERS 50 Primer parameters e Preferred GC content Specify the desired percentage of guanine and cytosine nucleotides in the primer e Maximum self annealing Specify the maximal accepted number of hydrogen bonds in case of self annealing e Maximum self end annealing Specify the maximal accepted number of hydrogen bonds in case of self end annealing Buffer parameters e Salt concentration mM Specify the desired salt concentration in the buffer in mM e Primer concentration nM Specify the desired primer concentration in nM Melting temperature parameters e Minimum temperature Primers with a melting temperature below this limit are rejected e Target temperature The desired melting temperature of the primers e Maximum temperature Primers with a melting temperature above this limit are rejected After
76. ics Server CLC Science Server CLC Drug Discovery Server or Bioinformatics Database The plug in also includes Grid Engine Integration THe Proxy Settings _ Check for Updates Install from Fle use Figure 2 20 The Plugin Manager with plugins installed The installed plugins are shown in this dialog To uninstall a plugin click on the entry for the CLC Genome Finishing Moduleand click on the Uninstall button If you do not wish to completely uninstall the plugin but want to stop it from being loaded the when you start the Workbench click on the Disable button When you close the dialog you will be asked whether you wish to restart the workbench The plugin will be uninstalled when the Workbench is restarted 2 5 How to install a Server plugin If you wish to use the tools and functionalities of the CLC Genome Finishing Module with a CLC Genomics Server you must purchase a Microbial Genome Finishing Extension license and install it on your CLC Server as explained in the following steps 1 Install plugin licenses to each machine with the CLC Server software installed as described below 2 Install the Server plugin on only the master CLC Server in the server setup as described in section 2 5 7 CHAPTER 2 SYSTEM REQUIREMENTS AND INSTALLATION OF THE CLC GENOME FINISHING MODULE 25 3 Restart all CLC Servers in the setup How to stop and start CLC Servers is covered in the CLC
77. ignificantly on it own will usually contaminate the graph build by the Join Contigs tool with bad information and thus make it hard to identify the correct joins For example if both long reads and a reference sequence is available then running the Join Contigs tool with both can result in an inferior result compared to just using the long reads This usually happens when the reference sequence is contain too many structural variations compared to the organism which was sequenced In other words the reference and the long reads will not agree on the set of possible joins In the Result handling step shown in figure 12 3 specify which tables to output before clicking Finish r Join Contigs Result handling 1 Choose where to run Output options V Create table of joined contigs N Select contigs 3 Select join contigs V Create table of contigs not joined Parameters 4 Result handling Result hand Open o Save Log handling Figure 12 3 Specify which tables to output with details of the join process The tool proposes the creation of two output table The primary output is a table of joined contigs It lists all contigs that are resulting from a join between two or more input contigs as well as details about the join itself figure 12 4 ES contigs short X ES contigs short X 37 contigs short X Rows 134 Joined contigs Contig 1 Contig 2 Joined con
78. ion Borrow License License Information Product ID Order ID Licensee Expires Source Local Machine Information Hostname HE Host TD A Export License Information Help Configure Network License Upgrade Workbench License Refresh Cose Figure 2 19 The license manager e Install the CLC Genome Finishing Module on the machine you wish to run the software on e Start up the software as an administrative user and find the host ID of the machine that you will run the CLC Workbench on You can see the host ID the machine reported at the bottom of the License Manager window in grey text e Make a copy of this host ID such that you can use it on a machine that has internet access e Go to a computer with internet access open a browser window and go to the relevant network license download web page e For Workbenches released from January 2013 and later e g the Genomics Workbench version 6 0 or higher and the Main Workbench version 6 8 or higher please go to https secure clcbio com LmxWSv3 GetLicenseFile e Paste in your license order ID and the host ID that you noted down in the relevant boxes on the webpage e Click download license and save the resulting lic file e Open the Workbench on your non networked machine In the Workbench license manager choose Import a license from a file In the resulting dialog click choose license file to browse t
79. l with most functionalities An alignment of contigs is performed using BLAST against either a reference sequence or if no reference sequence is available the contigs themselves When aligned to a closely related reference sequence it becomes visible how the contigs are located relative to each other which makes misassemblies repeats and overlaps between con tigs clear When contigs are aligned to themselves the main application of the contig alignment is identification of potential overlapping contigs that can be merged The result of the alignment can be viewed as both a list of matches and as a read mapping where contigs are represented as reads Through different views in the Align Contigs tool it is possible to join split and edit contig sequences view the read mapping of a contig remap all mapped reads to one or more contigs and replace all mapped reads with reads from one or more datasets 3 2 How to run the Align Contigs tool The best way to perform the contig alignment depends on the problem to be solved One way to start is to align all contigs from a de novo assembly to a known or related reference How to perform a de novo assembly is explained in the CLC Genomics Workbench manual which can be accessed at http clcsupport com clcgenomicsworkbench current It is possible to align contig sequences to multiple references and contigs both with and without reads mapped to them If a read mapping is used as input for the Align Con
80. lected licenses for a period of 1 hour lt Borrow Selected Licenses Help Configure Network License Upgrade Workbench License Refresh Close Figure 2 16 Borrow a license Common issues when using a network license No license available at the moment If all the network licenses or Finishing Moduleare in use you will see a dialog like that shown in figure 2 1 when you start up the Workbench E No valid license found N g CLC Genomics Workbench The following problems were encountered while trying to locate a valid license Click on each error for a more detailed description License Server localhost port 6200 E No license available at the moment To import a new license or change your license server settings please click the License Assistant button If you experience any problems please contact The CLC Support Team i ao E DE A License Assistant Limited Mode Retry Quit Figure 2 17 This window appears when there are no available network licenses for the software you are running This means others are using the network licenses You will need to wait for them to return their licenses before you can continue to work with a fully functional copy of software If this is a frequent issue you may wish to discuss this with your CLC License Server administrator Clicking on the Limited Mode button in the dialog allows you to start the W
81. ly quality Click Next to set the output options and click Finish to start the error correction 16 3 Error correction report In the last dialog of the wizard you can choose to create a report of the results See Figure 16 5 The report contains the following information for the input reads and the corrected reads CHAPTER 16 CORRECT PACBIO READS BETA 83 Read length distribution 1200 1100 1000 900 800 700 600 500 Number of reads 400 300 200 100 PPP P TS ORE YD lal Lala heh ple O OD ESG Read length kb 1 3 Error statistics 1 4 Read statistics Low coverage regions trimmed away 2 519 8 389 481 Seed reads after splitting and trimming 7 631 114 936 705 Final corrected reads 7 619 110 170 817 Figure 16 5 The error correction report is useful for evaluating the quality of the input data and the performance of the error correction Nucleotide distribution Fraction of the reads covered by each nucleotide A C Gand T Count The total number of reads Minimum maximum average N50 and N90 Read length statistics Total The total number of bases Read length distribution A graph showing the number of contigs of different lengths In addition to this some statistics about the error correction are given Seed read length threshold The length of the shortest seed read used as seed read picked according to the Coverage percentage of reads to correct See above Average correction coverage The a
82. m requirements The system requirements of the Finishing Module are e Windows Vista or Windows 7 Windows Server 2003 or Windows Server 2008 e Mac OS X 10 7 or later 64 bit e Linux Red Hat 5 or later SUSE 10 2 or later 2 GB RAM required 4 GB RAM recommended e 1024 x 768 display recommended e CLC Genomics Workbench 2 1 1 Special requirements for Join Contigs Most types of analyses in the Join Contigs tool run in a single thread An exception is the long reads scaffolding option that utilize the CLC read mapper and is therefore able to use all available cores in a system As mapping reads to contigs is one of the most time consuming steps when performing long reads scaffolding it is often an advantage to use a machine with many cores for this type of analysis The memory requirements for the Join Contigs can exceed the recommended memory requirements for the Finishing Module The memory required for joining contigs depends on several factors as described below and it is not possible to predict the maximum memory consumption for an analysis For most bacterial data sets it will be possible to run the Join Contigs tool on a machine that fulfill the system requirements for the Finishing Module Some examples where more memory can be needed e Long reads scaffolding using long reads with a high error rate such as PacBio reads on a machine with many cores 10 CHAPTER 2 SYSTEM REQUIREMENTS AND INSTALLATION OF THE CLC GENOME FINISH
83. matic scaffolding based on the read mapping created above see Chapter 12 The final contigs are saved to a file named Contig sequences 6 Map Reads to Contigs The corrected reads are mapped to the final contigs in order to be able to run the Analyze Contigs tool This read mapping can together with the output from the Analyze Contigs tool furthermore be used to evaluate the support for each contig and manually identify and resolve possible assembly errors The read mapping is saved to a file named Corrected reads mapped to contigs and a report that summarizes the read mapping is saved to a file named Corrected reads mapped to contigs report T Analyze Contigs The final contigs are analyzed in order to find problematic regions that may need manual curration see Chapter 4 A summary of the analysis is saved to a file named Contig analysis report and the problematic regions are reported in a file named Contig analysis table CHAPTER 18 WORKFLOWS 91 Imported reads Correct PacBio Reads beta Error Correction Report Error Correction Report Corrected reads Corrected reads Corrected reads report Reads Reads for error correction De Novo Assemble PacBio Reads beta r ne Teen Map Reads to Contigs Reads Track Read Mapping Mapping Report Un mapped Reads Join Contigs Contigs sequences Joined contigs table Contigs not joined table Contigs sequences Contig sequences C
84. may improve the mapping as reads that potentially could have interrupted the first assembly have been removed The Reassemble Regions tool is a stand alone wizard driven action however reassembly can also be performed by right clicking on a selected reference contig 10 2 How to run Reassemble Regions The use of the Reassemble Regions tool is best demonstrated with an example Figure 10 1 illustrates a region with a small gap and only one read spanning the region around the gap 4 660 4 680 4 700 4 720 4 740 I For reassembly eq contig 7 STTCCTT TATCGACAGGT CAGGTCANNNNNCGCCGACGAGAAAGCAACACTGACATGC TAAAGCAAAAAATAGATGAATAAGTTGAGTT 34 _ _ Coverage a 5 ACACTGACATGCTAAAGCAAAAAATAGATGAATAAGTTGAGTT STTCCTTTATCGACAGGTCAGGT ATGCTAAAGCAAAAAATAGATGAATAAGTTGAGTT CGCCGACGAGAAAGCAACACTGACATGCTAAAGCAAAAAATAGATGAATAAGTTGAGTT GACGAGAAAGCAACACTGACATGCTAAAGCAAAAAATAGATGAATAAGTTGAGTT TAAGTTGAGTT GACATGCTAAAGCAAAAAATAGATGAATAAGTTGAGTT TAAAGCAAAAAATAGATGAATAAGTTGAGTT GACATGCTAAAGCAAAAAATAGATGAATAAGTTGAGTT TGAATAAGTTGAGTT TTGAGTT GAATAAGTTGAGTT GT STTCCTTTATCGACAGGTCAGGTCA STTCCTTTATCGACAGGTCA STTCCTTTATCGACAGGTCAGGTCA 3 STTCCTTTATCGACAGGTCAGGT STTCCTTTATCGACAGGTCAGGTCA Figure 10 1 Region with a gap that potentially can be closed with the Reassemble Region tool However this single read contains sequencing errors and the region would be impossible to assemble for the de
85. ming parameters 4 Use annotation name in primer name Figure 6 9 Set primer naming parameters CHAPTER 6 CREATE PRIMERS 52 Primer naming e Use sequence name in primer name The sequence name is the name the sequence is created from e Use strand in primer name Add the primer strand to the primer name e Use pair information in primer name Add pair numbering to the primer name e Use region in primer name Add the primer region to the primer name e g 9842 9942 e Use annotation name in primer name Add the annotation name to the primer name After the Result handling step click Finish 6 2 1 Create Primers output The Create Primers tool creates four different outputs e Primer sequence list A sequence list with the created primers e Missing primers table A table that lists information about rejected primers that did not fulfil the criteria including an explanation about why the primer was not created e Primer table A table with information about each primer If several primers are valid according to the requirements the primer with the best score is used see section 6 2 2 e The input objects The input supplied to the tool with primers annotated on the sequence The primers created are the best possible according to the given parameters If no primers are created another attempt can be made after making adjustments to some of the primer settings or the input sequence Figure 6 10 shows an example of p
86. n and the position of the overlap Overlaps close to the edge of a contig give rise to higher weights than an overlap located in the middle of a contig The Join Contigs tool builds a graph over all possible joins based on the four analyses above where edges represents possible joins and nodes represent contigs Each edge is assigned a weight as described above If a join is ambiguous i e two or more analyses disagree on a join one of the following events can happen e The weights of two or more joins are within the same range In this case nothing is done e The weight for one join is significantly higher than the weights for all alternative joins The join with the highest weight is performed 12 2 How to run the Join Contigs tool To run Join Contigs tool find the Join Contigs tool in the toolbox Toolbox Genome Finishing Module Join Contigs 4 This opens the dialog shown in figure 12 1 Select the input contigs and click Next Join Contigs Select contigs nagar mar Navigation Area Selected elements 1 2 Select contigs E 1 MG1655 Methylome a E coli reads 180 250 mapping MG1655_pacbio Escherichia coli K 12 substr MG1 contigs short pairs H E contigs short pairs only gt MG1655 E MG 1655 updated all overlapping reads H E all small overlapping reads overlapping reads 120N PENN es ER manninn b lI Qr lt enter search term gt Batch
87. n of DMSO dNTP or Magnesium is greater than zero the temperature correction defined in von Ahsen et al 2001 is used Chapter 7 Add Reads to Contigs 1 14 What is Add Reads to Contigs It is possible to add reads to existing contigs if extra reads are available e g after resequencing of problematic regions This is useful in regions with extremely low coverage figure 7 1 The advantage of adding reads to the existing read mappings rather than making a new read mapping of old and new reads together is that all modifications that potentially have been made in the old reads will be preserved CGTACTGGTGCTTTACCTG GCAATGCGATGT TCACCACGGTAGAGAATATCCACCATTGTG GCAATGCGATGT TCACCACGGTAGA GCAATGCGAT GTAGAGAATATCCACCATTGTGCGTACTGGTGCTTTACCTG CCTG ACCATTGTGCGTACTGGTGCTTTACCTG CGTACTGGTGCTTTACCTG TTTACCTG Figure 7 1 Example showing a region with low coverage that will benefit from adding reads to contigs Extra reads to this region can be genereted using the Design Primers tool 1 2 How to run the Add reads to contigs Toolbox Genome Finishing Module 71 Add reads to contigs Z This opens the dialog shown in figure 7 2 Select sequence reads and click Next This opens the dialog shown in figure 7 3 Select the contig or the list of contigs that you want to add by clicking on the folder bm Next set the mapping options figure 7 4 94 CHAPTER 7 ADD READS TO CONTIGS 99 E E Add Reads
88. nables restriction of annotations by qualifier figure 5 2 and figure 5 3 Qualifier key Amplicons are only applied to annotations when the selected qualifier key e g gene product etc has the specified qualifier value Qualifier value Amplicons are only applied to annotations when the selected qualifier key has the specified qualifier value e g TIMP2 metalloproteinase inhibitor 2 precursor etc ES NC_000017 x H H LG UT T 1 GOIS J ati X Rows 5 688 Filter All v K e S o i Annotation type fi Shown annotation types Name Type Region Qualifiers assembly gap gene TIMP2 S CDS gene_synonym CSC 21K ee inote Derived by automated eqmputstions analysis using gene Y Gene Qualifier key Hiction method E EE product metalloproteinsse inhibit J mRNA precursor protein_id NP_003246 1 ncRNA idb_ ref G1 4507511 db xref CCDS CCDS11758 1 r TIMP2 CDS complement join 76 2 48 idb xref GenelD 7077 7 tRNA db_ ref HGNC 11821 idb xref HPRD 01784 Select all db_xref MIM 188825 E translation MGAAARTLRLALGLLLLATLLRPAD Deselect All ACSCSPVHPQOAFCNADVV IRAKAVSEKEVDSGNDIY NPIKRIQYEIKQIKMFKGPEKDIEFIY TAPSSAVCGVSL DVGGKKEYLIAGKAEGDGKMHITLCDFIVPWDTLSTTQ KSLNHRYQMGCECKITRCP MIPCYISSPDECLWMDWV EKNINGHOAKFFACIKRSDGSCAWYRGAAPPKQEFLDII DP r 0BEZ RADL Figure 5 3 Annotation type qualifier key and qualifier value Amplicon annotations are created back to
89. nal contig are extracted and mapped again with the original contig and all copies as a reference The result of this mapping is an even distribution of the reads across all copies of the contig Map Reads Allows the read mapping of contigs to be updated in two different ways The Map Reads Again function extracts reads from the selected contigs and re map each read to its source contig The Replace all reads function allows the user to select one or more data sets containing reads which are then used to replace all reads mapped to all contigs Join Contigs Function for joining contigs in two different ways The automatic join uses BLAST to find overlaps between two contigs and the manual gap method can be used to join sequential and non overlapping contigs when the orientation and gap distance is known The Join Contigs function is described in section 3 3 3 Remove Contigs Makes it possible to remove contigs e g when no mapping is seen to the reference or if very low coverage is observed 3 3 2 The Contig match table The Contig match table has a row representing one or more matches of a contig from the BLAST search When a reference sequence is used each row represents the match of a contig or part of a contig to the reference sequence Consecutive matches are linked to make the view cleaner One contig can result in several matches in the table Double clicking the match will CHAPTER 3 ALIGN CONTIGS 34 open a view where th
90. nce regions are used to transfer annotations The thresholds which can be adjusted are the fraction of annotated regions that must match a contig and the identity of matches Click Next Figure 14 6 shows the output options which include generation of reports and tables containing information on the annotations that were transferred and those that were not You can also add annotations to aligned contigs and create contigs with annotations Click Finish to transfer annotations CHAPTER 14 ANNOTATE FROM REFERENCE 16 1 Annotation summary report 1 1 Reference summary statistics eference On Transferred Warp Overlap Overlap Identity reference filtered filtered pct filtered o o MM Eat e 01047 a Sa 98 88 1 2 Reference type statistics 1 2 1 Reference NC 010473 statistics rred Transferred Overlap Overlap Identity pet filtered filtered pct filtered css com o om O HT of 000 Demo om Dm esp 4 MO as o Figure 14 3 A report can be output showing statistics for each reference and each type of annotation E Annotate from Reference 1 Choose where to run Navigation Area 5 CLC_Data Er Example Data 2 Select contigs that have been aligned E3 iontorrent ecoli DH10B ilumina coli DH10B pfa mr s HI rre H E 1 Choose where to run 2 Select contigs that have been aligned 3 Select annotations to transfer Allow all annotation typ
91. ng to the type of the row selected under Annotation types in the right pane An example is shown in figure 4 5 Right clicking on a highlighted region of the sequence allows editing directly in the sequence and splitting of the contig CHAPTER 4 ANALYZE CONTIGS 43 EE smal ecoia x EB small ecoa x Rows 105 Contig analysis table Filter O rum 2 Column width Sequence name Annotation type Start position End position Length Automatic X small_ecoli contig 21 Broken pair 34278 34342 65 a Show column small ecoli contig 21 Broken pair 34346 34357 12 Y Sequence name small_ecoli contig 21 Broken pair 34368 34378 11 E small ecoli contig 23 Broken pair 131 135 5 V Annotation type small ecoli contig 23 Unstable coverage 22942 22946 BI 7 Start position small ecoli contig 24 Unstable coverage 21774 21774 1 E a small_ecoli contig 26 Unstable coverage 148 150 3 v End position small ecoli contig 26 Broken pair 28463 28471 9 Y Length ecoli ig 26 Broken pai 28473 28487 15 ae small_ecoli contig 27 Broken pair 107 206 100 small_ecoli contig 27 Broken pair 1475 1491 17 small ecoli contig 27 Broken pair 1500 1506 7 small ecoli contig 27 Broken pair 1523 1528 6 small_ecoli contig 27 Unstable coverage 5793 5793 1 conti 421742 171742 gli Boy small_ecoli x 120 140 160 180 I l
92. nload Plugins tab of the Plugin Manager When you close the dialog you will be asked whether you wish to restart the software The plugin will be ready for use after the software is restarted 2 3 Workbench Licenses When you have installed the CLC Genome Finishing Module and start it for the first time or after installing a new major release you will meet the license assistant shown in figure 2 2 To manually start up the License Manager for a Workbench plugin first open the Plugin Manager see Section 2 2 select the relevant plugin or module and press the button labeled Import a new license To install a license you must be running the program in administrative mode The following options are available They are described in detail in the sections that follow Request an evaluation license Request a fully functional time limited license see below Download a license Use the license order ID received when you purchase the software to download and install a license file Import a license from a file Import an existing license file for example a file downloaded from the web based licensing system Configure license server connection If your organization has a CLC License Server select this option to configure the connection to it 2How to do this differs for different operating systems To run the program in administrator mode on Windows Vista or 7 right click the program shortcut and choose Run as Administrator
93. novo assembler To use the Reassemble Regions tool start out by marking a region around the area to reassemble right click and assign an annotation to the selected region by clicking Add annotation The annotation will be used to define the region to reassemble if using the wizard driven version of the Reassemble Region tool Alternatively it is possible to click on the selected sequence and select Reassemble In both cases the reassemble tool will 62 CHAPTER 10 REASSEMBLE REGIONS 63 autonomously expand the region used for reassembly which further will be highlighted with an annotation Toolbox Genome Finishing Module 71 Reassemble Regions This opens the dialog shown in figure 10 2 s Gx Reassemble Regions Select one or more read mappings with annotations 1 Select one or more read ae mappings with Navigation Area Selected elements 1 annotations 3a ac paired ilumina miseg tutorial assembly A aired ilumina miseq tutorial assembly LS CLC References Qr lt enter search term gt Figure 10 2 Select the annotated read mappings to reassemble Next select annotations for the regions to reassemble by clicking on the figure 10 3 Click Finish E Reassemble Regions Select annotations 1 Select one or more read mappings with annotations 2 Select annotations Annotations Annotations to reassemble Scaffold EH Select Annotations to r
94. nse order ID is used when downloading a single license file This license file includes information about how many network licenses are associated with the license order ID 2 5 7 Server plugin download installation and removal 1 Download the Finishing Module plugin for the CLC Server as a cpa file from http www clcbio com clc plugin Server 2 Install the plugin cpa file on the master CLC Server using the Server web administrative interface The plugin should only be installed on the master server in all server setup models It does not need to be manually installed on any machine acting as an execution node To install the plugin a Go to the Plugins section under the Admin 413 tab see figure 2 23 b Click on the Browse button and locate the cpa file for the plugin to install Logging into the web administrative interface is described in the CLC Server manual at http www clcsupport com clcgenomicsserver current admin index php manual Logging_into_administrative_interface html 3 Restart the master CLC Server Starting stopping and restarting the CLC Server software is described in the CLC Server manual started at http www clcsupport com clcgenomicsserver current admin index php manual Starting stopping server html 4 For job node setups only a Wait until the master CLC Server is up and running normally Then restart each job node CLC Server so that the plugin is ready to run on each node
95. ntigs tool ee 4 3 How to use the Analyze Contigs tool 4 3 1 The contig analysis table uc oo a a ae 4 3 2 How to edit data following contig analysis aoao aoao oaoa a a a a 5 Create Amplicons 5 1 What is the Create Amplicons tool eee a 5 2 How to run the Create Amplicons tool 6 Create Primers 6 1 What is the Create Primers tool 6 2 How to use the Create Primers tool CONTENTS 5 6 2 1 Create Primers output aw ce cee ds eee eG E E a 52 6 2 2 Primer SConNgE s siia sds nan eddi 53 6 2 3 Temperature calculation 53 7 Add Reads to Contigs 54 7 1 What is Add Reads to Contigs 0 2 A 54 1 2 How to run the Add reads to contigs 0 002 eee eee 54 8 Find Sequence 57 8 1 What is the Find Sequence tool 000 0 2 ee a 57 8 2 How to run the Find Sequence tool a 57 8 2 1 The Find Sequence output o 58 9 Collect Paired Read Statistics 59 9 1 What is the Collect Paired Read Statistics tool 59 9 2 How to run the Collect Paired Read Statistics tool 59 9 3 How to use the Collect Paired Read Statistics tool 60 10 Reassemble Regions 62 10 1 What is Reassemble Regions lt 0 0 0 62
96. o Assemble PacBio reads tool to increase the quality and thereby obtain a better assembly Both tools are designed for assembly of microbial genomes and small Eukaryotic genomes for example C elegans SMRT sequencing technologies as implemented by Pacific Biosciences have the potential to vastly improve the completeness of genome sequence assemblies as read lengths often exceed the length of most repeats in the genome A major obstacle is the high 10 15 rate of sequencing errors in SMRT reads A second obstacle is the presence of chimeric reads and sequences derived from untrimmed adapters which can be hard to recognize given the rate of errors and truncations However because sequencing errors are mostly random and reads are randomly sampled across the genome it is possible to correct SMRT sequencing reads if coverage is sufficiently high with the Correct PacBio Reads tool and ii assemble the error corrected reads into high quality contigs with the De Novo Assemble PacBio Reads tool The Correct PacBio Reads tool takes raw PacBio reads as input and produces error corrected reads as output The overall strategy for correcting PacBio reads consists of the following four steps 1 Partition the reads into long seed reads and shorter correction reads 2 Map all correction reads to all seed reads 80 CHAPTER 16 CORRECT PACBIO READS BETA 81 3 Detect and handle hairpin sequences untrimmed adapters and chimeras in the se
97. o begin using you license END USER LICENSE AGREEMENT FOR CLC BIO SOFTWARE CLC Genomics Workbench 1 0 1 Recitals 1 1 This End User License Agreement EULA is a legal agreement between you either an individual person or a single legal entity who will be referred to in this EULA as You and CLC bio A S CVR no 28 30 50 87 for the software products that accompanies this EULA including any associated media printed materials and electronic documentation the Software Product I accept these terms If you experience any problems please contact The CLC Support Team Proxy Settings Previous Finish Quit Workbench Figure 2 14 Read the license agreement carefully Please read the EULA text carefully before clicking in the box next to the text I accept these terms to accept and then clicking on the button labeled Finish 2 3 4 Configure license server connection If your organization is running a CLC License Server you can configure your Workbench to connect to it to get a license To do this select this option and click on the Next button A dialog like that shown in figure 2 15 then appears Here you configure how to connect to the CLC License Server License Wizard BEJ D LC Plugins Configure License Server connection Please choose how you would like to connect to your CLC License server V Enable license server connection Automatically detect license server Manually s
98. on against NCEL ACTCTACATG TGCTCTCTICTCCTTGi 3 Enea ACTCTACATGT TGCTCTCTICTCCTTG E BLAST Selection against Local Data ACTOTACATGT TOCTCTCTTCTCCTTE Set Numbers Relative to This Selection ACTCTACATGT TGCTCTCTTCTCCTTK ACTCTACATGT TGCTCTCTTCTCCTTG Add Annotation ACTCTACATGT TGCTCTCTICTCCTIGH eee ACTCTACATGT TGCTCTCTTCTCCTTG p ACTCTACATGT TGCTCTCTICTCCTTE AS ACTCTACATGTI TGCTCTCTTICTCCTTGi Split contig ACTCTACATGT TOCTCTCTTCTCCTT COrTAGCGUTL ce TGCTCTCTTCTCCTTGAICCTTACGGT CA a the two selected nucleotides Figure 3 9 Splitting a contig In case of misassemblies made by the de novo assembler it can be necessary to split a contig For example if the scaffolder has produced an erroneous scaffold or if two fragments that do not belong together have been joined into one contig this tool can be used to split the scaffold or contig respectively Splitting contigs is performed by selecting two nucleotides in a contig using a contig read mapping or by selecting two nucleotides in a match in the match view After selecting two nucleotides right clicking the selection will bring up a menu where Split Contig between the two selected nucleotides can be selected figure 3 9 This brings up a dialog where reads intersecting the split can be distributed between the resulting two contigs figure 3 10 Click Finish to perform the split The contig will be split between the two selected nucleotides If a contig contains reads that int
99. ontigs wizard it is possible to specify a minimum number of paired reads that must span two contigs before they are considered in a join A weight is computed for each possible join based on the number of paired reads spanning the two contigs and the standard deviation of the distance estimate as follows readcount log maz 2 stddev abs libdist 5 where readcount is the number of paired reads supporting the join stddev is the standard deviation and libdist is the expected paired library distance e An alignment of the contigs to a closely related reference Contigs are first aligned to the reference using the Align Contigs tool Next spurious matches are filtered as follows Matches which only cover a small fraction of contigs are ignored Overlapping matches are evaluated with respect to the match size and the identity if the match If one match is significantly larger than the other match or has significantly higher identity we ignore the smallest or lowest identity match if gt 25 of this match is overlapped by the other match 67 CHAPTER 12 JOIN CONTIGS 68 The remaining matches are used to join contigs where the reference suggests a small overlap between the contigs or the contigs appear to be close neighbors e Overlapping contigs are detected by aligning contigs against each other using the Align Contigs tool A weight for each possible join is computed based on the number of mismatches in the overlapping regio
100. orkbench with functionality equivalent to the CLC Sequence Viewer This includes the ability to access your CLC data CHAPTER 2 SYSTEM REQUIREMENTS AND INSTALLATION OF THE CLC GENOME FINISHING MODULE 22 Lost connection to the CLC License Server If the Workbench connection to the CLC License Server is lost you will see a dialog as shown in figure 2 18 EM Server connection lost We are sorry We are unable to contact the CLC Server Please log in again if you want to use the server and contact your administrator if the problem persists Figure 2 18 This message appears if the Workbench is unable to establish a connection to a CLC License server If you have chosen the option to Automatically detect license server and you have not succeeded in connecting to the License Server before please check with your local IT support that automatic detection will be possible to do at your site If it is not possible at your site you will need to manually configure the CLC License Server settings using the License Manager as described earlier in this section If you have successfully contacted the CLC License Server from your Workbench previously please consider discussing this issue with your CLC License Server administrator or your local IT support to make sure that the CLC License Server is running and that your Workbench can connect to it There may be situations where you wish to use a different license or view information about the licens
101. ot received an Order ID Note that if you are upgrading an existing license file this needs to be deleted from the licenses folder When you run the downloadlicense script it will create a new license file 2 5 5 Download a static license on a non networked machine To download a static license for a machine that does not have direct access to the external network you can follow the steps below after the Server software has been installed CHAPTER 2 SYSTEM REQUIREMENTS AND INSTALLATION OF THE CLC GENOME FINISHING MODULE 21 BAR Terminal bash 79x24 bash machook 1 gqandersen Aa 1 HH H HE 0 H H HE H RARA RR NAAA REES CLO bio License download utility 1 HH EHEHE HEHEHEHE H H LARREA HE PE HE PEHE Renee Hostname hos amp le Com HostlOfss XX MARIANA Please enter for copy paste your License Order ID and press return Figure 2 22 Download a license based on the Order ID e Determine the host ID of the machine the server will be running on by running the same tool that would allow you to download a static license on a networked machine The name of this tool depends on the system you are working on Linux downloadlicense Mac downloadlicense command Windows licensedownload bat When you run the license download tool the host ID for the machine you are working on will be printed to the terminal e Make a copy of this host ID such that you can use it on a machine that has internet access
102. pecify license server Hostname IP address Port Disable license borrowing IF you choose this option users of this computer will not be able to borrow licenses From the License Server If you experience any problems please contact The CLC Support Team Proxy Settings Previous finish Cancel Figure 2 15 Connecting to a CLC License Server e Enable license server connection This box must be checked for the Workbench is to contact the CLC License Server to get a license for Finishing Module e Automatically detect license server By checking this option the Workbench will look for a CLC License Server accessible from the Workbench 3 Automatic server discovery sends UDP broadcasts from the Workbench on a fixed port 6200 Available license CHAPTER 2 SYSTEM REQUIREMENTS AND INSTALLATION OF THE CLC GENOME FINISHING MODULE 20 e Manually specify license server If there are technical limitations such that the CLC License Server cannot be detected automatically use this option to provides details of machine the CLC License Server software is on and the port used by the software to receive requests After selecting this option please enter Host name The address for the machine the CLC Licenser Server software is running on Port The port used by the CLC License Server to receive requests e Use custom username when requesting a license A username entered here will be passed to the CLC License Ser
103. rimers designed to a region containing a scaffold paired illumina miseg contig 114 Consensus 65 Coverage Figure 6 10 Primers have been designed to a region containing a scaffold Prior to primer creation an amplicon has been created using the Create Amplicon Tool and annotated For primer creation CHAPTER 6 CREATE PRIMERS 53 6 2 2 Primer scoring In cases where several primers fulfil the defined requirements the suggested primers are the ones with the best score The score is calculated from the melting temperature GC content self annealing and self end annealing A good score is a low score which is obtained when the values of the suggested primer are close to the user defined target values 6 2 3 Temperature calculation The primer melting temperature is calculated using a nearest neighbors approach similar to the one used by MELTING Nov re 2001 However the Primer Creator uses the nearest neighbor model and interaction parameters given in SantaLucia et al 2000 which give rise to some differences in the melting temperatures calculated by the two tools Temperatures are corrected for salt concentration and dangling end parameters are used Bommarito et al 2000 Nucleotide mismatches are handled using the parameters defined in Allawi and SantaLucia 1997 Allawi and SantaLucia 1998a Allawi and SantaLucia 1998c Allawi and SantaLucia 1998b Peyret et al 1999 If the concentratio
104. ript right click on the file and choose Run as administrator This will present a window as shown in figure 2 21 EN C Windows System32 cmd exe H carp publ te ot HH CLO bio license download utilit Hit ARA a Hostname Host IDCs Please enter Cor copy paste your license Order ID and press return gt Figure 2 21 Download a license based on the Order ID Paste the Order ID supplied by CLC bio right click to Paste and press Enter Please contact support clcbio qiagen com if you have not received an Order ID Note that if you are upgrading an existing license file this needs to be deleted from the licenses folder When you run the downloadlicense command script it will create a new license file 2 5 3 Mac OS license download License files are downloaded using the downloadlicense command script To run the script double click on the file This will present a window as shown in figure 2 22 Paste the Order ID supplied by CLC bio and press Enter Please contact support clcbioVgiagen com if you have not received an Order ID Note that if you are upgrading an existing license file this needs to be deleted from the licenses folder When you run the downloadlicense command script it will create a new license file 2 5 4 Linux license download License files are downloaded using the downloadlicense script Run the script and paste the Order ID supplied by CLC bio Please contact support clcbio qiagen com if you have n
105. rtain methods which are the intellectual property of Pacific Biosciences The use of Correct PacBio Reads beta tool or the predefined workflow PacBio De Novo Assembly Pipeline with any data other than data generated on a Pacific Biosciences instrument constitutes a violation of the end user license agreement that users of the CLC Genome Finishing Module agree to during installation 17 1 What is the De Novo Assemble PacBio Reads tool SMRT sequencing technologies as implemented by Pacific Biosciences have the potential to vastly improve the completeness of genome sequence assemblies as read lengths often exceed the length of most repeats in the genome A major obstacle is the high 10 15 rate of sequencing errors in SMRT reads A second obstacle is the presence of chimeric reads and sequences derived from untrimmed adapters which can be hard to recognize given the rate of errors and truncations However because sequencing errors are mostly random and reads are randomly sampled across the genome it is possible to i correct SMRT sequencing reads if coverage is sufficiently high and ii assemble the error corrected reads into high quality contigs The Correct PacBio Reads tool see Chapter 16 performs the first of these two tasks It takes raw PacBio reads as input and produces error corrected reads as output The De Novo Assemble PacBio Reads tool performs the second task assembling the error corrected reads into high quality contigs
106. s 14 2 How to run the Annotate from Reference tool Toolbox Genome Finishing Module Annotate from Reference This opens the dialog shown in figure 14 4 where at least one Align Contigs result must be selected Click Next The next step in figure 14 5 allows you to choose between transferring all annotations found on the input reference sequences or a subset of these It is also possible to adjust two 4 CHAPTER 14 ANNOTATE FROM REFERENCE 19 EE assembly cont X assembly cont x ES assembly cont x EB assembly cont x Rows 4 544 Annotations transferred Reference Reference Region Contig Contig region NC 010473 complement 130403 130987 unpaired ilumina miseg contig 37 complement 109858 110442 1 00 NC 010473 complement 131357 131836 unpaired ilumina miseg contig 37 complement 110812 111291 1 00 NC 010473 complement 131833 133230 unpaired ilumina miseg contig 37 complement 111288 112685 3 NC_010473 complement 133290 134216 unpaired ilumina miseg contig 37 complement 112745 gt 113665 NC 010473 complement 134253 134708 unpaired_illumina_miseq contig 9 62880 63335 NC_010473 complement 134886 135590 unpaired illumina miseg contig 9 61998 62702 1 E 1 NC_010473 complement 135605 136135 unpaired_ilumina_miseq contig9 61453 61983 1 00 1 00 NC_010473 136209 138638 unpaired ilumina miseg contig 9 complement 58950 61379 1 00 1 00 NC 010473 138
107. s gt Additional Alignments When E Using this plug in it is possible to annotate a sequence from list of you run the alignments there are a number of parameters that can be annotations found in a GFF file Jin the Toolbox set You can also specify command line instructions Batch Rename Alignments and Trees Q CLC bio support cicbio com FEE Create Alignment Version 1 3 1 Build 131211 2144 102901 PE Join Alignments Rename files in batch by adding a prefix or a number E Create Pairwise Comparison Biobase Genome Trax Annotate te Create Tree Q CLC bio supportedicbio com Te Maximum Likelihood Phylogeny Version 2 0 11 Build 140103 1321 103719 7 Re Additional Alignments Create tracks with various data from Biobase Genome Trax gt ClustalO EE ClustalW Biobase Genome Trax Download MUSCLE Q CLC bio support ckbio com Version 2 0 11 Build 140103 1322 103719 The additional alignments in the toolbox Create tracks with various data from Biobase Genome Trax Plugin requires registration Allignment methods Blast2GO PRO Three different alignment methods are included in this extension ClustalW ClustalO and Muscle For more detailed information on each of Q BioBam Bioinformatics pluginsupport amp blast2go com a Proxy Settings Check for Updates InstalfromFie dose Figure 2 1 The plugins that are available for download are listed in the Dow
108. s matches of length gt 8bp and matches that are close to contig ends are considered to be more significant compared to matches far from the ends 3 3 How to use the Align Contigs tool Following the alignment of contigs two tables are created 1 The Contig table which gives an overview of the contigs figure 3 3 This table will be the one that opens per default when running the contig alignment 2 The Contig match table which lists all matches found by BLAST between the contigs and the reference sequences figure 3 4 This table can be opened by clicking on E in the bottom left corner F ilumina mise x Rows 114 Filter JO Contig Contig length Total read count Average coverage Contig matches length Contig matches count unpaired ilumina miseg contig 1 gt unpaired ilumina miseg contig 2 80887 10539 17 84 80887 1 unpaired ilumina miseg contig 3 40088 5221 18 00 40088 1 unpaired ilumina miseg contig 4 86923 11186 17 85 86923 1 unpaired ilumina miseg contig 5 25448 3542 19 14 25448 1 unpaired ilumina miseg contig 6 82998 11378 19 01 82998 1 unpaired ilumina miseg contig 7 14511 2122 20 69 17035 3 unpaired ilumina miseg contig 8 27892 3839 18 79 27892 1 unpaired ilumina miseg contig 9 63384 8218 17 77 63384 1 unpaired ilumina miseg contig 10 3405 553 23 38 11509 18 unpaired ilumina miseg contig 11 54409 6970 17 64 57657 unpaired ilumina miseg contig 12 2931 537 26 22 9352 11 unpaired ilumina miseg con
109. s from the drop down menu and specify the contigs to be joined in the dialog window figure 3 8 The wizard lists all contig matches in the selected region and the contigs to use in the join are selected by selecting the corresponding matches Select first contig match Select the first contig match from the list to use for the join Select second contig match Select the second contig match from the list to use for the join Escherichia x r__DH10B_uid58979_reference_genome 319 800 l IN Coverage unpaired ilumina miseg contig 53 unpaired ilumina miseg contig 82 unpaired ilumina miseg contig 90 re 580 Y Figure 3 8 Match view Jo R TS Y WN fp S 2 ETIE 320 000 320 200 Delete selection Edit selection Transfer Selection to All Reads Transfer Selection to Contig Sequence Extract from Selection Find Broken Pair Mates Extract Consensus Sequence Copy Open Selection in New View BLAST Selection against NCBI BLAST Selection against Local Data Set Numbers Relative to This Selection Add Annotation Join two contigs Reassemble Join two contigs with matches in the selected area T Calis in Contigs wizard This method is very useful in cases where an overlap between two contigs is very short Indeed this method only considers overlaps that are present in the selection made by the user The automatic join method described earlier would fail to con
110. seq contig 44 Before paired ilumina miseg contig 5 Forward 17767 0 paired ilumin seg contig 44 Before paired ilumina miseg contig 3 Forward 74153 0 paired_illumin seq contig 44 Before paired_illumin seq contig 36 Forward 52096 0 paired fumin seq contig 44 SE E E RE z paired_illumin seq contig 44 After paired ilumin seg contig 37 Reverse 99663 0 paired _illumin seq contig 44 After paired _illumin eq contig 109 Reverse 32917 0 Escherichia x 10B uid58979 reference 06n0rme Oo LS Coverage A pairea ilumina miseg contig sy x paired ilumina miseg contig 14 paired ilumina miseg contig 44 E paired ilumina miseqg contig 73 paired ilumina miseg contig 33 paired ilumina miseg contig 55 lt 4 w 4 w p Lhe Y Figure 9 4 Paired read statistics table and contigs aligned to a reference in the Align Contigs tool This shows that both tools agree on how contig 14 and contig 33 are positioned before and after contig 44 Chapter 10 Reassemble Regions 10 1 What is Reassemble Regions When problematic areas in contigs mapping or de novo are encountered the Reassemble Regions tool can be used as an alternative to manual editing the sequence This tool is not always capable of fixing problems in the assembly but may be worth a try The Reassemble Regions tool adjusts the read mapping and makes changes in the consensus sequence based on reads in the selected region Reassembly of only an isolated part of the reads
111. sider the short overlap favoring other more significant ones instead With this method the user has control over the location of the overlap which makes it possible to join contigs that only overlap with a single nucleotide For all join methods described above it is possible to keep the old contigs This is done by ticking Keep contig under Old contigs which is useful when joining contigs that represent repetitive elements needed for joining other contigs elsewhere in the mapping CHAPTER 3 ALIGN CONTIGS 37 Note When joining two contigs the orientation of the result is not guaranteed to follow the orientation of the original contigs e g two contigs with reverse orientation relative to the reference can result in a contig with forward orientation depending on the join function used However the orientation of contigs is usually of no importance and the CLC de novo assemblers will output contigs with a somewhat arbitrary orientation 3 3 4 Splitting a contig 200 TGCTCTCTTCTCCTT Delete selection E G Edit selection cs ER Transfer Selection to All Reads ACTCTACATGT TGCTCTCTTCTCCTTE TGCTCTCTTCTCCTTG Transfer Selection to Contig Sequence ACTCTACATGT TGCTCTCTTCTCCTTE ACTCTACATGT Extract from Selection ACTOTACATGT E Find Broken Pair Mates Extract Consensus Sequence T Si Copy TGCTCTCTTCTCCTTE eee TGCTCTCTTCTCCTTG E Open Selection in New View ACTCTACA TGCTCTCTTCTCCTTG BLAST Selecti
112. tig 13 53170 7177 18 63 53170 1 unpaired _illumina_miseq contig 14 260013 33848 17 91 259987 1 unpaired ilumina miseg contig 15 326336 42341 17 93 326294 1 unpaired_illumina_miseg contig 16 107629 14125 18 04 107629 1 unpaired ilumina miseg contig 17 88259 11538 17 85 88259 1 unpaired ilumina miseg contig 18 77565 9912 17 52 77501 1 unpaired_illumina_miseq contig 19 40225 5653 19 37 40225 1 unpaired_illumina_miseq contig 20 9186 1544 23 52 12169 5 Show Contigs Add Extract f Copy Contig Map Reads Join Contios SE Remove Contigs E e a A y Figure 3 3 The Contig table The two tables complement each other and are both very useful in the finishing procedure Besides listing contigs and matches between contigs and references the tables also give access to a number of functions for manipulating contigs such as editing the contig sequence joining contigs and splitting them One of the most important features is the visualization of contig matches which can give a quick overview of how contigs align to a reference genome The visualization also gives direct access to several tools for manipulating the contigs thus providing a quick and intuitive way of working with the contigs 3 3 1 The Contig table The Contig Table is almost identical to the table generated by the de novo assembly tool with the difference that two extra columns have been added Contig matches length describes the CHAPTER 3 ALIGN CONTIGS 33
113. tig Join details Distance Paired reads support coli reads 180 250 contig9 coli reads 180 250 contig 96 Joined contig 1 64 No coli reads 180 250 contig 96 coli reads 180 250 contig 93 Joined contig 1 32 No o eference support Contig overlap support Long reads support No Yes No Yes No Yes No Yes No Yes No Yes No Yes No coli reads 180 250 contig 12 Joined contig 1 18 No coli reads 180 250 contig 73 Joined contig 1 60 No coli reads 180 250 contig 64 Joined contig 1 16 No coli reads 180 250 contig 89 Joined contig 1 19 No coli reads 180 250 contig 86 Joined contig 1 18 No coli reads 180 250 contig 56 Joined contig 1 17 No coli reads 180 250 contig 64 Joined contig 1 12 No coli reads 180 250 contig 28 Joined contig 1 14 No coli reads 180 250 contig 71 Joined contig 1 14 No coli reads 180 250 contig 3 Joined contig 1 19 No coli reads 180 250 contig 92 Joined contig 1 19 No coli reads 180 250 contig 21 Joined contig 1 19 No coli reads 180 250 contig 21 coli reads 180 250 contig 64 Joined contig 1 16 No coli reads 180 250 contig 64 coli reads 180 250 contig 58 Joined contig 1 19 No coli reads 180 250 contig 58 coli reads 180 250 contig 5 Joined contig 1 127 No 180 250 contig 5 coli reads 180 250 contig 98 Joined contig 1 19 No coli reads 655555555555555555565 coli reads 180 250 contig 98 coli reads 180 250 contig 50 Joined contig 1 18 No coli reads 180 250 contig 50 coli reads 180 250 contig 1 Joined contig
114. tigs tool the consensus sequence will be used for the alignment However using the consensus of a read mapping can be slow in some cases so if no manual editing of the input read mapping has been performed consider mapping the reads using Map Reads to Contigs This chapter will be focusing on how to perform a contig alignment when a reference sequence is available To run the Align Contigs tool 30 CHAPTER 3 ALIGN CONTIGS 31 Toolbox Genome Finishing Module 21 Align Contigs tool This opens the dialog shown in figure 3 1 E E Align Contigs Select contigs Navigation Area Selected elements 1 1 Select contigs paired ilumina miseg tutorial assembly L paired ilumina miseg tutorial assembly 20 E coli DH10B i paired ilumina miseg 1 paired YLS CLC References Q lt enter search term gt Batch EI The consensus will be used as contig sequence Consider using Map Reads to Contigs gt Next Finish X Cancel Figure 3 1 Select one or more contigs to analyze Select the relevant file containing the contigs and click Next This leads to the Select contig mapping parameters step shown in figure 3 2 Align Contigs Select contig mapping parameters 1 Select contigs 9 S c Use input contigs as references 2 Select contig mapping Use selected reference s parameters References XxX E coli DH10B BLAST options BLAST word size 20 Maximum BLAST e value 0 0001
115. tion license We offer a fully functional version of the CLC Genome Finishing Module for evaluation purposes free of charge Each user is entitled to 14 days demo of the CLC Genome Finishing Module If you are unable to complete your assessment in the available time please send an email to sales clcbio com to request an additional evaluation period When you choose the option Request an evaluation license you will see the dialog shown in figure 2 3 License Wizard DO CLC Plugins Request an evaluation license Please choose how you would like to request an evaulation license Direct Download The workbench will attempt to contact the CLC Licenses Service and download the license directly This method requires internet access from the workbench Go to License Download web page The workbench will open a Web Browser with the License Download web page From there you will be able to download your license as a file and import in the next step If you experience any problems please contact The CLC Support Team Proxy Settings Previous Next Quit Workbench Figure 2 3 Choosing between direct download or going to the license download web page CHAPTER 2 SYSTEM REQUIREMENTS AND INSTALLATION OF THE CLC GENOME FINISHING MODULE 14 In this dialog there are two options e Direct download Download the license directly from CLC bio This method requires that the Workbench has access to the external network
116. to Contigs E Select sequencing reads Navigation Area Selected elements 1 Ef CLC Data Z paired ilumina miseg 1 paired E E MGFM wd E coli DH10B gt Le T S CLC_References Q lt enter search term gt Batch Figure 7 2 Select sequence reads a Add Reads to Contigs Select a read mapping 1 Select sequencing reads Target mapping Target contigs FE paired ilumina miseg tutorial assembly 2 Select a read mapping fama a Figure 7 3 Select a contig a S Add Reads to Contigs Mapping options 1 Select sequencing reads 2 Select a read mapping Read alignment as Mismatch cost 2 Linear gap cost Affine gap cost Insertion cost 3 Deletion cost 3 Insertion open cost 6 Insertion extend cost 1 Deletion open cost 6 Deletion extend cost 1 Length fraction 0 5 Similarity fraction 0 8 Color space alignment Color error cost 3 J Auto detect paired distances Global alignment Non specific match handling Map randomly Ignore Figure 7 4 Set mapping options Read alignment e Mismatch cost The cost of a mismatch between the read and the reference sequence Ambiguous nucleotides such as N R or Y in read or reference sequences are treated as a mismatches and any column with one of these symbols will therefore be penalized with the mismatch cost e Linear gap cost The cost of a gap is computed
117. uality scores will reduce both disk space usage and memory consumption As PacBio quality scores currently contain very little information we recommend that you discard them When importing Fasta files this option is not available since Fasta files do not contain quality scores Click Next and choose how the result of the import should be handled We recommend choosing Save which will save the results directly to the disk Chapter 16 Correct PacBio Reads beta Please note that the tools Correct PacBio Reads beta De Novo Assemble PacBio Reads beta were optimized for the use of PacBio data and readily support data generated with different generations of PacBio chemistry sequencing reagents Due to such algorithm optimizations the use of these tools for other data types is not supported Moreover for the tool Correct PacBio Reads beta we are relying on certain methods which are the intellectual property of Pacific Biosciences The use of Correct PacBio Reads beta tool or the predefined workflow PacBio De Novo Assembly Pipeline with any data other than data generated on a Pacific Biosciences instrument constitutes a violation of the end user license agreement that users of the CLC Genome Finishing Module agree to during installation 16 1 What is the Correct PacBio Reads tool The Correct PacBio Reads tool should be used as a preprocessing step prior to assembly of SMRT sequencing reads with high error rates with the De Nov
118. uencing primers on the forward strand The target region Contigs joined is covered by 400bp amplicons and for each amplicon a sequencing primer has been created inside the start of the amplicon Edge primers e Create edge primers for all input sequences When ticking this box primers pointing out of all input sequences are created Primer Placement CHAPTER 6 CREATE PRIMERS e Minimum primer length Allows specification of the prefered minimum primer lengths e Maximum primer length Allows specification of the prefered maximum primer lengths e Minimum distance to annotation Allows specification of the prefered minimum distance from primer to target region e Maximum distance to annotation Allows specification of the prefered maximum distance from primer to target region e Relative to annotation Allows specification of whether primers should be targeted indside figure 6 4 or outside figure 6 5 the selected annotation Min distance Min distance a so l Max distance Max distance a anr kar l Example Primers Example Primers Figure 6 4 Illustration of how inside PCR primers are positioned Min distance Min distance r ra nnt kannn ant Max distance Max distance _ _ gt QA gt P Example Primers Figure 6 5 Illustration of how outside PCR primers are positioned Example Primers Note It is not possible to create primers that span two exons Cl
119. um coverage 3 Maximum unaligned ends coverage 30 Figure 11 4 Specify parameters for deciding how much the contig should be extended After clicking Finish in the Result handling step the contigs will be extended if possible To see the results of the contig extension and to join the contigs that now are overlapping run the Align Contigs tool again on the extended assembly Chapter 12 Join Contigs 12 1 What is the Join Contigs tool The Join Contigs tool provides an automated way of joining contigs based on the following types of analyses e Long reads such as PacBio reads can be used to join contigs if they span more than one contig Long reads are mapped to the contigs iteratively using the CLC read mapper by using unmapped regions of reads from one iteration as input reads to the following iteration If the tool estimates that two contigs should be joined with a gap in between an attempt is made to fill the gap using an alignment of the reads spanning the gap If the quality of this read alignment is too low the gap is filled with N s instead A weight is computed for each possible join based on how well the reads map to the two contigs e Paired reads that span multiple contigs are used to identify possible neighboring contigs which can be joined The Join Contigs tool only consider reads that map close to the contig ends to prevent spurious matches from repeat regions embedded in the contigs Through the Join C
120. ve no cost Color space alignment and Auto detect paired distances are only accessible when using the relevant data sets Non specific match handling e Map randomly Reads with more than one match are assigned randomly e Ignore Reads with more than one match are ignored After clicking Finish in the Result handling step the reads will be added to the existing mapping of reads to contigs Chapter 8 Find Sequence 8 1 What is the Find Sequence tool The Find Sequence tool makes it possible to search for sequence names sequence strings or annotations in a set of objecis 8 2 How to run the Find Sequence tool Toolbox Genome Finishing Module 71 Find Sequence Ty This opens the dialog shown in figure 8 1 S Find Sequence Select elements to search in 1 Select elements to Navigation Area Selected elements 1 search in CLC Data paired ilumina miseq tutorial assembly MGFM Nnaired ilumina miseg tutorial assembly xx E coli DH10B paired ilumina miseg 1 paired YLS CLC References Q7 lt enter search term gt Batch Figure 8 1 Select the elements to search in Select the relevant assembled reads and click Next This leads to the Set search string step Shown in figure 8 2 The parameters to be specified in this step are Search text Type or paste the relevant sequence name that should be used in the search and select whether the search should be performed in a name
121. ver instead of the username of your account this machine e Disable license borrowing on this computer If you do not want users of the computer to borrow a license from the set of licenses available then See section 2 3 4 select this option Borrowing a license A network license can only be used when you are connected to the a license server If you wish to use the Finishing Module when you are not connected to the CLC License Server you can borrow an available license for a period of time During this time there will be one less network license available on the for other users The Workbench must have a connection to the CLC License Server at the point in time when you wish to borrow a license The procedure for borrowing a license is 1 Go to the Workbench menu option Help License Manager 2 Click on the Borrow License tab to display the dialog shown in figure 2 16 3 Use the checkboxes at the right hand side of the table in the License overview section of the window to select the license s that you wish to borrow 4 Select the length of time you wish to borrow the license s 5 Click on the button labeled Borrow Licenses 6 Close the License Manager when you are done You can now go offline and work with the Finishing Module When the time period you borrowed the license for has elapsed the network license you borrowed is made available again for other users to access To continue using the Finishing Module with a
122. verage coverage by correction reads on seed reads CHAPTER 16 CORRECT PACBIO READS BETA 84 Hairpin splits The number of splits performed due to putative untrimmed hairpin adapter se quences Chimeric splits The number of splits performed due to putative chimeras Mismatches corrected The number of mismatches that have been corrected in the output reads Insertions corrected The number of insertions that have been corrected in the output reads Deletions corrected The number of deletions that have been corrected in the output reads Errors per 100kb trimmed input read The total number of errors mismatches insertions and deletions that have been corrected per 100kb in the output reads Finally the number and total size of the following elements are given e Input reads longer than 100bp e Seed reads longer than threshold e Correction reads shorter than threshold e Low coverage regions trimmed away e Seed reads after splitting and trimming e Final corrected reads Chapter 17 De Novo Assemble PacBio Reads beta Please note that the tools Correct PacBio Reads beta De Novo Assemble PacBio Reads beta were optimized for the use of PacBio data and readily support data generated with different generations of PacBio chemistry sequencing reagents Due to such algorithm optimizations the use of these tools for other data types is not supported Moreover for the tool Correct PacBio Reads beta we are relying on ce
123. wer C T and Schutz E 2001 Oligonucleotide melting temperatures under PCR conditions nearest neighbor corrections for Mg 2 deoxynu cleotide triphosphate and dimethyl sulfoxide concentrations with comparison to alternative empirical formulas Clin Chem 47 11 1956 1961 93
Download Pdf Manuals
Related Search
Related Contents
Sony MSPSMS11000 User's Manual Samsung SGH-X450 Керівництво користувача Franklin Sports FS 5000 Page 1 .This R/C model requires the following items: Twelve RG/AA 3Com Network Router 5000 Owner's Manual General User Manual for MagIC II, Version 31 6 - SEW-Eurodrive TE-M80 - Support Instruction Manual SyCore / PC104 - Manual Release Copyright © All rights reserved.
Failed to retrieve file