Home

454 Sequencing System Software Manual, v 2.5p1 General

1. For a full description of the various data analysis applications see Parts C and D in this manual The software package described in this manual also includes the GS Reporter and the GS Run Browser applications available on the GS Junior Attendant PC on the Genome Sequencer FLX Instrument and for off instrument use on a cluster or DataRig and used to view and troubleshoot the results of a completed sequencing Run the GS Support Tool used to package sequencing Run data to send to Roche Customer Support for further help and troubleshooting and the SFF Tools a set of commands used to create manipulate and access sequencing trace data from SFF files However these applications and commands are not required steps of data processing and analysis 2 2 Data Processing Options 2 2 1 GS FLX System The GS FLX System gives the user four options for processing the data from a sequencing Run by selecting among the available processing types during sequencing Run set up see Figure 1 for more detail on data processing options see Part B Section 1 of this manual for details on sequencing Run set up see Part A section 3 of this manual GS FLX System version and the GS FLX Titanium Sequencing Method Manual e n most cases the user will perform the image processing step of data processing on the instrument copy the partially processed data to a dedicated computational node and carry out the time consuming signal processing step there T
2. Transcriptome assembly projects now generate two new text output files 454Isotigs faa and 454lsotigOrfAnalysis txt that provide ORF open reading frame information for isotigs The GS Reference Mapper now generates two new files 454HCStructRearrangements txt and 454AllStructRearrangements txt which provide explicitly labeled classifications and information for observed Structural Rearrangements For cDNA mapping projects the reference type CDNA or gDNA can be automatically detected under certain circumstances thus removing the need to specify the reference type in the GUI or on the command line These circumstances are specified in the user manual The 454ReadStatus txt file has been updated to contain more information about chimeric reads These changes only apply if running without regions 1 2 5 Amplicon Variant Analyzer Both the Graphical User Interface and the Command Line Interface now support the export of sequence alignments in FASTA CLUSTAL ACE or Table csv or tsv formats The graphical interfaces for file filtering and file choosing have been improved The Graphical User Interface computation Tab now allows the user to select the number of CPUs to use for computation 2 OVERVIEW OF THE 454 SEQUENCING SYSTEM SOFTWARE The 454 Sequencing System developed by 454 Life Sciences Corporation a Roche company is an ultra high throughput automated DNA sequencing system capable of carrying out and monitoring sequenci
3. 454Alignmentinfo tsv 454ContigGraph txt 454PairAlign txt 454ReadStatus txt 454Contigs ace or ace ContigName ace or consed 454AssemblyProject xml 454Scaffolds fna 454Scaffolds qual 454Scaffolds txt sff sff 454Project xml Figure 6 File output of the GS De Novo Assembler application when using the newAssembly and related commands or their GUI equivalents for project based assembly All result files specifying the actual contig names are placed in a folder within the user s current working directory when running the command or in a directory specified by the user This is identified as a Project folder by the presence of a 454Project xml file within it All assembly status and result files are organized in the assembly sub folder and all input SFF files used in the assembly or symbolic links to them are organized in the sff sub folder These files are produced only when the Paired End option is used the Paired End option also adds sections to the 454NewblerMetrics txt file See Part C Section 1 of this manual for full file descriptions current working directory or directory specified P yyyy mm dd hh min sec runMapping mapping 454AllContigs fna 454AllContigs qual 454LargeContigs fna 454LargeContigs qual 454A llDiffs txt 454HCDiffs txt 454NewblerMetrics txt 454NewblerProgress txt 454MappingQC xls 454Alignmentinfo tsv 454PairAlign txt 454ReadStatus txt 454RefStatus txt 454Contigs
4. GS Junior System refers to the whole system for Following an overview of data processing and analysis in the 454 Sequencing System this manual provides a full description of these applications and commands including how they are invoked through their Graphical User Interface GUI and at the UNIX command line level on the GS Junior Attendant PC a DataRig or a computer cluster resource and information on the format of the output files of all the applications System Protection System Protection Connection to computer networks contains an inherent risk of infection by viruses and worms and of malicious targeted attacks through the network It is the customer s responsibility to protect the system against such threats i e by keeping up to date the protection of any network to which the customer chooses to connect the GS Junior Instrument the Genome Sequencer FLX Instrument or any DataRig or computer cluster This protection might include measures such as a firewall to separate these devices from uncontrolled networks as well as measures to ensure that the connected network is free of malicious code Assistance If you have questions or experience problems with the 454 Sequencing System please call write fax or email us When calling for assistance be prepared to provide the serial number of your GS Junior Instrument or Genome Sequencer FLX Instrument and or lot number of the kit s you are using The instrument s serial numb
5. The number of wells in each block is arbitrary and should not be hard coded There is no assumption that all blocks in the file are stored in the same format nor that all blocks have the same number of wells For example Control DNA reads and failed wells may be archived in a lower fidelity format while the library fragments are stored as full floating point numbers However the format cannot currently store discontinuous ranges of elements in a block which constitutes a limitation of this feature Users of the CWF format are encouraged to use libcwf to insulate them from the complexity of extracting the well information from the CWF file The block types are as follows e r This is a raw well block type identical to the wells format of software versions 1 1 03 and earlier Values are stored Little Endian byte order as generated by Intel brand x86 processors e Header o 32 bits numWells as unsigned integer o 16 bits numFlows as unsigned integer o numFlows bytes flowLabels one of A T G C P o 32 bits rank as unsigned integer o 16 bits xCoord as unsigned integer o 16 bits yCoord as unsigned integer o 32 bits numFlows flowValues as IEEE Single Precision Float e h Half precision floating point Each block is made up of four arrays stored back to back without padding The size of the first three arrays is equal to the data type size times the number of wells in the block The last array consumes the res
6. or the byte array O O O 1 index offset uint64 t index length uint32 t The index offset and index length fields are the offset and length of an optional index of the reads in the SFF file If no index is included in the file both fields must be 0 number of reads uint32 t The number of reads field should be set to the number of reads stored in the file header length uint 6 t The header length field should be the total number of bytes required by this set of header fields and should be equal to 31 number of flows per read key length rounded up to the next value divisible by 8 key length uint 6 t The key length field should be set to the length of the key sequence used for these reads number of flows _ per read uint 6 t The number of flows per read should be set to the number of flows for each of the reads in the file flowgram format code The flowgram format code should be set to the format used to encode each of the flowgram values for each read Currently only one flowgram format has been adopted so this value should be set to 1 The flowgram format code 1 stores each value as a uint16_t where the floating point flowgram value is encoded as int round value 100 0 and decoded as storedvalue 100 0 In other words the values are stored as an integer encoding of a limited precision floating point value keeping 2 places to the right of the decim
7. 8 DC Element 454 Usage Title Run name Description User defined flowgrams Serial number of instrument performing the Run Original Run name e g R_2007_06_27_15 44 21 rig3_ccelone_1007075seqkit93555420PELTxxEX2xxVERIIF2 Instrument operator s name dcterms created Date analysis was performed UUID version of job Table 4 Dublin Core metadata elements used in CWF files lt xml version 1 0 encoding utf 8 Metadata xmlns tns http purl org dc terms xmlns tnsa http purl org dc elements 1 1 xmlns tnsb http purl org dc dcmitype xmlns xsi http www w3 org 2001 XMLSchema instance xsi noNamespaceSchemaLocation GSDataProcessing 1 0 xsd gt lt tnsa title gt R_2007_06_27_15_44 21 rig3 ccelone 1007075seqkit93555420PELTxxEX2xxVERIIF2 tnsa titl e lt tnsa type gt flowgrams lt tnsa type gt tnsa creator Chris Celone amp lt ccelone 454 com amp gt lt tnsa creator gt tns created 2007 06 27T15 44 21Z tns created Run Name 0 Name lt Project gt Applications I lt Project gt lt Kit gt LR7OKIT lt Kit gt Script 100x TACG 70x75 LR7TORIT icl Script lt RegionCount gt 2 lt RegionCount gt lt RegionLayoutName gt 2 Regions lt RegionLayoutName gt lt PTP gt lt ID gt 630751 lt ID gt lt WellSize unit um gt 50 lt WellSize gt lt Size unit mm gt lt Width gt 70 lt Width gt lt Height gt 75 lt
8. Name gt lt Filter gt lt Filter basic true gt lt ID gt 8 lt ID gt lt Name gt Trimmed Too Short Primer lt Name gt lt Filter gt lt Filter basic true gt lt ID gt 9 lt ID gt lt Name gt Low Quality lt Name gt lt Filter gt lt Filters gt Figure 13 Example filters xml stream 3 3 1 8 filterResults uint8 dat This file contains a binary list of wells that failed filtering and the specific filter that the well did not pass The filters are defined in file filters xml This stream consists of one byte per well sorted by rank See the discussion of Other Streams sections 3 3 1 10 and 3 3 1 11 for a description on the layout of this stream The majority of the CWF payload consists of the well flow values themselves In order to support rapid random flow extraction from a CWF file wells are stored in blocks The data format for the blocks can vary depending on the application The size of each block is equal to the largest multiple of the data size times the number of flows that is smaller than 32 Megabytes In other words no well s flow values will be broken between blocks and no block will be larger than 32 MB The naming convention for each block of flowgrams is TY Z wel Where e Tisaletter indicating the type of the block see below e Yisthe starting well index and e Zisthe final well index Y and Z are integer values and should not be padded with zeros
9. The trim points from the end of the reads This number is in flowgram space instead of base space like the information in the baseCallerSeq dat stream cfValues The carry forward corrections for each well ieValues The incomplete extension correction factors for each well signalPerBase The average value of the keypass flows for this well Can be used to do a simple base calling Also can be used to judge the strength of the well keyPassDensity An image referenced to the region showing the density wells that pass key This is normalized for each PTP pitch so 02096 loading and 255 100 loading rawWellDensity An image referenced to the region showing the density of the bead loading This is normalized for each PTP pitch so 02096 loading and 255 100 loading Table 6 Current stream types lossless Portable Anymap format specifically the P5 PGM Portable Graymap variant htip netpom sourceforge net doc pgm html To save space only the area encompassed by the region is included in the image To properly register the image against the original PTP device you must offset the image slice by the region boundary This region boundary can be read in the RevisedBounds element of the Region block in meta xml stream 454 Life Sciences Corporation may introduce other image formats in future variants of the CWF file so it is important to read the magic number and or file extension of the image to determine the corr
10. a Post Run Analysis sub directory to include all the files they generate using the following naming convention P year month day hour minute second runCommandName The rest of the files within these directories or within the current working directory have either fixed names or simple standard naming conventions e g files specific to a region or key are named with the region or key at the beginning of the name Exact nomenclature for all the files and other sub folders produced by the data processing applications are provided in the output subsection of the description of each application in the various Sections of Parts A B C and D of this manual 3 2 Format Requirements for Input FASTA Files The GS De Novo Assembler the GS Reference Mapper and the fnafile command from the SFF Tools can all take FASTA files as input For the reference or input read FASTA file s to be readable by the 454 Sequencing System software they must follow the industry standards for a FASTA file In particular e The first line descriptor line of each sequence entry in the file should begin with a gt e There may be one or more additional header lines for a sequence entry each beginning with a gt or character The first line not beginning with a gt or 7 starts the sequence region of the entry e The sequence region may contain any characters but only the alphabetic characters will be used to form the sequence A
11. consists of one or more of the following e AC style comment using and to delineate the comment text e A group whose syntax is groupname and where one or more comments sub groups or name value pairs can occur between the braces e Aname value pair whose syntax is name value and where the quotes around the value are optional but the equals sign and semi colon are required The parser file format is free form i e the syntactic elements can appear with any style of white space or line division However the output files generated by the 454 Sequencing System software use a standard indentation convention where the group names appear by themselves on a line with the braces below it all the text between the braces is indented and each name value pair appears on a single line Several examples of parser files are shown in the Output sub sections of the various applications descriptions 3 3 3 Image Files pif The pif file format was developed at 454 Life Sciences Corporation for storing image data from the Genome Sequencer FLX Instrument or the GS Junior Instrument The file consists of a header followed by data The byte order is little endian The header is 12 bytes long comprised of three 4 byte integers the first integer value is the number of bits per pixel of data the second integer is the width of the image in pixels and the third integer is the height of the image in pixels The following dat
12. last base of the insert is min clip qual right 22 0 number of bases clip qual right clip adapter right 0 number of bases clip adapter right The name field should be set to the string of the Name char name length read s accession or name Note that the name field is not null terminated If any eight byte padding bytes exist in the section they should have a byte value of 0 clip adapter right uint16 t eight byte padding uint8 t 3 3 8 8 Read Data Section The read data section consists of the following fields Field name Format Properties The flowgram values field contains the homopolymer stretch estimates for each flow of the uint t number of read The number of bytes used for each value flows depends on the common header flowgram format code value where the current value uses a uint16 t for each value The flow index per base field contains the flow positions for each base in the called sequence i e for each base the position in the flowgram whose estimate resulted in that base being called Note that these values are incremental values i e the stored position is the offset from the previous flow index in the field All position values prior to their incremental encoding use 1 based indexing so the first flow is flow 1 char number of The bases field contains the basecalled nucleotide bases sequence The quality_scores field contains the quality scores for
13. line e The user enters a unique name for the Run during set up A good choice for a unique name could be structured as follows PicoTiterPlate device size and barcode kit lot genome run e g 70x75 123456 081708 ecoli Runt To form the complete name of the Run files the date stamp time stamp instrument name and user name will be added automatically in front of this unique Run name The name structure is R yyyy mm dd hh min sec machineName userName uniqueRunName Before any data processing or data analysis application can be run on a DataRig or a cluster GS Run Processor GS De Novo Assembler GS Reference Mapper or GS Amplicon Variant Analyzer the necessary input data see Table 1 and Table 2 for the sequencing Run s being processed or analyzed must be made available on that DataRig or cluster The Genome Sequencer FLX Instrument or GS Junior Attendant PC can be configured to automatically transfer the output files that result from the GS Run Processor application to a remote disk see the section on data transfer scripts in the Genome Sequencer FLX System Site Preparation Guide or the GS Junior System SysAdmin Guide Alternatively the Data tab of the GS Sequencer application on the Genome Sequencer FLX Instrument or of the GS Junior Sequencer application on the GS Junior Attendant PC can be used to transfer the raw images of sequencing Run s to a pre configured destination For more information on the Data tab
14. org 2001 XMLSchema instance xsi noNamespaceSchemaLocation GSDataProcessing 1 0 xsd Job ID edf2c3fe 24fe 11dd a9ef 0014d60a92f 87 ID Name Name lt ProcessingDirectoryName gt D_2008_05_18_13_22_00_zappa_allButMetrics lt ProcessingDirectoryName gt OS Linux OS StartTime 2008 05 18T17 22 17Z StartTime lt TotalJobSeconds gt 3884 lt TotalJobSeconds gt lt PartialJobSeconds gt 3884 lt PartialJobSeconds gt lt GsRunProcessorVersion gt 20080308172036 lt GsRunProcessorVersion gt Host zappa bw0l labrat com Host NumDataSetsInJob 2 NumDataSetsInJob lt NumP rocessors gt 10 lt NumProcessors gt lt Type gt allButMetrics lt Type gt lt Pipeline gt allButMetrics xml lt Pipeline gt lt ParamsUsed gt lt WellFinder gt lt kernelSize gt 21 lt kernelSize gt lt upsampleHighDensityPtps gt true lt upsampleHighDensityPtps gt lt upsampleFactor gt 1 lt upsampleFactor gt lt minPPISignal gt 80 lt minPPISignal gt lt minConsensusSignal gt 70 lt minConsensusSignal gt lt minWellSpacing gt 2 lt minWellSpacing gt lt secondSearchPass gt true lt secondSearchPass gt lt maskConstant gt 0 1054 lt maskConstant gt lt maskAlpha gt 0 6 lt maskAlpha gt lt maskHoleSize gt 2 lt maskHoleSize gt numPixelsPerWell 4 numPixelsPerWell lt morphologyThresholdMultiplier gt 0 lt morphologyThresholdMultiplier gt lt morphologyNumInARow gt 5 lt morphologyNumInARow gt lt WellFinder gt
15. the generation of FASTA FNA and QUAL files on demand Like the flowgrams the reads are stored in blocks Unlike the flowgrams however each block consists of a variable number of reads A special stream called baseCalledSeq dat is used to index the basecalled blocks This stream contains 6 bytes per well Byte 0 1 Total stored read length Byte 2 3 Number of reads to trim from the distal end 3 Byte 4 5 Number of reads to trim from the local end 5 The complete data for any read is stored in three separate streams identified by the dna qual and flow extensions containing the reads the PHRED based quality values and the offset flow indices respectively Each base of each well uses one byte in each file The total number of bytes consumed by a read is reflected in the first field of the baseCalledSeq dat Therefore these two bytes can be used as an index of sorts For example the byte offset in the dna file for read 100 can be found by summing the first two bytes of the first 99 entries in baseCalledSeq dat The 100 entry can then be used to tell how many bytes are available for read 100 It is important to note that since the basecalls are stored in blocks one must first find the appropriate block then compute the offset from there Again users of the CWF format are encouraged to use the libcwf to insulate them from errors in extracting the base information 3 3 1 10 Other Public Streams The cwf file can also con
16. the streams in a CWF file following image and signal processing of a sequencing Run Uncompressed Size 24 8376341 3749552 3749552 8376341 204695 684743 468694 937388 204695 263879 684743 1064508 33554213 33554213 33554213 33554192 33554192 33554192 33554389 33554389 33554389 10323379 10323379 10323379 33553894 33553894 33553894 33553894 33553894 33553894 33553894 33553894 33553894 33553894 33553894 1874776 12424082 2812164 2334 4686940 2162 839 130111 3286 752 MB Compressed Size 24 1519187 3328881 3021477 1103849 24439 40138 130450 444081 22734 20221 48430 60786 10135865 9872137 21064266 10070875 9787599 20866664 10066101 9787253 20863713 3139442 3056298 6530831 29209536 29206671 29126347 29069883 29068170 29082037 29120018 29184495 29262051 29361393 29461534 1539471 10939973 1126775 612 2962553 701 286 18528 1284 482MB Compression Ratio 3 3 1 1 mimetype This is a single line file containing the words application vnd 454 cwf It must be the first stream in the file and must be stored uncompressed 3 3 1 2 meta xml Like the OpenDocument format meta xml stores information about the Run itself As a convenience the schema defining the XML data stored in the CWF file references the Dublin Core DC metadata elements The DC elements used and their interpreted meaning are summarized in Table 4 An example meta xml file is shown in Figure
17. 454 Sequencing System Software Manual v 2 5p1 General Overview and Data File Formats For life science research only Not for use in diagnostic procedures 454 SEQUENCING 454 Sequencing System Software Manual Software v 2 5p1 August 2010 General Overview and Data File Formats Table of Contents PGT ACG EET 4 About This Manual sssini E 4 Systemi PLOTS ceo e 8 jazz M P err 5 DER CUEBSDPIM s aana maaan aaa EEan eink n iaaa aaa aiaa eaei 7 1 1 454 Sequencing System Software Manual sess 7 1 1 1 Organization of the User Manual sees nen 7 1 1 2 Formats of the User Manat oO Ita Here ettet uc Cee eden es 7 1 2 454 Sequencing System applications sseeeeesceseeeeeeeeeeeneeennnrne 7 1 2 1 GS Junior SequerBbl dcot odo odetetp eir due et aeo eia cda Dade uim bove eta pe uiu 7 1 2 2 GS Run Processor 5e oer p E E A E E COR nr aUUE 8 1 2 3 GS Run BrOWSer si e e op erste xis Sra T ue mene i acra usen du aaaea a aE ERE 8 1 2 4 GS De Novo Assembler and GS Reference Mapper sssseesss 8 1 2 4 1 Graphical User Interfaae s ccr te asta e ao st opone pice qr derer esce REA pire Dcus 8 1 2 4 2 Command Line ODUOFS diis M etiecee Sot Baden aon Re ebeE RH o eet uer do endete evi bid 9 T2 3 OUIDUUEIBS cca One een diro er eS Dee ee ERU ere 9 1 2 5 Amplicon Variant Analyzer eet eeu ten Enc E
18. Height gt lt Size gt PTP Flow ActualOrder SSSSPSSTACGT CGPSS ActualOrder lt FlowCount gt 403 lt FlowCount gt lt CycleCount gt 100 lt CycleCount gt FlowOrder PTACGTACGTACGT CGP FlowOrder Flow Run Region Name Region0 Name lt Number gt 1 lt Number gt lt TemplateBounds unit pixel gt lt Center gt lt X gt 1024 lt X gt Y 2048 Y Center Dimension lt Width gt 2046 lt Width gt lt Height gt 4094 lt Height gt lt Dimension gt lt TemplateBounds gt lt RevisedBounds unit pixel gt lt Center gt lt X gt 1024 lt X gt Y 2048 Y Center Dimension lt Width gt 2046 lt Width gt lt Height gt 4094 lt Height gt lt Dimension gt lt RevisedBounds gt lt Region gt lt WellCount gt 468698 lt WellCount gt lt Metadata gt Figure 8 Example meta xml stream 3 3 1 3 history xml This is a single XML file showing what processing has been done to these wells It contains reference copies of the pipeline parameters that were used to create the final result as well as processing times dates software revision numbers and a Universally Unique Identifier QUID for each processing step Each analysis performed on the data set adds a new Job element to the stream An example history xml file is shown in Figure 9 lt xml version 1 0 encoding utf 8 lt History xmlns xsi http www w3
19. S De Novo Assembler GS Reference Mapper and GS Amplicon Variant Analyzer on the other hand are always carried out separately and are supported via a Graphical User Interface in the GS Run Browser or at the command line level Applications launched by direct user input are shown in Bold lialics others are embedded in the Run The GS Run Browser application not shown can be invoked either on the Genome Sequencer FLX Instrument or on a DataRig or a cluster to view raw images and other Run data e g for troubleshooting purposes or to reanalyze the dataset using different settings this is done after a Run completes or is aborted Software v 2 5p1 August 2010 17 2 2 2 GS Junior System In the GS Junior System the data sets are smaller and the computing power aboard the Attendant PC is sufficient to fully process the data rapidly Therefore there is no need to carry out the image processing and signal processing steps separately and the Image Processing Only option is not offered Users would typically use the appropriate Full Processing option Standard or Amplicon see Figure 1 the No Processing option is also available 2 3 Data Output and Folder Structure This section provides an overview of the listing and organization of the files that are generated at each step of data processing or data analysis and made available to the user After a sequencing Run results can be found in the data directory on the Genome Sequence
20. S Junior System and the GS FLX System e with the GS Junior System all of data acquisition and data processing can be handled at once as part of the sequencing Run All processing can be done on the Attendant PC e with the GS FLX System the two steps of data processing are typically executed independently running the image processing step on board the instrument concurrently with the sequencing Run and the more time consuming signal processing step on a dedicated data processing computer cluster See section 2 2 for a description of data processing pipeline options In either case the general functions carried out are as follows 1 the GS Junior Sequencer or GS Sequencer application records a set of raw digital images representing the light detected over the surface of the PicoTiterPlate device during each reagent flow of the sequencing Run data acquisition 2 the first step of the GS Run Processor application image processing performs initial pixel level calculations and then groups pixels from the image set into a representation of the PicoTiterPlate wells where sequencing reactions were detected 3 the second step of the GS Run Processor application signal processing performs well level calculations across the whole series of images to generate well flowgrams and the basecalls of the DNA fragments being sequenced in all the active wells of the PicoTiterPlate device reads Application GS Junior Sequencer or GS Seq
21. WellBuilder lt kernelSize gt 21 lt kernelSize gt lt minPPISignal gt 80 lt minPPISignal gt lt scaleFactor gt 1 lt scaleFactor gt lt WellBuilder gt lt NukeSignalStrengthBalancer gt lt comput eMedianRange gt 41 lt comput eMedianRange gt lt NukeSignalStrengthBalancer gt lt CafieCorrector gt lt maxAcceptableDroop gt 0 25 lt maxAcceptableDroop gt lt CafieCorrector gt lt IndividualWellScaler gt lt startMinSinglet gt 0 75 lt startMinSinglet gt lt startMaxSinglet gt 1 25 lt startMaxSinglet gt lt windowEachSide gt 13 lt windowEachSide gt lt stepSize gt 1l lt stepSize gt lt minSingletsPerWindow gt 3 lt minSinglet sPerWindow gt lt useAverage gt t rue lt useAverage gt lt interpolateVacantWindows gt true lt interpolateVacantWindows gt lt reSeedThresholds gt true lt reSeedThresholds gt lt padEnd gt false lt padEnd gt lt IndividualWellScaler gt lt WellScreener gt lt enable gt false lt enable gt lt useSt dDevThresholding gt t rue lt useStdDevThresholding gt lt WellScreener gt lt ParamsUsed gt lt Job gt lt History gt Figure 9 Example history xml stream 3 3 1 4 location idx This is a binary index to the wells files It contains common data about each well that can be used to support a well browser style application Items in this stream are stored in Intel Little Endian format e the rank is stored as Byte3 Byte2 Byte1 ByteO The wells file contains one fie
22. a in the file are the pixel intensity values presented in row major order starting in the upper left corner Currently all image data is stored in 16 bit unsigned integers or 2 bytes per pixel Valid image data is limited to the first 14 bits As these are binary format files an example cannot be provided in this text based document 3 3 4 Well Level Signal Data Files wells The wells data file is a legacy binary file containing counts values for the light collected on the nucleotide and PPi images at each active well location The file consists of a header and a body where the header contains the following fields unsigned int numWells unsigned short numFlows char flowOrder numFlows where numWells is the number of wells or reads in the file numFlows is the number of flows in the sequencing Run script and flowOrder are characters indicating the reagent for each flow A C G and T specify each nucleotide flow and P signifies a PPi flow The body of the wells data file contains numWells records of the following fields unsigned int rank unsigned short x unsigned short y float flowValues numFlows where rank is the general ranking of the well by signal intensity x and y are the coordinates of the well center pixel and flowValues are the signal values for all the flows in this well All the multi byte values in the header and body are written using little endian byte or
23. ace or ace ContigName ace or consed 454MappingProject xml sff sff 454Project xml Figure 7 File output of the GS Reference Mapper application when using the newMapping and related commands or their GUI equivalents for project based mapping All result files specifying the actual reference sequence accession numbers are placed in a folder within the user s current working directory when running the command or in a directory specified by the user This is identified as a Project folder by the presence of a 454Project xml file within it All mapping status and result files are organized in the mapping sub folder and all input SFF files used in the mapping or symbolic links to them are organized in the sff sub folder See Part C Section 2 of this manual for full file descriptions 3 DATA FILES AND FORMATS Section 2 3 lists all the files and folders that constitute the deliverable output of the 454 Sequencing System data processing and data analysis software for a generic sequencing Run including the results of the GS De Novo Assembler and GS Reference Mapper applications The actual directories generated may contain a number of additional files but those are intermediate or log files generated for use only by Roche Customer Support personnel in the event that a Run might require additional investigation The 454 Sequencing System software uses fixed names for the files it generates and the structure and nam
24. ad header and read data sections Each read header section consists of the following fields Field name Format Properties The read header length should be set to the length of the read header for this read and should be equal to 16 name length rounded up to the next value divisible by 8 The name length field should be set to the length of the read s accession or name The number of bases should be set to the number of bases called for this read e The clip qual left and clip adapter left fields should be set to the position of the first base after clip qual left uint16 t the clipping point for quality and or an adapter sequence at the beginning of the read If only a combined clipping position is computed it should be stored in clip qual left e The clip qual right and clip adapter right fields clip qual right uint16 t should be set to the position of the last base before the clipping point for quality and or an adapter sequence at the end of the read If only a combined clipping position is computed it should be stored in clip qual right read header length uint 6 t name length uint16 t number of bases uint32 t clip adapter left uint16 t Note that the position values use 1 based indexing so the first base is at position 1 If a clipping value is not computed the field should be set to 0 Thus the first base of the insert is max 1 max clip qual left clip adapter left and the
25. ader length 4 While the file contains more bytes do the following a If the file pointer position equals index offset either read or skip index length bytes in the file processing the index if read b Otherwise i Read 16 bytes and extract the read header length name length and number of bases values ii Read the next read header length 16 bytes to read the name iii At this point a test of the name field can be performed to determine whether to read or skip this entry 01 Compute the read data length as number of flows flowgram bytes per flow 3 number of bases Founded up to the next value divisible by 8 02 Either read or skip read data length bytes in the file processing the read data if the section is read Published by 454 Life Sciences Corp A Roche Company Branford CT 06405 2010 454 Life Sciences Corp All rights reserved For life science research only Not for use in diagnostic procedures 454 454 LIFE SCIENCES 454 SEQUENCING GS FLX GS FLX TITANIUM GS JUNIOR EMPCR PICOTITERPLATE PTP REM NIMBLEGEN FASTSTART CASY and INNOVATIS are trademarks of Roche Other brands or product names are trademarks of their respective holders 5 0810
26. al point and capping the values at 655 35 flow chars char number of flows per read The flow chars should be set to the array of nucleotide bases A C G or T that correspond to the nucleotides used for each flow of each read The length of the array should equal number of flows per read Note that the flow chars field is not null terminated key sequence char key length The key sequence field should be set to the nucleotide bases of the key sequence used for these reads Note that the key sequence field is not null terminated eight byte padding uint8 1 If any eight byte padding bytes exist in the section they should have a byte value of 0 If an index is included in the file the index offset and index length values in the common header should point to the section of the file containing the index To support different indexing methods the index section should begin with the following two fields index magic number index version uint32 t char 4 and should end with an eight byte padding field so that the length of the index section is divisible by 8 The format of the rest of the index section is specific to the indexing method used The index length given in the common header should include the bytes of these fields and the padding 3 3 8 2 Read Header Section The rest of the file contains the information about the reads namely number of reads entries consisting of re
27. available for another Run Run troubleshooting may also be time intensive so the GS Reporter and the GS Run Browser applications are provided both on the Genome Sequencer FLX Instrument and as separate applications that can be run on a DataRig or the computer cluster 454 Sequencing System Software Manual General Overview and Data File Formats Full Processing Image Processing Only No Processing Standard or Amplicons o t GS Junior Sequencer GS Junior Sequencer S GS Sequencer pokes sence GS Sequencer gt Raw images g Raw images 2 Run Processor Step 1 Image Processing Preliminary CWF files Run Processor Step 1 Image Processing Preliminary CWF files Run Processor Step 2 Signal Processing Std or Amp CWF and SFF files Performed during the sequencing Run per processing type selected Run Processor Step 1 Image Processing Preliminary CWF files Buisso2oug ejeg Run Processor Run Processor Step 2 Signal Processing Std or Amp CWF and SFF files Step 2 Signal Processing Std or Amp CWF and SFF files Use read flowgram data and basecalls in either of the following applications GS De Novo Assembler consensus sequence assembled into contigs de novo sequencing with per base quality scores and ACE file of the multi alignment with the Paired End option contig scaffolds are also provided requires special Paired End Sig
28. d to the extent specified in the processing type selected and deposited in the Data Processing folder concurrently with the Run When the Run Completed window appears on the screen the processing of the sequencing Run has completed and the results are ready for further processing or transfer R yyyy mm dd hh min sec machineName userName uniqueRunName ss D yyyy mm dd hh min sec machineName analysisTypef gsRunProcessor log gsRunProcessor_err log dataRunParams xml regions region cwf sff uaccnoRegion sff Figure 3 General organization of a Data Processing folder Data Processing D_ folders are created within the corresponding Run s R folder an R_ folder can contain multiple D_ folders if the dataset was re processed Words in italics are generic The superscripts indicate the application by which the folders and files are generated GS Sequencer or GS Junior Sequencer GS Run Processor The set of SFF files are generated during the signal processing step of the GS Run Processor using the universal accession prefix described in section 3 3 7 See Part B Section 1 of this manual for full file descriptions 2 9 8 Data Analysis Applications Results As indicated above the data analysis applications are often performed on a pool of sequencing Runs rather than on any single Run and or can require additional information beyond the Run data For this reason they are carried out off instru
29. d E eat eddeie REM n LEE 9 2 Overview of the 454 Sequencing System Software eene 10 2 1 Data Acquisition Data Processing and Data Analysis sssssesssssss 10 2 2 Data Processing ODptIOriS eie erret tbe abe Papa a EDDA PD a E su ois 15 2 2 1 GS FLX System e Gnemies 15 2 2 2 GS Junior System suuin ec eat sea aS e iusta oae scd Cesta tee fecula 18 2 3 Data Output and Folder Structure ec scio tec tano erede f esent been 18 2 3 1 Data Acquisition GS Sequencer and GS Junior Sequencer Results the Run Folder 19 2 3 2 Data Processing GS Run Processor Results the Data Processing Folder 20 2 3 3 Data Analysis Applications Results eeeeee 21 3 Data Files and Formials niei bib edexstuvkx e EM GDn UI udo ce innana annan anaana annisa ense a des 26 3 1 Directory Naming Gonventlons 2 terere retener Rn Reti ure ERR DRE VR a ERE AERE 26 3 2 Format Requirements for Input FASTA Files sseem 27 3 3 Standard File EOITU Bls siet dte temet ines oeb enia ta M etusifia cte aceti immi tos e epos s 27 3 3 1 Composite wells file format s reete rte pen oppor et bene ardent ER Eoque oven 27 SE MEE m 30 SX So E 30 3da3 1 3 history XMI iue ee epe Dt Date iet e desi Be d ne ERES ERN DeL NER PUEDE 31 BBals4 IOCATOM Bop RE TET 33 3 9 1 5 Metes XMI iussi te teta ett C e t era ERE tants Praet ie en ER
30. dering consistent with the Linux operating system As these are binary format files an example cannot be provided in this text based document 3 3 5 Exportable Metrics Files csv The comma separated values text file csv is an alternate format in which a number of the metrics files are output This format is suitable for automated parsing by programs or for loading into a spreadsheet program like Microsoft Excel The contents of these files where generated is identical to the corresponding file formatted in the 454 parser format described above txt 3 3 6 DNA Sequence FASTA fna and Base Quality Score qual Files Three of the 454 Sequencing System data processing applications output DNA sequences Signal Processing for the basecalls of individual reads GS De Novo Assembler for the de novo assembled consensus sequence of the sample DNA library and GS Reference Mapper for the sample s consensus sequence mapped to a reference sequence These use the FASTA standard file format fna and are always accompanied by a corresponding base quality scores file in the qual format Examples are shown in the Output sub sections of these applications descriptions e g region key 454Reads and 454AllContigs Note that the description lines are slightly different depending on whether the FASTA file outputs contain reads or contigs and for contigs whether they were generated by the GS De Novo Assembler or by the GS Reference Mapper a
31. dt ERR 33 33 1 6 SSOCUSMCSS XM rm 35 Seeley a MES XM e aeaa e e a a are a a E 37 3 3 1 8 filterReSuUItS UINtS dat ie a Hee 37 3 3 1 9 Base Called Data naene coat dest publ E a TOT EAEE AE SEAE X REA NETS 39 3 3 1 10 Other Public Streams sree sess teca eee EE nati duored ape areas Fuge etod nac d bea uid 40 3 3 1 11 Other Private Sire ais sacsisdiet cnted ear eade e ga uec o d Ts Ryu Fe Gad re Kt dats 41 3 3 1 12 Oth r PIGS rr a EE VEEE TE T V 41 3 3 2 Parameters and Viewable Metrics Files 454 Parser Format parse txt 42 3 3 3 Image Files Pif oer Rte terat ke AKEn e Aaa Fase rea CUM kiaina 42 3 3 4 Well Level Signal Data Files wells se 42 3 3 5 Exportable Metrics Files CSV ier rh e Eee pe Reb et eec t alaleee 43 3 3 6 DNA Sequence FASTA fna and Base Quality Score qual Files 43 3 3 7 454 Universal Accession Numbers ccccccecccceecceccceeeeeeeeeceeeeeeeeceeeceaeeeneees 44 3 3 8 Standard Flowgram Files sff iecit eet tete ne tend Renten Erb dires 45 3 3 8 1 Common Header Section reete ere Beet obe do Fa RS Re ERE 45 3 3 8 2 Read Header Section sss 47 3 3 8 3 Head Data Section cese catio eum iss ors cett FO Rode rev ur bee De ca bn uae ha Aedes 47 3 3 8 4 Computing Lengths and Scanning the File 48 PREFACE Q For life science research only Not for use in diagnostic procedures About this Manual The 454 Sequencing Syst
32. e 60 second As a result of this calculation the first character of read accessions will always be a letter for Runs performed from now until 2038 The timestamp values are taken from the rigRunName found in the analysisParms parse file in the specified analysis directory This rigRunName is the R_ name that is generated by the instrument software and is also used as the standard directory name for the Run Thus a Run whose name begins with R 2004 09 22 16 59 10 generates C3U5GW as its encoded timestamp value e Since two Runs may be started at the same second an additional base 36 character is generated by hashing the full rigRunName to a base 31 number the highest prime below 36 as in chval 0 for s rigRunName s s chval int s chval 31 ch chval lt 26 A chval 0O chval 26 e The X Y location is encoded by computing a total value of X 4096 Y and encoding that as a five character base 36 string 3 3 8 Standard Flowgram Files sff The Standard Flowgram File is used to store the information on one or many 454 Sequencing reads and their trace data Sequencing reads obtained using the 454 Sequencing System differ from reads obtained using more traditional methods Sanger sequencing in that the 454 Sequencing data does not provide individual base measurements from which basecalls can be derived Instead it provides measurements that estimate the length of each
33. e number of flows can be retrieved from the meta xml stream Values are stored Little Endian byte order as generated by Intel brand x86 processors o X coordinates as unsigned 16 bit numbers o Y coordinates as unsigned 16 bit numbers o flow ranks as unsigned 32 bit numbers o flow values as IEE 754 binary full precision floating point number http grouper ieee org groups 754 i Integer Each block is made up of four arrays stored back to back without padding The size of the first three arrays is equal to the data type size times the number of wells in the block The last array consumes the rest of the block and is equal to 2 bytes times the number of wells times the number of flows The number of wells is derived from the name of the stream and the number of flows can be retrieved from the meta xml stream Values are stored Little Endian byte order as generated by Intel brand x86 processors Note that this block is rarely used because it cannot store values less than zero which can occur during signal processing routines and lacks precision for values near one o X coordinates as unsigned 16 bit numbers o Y coordinates as unsigned 16 bit numbers o flow ranks as unsigned 32 bit numbers o flow values as IEE 754 binary full precision floating point number 3 3 1 9 Base Called Data If a data set has processed through the baseCaller section of the GS Run Processor the actual bases are written into the CWF file This allows for
34. each of the bases in the sequence where the values use the standard log10 probability scale If any eight byte padding bytes exist in the section they should have a byte value of 0 flowgram values flow index per ba uint8_t number_of se bases Bases uint8 t number of quality scores bases eight byte padding uint8 t 3 3 8 4 Computing Lengths and Scanning the File The length of the various read s section will vary because of different length accession numbers and different length nucleotide sequences However the various flow name and bases lengths given in the common and read headers can be used to scan the file accessing each read s information or skipping read sections in the file The following pseudocode gives an example method to scan the file and access each read s data 1 Open the file and or reset the file pointer position to the first byte of the file Read the first 31 bytes of the file confirm the magic number value and version then extract the number of reads number of flows per read flowgram format code header length key length index offset and index length values a Convert the flowgram format code into a flowgram bytes per flow value currently with format code 1 this value is 2 bytes 3 If the flow chars and key sequence information is required read the next header length 31 bytes then extract that information Otherwise set the file pointer position to byte he
35. ect image decoder to use Q Note on Image Formats Currently the only image format stored in CWF is the 3 3 1 11 Other Private Streams There are other streams of data that can be stored in the CWF file that contain intermediate results from various pipeline and post processing functions These are left unspecified to avoid restricting the development of new algorithms in the data processing applications CWF file users should neither depend on their existence nor attempt to parse them as their contents may change between revisions of the software or invocations of different processing streams 3 3 1 12 Other Files There may be other files in the CWF container the contents of which may include log files and other binary data A proper CWF writer like the GS Run Processor will copy any other unknown stream verbatim to the destination CWF file readers should not balk at these extra streams but should not depend on their existence either This allows advanced users to add additional payloads to the CWF container before moving them to the data analysis system 3 3 2 Parameters and Viewable Metrics Files 454 Parser Format parse txt The parser format developed at 454 Life Sciences Corporation is a standard format used for all the software parameter files and for most of the metrics files This is a text based format that organizes the text in titled groups that contain either sub groups or name value pairs of strings A parser file
36. elow the one step runAssembly Figure 4 or runMapping Figure 5 commands for the standard non incremental assembly or mapping of one or more Runs and newAssembly Figure 6 or newMapping Figure 7 for the project based incremental assembly or mapping of one or more Runs See Part C Sections 1 and 2 of this manual for full file descriptions and for more information on the GS De Novo Assembler and GS Reference Mapper applications The GS Amplicon Variant Analyzer application can also operate either via a GUI or via its own Command Line Interface the AVA CLI but its output is not structured like that of the other two data analysis applications See Part D of this manual for details on this application current working directory or directory specified P yyyy mm dd hh min sec runAssembly 454AllContigs fna 454AllContigs qual 454LargeContigs fna 454LargeContigs qual 454NewblerMetrics txt 454NewblerProgress txt 454Alignmentinfo tsv 454ContigGraph txt 454PairAlign txt 454ReadStatus txt 454Contigs ace or ace ContigName ace or consed sff sff 454Scaffolds fna 454Scaffolds qual 454Scaffolds txt Figure 4 File output of the GS De Novo Assembler application using the runAssembly command or its GUI equivalent All result files specifying the actual contig names are placed in a folder with a P prefix within the user s current working directory when running the command or in a directory s
37. em Software Manual describes the software package of the GS Junior System and the GS FLX System for DNA Sequencing developed by 454 Life Sciences Corporation It is divided into 5 parts e General Overview and Data File Formats e Part A o GS Junior Sequencer for the GS Junior System o GS Sequencer and Other On Instrument Applications for the GS FLX System Part B GS Run Processor GS Reporter GS Run Browser and GS Support Tool Part C GS De Novo Assembler GS Reference Mapper and SFF Tools Part D GS Amplicon Variant Analyzer Since the software is generally common for both systems the manual covers applications for both The only exception to this rule is the GS Sequencer or GS Junior Sequencer application so users should make sure to refer to the manual Part A that matches their instrument DNA sequencing developed by 454 Life Sciences Corp including the GS Junior Instrument and its Attendant PC components all the kits for the preparation amplification and sequencing of a DNA sample the methods to use the kits as described in the Manuals and Guides and the software provided to process and analyze the data from sequencing Runs Likewise GS FLX System refers to our similar high throughput system based on the Genome Sequencer FLX Instrument The phrase 454 Sequencing System refers to the common technology that underlies both systems 454 Life Sciences Corporation is a Roche company Q In this documentation the phrase
38. ences xml and one optional one filters xml Each of these files references a single XML shema which is available on request Table 3 shows an example listing of a CWF file s streams that might exist at the end of signal processing This file represents the data from one region of a high quality 2 region sequencing Run GS FLX System and the Table shows the size savings provided by the CWF compressed format Each stream is described separately below Stream Name mimetype rawWellDensity pgm cfValues double dat ieValues double dat keyPassWellDensity pgm histogram unfiltered ATGC pgm histogram unfiltered TCAG pgm filterResults uint8 dat trimInfo uint16 dat histogram filteredCounts ATGC pgm histogram nmers ATGC pgm histogram filteredCounts TCAG pgm histogram nmers TCAG pgm 0 149421 char dna 0 149421 uint 8 flow 0 149421 uint 8 score 149422 288120 char dna 149422 288120 uint 8 flow 149422 288120 uint 8 score 288121 424287 char dna 288121 424287 uint 8 flow 288121 424287 uint 8 score 424288 468693 char dna 424288 468693 uint_8 flow 424288 468693 uint_8 score h0 41220 wel h41221 82441 wel h82442 123662 wel h123663 164883 wel h164884 206104 wel h206105 247325 wel h247326 288546 wel h288547 329767 wel h329768 370988 wel h370989 412209 wel h412210 453430 wel signalPerBase float dat h453431 468693 wel baseCalledSeq dat sequences xml location idx meta xml filters xml metrics xml history xml TOTAL Table 3 List of
39. eq gt lt Sequence gt lt Sequences gt Figure 12 Example sequences xml stream 3 3 1 7 filters xml A list of filters referred to by the values in the filterResults uint8 dat stream see section 3 3 1 8 below Note that the order of filters in this file is not guaranteed It is also likely that the filters will be reorganized in a future release of the software to provide more detail An example filters xml file is shown in Figure 13 lt xml version 1 0 encoding utf 8 gt Filters xmlns xsi http www w3 org 2001 XMLSchema instance xsi noNamespaceSchemaLocation GSDataProcessing 1 0 xsd gt lt Filter basic true gt lt ID gt 0 lt ID gt lt Name gt Pass lt Name gt lt Filter gt lt Filter basic true gt lt ID gt 1 lt ID gt lt Name gt No Key lt Name gt lt Filter gt lt Filter basic true gt lt ID gt 2 lt ID gt lt Name gt Bad Band lt Name gt lt Filter gt lt Filter basic true gt lt ID gt 3 lt ID gt lt Name gt Trimmed Too Short Quality lt Name gt lt Filter gt lt Filter basic true gt lt ID gt 4 lt ID gt lt Name gt Low Pass Filter lt Name gt lt Filter gt lt Filter basic true gt lt ID gt 5 lt ID gt lt Name gt Classifier Filter lt Name gt lt Filter gt lt Filter basic true gt lt ID gt 6 lt ID gt lt Name gt Dot Filter lt Name gt lt Filter gt lt Filter basic true gt lt ID gt 7 lt ID gt lt Name gt Mixed Filter lt
40. er is located on the label found on the back of the instrument If you are located in Please contact Roche Applied Science Technical Support via USA or Canada Phone toll free 1 800 262 4911 e mail uSs gssupport roche com Europe Middle East Mexico South America or Africa phone 49 8856 60 6457 or toll free 800SEQUENCE e mail service sequencing roche com Asia Pacific Phone toll free 800 820 0577 China Mainland 008 018 63123 China Taiwan 800 966 851 China Hong Kong 800 852 3686 Singapore 1800 814 958 Malaysia 180 064 5619 Australia 007 988 620647 Korea 001 800 861 0660 Thailand 1208 6101 Vietnam 1800 186 10007 Phillipines e mail asc support roche com Japan phone 03 5443 5287 e mail tokyo biochemicals roche com 1 WHAT S NEW 1 1 454 Sequencing System Software Manual 1 1 1 Organization of the User Manual The user manual for the 454 Sequencing System Software V 2 5 2 5p1 is divided into five parts as it was in V 2 3 The section Overview and File Formats as well as Part B Part C and Part D which document the Data processing and Data analysis applications have been updated to reflect the latest features and enhancements in version 2 5 2 5p1 These sections are common to both the GS FLX System and the GS Junior System but have also been revised to reflect the differences between the two Systems The sectio
41. es of the directories allows to differentiate individual sequencing Runs or post Run analyses This section describes the nomenclature conventions and file formats used by the software Examples of the various file types are given in the Sections of this manual that describe each application in detail Note that the content of this Section does not apply to the GS Amplicon Variant Analyzer software whose output structure is completely distinct beyond the basic data processing files and folders 3 1 Directory Naming Conventions When a sequencing Run is performed on a Genome Sequencer FLX Instrument or a GS Junior Instrument its results are placed in a Run folder where the format of the Run name is generated by the software and includes the following components R year month day hour minute second instrument user runname A similar naming convention is used for the Data Analysis folder s which are deposited inside the corresponding Run folder by the Run time data processing applications either on the Genome Sequencer FLX Instrument or GS Junior Attendant PC or on a DataRig Data Analysis folders contain all the flow signal and signal processing files Their names include the following components except that instrument and user are not included when using the command line software D year month day hour minute second instrument user analysisname Finally the GS De Novo Assembler and GS Reference Mapper applications create
42. esults suitable for use by downstream data analysis applications Data processing is done in two main steps image processing and signal processing The signal processing step in turn exists in two options standard for the sequencing of Rapid cDNA or Paired End libraries and an alternative set of algorithms especially tuned to process sequencing data from Amplicon libraries or test Runs performed with only Control DNA Beads The data analysis phase offers a choice of several downstream analysis paths to generate the desired final output a consensus sequence of the DNA sample generated by the assembly of reads into contigs with or without Paired End analysis to order and orient the contigs into scaffolds GS De Novo Assembler a consensus sequence along with a list of high confidence differences obtained by mapping the reads to a known reference sequence GS Reference Mapper or the identification and quantitation of sequence variants by the ultra deep sequencing of amplicons GS Amplicon Variant Analyzer All data analysis outputs also include base per base quality scores Phred equivalent and other specific metric files Table 1 lists the inputs and outputs of the three main early components of data handling from data acquisition through data processing as well as the individual functions carried out by each application It is important to note that the preferred scheme for the execution of data processing is different between the G
43. homopolymer stretch in the sequence e g in AAATGG AAA is a 3 mer stretch of A T is a 1 mer stretch of T and GG is a 2 mer stretch of G A basecalled sequence is then derived by converting each estimate into a homopolymer stretch of that length and concatenating the homopolymers The sff file format consists of three sections a common header section occurring once in the file then for each read stored in the file a read header section and a read data section The data in each section consists of a combination of numeric and character data the specific fields for each section are defined below The sections adhere to the following rules e The standard Unix types uint8 t uint16 t uint32 t and uint64 t are used to define 1 2 4 and 8 byte numeric values e All multi byte numeric values are stored using big endian byteorder same as the SCF file format e All character fields use single byte ASCII characters e Each section definition ends with an eight byte padding field which consists of 0 to 7 bytes of padding so that the byte length of each section is divisible by 8 and hence the next section is aligned on an 8 byte boundary 3 3 8 1 Common Header Section The common header section consists of the following fields Field name magic number Format uint32 t Properties The magic number field value is 0X2E736666 the uint32 t encoding of the string sff version char 4 The version number is 0001
44. imum brightness etc have been modified for viewing images captured from the GS Junior Instrument Low resolution versions of captured images are now stored in the composite well cwf files These images can be viewed in the GS Run Browser if the original captured images are removed from the data set Several bugs have been fixed 1 2 4 GS De Novo Assembler and GS Reference Mapper 1 2 4 1 Graphical User Interface The interface for adding read data files to a project has been improved The GS Reference Mapper HC Diffs and Structural Variants sub tabs now support the ability to export the table in its current sort order in tsv csv or png format The GS Reference Mapper Reference Status and Gene Status sub tabs now support the ability to export the table in its current sort order in tsv csv or png format The Minimum Overlap Length parameter on the Parameters Computation sub tab can now be entered as either a number of bases or a percent read length The zoom to selected button on the Alignment Results sub tab now functions even if no column is selected In this case the zoomed in view is centered on the center column of the zoomed out view Several bugs have been fixed 1 2 4 2 Command Line Options The command line option force has been added This option is used with the newMapping newAssembly runMapping and runAssembly commands and is now required to overwrite an existing project directory 1 2 4 8 Output Files
45. ld per well and each field is made up of the packed structure shown in Figure 10 P Sequence ID 15 0 1 Rank unsigned integer 2 3 4 5 X Coordinate integer part 6 fract 7 Y Coordinate integer part 8 fract 9 1 0 Reserved Figure 10 Packed structure of the location idx file e Pisabit showing if the well has passed filtering 1 or has been discarded 0 e Sequence ID is an index to the table contained in sequences xml see below showing which sequence if any is matched e The reserved field is for future use and should be set to all 0 s e The X and Y coordinates are stored with two bits of fractional information This allows the storage of well coordinates with sub pixel resolution in the CWF file format Applications accessing the coordinates may simply choose to map bytes 5 6 and 7 8 to 16 bit integers and divide by 4 preserving or discarding the subpixel data as needed Also note that all coordinates are relative to a common 0 0 location representing the upper left hand corner of the PicoTiterPlate device the corner opposite the DataMatrix code no matter what the region In other words the coordinate reflects the actual location of the well on the original PicoTiterPlate device and not the offset into the region itself 3 3 1 5 metrics xml This file contains all the derived statistics created during data proce
46. lds and specific metrics Tables flowgrams e For each read search for alignment s to the and reference sequence in nucleotide space basecalls e Construct contigs and compute a consensus basecall and per sequence from the signals of the aligned reads base Sample flowspace quality consensus e Identify the positions where the consensus or subsets GS Scores sequence of the reads that comprise it differ from the reference BSletencs mapped to a sequence or reads from one another these are the Mapper reference putative differences l l eas mapper sequence e Evaluate the putative differences to identify high and list of confidence differences differences Output contig consensus sequence s and corresponding quality scores an ACE file of the multiple alignments of the reads and contigs to the reference the list of identified differences and mapping metrics files Identity and e Trim reads remove primer sequences Amplicon quantitation e Assign reads to Samples demultiplex datasets Variant of sequence e Align Sample reads to their reference sequences Analyzer variants e Quantitate variant frequency for each Sample Table 2 The 3 applications of the data analysis phase of the 454 Sequencing System with their inputs outputs and main processing steps Note that all data analysis applications use as input the reads and flowgrams output in SFF format by the data processing GS Run Processor application
47. ll alphabetic characters are converted to uppercase and any alphabetic character that is not A C G or T will be treated as an N e Multiple sequences may be included in the file each starting with a gt e Only the characters between the gt and the first whitespace character are used to identify the sequence e the accno for the sequence For clarity each sequence in a project should be identified uniquely within the characters prior to the first whitespace in their respective descriptor lines For example Ecolik12 4300K bp CCTTGTGCAGTAGCACTTAATCATCATGTTTTAGCATTTTGATCTTCTGCTCAATTTCT AAGCTAGACGCTCAATCTTCTTATGATGAACGATTTCTTCTTCATGGTGTTTTTTCATA 3 3 Standard File Formats Most of the file formats are specific to the type of data being stored such as the image files or the wells data files Other files adhere to standard formats used throughout the generation and processing of sequencing Run data assembly data and mapping data Example files in many of these formats are provided in the Output sub sections of the applications descriptions 3 3 1 Composite wells file format The CWF file is a container format which stores multiple streams of information The container itself is a ZIP file http www pkware com documents casestudies APPNOTE TXT with a single level hierarchy Each stream is named and compressed separately allowing for rapid access to any informa
48. m type signalPerBase StreamName signalPerBase float dat StreamName lt DataType gt float lt DataType gt lt Stream gt lt Streams gt lt Other gt lt NukeSignalStrengthBalancer gt medianOneMerA 1 09085 medianOneMerA lt medianOneMerT gt 0 909846 lt medianOneMerT gt lt medianOneMerG gt 0 898174 lt medianOneMerG gt lt medianOneMerC gt 0 894138 lt medianOneMerC gt lt NukeSignalStrengthBalancer gt lt BlowByCorrector gt lt droopLambda gt 0 00171434 lt droopLambda gt lt MedianSignal gt 1375 71 lt MedianSignal gt lt MaximumSignal gt 5086 11 lt MaximumSignal gt lt MedianDensity gt 12 lt MedianDensity gt lt MinimumDensity gt 1 lt MinimumDensity gt lt MaximumDensity gt 19 lt MaximumDensity gt lt num_low_density_low_signal_wells gt 14742 lt num_low_density_low_signal_wells gt lt num_high_density_low_signal_wells gt 10481 lt num_high_density_low_signal_wells gt lt num_low_density_high_signal_wells gt 13645 lt num_low_density_high_signal_wells gt num high density high signal wells 14899 num high density high signal wells mask averaging used true mask averaging used lt FinalMask gt lt class density high signal high class 0 gt lt epsilon gt 0 174537 lt epsilon gt lt beta gt 0 964658 lt beta gt lt class gt lt class density low signal high class 1 gt lt epsilon gt 0 188713 lt epsilon gt lt beta gt 0 988837 lt beta gt lt class gt lt clas
49. ment via a Graphical User Interface GUI or from the command line on a DataRig rather than on the Genome Sequencer FLX Instrument For the GS Junior System this can also be performed on a separate computer resource a DataRig but it is typically performed on the Attendant PC though still separately from the data acquisition data processing As a consequence of this separation the result files generated by these applications are not deposited in a Run folder For the GS De Novo Assembler and the GS Reference Mapper rather one of the following will apply e A folder with a P_ prefix for P ost Run Analysis is created to receive them in the user s current working directory on the DataRig or Attendant PC at the time the application is launched or written to a directory specified by the user via the applications GUI or on the command line e Mapping and Assembly can also be carried out in a project based fashion whereby datasets can be added to existing results or a new reference sequence can be specified for an existing Assembly or Mapping project This uses the corresponding applications GUI or can be done using the newAssembly newMapping and associated commands and the data is then stored in a Project folder A Project folder is identified by the 454Project xml file it contains The folder and file structures generated for each of these commands or GUI equivalents are shown in the Figures b
50. ml file to find their data and not depend on a particular file naming convention See the Sections on Other Streams 3 3 1 10 and 3 3 1 11 for information on the data types and stream names An example metrics xml file is shown in Figure 11 lt xml version 1 0 encoding utf 8 gt Metrics xmlns xsi http www w3 org 2001 XMLSchema instance xsi noNamespaceSchemaLocation GSDataProcessing 1 0 xsd gt lt RunMet rics gt lt MaxWellCount gt 529820 lt MaxWellCount gt lt RawWellCount gt 468698 lt RawWellCount gt lt SampleKeyPassWellCount gt 451747 lt SampleKeyPassWellCount gt ControlKeyPassWellCount 7818 ControlKeyPassWellCount ControlKeys lt Key gt ATGC lt Key gt lt ControlKeys gt lt SampleKeys gt lt Key gt TCAG lt Key gt lt SampleKeys gt lt Streams gt lt Stream type rawWellDensity gt lt StreamName gt rawWellDensity pgm lt StreamName gt lt DataType gt image lt DataType gt lt Stream gt lt Stream type carryForwardCorrections gt lt StreamName gt cfValues double dat lt StreamName gt lt DataType gt double lt DataType gt lt Stream gt lt Stream type incompleteExtensionCorrections gt lt StreamName gt ieValues double dat lt StreamName gt lt DataType gt double lt DataType gt lt Stream gt lt Stream type filterResults gt lt StreamName gt filterResults uint8 dat lt StreamName gt lt DataType gt byte lt DataType gt lt Stream gt Strea
51. n that documents Data acquisition Part A has been divided into two system specific versions This reflects the differences between the two instruments themselves and also whether the applications are installed on instrument or off instrument see section 1 2 below 1 1 2 Formats of the User Manual As was done in V 2 3 the user manuals are available in PDF format print friendly For V 2 5 2 5p1 they are also available as electronic help files that can be launched from within the applications via the Help button or accessed on the my454 com web site The e manual format offers various convenient navigation aids from the TOC frame in text cross links index and search field pop up definitions of glossary terms and other new features 1 2 454 Sequencing System applications 1 21 GS Junior Sequencer The Data acquisition application GS Junior Sequencer has been re written for the new GS Junior System While the overall functionality is similar to the GS Sequencer application for the GS FLX System the instrument sensors and Run Wizard used to set up sequencing Runs and other procedures have changed to reflect differences in the GS Junior System e The number of instrument sensors has been reduced to three Heater temperature CCD temperature and Enzyme Chiller temperature e Choices for Sequencing kits and PTP types are not present e The choices for number of nucleotide cycles are 42 100 and 200 cycles e There are three se
52. nal Processing GS Reference Mapper consensus sequence as contigs mapped to a reference sequence resequencing with per base quality scores and ACE file of the multi alignment and list of high confidence mutations differences between the sample consensus and the reference GS Run Browser or command line Performed off line on a DataRig or a cluster sishjeuy eeg GS Amplicon Variant Analyzer Identification and quantitation of sequence variants in an Amplicon library requires special Amplicon Signal Processing Figure 1 Data processing options in the GS Junior and GS FLX Systems The blocks identify the various data acquisition data processing or data analysis applications and their outputs Raw images are captured as part of the sequencing Run data acquisition Depending on the processing type selected during Run set up top of each column the image processing and signal processing steps can either be performed as part of the sequencing Run above the dotted line or they can be carried out separately following the Run using the off instrument version of the software on a DataRig or a cluster below the dotted line Note the two options for signal processing the standard algorithm is for sequencing of General Rapid Rapid cDNA or Paired End libraries and an alternative one exists for Amplicon sequencing and test Runs The preferred data processing path for the GS FLX System is highlighted Data analysis applications G
53. nalysis of any kind In all cases however the output DNA sequence is supplied as a set of FASTA files with associated Quality Scores and other Run and data metrics files useful for troubleshooting and determining the overall quality of the sequencing Run ACE formatted files are also produced by each of the data analysis applications to allow users to view alignment results using third party software tools Application processing e Identify pairwise overlaps between reads in nucleotide space e Construct multiple alignments of reads that tile together i e form contigs based on the pairwise overlaps Sample e Generate consensus basecalls of the contigs by consensus averaging the processed flow signals for each sequence nucleotide flow included in the alignment in GS De assembled flowspace l Nava de novo e Output the contig consensus sequences and l Assembler and corresponding quality scores along with an ACE file L scaffold of the multiple alignments and assembly metrics files information Additional steps with Paired End option SFF files with Paired e Identify pairwise overlaps between Paired End tags from one or End option and the shotgun contigs multiple e Organize the contigs into scaffolds order orientation sequencing and approximate distance Runs e Output the scaffolded consensus sequences and containing corresponding quality scores along with an AGP file read of the scaffo
54. ng reactions in a massively parallel fashion tens to hundreds of thousands of simultaneous reactions in the wells of a PicoTiterPlate device During DNA directed DNA synthesis pyrophosphate PPi is released with each nucleotide addition the system s chemistry generates an amount of light commensurate with the amount of PPi released this light is captured by a charge coupled device camera and converted into a digital signal For more information on the basics of the 454 Sequencing System please refer to the GS Junior Instrument Owner s Manual or the Genome Sequencer FLX Instrument Owner s Manual process datasets generated under any of the 454 Sequencing System s chemistries GS 20 chemistry GS FLX standard chemistry GS FLX and GS Junior Titanium chemistry This manual describes all the functionalities of the software even those that apply only to datasets generated with older chemistries Note however that it comprises a separate Part A for the GS Junior System and the GS FLX System Q The 454 Sequencing System software is fully backward compatible with and can 2 1 Data Acquisition Data Processing and Data Analysis Data handling in the 454 Sequencing System occurs in three main phases e Data Acquisition e Data Processing e Data Analysis Each phase is governed by one or more specific applications The data acquisition phase occurs during a sequencing Run on the GS Junior Instrument or the Genome Sequencer FLX Instrument
55. o do this the user would select the Image processing only processing type This minimizes the amount of time the Genome Sequencer FLX Instrument is busy processing data and thus unavailable for another sequencing Run The image processing step by contrast takes place concurrently with image acquisition e n some cases the user will elect to include both the image and signal processing steps in the sequencing Run select one of the Full processing types This is the simplest option from the standpoint of instrument operation as all data processing up to the generation of read flowgrams and basecalling of the reads is carried out during the Run without user intervention However the on instrument computing resources can take up to 80 hours to process a 200 cycle Run performed with the GS FLX Titanium chemistry When this is complete the user can proceed with the post Run analysis step s which must await the acquisition of all the read data as appropriate for the experiment e An alternate Full processing for Amplicons processing type exists in the GS FLX System which uses data processing algorithms that are specially tuned for Amplicon sequencing this option should be selected if you are sequencing an Amplicon library and you want to carry out the full Run time data processing on it It is also used for test Runs that use only Control DNA beads If Image processing only is carried out during the sequencing Run a signal proce
56. of the GS Junior Sequencer application see Part A of this manual Section 2 in the GS Junior System version or Section 3 in the GS FLX System version 2 3 1 Data Acquisition GS Sequencer and GS Junior Sequencer Results the Run Folder The organization of a generic Run folder R_ is depicted in Figure 2 All the raw data raw images log files etc remain in temporary local storage on the Genome Sequencer FLX Instrument or the GS Junior Attendant PC in case the user chooses to re analyze them e g using the reanalysis function of the GS Run Browser see Part B Section 3 of this manual In addition if Backup is selected during Run set up raw and processed data files from the Run can be transferred to a network location specified by the System Administrator for long term storage RH yyyy mm dd hh min sec machineName userName uniqueRunName dataRunParams parse imageLog parse PTP flowOrder cycleCount icl runlog parse rawlmages 00001 pif 00002 pif 00003 pif Figure 2 General organization of a Run folder On the Genome Sequencer FLX Instrument or the GS Junior Attendant PC this is located inside data date while on a DataRig or a cluster it can be placed anywhere Words in italics are generic Several other files or directories may appear in the R_ directory which are created for internal use of the software They are not described in this manual and may indeed be reorganized or eliminated in fu
57. one gt lt ID gt 0 lt ID gt lt Name gt unknown lt Name gt lt Key gt lt Key gt lt Seq gt lt Seq gt lt Sequence gt Sequence Type Control gt lt ID gt 1 lt ID gt Name ATGC control Name lt Key gt ATGC lt Key gt Seq ATGC Seq Sequence Sequence Type Library gt lt ID gt 2 lt ID gt lt Name gt TCAG key lt Name gt lt Key gt TCAG lt Key gt lt Seq gt TCAG lt Seq gt lt Sequence gt lt Sequence Type Control gt lt ID gt 3 lt ID gt lt Name gt TF2LonG lt Name gt lt Key gt ATGC lt Key gt lt Seq gt ATGCCA TGTGTG lt Seq gt lt Sequence gt Sequence Type Control gt lt ID gt 4 lt ID gt lt Name gt TF7LonG lt Name gt lt Key gt ATGC lt Key gt Seq ATGC TTCCTGTGTG Seq Sequence Sequence Type Control gt lt ID gt 5 lt ID gt lt Name gt TF90LonG lt Name gt lt Key gt ATGC lt Key gt lt Seq gt ATGCCGCA GTGTG lt Seq gt lt Sequence gt lt Sequence Type Control gt lt ID gt 6 lt ID gt lt Name gt TF100LonG lt Name gt lt Key gt ATGC lt Key gt lt Seq gt ATGCAT GIGTG lt Seq gt lt Sequence gt lt Sequence Type Control gt lt ID gt 7 lt ID gt lt Name gt TF120LonG lt Name gt lt Key gt ATGC lt Key gt Seq ATGCA CCTGTGTG Seq Sequence Sequence Type Control gt lt ID gt 8 lt ID gt lt Name gt TF150MMP7A lt Name gt lt Key gt ATGC lt Key gt lt Seq gt ATGCGC ATGG lt S
58. pecified by the user All input SFF files used in the assembly are organized in the sff sub directory These files are produced only when the Paired End option is used the Paired End option also adds sections to the 454NewblerMetrics txt file See Part C Section 1 of this manual for full file descriptions current working directory or directory specified P yyyy mm dd hh min sec runMapping 454AllContigs fna 454AllContigs qual 454LargeContigs fna 454LargeContigs qual 454A llDiffs txt 454HCDiffs txt 454NewblerMetrics txt 454NewblerProgress txt 454MappingQC xls 454Alignmentlnfo tsv 454PairAlign txt 454ReadStatus txt 454RefStatus txt 454Contigs ace or ace refaccno ace or consed sff sff Figure 5 File output of the GS Reference Mapper application when using the runMapping command or its GUI equivalent All result files specifying the actual reference sequence accession numbers are placed in a folder with a P_ prefix within the user s current working directory when running the command or in a directory specified by the user All input SFF files used in the mapping are organized in the sff sub directory See Part C Section 2 of this manual for full file descriptions current working directory or directory specified P yyyy mm dd hh min sec runAssembly assembly 454AllContigs fna 454AllContigs qual 454LargeContigs fna 454LargeContigs qual 454NewblerMetrics txt 454NewblerProgress txt
59. pes prepared from the same DNA sample to be analyzed together with Shotgun sequencing Run s and help order and orient the resulting contigs into scaffolds Paired End reads do not necessarily need to be analyzed together with Shotgun reads 2 The GS Reference Mapper application generates the consensus DNA sequence by mapping or aligning the reads to a reference sequence as well as a list of high confidence differences individual bases or blocks of bases that differ between the consensus DNA sequence of the sample and the reference sequence Robust cDNA analysis is also available 3 The GS Amplicon Variant Analyzer application compares reads from an Amplicon library to corresponding reference sequences and allows the user to detect identify and quantitate the prevalence of sequence variants It may take multiple Runs to generate enough data for a given sequencing project e g a project requiring several fold depth sequencing of a large genome In such cases the data sets of all the Runs can be combined at the time of data analysis Furthermore contig consensus calling in mapping and assembly are carried out in flowspace i e they operate directly on the processed signals measured from the wells followed by basecalling to produce a consensus sequence for the sample The final output of the 454 Sequencing System thus varies depending on what kind of analysis is performed Assembly Mapping or Amplicon Variant Analysis or no a
60. pplication 1 For individual reads the description lines are formatted as gt rank_x_y length XXbp uaccno accession where rank x y is the identifier or accession number of the read the rank x and y values are as described in section 3 3 4 XXbp is the length in bases of the read and accession is the full universal accession number for the read 2 For contigs generated by the GS De Novo Assembler application the description lines are formatted as follows contigXXXXX length abc numReads xyz where contigXXXXX is the identifier of the contig and XXXXX is a sequential numbering of the contigs in the assembly and where the length and numReads values are the length in bases of the contig and the number of reads that were used in that contig s multiple alignment 3 For contigs generated by the GS Reference Mapper application the description lines are formatted as follows contigXXXXX refaccno YYY ZZZ length abc numReads xyz where contigXXXXX is the identifier of the contig and XXXXX is a sequential numbering of the contigs along the reference refaccno is the accession of the reference sequence where this contig aligns YYY ZZZ is the start and end position of the contig on that reference sequence and the length and numReads values are the length in bases of the contig and the number of reads that were used in that contig s multiple alignment 3 3 454 Universal Accession Numbers The standa
61. quencing Run processing types None Full processing for Shotgun or Paired End and Full Processing for Amplicons Data management and Configuration options are similar to the GS Sequencer application for the GS FLX System 1 2 2 GS Run Processor When performing on the fly analysis on the GS Junior Attendant PC the Image Processing Only option is not present The Attendant PC is sufficiently powerful to process an entire GS Junior System sequencing Run in one step Data processing jobs starting on the GS Junior Attendant PC during sequencing will not interrupt currently running jobs The Attendant PC is sufficiently powerful to perform a sequencing Run and a processing job simultaneously Global signal droop correction Correction for signal reduction during a Run is enabled for all non amplicon sequencing Runs A recursive form of CAFIE CArry Forward amp Incomplete Extension is enabled for all sequencing Runs performed on the GS Junior Instrument For the GS FLX System there is a new and improved algorithm for finding regions Several bugs have been fixed 1 23 GS Run Browser When opening a project the user now has the option of opening individual CWF and Wells Files in addition to Run Data sets and Processor Data sets Data export files no longer use a xls extension They are still tab delimited text files but are now saved with a txt or csv extension Several image viewing parameters minimum brightness max
62. r FLX Instrument or the GS Junior Attendant PC grouped by date where each date folder contains individual Run folders Each Run folder is identified by the Run name specified during Run set up i e with an R prefix denoting R un see first Note below Run folders contain the raw data of the sequencing Run i e the results of the GS Sequencer or GS Junior Sequencer application as well as any Data Processing folder s for that Run identified by a D prefix for D ata Data Processing folders in turn contain the results of the image processing and or signal processing steps tmp sub directory required The data directory in the Genome Sequencer FLX Instrument on instrument computer or the GS Junior Attendant PC contains a tmp sub directory Do not delete the tmp sub directory as it is required for the proper functioning of the instrument software Because the data analysis applications are typically performed on a pool of sequencing Runs rather than on any single Run their results are not associated with a specific sequencing Run or data processing invocation rather assembly mapping and amplicon variant analysis results are deposited in the current working directory at the time the application is launched or written to a directory specified by the user see section 2 3 3 The SFF Tools commands also usually deposit their output into the current working directory or a directory specified on the command
63. rd 454 read identifiers used in 454 Sequencing System data analysis software versions prior to 1 0 52 early GS 20 System have the format rank x y as in 003048 1034 0651 where rank is a ranking of the well in a region by signal intensity and x and y are the pixel location of the well s center on the sequencing Run images This identifier is guaranteed to be unique only within the context of a single sequencing Run and may or may not be unique across specific sets of Runs To allow for the combination of reads across larger data sets a more unique accession number format has been developed An accession in this format is a 14 character string as in C3U5GWLO1CBXT2 and consist of 4 components C3U5GW a six character encoding of the timestamp of the Run L a randomizing hash character to enhance uniqueness 01 the region the read came from as a two digit number CBXT2 a five character encoding of the X Y location of the well The timestamp hash character and X Y location use a base 36 encoding where values 0 25 are the letters A Z and the values 26 35 are the digits 0 9 An accession thus consists only of letters and digits and is case insensitive e The timestamp is encoded by computing a total value as shown below then converting it into a base 36 string total year 2000 13 32 24 60 60 month 32 24 60 60 day 24 60 60 hour 60 60 minut
64. rther releases of the software they may include the following e Additional files o aaLog fpgaReadWriteLog bin flowCalibrtionLog nfc tempControlLog ntc dmesg txt o debugMessageLog txt e Additional directories o Calibrate o prime o prewash Oo O O O 2 3 2 Data Processing GS Run Processor Results the Data Processing Folder The organization of a generic Data Processing folder D_ is depicted in Figure 3 D folders are created by the GS Run Processor application within the R folder of the sequencing Run whose data is being processed Since a dataset can be re processed multiple times via the GS Run Browser see Part B Section 3 of this manual a given R_ folder can contain multiple D_ folders To the extent that they are generated on instrument per the processing type selected see section 2 2 all the processed data basecalls and quality scores Run metrics log files etc remain in temporary local storage on the Genome Sequencer FLX Instrument or the GS Junior Attendant PC In addition if Backup is selected during Run set up raw and processed data files from the Run can be transferred to a network location specified by the System Administrator for long term storage See Part B Section 1 for full file descriptions and for more information on the GS Run Processor application The Genome Sequencer FLX Instrument or the GS Junior Attendant PC processes the sequencing data on the fly i e the data is processe
65. s density high signal low class 2 gt lt epsilon gt 0 171184 lt epsilon gt lt beta gt 0 950913 lt beta gt lt class gt lt class density low signal low class 3 gt lt epsilon gt 0 184087 lt epsilon gt beta gt 0 900547 lt beta gt lt class gt lt FinalMask gt lt BlowByCorrector gt lt CafieCorrector gt lt droopLambda gt 0 00158241 lt droopLambda gt lt CafieCorrector gt lt NukeSignalStrengthBalancer gt lt medianOneMerA gt 1 01745 lt medianOneMerA gt lt medianOneMerT gt 0 986296 lt medianOneMerT gt medianOneMerG 0 987408 medianOneMerG medianOneMerC 0 980024 medianOneMerC lt NukeSignalStrengthBalancer gt lt Other gt lt RunMetrics gt lt Metrics gt A Figure 11 Example metrics xml stream 3 3 1 6 sequences xml A list of sequences referred to by the Sequence ID field in the locations idx node This stream can also be used to identify which sequences denote Control DNA reads and which are library reads By convention library keys are named ATCG library where ATGC is the four letter library key An example sequences xml file is shown in Figure 12 Note that the sequences were truncated ellipses on the Figure for brevity the full sequence of each Control DNA is included in the actual sequences xml file lt xml version 1 0 encoding iso 8859 1 gt lt Sequences gt lt Sequence Type N
66. ssing This file can be used to create all the ancillary output files including all the reports that were produced by the 454 Sequencing System software versions anterior to 2 0 00 The data is divided into various sections in four main types e The first is the header section which contains metrics that are valid across all keys and sequences e The next sections are the MetricsPerKey containing metrics that cover one key either library or control There will be one MetricsPerKey block per key used in the experiment e The next sections are the MetricsPerSequence blocks The metrics for each Control DNA sequence are contained in separate blocks e The Other block is a free form container for metrics that may be used by Roche and 454 troubleshooters to evaluate problematic sequencing Runs These metrics can and will change between releases of software and therefore users should not depend upon them for the assessments of Runs An additional block the Streams block acts as a manifest for data stored in auxiliary streams inside the CWF file Each data stream is individually tagged with a type identifier The stream block contains the exact stream name and the type of the binary data contained in the data While the exact stream name and data type may change with future releases of the software the type name will remain constant Users implementing CWF file readers should use the streams information contained in metrics x
67. ssing for Amplicons option is available off instrument e Finally the user can choose to only acquire the raw images during the Run select the No processing processing type If this is selected all other steps of data processing and analysis can be carried out separately afterwards This provides users with maximum flexibility in data handling to conform to their desired processing architecture As shown in Figure 1 data processing or data analysis applications that are carried out separately from the sequencing Run i e not determined by the processing type selection are invoked via the GS Run Browser or at the command line level This must be done off instrument on a separate computer called a DataRig or on a computer cluster The GS Run Processor will produce comparable results whether it is run on instrument or off instrument The GS De Novo Assembler GS Reference Mapper and GS Amplicon Variant Analyzer applications and the SFF Tools commands are always run separately from the sequencing Run i e off instrument on a DataRig or a computer cluster The rationale for this is that these applications either are not usually applied to individual Runs but rather draw on multiple Runs or require additional information beyond the Run data such as a reference genome against which to map the sequencing reads In addition these applications can take hours to complete during which time the Genome Sequencer FLX Instrument would not be
68. t of the block and is equal to 2 bytes times the number of wells times the number of flows The number of wells is derived from the name of the stream and the number of flows can be retrieved from the meta xml stream Values are stored Little Endian byte order as generated by Intel brand x86 processors X coordinates as unsigned 16 bit numbers Y coordinates as unsigned 16 bit numbers flow ranks as unsigned 32 bit numbers Oo O O o flow values as half precision binary floating point numbers The half precision floating point is a relatively new binary floating point format that uses 2 bytes and which is not covered by the IEEE 754 standard for encoding floating point numbers but is included in the IEEE 754r proposed revision http www validlab com 754H The format uses 1 sign bit a 5 bit excess 15 exponent 10 mantissa bits with an implied 1 bit and all the standard IEEE rules The minimum and maximum representable values are 2 98x10 and 65504 respectively Libcwf includes a half to full precision floating point conversion routine f Full precision floating point Each block is made up of four arrays stored back to back without padding The size of the first three arrays is equal to the data type size times the number of wells in the block The last array consumes the rest of the block and is equal to 4 bytes times the number of wells times the number of flows The number of wells is derived from the name of the stream and th
69. tain other binary streams These are usually identified by the dat suffix except in the case of the image type see note below Many are used for storing intermediate results of various processing stages but there are three notable streams that contain metrics data that may be of interest to end users e The first filterResults can be used to find which wells have passed which filter e The second rawWellDensity is a grayscale image containing the bead loading density plot e The last is the keyPassDensity stream a grayscale image showing the keypass density With the exception of the image format stream each stream contains one entry per read sorted in rank order The location idx can be used to find which offset corresponds to which read The possible data types are listed in Table 5 and the current stream types in Table 6 Size in Bytes Well byte short 2 Signed short Intel byte order unsignedShort Unsigned short Intel byte order int Standard integer intel byte order unsignedint Unsigned integer Intel byte order float IEEE 754 Single Precision floating point number double IEEE 754 Double Precision floating point number image PGM format graphics See note below on resolution registration Table 5 Possible data types Stream Type Contents filterResults The information about which well failed which filter in the qualityFiltering section of code trimInfo
70. tion in the file The CWF file format is inspired from the OpenDocument format described in ISO IEC 26300 2006 http www iso org iso iso catalogue catalogue tc catalogue detail htm csnumber 43485 The OpenDocument format benefits from the segregation of concerns by separating the content styles metadata and application settings into four separate XML streams the CWF format maintains a similar separation flowgrams called bases meta data processing history and run metrics are stored individually The user should not normally unpack the CWF file as each file is read as needed Nonetheless a C library libcwf is available that can read the CWF file format for convenience The data for each region of the PicoTiterPlate device is stored in a separate CWF file GS FLX System only Because this file is fully self contained it is the only file that needs to be moved between the instrument and the data processing system for continued analysis after the image processing phase of the GS Run Processor this file format contains all the necessary information to generate any pipeline output artifact on demand except for Standard Flowgram Format SFF files Therefore the user only needs to store one CWF file and one SFF file per region to fully archive the experiment s processed data Textual data in the CWF container will generally be stored as XML Specifically there are four main required XML streams meta xml metrics xml history xml sequ
71. tputs and main processing steps They are performed in succession in the order indicated the SFF files output by the signal processing step of the GS Run Processor application are used as input to the data analysis applications see Table 2 For a description of the data processing pipeline options see section 2 2 For a full description of the GS Junior Sequencer or GS Sequencer application see Part A of this manual Section 2 in the GS Junior System version or Section 3 in the GS FLX System version and for the GS Run Processor application see Part B Section 1 The data analysis applications use the fully processed and trimmed read basecalls of a sequencing Run or of a pool of Runs to produce initial alignments to the reference sequence or read to read overlaps for the GS De Novo Assembler then they use a combination of nucleotide and flowgram information for consensus calling of the contigs and determination of quality values for the contig sequences Table 2 lists the specific outputs of the 3 data analysis applications as well as the individual functions carried out by each one The final system output choices are the following 1 The GS De Novo Assembler application generates a consensus sequence of the whole DNA sample by assembling the reads into contigs de novo shotgun assembly An option allows the use of one or more sequencing Runs performed on a Paired End library any type or even a combination of Paired End library ty
72. uencer Raw images Main processing steps e Image acquisition and storage to disk GS Run Processor image processing step Raw images Composite Wells Format CWF Files Subtract background and normalize the images at the pixel level e Find the active wells on the PicoTiterPlate device e Extract the raw signals for each flow in each active well Write the resulting flow signals into composite wells format CWF files GS Run Processor signal processing step Composite Wells Format CWF Files Corrected CWF files and SFF files containing read basecalls and per base quality scores Filter out lower signal ghost wells Amplicon sequencing pipeline only e Correct for crosstalk between neighboring wells e Correct for known out of phase errors incomplete extension and carry forward Correct for signal droop and perform residual background subtraction Filter out any residual ghost wells Amplicon sequencing pipeline only Filter pass or fail the processed reads based on signal quality e Trim read ends for low quality and primer sequence e Update the CWF files with the fully processed data e Generate Standard Flowgram Format SFF files containing the basecalled read sequences and per base quality scores Table 1 The 3 main early components of data handling from data acquisition through data processing in the 454 Sequencing System with their inputs ou
73. under the control respectively of the GS Junior Sequencer or the GS Sequencer software application this is the only application that has two different implementations under the two systems The raw data consists of a series of digital images captured by the camera The images are a representation of the surface of the PicoTiterPlate device over which the sequencing reactions are taking place and each image corresponds to one nucleotide flow over that surface as defined by the Run script If the sample DNA fragment present in a given well of the PicoTiterPlate device is extended during a nucleotide flow light is emitted from the well and captured on the image corresponding to that flow Furthermore the amount of light emitted is proportional to the number of nucleotides extended Knowledge of the nucleotide flowed while each image is being captured from the Run script of the location on the PicoTiterPlate device where light is being emitted coordinates of each pixel on the images and of the amount of light emitted during each flow brightness of the pixels in the corresponding images allows the software to identify PicoTiterPlate wells that contain a DNA library fragment and determine the sequence of the DNA fragments present in each well This determination occurs during the data processing phase data processing is carried out by the GS Run Processor application and encompasses all the steps required to go from raw image data to base called r

454 Sequencing System Software Manual, v 2.5p1 General

Contents

Download Pdf Manuals

Related Search

Related Contents