Home
        CASAVA v1.8.2 User Guide (15011196) - Support
         Contents
1.                                    112  INSTAIINGGAGAVA MAA SE GE De EE GE 116   Appendix B Using Parallelization                                                 119    CASAVA v1 8 2 User Guide V       Make    Utilities    120    Appendix C Reference Files CASAVA iec ere 123  Tass CMI UGO AA 124  ELAND Reference Files    125  Variant Detection and Counting Reference Files                              127  Getting Reference Files    128  Appendix D Algorithm Descriptions                                          131  MT OOUICUOM Ses ER Ee Hanay KANG SE LEEG ES no ka a GAGO SE a 132  ELANDv2 and ELANDv2e          222 22 e cece EE EE EE EE EE EE Eie 133  Variant Detection    141  readBases Counting Method    158  Appendix E Qseq ConversioN  a 159  MOJU NON haces tains ons anng KAALAMAN EE RE EE MAA 160  Oseg Converter Input Files    161  Running Oseg Converter    163  Oseg Converter Parameters                 ee    164  Oseg Converter Output Data    165  Appendix F Export to SAM Conversion                                    16   a EE 168  SAM RONA EE EE EE Ge ala an ana ba a EE EE E ba De GE EE ge Ee a 169  BE e RE AE ON OE EL EE EE AE ES EE DE VERDER TIEN 173  GlGSS eN N EE OR OE OR AE 175  Vale AA eee ee 177  Technical Assistance    179  Parti 15011196 Rev D    Table 1   Table 2   Table 3   Table 4   Table 5   Table 6   Table 7   Table 8   Table 9   Table 10  Table 11  Table 12  Table 13  Table 14  Table 15  Table 16  Table 17  Table 18  Table 19  Table 20  Table 21  Table 22  
2.      127   Getting Reference Files    128  VEERA        N      NANU 3 A  N 4 a     wo    M k    ME ES RA     e   lt a mpa         inga w D l po MIA a nee     gr     LANE Ee sci  TER a aa v 1 G  gt  F re    AT real     REG ig 4 4 pi REID  po     c m  DH  PP ne ee ee    gt   CE  lt   get me  xv par  S    CASAVA v1 8 2 User Guide T D 3    A XIpusddwy    Reference Files CASAVA    Introduction    124    CASAVA needs a number of special reference files to run analysis  especially for RNA  sequencing    This chapter describes the reference files that are needed to run Elandv2e and CASAVA  variant detection  and provides instructions how to generate these files for other species  and builds  As of CASAVA 1 8  ELAND squashes genome files automatically when it  starts     Genome sequence files for most commonly used model organisms are available through  iGenome  Getting Reference Files on page 128      Parti 15011196 Rev D    ELAND Reference Files    ELAND needs the following file to perform an alignment   Unsquashed genome sequence files  As of CASAVA 1 8  ELAND squashes genome  files automatically when it starts     In addition  eland_rna needs two types of files to analyze RNA Sequencing data   Abundant sequences files  mitochondrial DNA  ribosomal region sequences  55  RNA  optional   and other contaminants  RefFlat txt gz file  UCSC type  or seq_gene md gz file  NCBI type      Reference Genome    CASAVA uses a reference genome in FASTA format  Both single sequence FASTA and  mult
3.      If eland pair analysis has been specified for one or more lanes  then two Expanded  Lane Results Summaries are produced  one for each read  All lanes for which analysis  has been specified are represented in the Read 1 table  but only those for which eland_  pair analysis has been specified contribute statistics to the Read 2 table     Per Tile Statistics    Below the Expanded Sample Summary is a link to a file containing per tile statistics   The displayed metrics are similar to the expanded Lane 1   Read 1 tables in the  CASAVA 1 8 configureAlignment summary files     IVC Plots    Next is a link to IVC plots  The IVC htm file  Intensity versus Cycle  contains plots that  display lane averages for samples     All   This is the lane average of the data displayed in All htm  It plots each channel   A  C  G  T  separately as a different colored line  Means are calculated over all    Parti 15011196 Rev D    clusters  regardless of base calling  If all clusters are T  then channels A  C and G  will be zero  If all bases are present in the sample at 25  of total and a well   balanced matrix is used for analysis  the graph will display all channels with  similar intensities  If intensities are not similar  the results could indicate either poor  cross talk correction or poor absolute intensity balance between each channel   Called  This plot is similar to All  except means are calculated for each channel  using clusters that the base caller has called in that channel  If all b
4.      Standard flow cell level variables   USE BASES y  y    CHROM NAME VALIDATION off  ANALYSIS eland rna   ELAND FASTQ FILES PER PROCESS 2    Flow cell level ELAND GENOME variable set for all data sets with Reference   HumanNCBB7ELAND   REFERENCE HumanNCBI37ELAND ELAND GENOME   nome user genomes archive UCSChg18 fasta   Flow cell level SAMTOOLS GENOME variable set for all data sets with Reference   AMPLICONS180111JRB ELAND   REFERENCE AMPLICONS180111JRB ELAND SAMTOOLS GENOME   home user genomes AMPLICONS180111JRB AMPLICONS180111JRB fa   Flow cell level SAMTOOLS GENOME variable set for all data sets with Reference   ISC1 ELAND   REFERENCE TSC1 ELAND SAMTOOLS GENOME   illumina user TSC1 TSC1 fa   Overrides global ANALYSIS with eland extended if the reference is TSC1 ELAND   REFERENCE TSC1 ELAND ANALYSIS eland extended   If the reference is unknown  default for Undetermined barcode data sets   sets the   analysis to none  Only affects lanes 1 2 3 and 4   1234 REFERENCE unknown ANALYSIS none   Alternative way of ensuring Undetermined barcode data sets do not get aligned    Only affects lanes 5 6 7 and 8   5678 BARCODE Undetermined ANALYSIS none    Part   15011196 Rev D    Specific Scenarios    Below a number of scenarios are written out  assuming SampleSheet csv has two  projects  idxProj and noldxProj   Analyze only data for idxProj  not noldxProj   Disable analysis by default    ANALYSIS none  Then the following analysis specifiations only affect sample sheet entries that  hav
5.     CASAVA is the part of Illumina s sequencing analysis  software that performs alignment of a sequencing run to a reference genome and  subsequent variant analysis and read counting  The basic pieces of functionality of  Illumina s sequencing analysis cascade are described below     Analysis of Sequencing Data    After the sequencing platform generates the sequencing images  the data are analyzed in  five steps  image analysis  base calling  bcl conversion  sequence alignment  and variant  analysis and counting  CASAVA performs the bcl conversion  sequence alignment  and  variant analysis and counting steps  demultiplexes multiplexed samples during the bcl  conversion step     1 Image analysis  Uses the raw images to locate clusters  and outputs the cluster  intensity  X Y positions  and an estimate of the noise for each cluster  The output  from image analysis provides the input for base calling  Image analysis is  performed by the instrument control software     2 Base calling  Uses cluster intensities and noise estimates to output the sequence of  bases read from each cluster  a confidence level for each base  and whether the read  passes filtering  Base calling is performed by the instrument control software   s Real  Time Analysis  RTA  or the Off Line Basecaller  OLB      3 Bel conversion   Converts   bcl files into   fastq gz files  compressed FASTO files  in  CASAVA  Multiplexed samples are demultiplexed during this step     4 Sequence alignment   Aligns samples to 
6.     No Label Description   1 seq name Reference sequence label   2 Pos Sequence position of the site snp   3 bcalls used Basecalls used to make the genotype call for this site   4 bcalls filt Basecalls mapped to the site but filtered out before  genotype calling   5 Ref Reference Base   6 O snp  A O value expressing the probability of the homozygous  reference genotype  subject to the expected rate of  haplotype difference as expressed by the  Watterson  theta  parameter  see New Variant Calling Parameter  Theta on page  150        max gt The most likely genotype  subject to theta  as above     8 Q max gt  A Q value expressing the probability that the genotype is  not the most likely genotype above  subject to theta     9 max gtlpoly site The most likely genotype assuming this site is polymorphic  with an expected allele frequency of 0 5  theta is still used to  calculate the probability of a third allele    i e  the chance of  observing two non reference alleles     10 Q max_gt  poly_site  A Q value expressing the probability that the genotype is  not the most likely genotype above assuming this site is  polymorphic    11 A_used  A  basecalls used   12 C used  C  basecalls used   15 G used  G  basecalls used   14 T used  T basecalls used    Indels txt Files    Indels for each chromosome are summarized within each chromosome directory in a  file called indels txt  This file contains indels which have been called in each reference  sequence by the small variant caller  and fil
7.     configureAlignment Parameters Detailed Description    configureAlignment can be run in various analysis modes  Customize your analysis by  specifying variables  parameters  and options     ANALYSIS Variables    Set the ANALYSIS variable to define the type of analysis you want to perform for each  lane  The various analysis modes include default  eland extended  eland pair  eland_  rna  and none  You can mix and match analyses between lanes     Table 5 ANALYSIS Variables    Variable Alignment   Application   Description   Program  ANALYSIS eland_ ELANDv2   Single reads   Aligns single read data reads against a target using  extended ELANDv2e alignments     e Works well with reads  gt  32 bases    e Each alignment is given a confidence value based  on its base quality scores    e A single file of sorted alignments is produced for  each lane    For a detailed description  see configureAlignment  Input Files on page 48    ANALYSIS eland pair ELANDv2   Paired reads   Aligns paired end reads against a target using  ELANDv2 alignments  A single read alignment is  done for each half of the pair  and then the best   scoring alignments are compared to find the best  paired read alignment  For a detailed description  see  Using ANALYSIS eland_pair on page 69     CASAVA v1 8 2 User Guide 61    1uauubilveinbyuo2 BulUUNH    Sequence Alignment    Variable Alignment   Application   Description  Program  ANALYSIS eland rna ELANDv2   Single reads   Aligns each read against a large referenc
8.    The most scalable with the highest performance  They have a very high  bandwidth and support many simultaneous clients  but are complex to manage  and significantly more expensive     Server Configurations    You can use either a single multi processor  multi core computer running Linux  or a  cluster of Linux servers with a head node  CASAVA can take advantage of clustered  and multi processing servers     Single multi processor  multi core server   Simple but not scalable  It can only  analyze data from one sequencing platform  or two depending on power and your  turn around requirements    Linux Cluster    Highly scalable and capable of running multiple jobs  simultaneously  It requires one server as a management node and a minimum    Parti 15011196 Rev D    number of computational notes to be as efficient as a standalone server  By adding  computational nodes  the cluster can service more instruments   i NOTE    We test our software with SGE  other cluster configurations  like LSF or PBS   are not recommended     Analysis Computer     lumina supports running CASAVA only on Linux operating systems  It may be  possible to run CASAVA on other 64 bit Unix variants  if all of the prerequisites  described in this section are met    lumina recommends the IlMuminaCompute data processing solution for CASAVA   IlluminaCompute is available as a multi tier option  with the volume of instrument  data output per week determining the recommended Tier level  For more information   con
9.    This tells configureAlignment it needs to perform eland rna  and communicates the  locations of the genome  splice junction and contaminant files     The following table describes the parameters for ANALYSIS eland rna     Table 10 Parameters for ANALYSIS eland rna    Parameter Description   ELAND GENOME Must point to the reference genome  just as for a standard  ELANDv2e analysis    ELAND RNA GENOME ANNOTATION Must point to the refFlat txt gz file  gzip compressed  or seg    gene md gz file  gzip compressed     ELAND RNA GENOME CONTAM Must point to the files of ultra abundant seguences  generally  ribosomal and mitochondrial   Any read that hits to these is  ignored     Considerations When Running eland rna    When running eland rna  bear in mind the following points    The above parameters may be specified on a lane by lane basis in the usual fashion    for example to do lanes one  two  and four  enter the following    124 ANALYSIS eland rna   124 ELAND GENOME  data Genome ELAND hg18    124 ELAND RNA GENOME ANNOTATION  data Genome ELAND  RNA Human refFlat txt gz E   124 ELAND RNA GENOME CONTAM  data Genome ELAND RNA Human MT  Ribo Filter  E E 7   The output file export txt gz has the same format as those generated by eland_   extended  for a description see Export txt gz on page 79  The existing code KM       repeat masked     denotes all reads that hit to abundant sequences or with any other   unresolvable ambiguity     CASAVA v1 8 2 User Guide Ta    1uauubilveinbyuo2 bu
10.    Woo 7 Tari KG Ti AATT RAS ATAT TE  5 TH iso UM D  rg Ae TE ie t robi  Qa  TC mi P  Hi  ang ET HeT TIETE dE TT G  TTE   Ama tar Ga s    Ee Tal Tea T  E  anang eT eT teased EE eas GA shai ta  ii s  a N Si Ta AA AA EE Bia aN  TEE TEE AE TA GTA KA GER TG pies TAN ARS Dat GNALI TT pat TA Paa zone TEAC ACGAAAAGAATE anG PRAGA TE E  WET AG ATTAA ala va a  kan   PLEN  STA fiat MEJE   R   r   TTEA dig er zrli petih a   Mija  SAT    ni LE FETTE    buhe n al  h ay A Wa  p sabay  EE EE CAL AGTAA MEAT M at PAPET TRATI pani ATE tati KE Mabi AA   ok Fr ITA CAAT F MAT tt L me piga kra G d ani Ji Park 1 TO Fal    PAGA h Na MATTE Th TTA TAB  S  EA S VANT TAALO TANK MANG Na en TRE STE aA NAE OM Bed LA ESTA v mt MELA SE ME Eis  pike ia Er Pe Ka               ze Hee KE    bali z pal    PALAG ia EME San aag    snem ACT NET Al oe A KON  EY Su  EREE recipe es SAT LANS ba  lg    S ZV CT KIM  7 Maha  a Na REA EE JO O  o Ge EE maria i NU zali Mi NEA SERE Ie ii AE AA s ea Z S 2  Ha SUS ager L ALT  EET a MATI TAG a ATA HIDE hea ar zen a NAA DAA AAT x GT TAL  Kr Ai NOO  AE En se  TE merit LS DA     RIA tat ye Lr ae AED TAA   LA  SA  SIM Ge DA S KING S    ete Er mang EE shah NGGAL lai 13 SIT PR Than kal IFTE Heg TE RES G ai NATITIRA  GE en o EEE IA can ce RA HAS Ree ee ER            MG  F                    D ian   stal  IR   L Ha  ka   Ka   ml Nak anh ia s ATA  N DATAA Ti  EE Ba KO S a einen EE AE ES TE IAN GE EE N AT U VEE AA ng  ater ANA ja KJE NO zi NS E Ha R HRS T TOT TAMET TAMANT DAC TREE AROEN Kaj
11.   100 Parti 15011196 Rev D    Targeted Resequencing    Since targeted resequencing only sequences part of a genome  we recommend using the  option   variantsNoCovCutoff to turn off high coverage filtration of SNPs and  indels     Examples  The CASAVA installation provides examples of common use cases  such as   E  coli Single End  E  coli Paired End  RNA sequencing  The details of these examples are available on the configureBuild pl help page     Go to the CASAVA installation directory  and type     configureBuild pl    The examples are listed at the bottom of the help page     CASAVA v1 8 2 User Guide 1 O     DUI1UNOD pue U01 28 9  uenen BuluunH    Variant Detection and Counting Output Files    Once the post alignment build is complete  all relevant information is listed in the build  directory  such as   Build summary html pages   The build summary html pages are located in the buildDir html folder  and  provides access to run information and graphs of important statistics   Variant calls and counts   The CASAVA build contains sequence  SNP  indels  and  for RNA Sequencing   counts information  and is located in buildDir Parsed DATE   Computer readable statistics   Computer readable statistics are located in buildDir stats   Configuration files   CASAVA configuration files are located in buildDir conf     These files are described below     Build Directory  An outline of the CASAVA build directory is shown below     Variant Detection and Counting    102 Parti 15011196 Rev 
12.   In the case  of overlapping indels  max_gtype refers to the most likely copy number  of the indel  Note that indel calls where ref is the most likely genotype  will be reported  These correspond to indels with very low Q indel   values    Phred scaled quality score of the most probable indel genotype  which  refers to the probability that the genotype of the indel is not that given as     max gtype     The Q values given only reflect those error conditions  which can be represented in the indel calling model  which is not  comprehensive  See also Quality Scores on page 148    Except for right side breakpoints  this field reports the depth of the  position preceding the left most indel breakpoint  For right side  breakpoints this is the depth of the position following the breakpoint   Number of reads strongly supporting either the reference path or an  alternate indel path    Number of reads strongly supporting the indel path    Number of reads intersecting the indel  but not strongly supporting  either the reference or any one indel path    The smallest repeating sequence unit within the inserted or deleted  sequence  For breakpoints this field is set to the value    N A       Number of times the repeat_unit sequence is contiguously repeated  starting from the indel start position in the reference case    Number of times the repeat_unit sequence is contiguously repeated  starting from the indel start position in the indel case     109    so9ji4 Indjno bununo PUB Uo1 8 2   J
13.   Quality score  Error probability   Q A  P  A    10 0 1   20 0 01   30 0 001    Quality Scores Encoding    Quality scores are encoded into a compact form in FASTQ files which uses only one  byte per quality value  In this encoding the quality score is represented as the character  with an ASCII code equal to its value   33  as of CASAVA 1 8  The following table  demonstrates the relationship between the encoding character  the character s ASCII  code  and the quality score represented     VE WARNING  4 Quality score encoding schemes in previous version of CASAVA used an  at     ce   llumina specific offset value of 64     Table 1 ASCII Characters Encoding Q scores 0 40    Symbol   ASCII O  Symbol   ASCII O  Symbol   ASCII O   Code Score Code Score Code Score  do 0 i 47 14   61 28    34 1 0 48 15  gt  62 29    35 2 1 49 16   63 30    36 3 2 50 17   64 31  Yo 37 4 3 51 18 A 65 32   amp  38 5 4 52 19 B 66 33  i 39 6 5 53 20 C 67 34    40 7 6 54 21 D 68 35    CASAVA v1 8 2 User Guide 4     1  p 04  nd ng UOISIBAUOD DA    Bcl Conversion and Demultiplexing    Symbol   ASCII O  Symbol   ASCII O  Symbol   ASCII O   Code Score Code Score Code Score     41 8 7 55 22 E 69 36   N 42 9 8 56 23 F 70 37     43 10 9 57 24 G 71 38   F 44 11 i 58 25 H 72 39     45 12 F 59 26 I 73 40  46 13  lt  60 27    Read Segment Quality Control Metric    A number of factors can cause the quality of base calls to be low at the end of a read   For example  phasing artifacts can degrade signal quality in som
14.   those reads aligned to an alternate alignment by the variant caller  The BAM  filename is sorted realigned bam   Project Dir Parsed NN NN NN    c1 bam realigned sorted realigned bam   Statistics for coverage  as well as snp and indel calls for all reference seguences are  found in the  stats  directory   Project Ditistats  coverage  summary  LXU  Project Dir stats snps summary txt  Project DIr stats indels  summary  txt  A summary of the same information is also available on the following html pages   Project Dir html coverage himl  Project Dir html snps htm   Project Dit  Wel  ndels  him    To summarize the snps and indels in the stats and html directories above  quality  thresholds are used to select a subset of snps and indels for summary reporting  The  default thresholds are Q snp   gt   20 and Q indel   gt   20  These values may be changed  using the options   variantsSummaryMinosnp and     variantsSummaryMinOindel     snps txt and sites txt Files    The snps txt files contain the SNP calls sorted by position  while the sites txt files  provide depth and single position genotype call scores for every mapped site  There is  one snp txt file for each chromosome  stored in the chromosome specific directory under    CASAVA v1 8 2 User Guide 1 07    Solid  Ind ng bununo9 pug U 011791961 JUBIJEA    Variant Detection and Counting    the Parsed dd mm yy directory  The snps txt and sites txt files are tab delimited text files  contain the same columns  which are the following 
15.   variant analysis 2  variant detection 7  88  configuring multiple runs 100  examples 101  input files 93  options 97  99 100  153 154  output files 102  running 96    Variant Detection and Counting 7    W    What s New 9    Part   15011196 Rev D    Technical Assistance    For technical assistance  contact Illumina Customer Support     Table 29 Illumina General Contact Information    Illumina Website   http   www illumina com    Email   techsupport illumina com    Table 30 Ilumina Customer Support Telephone Numbers    Region Contact Number Region Contact Number  North America 1 800 809 4566 Italy 800 874909  Austria 0800 296575 Netherlands 0800 0223859  Belgium 0800 81102 Norway 800 16836  Denmark 80882346 Spain 900 812168  Finland 0800 918363 Sweden 020790181  France 0800 911850 Switzerland 0800 563118  Germany 0800 180 8994 United Kingdom 0800 917 0041  Ireland 1 800 812949 Other countries  44 1799 534000  MSDSs    Material safety data sheets  MSDSs  are available on the Illumina website at  http   www illumina com msds     Product Documentation    You can obtain PDFs of additional product documentation from the Illumina website   Go to http   www illumina com support and select a product  To download  documentation  you will be asked to log in to Mylllumina  After you log in  you can  View or save the PDF  To register for a Mylllumina account  please visit  https   my illumina com Account Register     CASAVA v1 8 2 User Guide 1 Vi O    SOUEISISSV  B2IuUDe      B TT 1 k RA
16.   which is the minimum    Example    verbose 1    Parti 15011196 Rev D    Option Application    version SE  PE    W  SE  PE    Workflow    Wa  SE  PE       workflowAuto     workflowFile lt FILE SE  PE    Description    Prints version information    Example    version   Instead of running CASAVA   generates the workflow definition  file   tasks DATA txt   Example   w   Generates the workflow definition file and runs it  See     jobsLimit    Example    workflowAuto   Overrides workflow file name  Default is tasks  lt date gt  txt  Example    workflowFile FILENAME txt    Table 18 Global Analysis Options for Variant Detection and Counting    Option Application    QVCutof    lt NUMBER PE      OVCutoffSingle lt NUMBER   SE  PE      read NUMBER PE      singleScoreForPE VALUE   PE      sortKeepAllReads SE  PE    toNMScore lt NUMBER SE  PE    ignoreUnanchored PE    Options for Target sort    Description   Sets the paired end alignment score threshold to NUMBER   default 90     Example    OVCutof f lt 60   Sets the single read alignment score threshold to NUMBER   default 10     Example    QVCutoffSingle 60   Limit input to the specified read only  Forces single ended  analysis on one read of a double ended data set    Example    read 1   Sets the variant caller to filter reads with single score below  OV CutoffSingle in PE mode YES   NO  Default NO    Example    singleScoreForPE YES   Generate an archive BAM file  Keep all purity filtered  duplicate  and unmapped reads in the build  T
17.  2 User Guide 1 1 D    Requirements and Software Installation    InstallingCASAVA    Starting with CASAVA 1 8  CASAVA must be built outside of the source directory   1 NOTE    For more information on the installation procedure  see the file CASAV A   1 8 0 install CASAV A 1 8 2 src INSTALL      The installation procedure is as follows     1 The Boost library 1 44 0 is bundled in the CASAVA distribution and will be  automatically built when necessary  If you want to use a preinstalled Boost library   declare the BOOST ROOT bash variable by typing the following at the command  prompt prior to running the CASAVA  configure script   export BOOST ROOT  path to compiled boost directory boost 1_   44 0    2 Download CASAVA v1 8 and copy it in a temporary directory  you will not need to  keep it once the installation is done   like  tmp for example     3 Download and untar CASAVA v1 8 using the following commands   cd  tmp  tar xvji CASAVA 1 8 4 tar bzd    4 Prepare to build CASAVA   mkdir CASAVA 1 6 2 build  cd CASAVA 1 8 2 build    5 Prepare CASAVA installation directory   mkdir  illumina software CASAVA 1 8 2    6 Configure CASAVA so it will be first built and then install where you want  in this  example we want to install it in  illumina software CASAV A 1 8 2    sol GASAVA 1 644 STE  Configure     prefix  illumina software CASAVA 1 8 2    7 Build CASAVA   make  8 Finally install it   make install  L NOTE  For more information on the configuration options      CASAVA 1 8 2 src c
18.  4   Yield The sum of all bases in clusters that passed filtering for the entire project      PF The percentage of clusters that passed filtering      of Lane Percentage of reads in the sample compared to total number of reads in  that lane      Perfect Index Percentage of index reads in this sample which perfectly matched the   Reads given index      One Mismatch Percentage of index reads in this sample which had 1 mismatch to given   Reads  Index  index       Of  gt  Q30 Bases Yield of bases with Q30 or higher from clusters passing filter divided by  total yield of clusters passing filter    Mean Quality Score   The total sum of quality scores of clusters passing filter divided by total  yield of clusters passing filter     Recipe Recipe used during sequencing  Operator Name or ID of the operator  Directory Full path to the directory     Below the sample information are links to the IVC plots     Finding Demultiplexed Samples    The key to finding the location of demultiplexed data is looking at the Demultiplex_  Stats htm file in the BaseCalls_Stats directory  The Directory column will indicate the  project sample output directory  The FASTQ files within the directory contain the index  and lane as part of the name  Alternatively it can be inferred from the project name and  the sample id as described in FASTQ Files on page 39     CASAVA v1 8 2 User Guide 4 3    J19p 04 IndiNO UOISIBAUOY DY    44    Part   15011196 Rev D    Sequence Alignment    Aa a AA 46  configureAlignm
19.  6  error rates yd  file naming 79  lane averages 76  proportion of reads zi  tile by tile E  ANALYSIS variables 61  B  BARCODE 57  base calling 2  BaseCalls directory 27  bcl files 28  C  CASAVA  build 105  build directory 102  build web page 104  installing 116  variant detection and counting 88  CASAVA software 5  7  88  clocs files 20  clusters passing filters 17  clusters per tile 17   Compressed FASTQ 5  config txt file 54  57  64  config xml 30  configurealignment pl script 46  Configuring GERALD 46  configuring multiple runs 100  contaminants 70  control files 29  Count txt files 110  counting 2 7  88  configuring multiple runs 100  examples 101  options 97  99 100  153 154  output files 102  running 96  customer support 179  D  demultiplexing 6  26  example 32  options 33    CASAVA v1 8 2 User Guide    DNA sequencing  large genome  small genome   documentation    E    ELAND   analysis modes  eland_extended  ELAND MAX MATCHES  eland pair  eland rna  ELAND SEED LENGTHI  ELAND SEED LENGTH2  ELAND SET SIZE    ELAND_standalone pl script    ELANDv2  email reporting  Error htm file           FASTO files  FASTO generation  filter files   first cycle intensity    G    gapped alignment  GERALD  GERALD  pl script    H    help  reporting problems  help  technical    image analysis  indexing  intensity curves  IVC htm file    K    KAGU PAIR PARAMS  KAGU PARAMS    L    locs files    88    179    61  61  68  68  61  69  62  70  68   68   63   85   47  116  77    49   250   IZ   6  
20.  8 2 User Guide T D Q    IF xiousddvy    Qseq Conversion    Introduction    160    As of CASAVA 1 8  configureAlignment uses FASTO files as input  If you have   qseq txt files that you want to analyze using CASAVA 1 8  use the Qseq Converter that  converts  _qseq txt files into FASTO files   The script has the following features   Creates a makefile to convert a directory of  _qseq txt files to a directory tree of  compressed FASTO files following CASAVA 1 8 filename and directory structure  conventions   If detected  configuration data used by configureAlignment are also transferred to  the output directory   This script will not configure demultiplexing  The input directory must contain  _  qseq txt files which are either non demultiplexed or already demultiplexed by  another utility     This appendix provides instructions to run the Qseq Converter     Parti 15011196 Rev D    Oseg Converter Input Files    The Qseq Converter needs the following input files   A BaseCalls directory with  _qseq txt files  The Qseq Converter is specifically  designed to convert  _qseq txt files produced by OLB  It expects the  _qseq txt files to  follow the OLB naming conventions   s  lt lane gt   lt read gt   lt tile gt  gseg txt  With    lt lane gt   the lane number on the flow cell  1 8    lt read gt   the read number  1 or 2    lt tile gt   the tile number  left padded with  0   to 4 digits  For example  s 1 1 0001 gseg txt     These files have the following format     Field Description   Ma
21.  FASTA files should not be squashed for CASAVA   PATH to a single samtools style reference file    Table 17 Behavioral Options for Variant Detection and Counting    Option Application   Description    a  SE TE Type of analysis DNA  RNA   default is DNA     applicationType TYPE Example   a RNA      SE  PE Ignore errors from previous CASAVA execution      force Example   f   Ally SEPE Prints on screen usage guide  If TARGET is specified  prints usage      help  TARGET        lo sa  ok SE  PE    postRunCmd lt CMDLINE SE  PE   sa    sgeAuto SE  PE      sgeQsubFlags SE  PE    sgeQueue SE  PE    targets LIST SE  PE    tempDir SE  PE    verbose lt NUMBER SE  PE    95    guide for the corresponding plugin target   Example    help bam   Limit number of parallel jobs  Defaults   1  unlimited  for      sge Auto  1 for   workflow Auto    Do not set it to the maximum number of processors as this might  cause the terminal to become unresponsive    Post Run Commands can be launched after CASAVA completes  by including the   postRunCmd option  followed by the  commands to be launched   Generates the workflow definition file and runs it on SGE  use  with   sgeQueue    Extra parameters to be passed to SGE qsub by the taskServer pl  SGE queue name  used with   sge Auto or   workflow  e g  all q   Space separated list of targets to run  see Targets on page 96    Default is all    Example    targets sort bam   Overrides default path for local temporary files   Sets the verbose level  default is O
22.  Gigabit recommended  or other data  transfer mechanism     A suitably large holding area for the analysis output  1 TB per run   As there will  almost certainly be some overlap between copying  analysis  possible reanalysis  2     3 TB is an absolute minimum     You need to consider which parts of the data you want to store long term and what  storage infrastructure you want to provide  CASAVA provides the option to perform  loss less data compression     Storage Configurations    You can configure your analysis server with either local storage or external network  storage     Local server storage can be internal to the server  or Direct Attached Storage  DAS    which is a separate chassis attached to the server   Internal   Simple but not scalable  Results data must be moved off to network  storage at some point to make room for subsequent runs   DAS     External chassis that is scalable since more than one DAS can be  connected to the server  The server is an application server running CASAV A  and a file server providing access to results and receiving incoming raw data  files   External network storage is either Network Attached Storage  NAS  or Storage Area  Network  SAN   NAS and SAN are functionally equivalent  but SAN is larger  with  higher performance  more connections  and more management options   NAS     External chassis connected via an Ethernet to the server  instrument PC   and other clients on the network  NAS devices are scalable and highly  optimized   SAN
23.  Page    The Barcode Lane Summary htm file provides similar metrics as the Sample Summary  page  with the following differences     Parti 15011196 Rev D    The results are displayed for each barcoded sample in a lane  instead of for  samples    Tables are named accordingly  the equivalents for the Sample Results Summary  and Expanded Sample Summary are named Barcode Lane Results Summary and  Expanded Barcode Lane Summary   The Barcode Lane Summary page contains a Barcode Lane Summary  described  below     For a description  see the equivalent section in the Sample Summary Page description   Barcode Lane Summary on page 75      Flow Cell Summary    For each run a FlowCellSummary_FCID htm file is produced  which contains the Project  Summaries and Sample Results Summaries of all projects  This provides an overview  of the most relevant metrics for the entire run  It is located in the Aligned folder     For a description of Project Summaries and Sample Results Summaries  see Sample  Summary Page on page 74     Analysis Results    The output files for each lane of a flow cell are named using the format export txt gz  For  paired read analysis  there are two parallel output files  one for each read  The files are  named using the format      sample name gt _ lt barcode sequence gt _L lt lane gt _R lt read  number gt   lt 0 padded 3 digit set number gt _export gz  The files are found in the  Aligned Project_ID SAmple_ID folder of a finished analysis run     Export txt gz    The sta
24.  VALIDATION off       ioe WARNING  4 You may run into problems with downstream analysis if you disable  i chromosome name validation     CASAVA v1 8 2 User Guide 1 D D    s    J 9o0USA9j2d NV 13    Reference Files CASAVA      NOTE     If ELAND finds two alignments with identical alignment scores  ELAND will  pick the first alignment  in the single end case  or combination of alignments   in the paired end case  that exhibit the highest observed alignment quality   These are the alignments that make it into the export files  which only  contain the best alignment for each read   In practice  post alignment  CASAVA ignores these reads because of the low alignment qualities Using a  reference with lexicographic chromosome names  like chr1  will yield  slightly different results compared to a reference with numerical  chromosome names  like 1  for these reads  since the hits are sorted in a  different way     Reference Sequence Blocks    For reasons of efficiency  ELAND treats the reference sequence as being in    blocks    of 16  MB  of which there can be at most 240  This limits the total length of DNA that ELAND  can match against in a single run   In a single ELAND run you can match against   One file of at most 240 x 16 3824 MB  239 files  each up to 16 MB in size  Something in between  such as 24 files of up to 160 MB each   The NCBI human  genome will fit      Abundant Sequences Files  eland_rna     eland_rna uses these files to mask hits to abundant or contaminant sequenc
25.  conversion for read 1 is complete     For instructions  see Starting Alignment for Read 1 on page 64     configureAlignment Output    configureAlignment output is a flat text file called  _export txt gz containing each read  and information about its alignment to the reference  In addition  configureAlignment  produces statistics and diagnostic plots that can be used to assess data quality  These  are presented in the form of html pages found in the Aligned output folder     As a result of running the configureAlignment pl script  a new directory is created in the  run folder  This directory is named using the format Aligned  If you want to rerun the  analysis and change parameters  you can rerun configureAlignment with new  parameters if you specify a new alignment directory  OUT DIR     CASAVA 1 8 also contains a script that converts   export txt files to SAM files  see  Introduction on page 168 and SAM Format on page 169      Alignment Algorithms    CASAVA provides the alignment algorithm Efficient Large Scale Alignment of  Nucleotide Databases  ELAND   ELAND is very fast and should be used to match a  large number of reads against the reference genome    ELAND has been improved a number of times     CASAVA 1 6 introduced a new version of ELAND  ELANDv2  The most important    improvements of ELANDV2 are its ability to perform multiseed and gapped  alignments     46 Part  15011196 RevD    As of CASAVA 1 8 a new version of ELANDv2  is available  ELANDv2e  The most  importan
26.  diverse applications  the CASAVA variant caller does not filter out  low confidence calls and thus prints all sites where Q snp  is greater than zero to the  snps txt file  Summary statistics for SNPs are generated for a subset of higher confidence  SNPs    by default any SNP with Q snp  of 20 or greater is summarized in CASAVA   s  reports  Note that for calls with a very low Q snp  score  it is possible that the most  likely genotype will be that of the homozygous reference  e g  max_gt will be    CC    for a  position with a reference value of    C     This can be interpreted to mean that there is a  non trivial probability of a heterozygous SNP existing at this site  but that the  homozygous reference genotype is still more likely than that of any non reference  variant     Indel Caller Reporting    Indels for each chromosome are summarized within each chromosome directory in a  file called indels txt  This file contains indels which have been called in each  chromosomal bin segment using the small variant caller from CASAVA s  callSmallVariants module     These indel calls have been filtered to remove those calls which are found at a depth  greater than a certain multiple of the mean chromosomal depth  By default this multiple  is set to 3  The purpose of this filtration is to remove indels calls in regions close to  centromeres and other high copy number regions     Three categories of indels are reported   Insertions  Deletions  Breakpoints  Breakpoint calls correspon
27.  five stages   1 Compute clusterings of non aligned    orphan reads        2 Compute clusterings of anomalous read pairs  with an insert size that is  anomalously large  possible deletion  or small  possible insertion      3 Combine clusters that appear to correspond to the same event     4 Assemble them into contigs     142 Part   15011196 Rev D    5 Align the contigs back to the genome  using the positions of associated  singleton     reads to narrow the search to a couple of thousand bp or so     Figure 25 assemblelndels Algorithm  1  Cluster Orphan Reads    Cluster of  Orphan Reads    Refe FENCE   GCTTTTCECCGTAGCATGCATGCATGCACGGACTTTCGGGACTCTATCCGGCATCT    2  Cluster Anomalous Reads  Cluster of    Anomalous Reads   insert too long     Reference ccrTTTcaccGTAGCATGCATGCATGCACGGACITTCGGGACTCTATCCGGCATCT    3  Merge Clusters from Same Event    Merged Cluster           4  Assemble Cluster into Contig    New Contig    GTAGCATGCATGCATGCACGGACGGACTCTATCCGG    5  Align Contig to Reference    Potential New Deletion    X    GEAGCATGCATGCATGCACGGAC GGACICTATCCGG  IDI DERD PETIT TIPE    VELEELELELI  GCCGTAGCATGCATGCATGCATGCACGATCGGTGTTTGTGGTGGGGGACTCTATCCGGCTAGT       assemblelndels Components    The assembleIndels module contains the following components     IndelFinder    The IndelFinder component takes a sorted bam file from a CASAVA    build and extracts   Any reads containing gapped alignments    CASAVA v1 8 2 User Guide    143    U01 28 94 JUBIJEA    Algorithm Descriptions    An
28.  for assemblelndels    Option    indelsSpReadThresholdIndels   NUMBER      indelsPrasThreshold lt NUMBER      indelsAlignScoreThresh  NUMBER      indelsSdFlankWeight lt NUMBER      indelsMinGroupSize NUMBER      indelsSpReadThresholdClusters   NUMBER      indel sMinCoverage NUMBER    CASAVA v1 8 2 User Guide    Application   Description    PE    PE    PE    PE    PE    PE    PE    Spanning read score threshold  The higher the  single read alignment score before realignment   the more unlikely it is to see this pattern of  mismatches given the read s quality values  Default  threshold value is 25  Drop this value to add more  reads into the indel finding process  at the possible  expense of introducing noise  For an alignment  with no mismatches this option should be set at  zero    Paired read alignment score threshold  If a read  has a paired read alignment score of at least this   then it is used to update the base quality stats for  that sample prep  Default is calculated based off the  data    If an alignment score for a read exceeds this  threshold after realignment then the output file is  updated to incorporate this new alignment   Otherwise the read s entry remains as per the input  file  Default value is 120  A low value will cause  some reads to be wrongly placed  albeit within a  small interval     Number of standard deviations to use when  defining the genomic interval to align the read to   default  1     Only output clusters if they contain at least this  many r
29.  in the table below     Option Description     readl FILENAME Read1 export file  mandatory   File may be gzipped with   gz  extension      read2 FILENAME Read  export file  File may be gzipped with   gz   extension      nofilter Include reads that failed the basecaller purity filter      glogodds Assume export file s  use logodds guality values as  reported by Pipeline prior to 1 3      version Prints version information    help Prints on screen usage guide     Example    An example of illumina export2sam pl use is as follows   path to CASAVA bin illumina export2sam pl   read1 NA10831    ATCACG LOOL RI VOL EXPOTL TXC OZ     readzZ NAIUSST ATCACG  LOUL R2 001 export  txt g2 Z     Z Converted sam    This will write an output file s 2 converted sam that contains the paired end reads  from s 2 1 export txt and s 2 2 export txt     CASAVA v1 8 2 User Guide 1 7 3    1 4 Parti 15011196 Rev D    Glossary    B  gt    Bayesian model  A Bayesian model provides a means to update a prior hypothesis based on  evidence  As an example  in a Bayesian genotype model we may have a  prior hypothesis that our sample genotype matches that from a reference  sample with probability q  After accounting for evidence of the sample gen   otype in the form of sequencing reads which are inconsistent with the ref   erence genotype  our hypothesis is updated such that the probability of the    sample genotype matching the reference has been reduced to a value less  than q     ee    De Bruijn graph  A De Bru
30.  indel genotype calling  Whenever an indel larger than  this size is nominated by a de novo assembly contig it  is handled as two independent breakpoints  Note that  increasing this value should lead to an approximately  linear increase in variant caller memory consumption   The default value is 300 for paired end builds and 50  for single end builds     Example    variantsMaxIndelSize 200    15 7    U01 28 9   1UuEUEA    Algorithm Descriptions    readBases Counting Method    This method is for exon and gene counts  Before counting CASAVA converts the  alignments to splice junction into two shorter genomic alignments  Then CASAVA will  count the number of bases  not the number of reads   that belong to exons  and genes   Bases within both original genomic and shorter genomic reads derived from spliced  alignments participate in the exon and gene counts     4 NOTE    Junction counts  in reads  not bases  are provided for convenience  Because    alignments to the junctions are converted to the genomic reads before the  counting  bases within reads aligned to splice junction are counted only once  for exon and gene counts     For splice junctions  counts are provided as the number of reads that cover the junction  point  The number of bases that fall into the exonic regions of each gene is summed to  obtain gene level counts  The normalized values are calculated as RPKM  Reads Per  KiloBase per Million of mapped reads   Since the base counts rather than read counts  are used  th
31.  of run folders places the names in  chronological order     2 The second field specifies the name of the sequencing machine  It may consist of  any combination of upper or lower case letters  digits  or hyphens  but may not  contain any other characters  especially not an underscore   It is assumed that the  sequencing platform is synonymous with the PC controlling it  and that the names  assigned to the instruments are unique across the sequencing facility     3 The third field is a four digit counter specifying the experiment ID on that  instrument  Each instrument should be capable of supplying a series of  consecutively numbered experiment IDs  incremental unique index  from the  onboard sample tracking database or a LIMS    i NOTE    It is desirable to keep Experiment IDs  or Sample ID  and instrument names  unique within any given enterprise  You should establish a convention under    which each machine is able to allocate run folder names independently of  other machines to avoid naming conflicts     A run folder named 070108 instrument1 0147 indicates experiment number 147  run  on instrument 1  on the 8th of Jan 2007  While the date and instrument name specify a  unique run folder for any number of instruments  the addition of an experiment ID  ensures both uniqueness and the ability to relate the contents of the run folder back to a  laboratory notebook or LIMS    Additional information is captured in the run folder name in fields separated by an  underscore from t
32.  produced  consisting of all sites with Q snp  7 0     8 A final filtration step is taken to remove potentially spurious SNP calls near the  centromeres and within high copy number regions  This is done by calculating the  mean used depth for each chromosome  and filtering out all SNP calls which occur  at a used depth which is greater than 3 times this chromosomal mean     Variant Detection Q Scores    148    Quality Scores    A quality score  or Q score  expresses an error probability  In particular  it serves as a  convenient and compact way to communicate very small error probabilities     Given an assertion  A  the probability that A is not true  P  A   is expressed by a quality  score  Q A   according to the relationship   Q A    10 log   P  A      where P  A  is the estimated probability of an assertion A being wrong     The relationship between the quality score and error probability is demonstrated with  the following table     Quality score  Error probability   O A  P  A    10 0 1   20 0 01   30 0 001    Part   15011196 Rev D    Variant Genotypes    In the context of resequencing a diploid individual  a genotype for a single site or indel  indicates the two alleles that are present     The set of diploid site genotypes considered by the CASAVA v1 8 model for SNPs are   AA CC GG TT ACAG AT CG CT GT   For example  given a site in the genome with a  reference base of C  the homozygous reference genotype is CC  A prediction of a SNP at  that site is an assertion that th
33.  sort and bam modules  instead     This section describes the usage of the illumina export2sam pl script   The script is  located in CASAVA   s bin directory and is an update to the SAMtools script  export2sam pl  redistributed in CASAVA under the MIT license  see  http   sourceforge net projects samtools develop      1 NOTE     Use CASAVA s illumina export2sam pl script instead of the SAMtools script   The illumina export2sam pl script has a number of updates that are  important for proper conversion of ELANDv2e alignments  See the script  header for a full list of these updates     Parti 15011196 Rev D    SAM Format    The Sequence Alignment Map  SAM  format is a generic format for storing large  nucleotide sequence alignments  SAM files have a  sam extension  and consist of one  header section and one alignment section  The whole header section can be absent  but  keeping the header is recommended     This section provides the information relevant for the SAM files generated by CASAVA   a detailed description of the generic SAM format is available from  samtools sourceforge net     To generate a SAM file  see Introduction on page 168     Header Section    The Illumina SAM files start with  PG  which indicates that the first line is a header  line     of the program type  PG   The line is TAB delimited and each data field has an  explicit field tag  which is represented using two ASCII characters  as described below     Tag   Description  ID   Program name  VN   Program ver
34.  the analysis is done  review the analysis for each sample   See Demultiplex_Stats File on page 42     Example Bcl Conversion and Demultiplexing    An example of a demultiplexing run is as follows     1 Enter    path to CASAVA bin configureBclToFastq pl   input dir   lt Basecalls dit      ourputedir sUnaligrmed      sample shs  et   lt input dir gt  SampleSheet  csy    2 Go to the  lt Unaligned gt  folder     3 Run   nohup make  j 3    32 Parti 15011196 Rev D    Step one will produce a set of directories in the Unaligned directory  Reads with an  unresolved or erroneous index are placed in the Undetermined  indices directory     Options for Bcl Conversion and Demultiplexing    The options for demultiplexing are described below     Option       as AE K KTZ CET    COUNT    le K c SEL G  lt  KE    m Te 1 ER G 1 EK S KE        DOSIiLlon   dir      p051tlons format      filter dir      intensities dir     S   Sample sheet      tiles      use bases mask    CASAVA v1 8 2 User Guide    Description    Maximum number of clusters per output  FASTO file  Do not go over 16000000  since  this is the maximum number of reads we  recommend for one ELAND process  Specify  0 to ensure creation of a single FASTQ file     Defaults to 4000000    Path to a BaseCalls directory      Defaults to current dir   Path to demultiplexed output    Defaults to  lt run_folder gt  Unaligned   Note that there can be only one Unaligned  directory by default  If you want multiple  Unaligned directories  you will h
35.  to identify any issues which may be specific to a certain  lane  or group of tiles   Cluster Density Box Plots  These plots show the raw cluster densities per lane  and  the clusters passing filter   L NOTE  Many of the run quality metrics are depicted as box plots  In these plots  the  red line shows the median  the box delimits the middle 50  of the data     interquartile range   and the error bars indicate the sample minimum and  maximum     The sections below describe a number of examples of good runs and bad runs     Excellent Quality Metrics    The figure below shows a screen shot from SAV displaying a run with excellent quality  metrics  Note the trend of high O scores    gt Q30  across each cycle  left side  and the  cumulative distribution of   gt Q30 among the reads  right side      Figure 3 SAV Screenshot Showing Excellent Quality Metrics    Data By Cycle    gt  QScore Distribution       gt Q30 Lane 4 Both Surfaces Lane 4 Both Surfaces All Cycles    v    g   Z     e           lt                20  O Score       Low Diversity Samples    The figure below shows a screen shot from SAV displaying the percent base per cycle  for a low diversity sample  which might result from seguencing a small number of PCR  artifacts     CASAVA v1 8 2 User Guide 1 3    syde1g pue Selde     len    Figure 4 Low Diversity Samples    Data By Cycle      Base Lane 5 Both Surfaces    jill    All Bases   Base  5    N  o    il    In contrast  the figure below shows the percent base per cycle g
36.  ungapped_alignment    mismatches  gapped alignment    mismatches gapped alignment   If the ratio for a given  alignment exceeds a certain value  set to 3 1 by default   we insert a gap     If any of two conditions is not satisfied  we return an ungapped alignment as the result     ELANDv2e Alignment Improvements    CASAVA 1 8 features ELANDv2e  This updated alignment program includes the  following new features     Better repeat resolution  A new orphan aligner  Shorter run times with a new version of alignmentResolver    1 36 Part   15011196 Rev D    Figure 21 ELANDv2e Workflow    CASAVA v1 8  Finding seed hits Improvements    Stage 0  s                                    singleseed  TO          Stage 1  E    ER Overlapping seeds   multiple seeds   m        Increase sensitivity    CT  OO    Resolve repeats    Gapped alignment    Extract 5 bases  marked in orange  on either side of a hit  Perform a banded global alignment to account for indels    aaa   SS ema Reference    EE L Read    i Resolving Orphans  Resolving orphans               Increases alignment      Improves indel finding  If one read anchors the read pair  do a local realignment  of the other read in the vicinity of the anchored read     Read 2 has multiple mappings  shown in red      Do local realignment using read 1  green  as an anchor    EE     i           Scoring alignments    Estimate insert size distribution from uniquely aligning  reads and score reads  Score read pairs according to  mismatches to the re
37.  value of 100 means that 100  consecutive bases match the reference     CASAVA v1 8 2 User Guide 1 71    ewo  NYS    Export to SAM Conversion    1 2    Tag   Value Field    XC    e Mismatched bases are indicated by a base  ACGIN   where the letter indicates the  reference base     e Insertions and deletions start with a     character and are closed with a      character  A number indicates an insertion in the read of that size  a base  or  number of bases  indicate the sequence of the reference that was deleted in the  read    For example  the string 30 1 28G means the following    e 30  30 bases matching reference   e  1   one base insertion in read   e 28  28 bases matching reference   e G  reference base G is mismatched in read    Provides read status information normally conveyed in the chromosome field of the  export txt file for unmapped reads  Specificially   XC Z QC  is used to mark an  ELAND OC failure read   XC Z RM  is used to mark an ELAND repeat mask read   and  XC Z CONTROL  is used to mark a control read  No optional field is added to  reads which are marked as no match   NM   in the export file   it is understood that  this is the default status of an unmapped read    Parti 15011196 Rev D    Usage    abesn    For export to SAM conversion  enter the following   path to CASAVA bin  illumina export2sam pl   read1 FILENAME   options   gt  outputfile sam    Make sure to specify an output file  else the output gets written to the screen     The options are described
38.  want all reads ina FASTO file  use the   with failed reads option     Control Values    The tenth columns   lt control number gt   is zero if the read is not identified as a  control  If the read is identified as a control  the number is greater than zero  and the  value specifies what kind of control it is  The value is the decimal representation of a  bit wise encoding scheme  with bit 0 having a decimal value of 1  bit 1 a value of 2  bit  2 a value of 4  and so on     Parti 15011196 Rev D    The bits are used as follows    e Bit 0  always empty  0    e Bit 1  was the read identified as a control    e Bit 2  was the match ambiguous    e Bit 3  did the read match the phiX tag    e Bit 4  did the read align to match the phiX tag    e Bit 5  did the read match the control index sequence   e Bits 6 7  reserved for future use    e Bits 8  15  the report key for the matched record in the controls fasta file  specified by the  REPORT KEY metadata     Quality Scores    A quality score  or Q score  expresses an error probability  In particular  it serves as a  convenient and compact way to communicate very small error probabilities     Given an assertion  A  the probability that A is not true  P   A   is expressed by a quality  score  Q A   according to the relationship   Q A    10 log    P  A      where P   A  is the estimated probability of an assertion A being wrong     The relationship between the quality score and error probability is demonstrated with  the following table   
39.  you are using for alignment  and are available    from iGenome for the most common model organisms  Getting Reference Files on page  128      CASAVA v1 8 2 User Guide O D    S6ll4 Indu  U01 28 84 JUBLe     Running Variant Detection and Counting    The major use cases for running CASAVA variant detection and counting are listed  below     Set additional options to define the type of analysis you want to perform for each  project  The options are listed in the next section     Major Use Cases  SNP and Indel Calling  To run CASAVA with callSmallVariants and assemblelndels  enter    path to CASAVA bin configureBuild pl  options   SNP and Indel calling without large indel assembly  To run CASAVA with callSmallVariants  but without assemblelndels  enter    path to CASAVA bin configureBuild pl   targets all   noassembleIndels   variantsSkipContigs  options    SNP and Indel calling  Single end Build  To run CASAVA with callSmallVariants for a single end build  enter    path to CASAVA bin configureBuild pl  options   RNA Sequencing  To run CASAVA for RNA Sequencing  enter    path to CASAVA bin configureRnaBuild pl  options     Variant Detection and Counting    Other Use Cases  Help  To get the CASAVA Help for callSmallVariants  enter    path to CASAVA bin configureBuild pl   help callSmallVariants  Rerun callSmallVariants  In any pre existing build in which the sort module was previously completed  and  the assemblelndels module for a paired end build   Small variant calling may be  
40.  you will find the config xml file that records any information  specific to the generation of the subfolders  This contains a tag value list describing the  cycle image folders used to generate each folder of intensity and sequence files     In the BaseCalls folder there is another config xml file containing the meta information  about the base caller runs     Adapter Sequences File    The adapter sequences FASTA contains the Illumina adapter sequences  and needs to be  provided if the option   adapter masking is used  FASTA files for various Illumina  adapters are available from teh Illumina website  through iCom      Bcl Conversion and Demultiplexing    Generating the Sample Sheet    The user generated sample sheet  SampleSheet csv file  describes the samples and  projects in each lane  including the indexes used  The sample sheet should be located in  the BaseCalls directory of the run folder  You can create  open  and edit the sample sheet  in Excel     The sample sheet contains the following columns     Column Description   Header   FCID Flow cell ID   Lane Positive integer  indicating the lane number  1 8    SampleID ID of the sample   SampleRef The reference used for alignment for the sample   Index Index sequences  Multiple index reads are separated by a hyphen  for example   ACCAGTAA GGACATGA     Description Description of the sample   Control Y indicates this lane is a control lane  N means sample   Recipe Recipe used during seguencing   Operator Name or ID of the 
41. 1 8 2 User Guide 1 6 3    19 1940U05 basi  DuluunH    Qseq Conversion    Oseg Converter Parameters    164    The Oseg Converter parameters that can be entered are listed below     Parameter        inpuct dir  DIRECTORY   U TL DUL d IE  DIRECTORY      fastq cluster   count INTEGER    config file  FILENAME      flowcell id  STRING    Description  Path to  _qseq txt directory  No default     Path to root of CASAVA 1 8 unaligned directory structure   Directory will be created if it does not exist     Default   lt input dir gt  QseqToFastq Unaligned   Maximum number of fastq records per fastq file    Default  4 000    Specify the Bustard config file to be copied to the fastq directory   Default   lt input dir gt  config xml   Use the specified string as the flow cell id    Default value is parsed from the config file     Parti 15011196 Rev D    Qseq Converter Output Data    The Qseq Converter generates the following output   gzipped FASTO files in the directory structure configureAlignment expects   configureAlignment Input Files on page 48    If found  Qseq Converter copies the basecalling config xml to the root of the FASTO  directory structure and renames it DemultiplexedBustardConfig xml  which is the  file expected by configureAlignment   Oseg Converter also creates a default sample sheet in the destination directory   IVC htm and corresponding plots are in the same directory where the qseq files are   L NOTE  configure Alignment in CASAVA 1 8 will fail if you try to run it a
42. 1196 Rev D    Parameter Description   rl Runs Bcl conversion for read 1  Can be started once the last read has started  sequencing    POST RUN _ A Makefile variable that can be specified either on the make command line or as an    COMMAND R1   environment variable to specify the post run commands after completion of read one   if needed  Typical use would be triggering the alignment of read 1     POST RUN  A Makefile variable that can be specified on the make command line to specify the  COMMAND post run commands after completion of the run   KEEP The option KEEP INTERMEDIARY tells CASAVA not to delete the intermediary files    INTERMEDIARY   in the Temp dir after Bcl conversion is complete  Usage  KEEP INTERMEDIARY  yes      NOTE  k If you specify one of the more specific workflows and then run a more  general one  only the difference will get processed  For instance     make    N rl   followed by    make    N   will do read 1 in the first step  and read 2 the second one     Starting Bcl Conversion for Read 1    If you want to start Bcl to FASTQ conversion before completion of the run  use the  makefile target r1 at any time after the last read has started  for multiplexed runs  this  is after completion of the indexing read      1 Enter the following command to create a makefile for Bcl conversion    path to CASAVA bin configureBclToFastg pl  options     2 Move into the newly created Unaligned folder specified by  output dir     3 Type the    make r1    command   make  j 8 
43. 15011196 Rev D    Using ANALYSIS eland pair    Based heavily on ANALYSIS eland extended  ANALYSIS eland  pair allows the  analysis of a paired read run using ELANDv2e alignments  As part of the analysis  it  will    Align both read 1 and read 2 to the reference genome   Determine the insert size distribution of the sample   Use the insert size distribution to resolve repeats and ambiguities    The export txt gz files are meant to contain all information necessary for downstream  processing of the alignment data  Other files produced that may be useful in some  circumstances are   s N 1 eland extended txt  s N 2 eland extended txt   these contain the candidate  alignments for each read 1 and read 2  The software chooses from these possibilities  in attempting to pick the best alignment of the read pair     For a detailed description of the export txt files  see Text Based Analysis Results on page  Dl     Multiseed  Gapped  Repeat  Orphan Alignment    ANALYSIS eland  pair performs the following alignment features implemented in  ELANDv2 and ELANDv2e   By default performs multiseed alignment by aligning consecutive sets of 16 to 32  bases separately   Uses a gapped alignment method to extend each candidate alignment to the full  length that allows for gaps  indels  of up to 10 bases   Aligns reads in repeat regions using two new modes  semi repeat resolution and  full repeat resolution  Full repeat resolution is more sensitive and places more reads  in repeat regions  but will r
44. 18 Sequence Chromosomes  PROJECT Project1 ANALYSIS eland pair  PROJECT Projectl USE BASES y n y n    Assignment by SAMPLE    If you just want to align the samples from your sample named Samplel  generate the  following config txt file   ELAND GENOME  lt GenomesFolder gt  iGenomes Homo _  sapiens UCSC hg18 Sequence Chromosomes  SAMPLE Samplel ANALYSIS eland pair  SAMPLE Samplel USE BASES y n y n    Assignment by REFERENCE    If you want to align the samples assigned to a human reference in the sample sheet   generate the following config txt file   ELAND GENOME  lt GenomesFolder gt  iGenomes Homo    sapiens UCSC hg18 Sequence Chromosomes  REFERENCE human ANALYSIS eland pair  REFERENCE human USE BASES y n y n    Parti 15011196 Rev D    The requirements and options for the configureAlignment configuration file are  described in configure Alignment Configuration File on page 54     Full Size Example    A full sized example of a config txt is shown below   123456  ANALYSIS eland pair  78  ANALYSIS eland rna  ELAND GENOME  data pipeline in genomes human hg19 fasta     123456  USE BASES Y n Y n   ei USE BAGES YOOn  n     REFERENCE human ELAND GENOME  data pipeline in genomes human hgl9  fasta    REFERENCE human ELAND RNA GENOME ANNOTATION  data pipeline  in genomes human humanrefflat refFlat txt gz   REFERENCE human ELAND RNA GENOME CONTAM  data pipeline  in genomes human contams fasta     REFERENCE phix ANALYSIS eland pair  REFERENCE phix ELAND GENOME  data pipeline in genomes phi 
45. 2xN 11  The bits are used as follows    Where is the cluster     Bit 0  always empty  0    index e Bit 1  was the read identified as a control     e Bit 2  was the match ambiguous    e Bit 3  did the read match the phiX tag    e Bit 4  did the read align to match the phiX tag    e Bit 5  did the read match the control index sequence    e Bits 6 7  reserved for future use   e Bits 8  15  the report key for the matched record in the    controls fasta file  specified by the REPORT KEY metadata     Position Files    The BCL to FASTO converter can use different types of position files and will expect a  type based on the version of RTA used      locs  the locs files can be found in the Intensities directory       clocs  the clocs files are compressed versions of locs file and can be found in the  Intensities directory     pos txt  the pos files can be found in the Intensities directory     CASAVA v1 8 2 User Guide D O    Sol INdUJ UOISIBAUOYD  DA    The    pos txt files are text files with 2 columns and a number of rows equal to the  number of clusters  The first column is the X coordinate and the second column is  the Y coordinate  Each line has a  lt cr gt  lt lf gt  at the end     Runinfo xmi File    The top level Run Folder contains a RunInfo xml file  The file RunInfo xml  normally  generated by SCS HCS  identifies the boundaries of the reads  including index reads      The XML tags in the RunInfo xml file are self explanatory     config xml Files    In the Intensities folder
46. 39  3910  3910  5461  5461  3807  3807  5821  5821    8061  8061    79 07     2 34  79 07     2 34  77 71     1 70  77 71     1 70  78 07     1 60  78 07     1 60  78 82     1 50  78 82     1 50  78 67     2 56  78 67     2 56    86 05     1 45  86 05     1 45    83 62     0 31  81 81     0 50  83 83     0 08  81 28     0 27  83 73     0 11  80 79     0 27  83 86     0 16  81 20     0 31  83 83     0 47  81 17     0 40    83 92     0 25  80 64     1 96    1 53     0 37  1 99     0 44  2 07     0 28  2 53     0 24  2 51     0 39  3 37     0 56  20d 1 L  3 20     0 56  2 37     0 66  2 85     0 44    1 56     0 33  1 81     0 36    1628162  1828182  1838089  1838089  1833031  1833031  1879675  1879675  1870679  1870679    2053659  2053659    pes 88898999986       Part   15011196 RevD    Figure 17 Coverage Graph in Home html    illumina       Bewort Many CASAVA 1 8 0a1 101019 PE DNA Seq  CASAVA 1 8 0a1 101019 PE DNA Seg   analysis  coverage for all reference sequences  mean depth at known sites       mean depth at known sites  Da S x v oa    a  ri       A Pi a E E E E Pi p E E E E E E E E Pi Pi A Pi dil Pi 4 P7  c1fa c2fa c3fa c4 fa c5fa c6fa c7fa cXfa cBfa c9fa c10 fa c11 fa c12 fa c13 fa c14 fa c15fa ci6fa c17 fa c18 fa c19 fa c20 fa cY fa c22 fa c21fa    mean depth at known sites    fraction of known sites mapped       o e  a  L 1    o  a  1    fraction of known sites mapped        o o o  bo pe  i n    o          Pa A A A A A A A A A A A A A A E A A A A A A FI PI FI P7  clfa 
47. 68 69    59    10  179    84  76    65  69  65    29    1 7    XapU     Index    1 8    M    make   make option  mitochondrial DNA  multiplexed sequencing  multiseed alignment    N    network requirements  none    O    Off Line Base caller  OLB  orphan alignment           paired reads  analysis variables  eland pair   parallelization  limitations   Perfect htm file    120   64   70   6   6  68 69    112    62    7  138    61    121   7    phasing prephasing percentage 18    pos txt files  position files  PROJECT    Q    Qseq Converter  options  parameters   quality scores    R    Read Segment Quality Control Metric42    readBases  REFERENCE  reference files  5S RNA  abundant sequences  CASAVA  contaminants  eland_rna  mitochondrial DNA  ribosomal repeats    reference genome 50  94     repeat alignment  repeat masked  repeat resolution  ribosomal repeats  RM  RNA sequencing  Run Folder  naming    29  29  97    164  164  41  148    92  D     126  128  126  128  127  126  128  125  126  128  126  128  125  127  68 69    run quality 83    run conf xml file 93  RunInfo xml file 30  runReport pl script 116  S  SAM Conversion 168  SAM format 168 169  SAMPLE SVA  SamplesDirectories csv 42  SampleSheet csv file 22  30  50  seguence alignment 2  seguence alignments 46  sites txt files 107  snps txt files 107  splice junctions 70  standard deviations 18  Standard GERALD Analysis 53  stats files 28  Summary htm file 83  T  technical assistance 179  tile variability 84  U  USE BASES 62  V
48. A Human seg gene md gz    seqGeneMdGroupLabel SE The group label specifies which assembly to use  in the seg  gene file  and is found in column 13  of the file  seg  gene files can hold entries for  multiple assemblies   Reguired for RNA counting when you use the  annotation seqGeneMd file from NCBI   Example    segGeneMdGroupLabel     GRCh37 p2 Primary Assembly       Options for Target bam    The options described below are used to specify analysis for target bam     Table 21 Analysis Options for bam    Option Application   Description    bamChangeChromLabels    SE  PE Change chromosome labels in the bam plugin output  The  OFF NOFA UCSC available behaviors are   OFF Use unmodified CASAVA chromosome labels  default  behavior      NOFA Remove any  fa  suffix found on each chromosome label   For example  c11 fa  is changed to  c11      UCSC Remove any  fa  suffix found on each chromosome label  and attempt to map the result to the corresponding UCSC  human chromosome label  For example  c11 fa  is changed to   chr11        bamSkipRefSeg SE TE Do not generate a reference seguence file with each bam file  The  default behavior can be restored with   no bamSkipRefSeg     Configuring Multiple Runs    To add multiple runs you can modify the run configuration file  run conf xml    a Go to  Human conf run conf xml  see Run conf xml on page 93    b Add the additional entries to the run conf xml file     c Then run the configuration again by executing     configureBuild pl  p  Human   
49. AGACTAAATAT TAACGTACCAT TAAGAGCTACC  ee NG V TATTAACGTACCATTAAGAGCTACCGTCTTCTGTTAACCT TAAGAT TACT T GAT CCACT GAT TCAAC   T TGAGACTAAATAT TAACGTT GTTAACCTTAAGAT TACT TGATCCACTGAT TCAACGTACCGTAACGAACGTAT CAAT TGAGACTAAATAT TAACGTACCAT TAAGAGCTTCTGT TAACCT TAAGAT TACT TGATCCACTGAT TCAACGTACCGTFA  TATCAATTGAGACTA TAAATAT TAACGTACT TAACCT TAAGAT TACT TGATCCACT GATT CAACGTACCGTAACGAACGT CTT CTGTTAACCT TAAGAT TACT TGATCCACTGAT TCAACGTACCGTAACGAACGTAT CAAT TGAGACTAACGACG    GACTAAATAT TAACGTACCAT TAAGAGCTACAACCT TAAGAT TACT TGATCCACTGAT TCAACGTACCGTAACGAACGTATCAAT TGAGACTAAATAT TAACGTACCAT TAAGAGCTACCGT GCAACGACGAAAAGAAT GATAACAGTAACACS   GATAACAGTAACACACTTCTGT TAACCTTAAGATTACTTGATCCACTGATTCAACGTACCGTAACGAACGTATCAATT GAGAGC TABATALIGAGGTAGCALIGAGAGG AG GG GLLGLGLIBAGGLIRAGALIAGLIGALGCACT AT oan  ACCATTAAGAGCTACCGTGCAACT TAACCTTAAGATTACT TGATCCACTGATTCAACGTACCGTAACGAACGTATCAATTGAGACTA AAGAT TACT TGA   GCTACCGTGCAACGAAAATAACCTTAAGATTACTTGATCCACTGATTCAACGTACTTCTGT TAACCTTAAGATTACTTGATCCAG  GAAAAGAAT GAT  TTAACCT TAAGAT TAC GATTACTTGATG          GAAAAGAATGA  TTAAGAGGTAGC    AACAGTAACACACTTCTG TTGATCCACTGATTCAACGTACCGTAAA T  IGATAACAGTAACACA T ATTACTTGATCCACTGATTCAACG GTAACG GTATCAATTGAGACTA ACACACTTCTGT  CAT TAAGAGCTACCGTGCAACAGTAACACACTT CT TTAAGATTACTTGATCCACTGATTCAACGTAC AACGA AATGA    GATAACAGTAACACA ATTACTTGATCCACTGATTCAACG GTAACG GTATCAATTGAGACTA ACTGAT TCAAC    GTACC CGAACGTATCATTAAGATTACTTGATCCACTGATTCAACGTACCGTAACGAACGTATCAATTGAG TA AA    TCTGTTAACCTT  C TT TT  A GTAC CGT  C AA
50. CCCCCCCCCCCCCCCCCGGCATCIAIGGCTTTT    3  CASAVA v1 8  Align unmapped reads using overlapping seeds    Read CCCCC GCCCCG CCCCCCCCCCC    Seed Seed               Reference  CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCGGCATCTATGGCTTTT    4  Report seeds that hit a non repetitive sequence    Read CCCCCCCCCCCCCCCCCCCCCCCOCCCCCGGCATCTA    Reference    CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCGGCATCTATGGCTTTT       Orphan Alignment    ELANDv2e performs orphan alignment by identifying read pairs for which only one of  the reads aligns  ELANDv2e tries to align the other read in a defined window  by    default 450 bp   If the number of mismatches is  lt 10  of the read length  ELANDv2e  reports the alignment     1 38 Part   15011196 Rev D    Figure 23 Orphan Alignment    1  Identify orphan read pairs  One read maps well  green    the other  has multiple mappings  red      With mapped read as anchor generate a 450 bp window    3  Do local realignment of the unmapped read within the window       Alignment Performance Improvements    The multiple component updates in CASAVA were designed to improve overall  alignment performance  To asses the performance change  alignment percentage   mismatch rates  and CPU run times were compared for three different configurations   CASAVA v1 7  CASAVA v1 8 with semi repeat resolution  and CASAVA v1 8 with full  repeat resolution  The data set consisted of three lanes of HiSeq    data from a single  sample sequenced with TruSeq v3 chemistry  The analysis was performed o
51. D    Figure 15 CASAVA Build Directory    ProjectDir project directory       Parsed xx xx xx current build directory  final files are here     notMapped non mapping reads  only in archival builds     c1 fa build chromosome directory    OONN chromosome bin directory    cont    file with genotype calls  configuration directory    snps txt file with SNPs    indels txt file with indels    html Chromosome     file with exon counts  RNA Sequencing only     exon count txt    Pa   Chromosome_ file with splice counts  RNA Sequencing only    splice_count txt   A   Chromosome  file with gene counts  RNA Sequencing only   stats gene count txt    directory with stats text reports    directory with html reports    chromosome bamdirectory    bam    sorted  sorted   ia file with sorted sequence reads  bam    genome genome directory      kan bam directory    ra  file with whole genome in BAM format  bam    The most important folders for downstream analysis are listed below    gt  Html Folder  The html folder contains the build summary html pages  see Build Html Page on  page 104   which provides access to run information and graphs of important  statistics        Parsed_xx xx xx folder   The Parsed_xx xx xx folder contains most of the sequencing information  such as  sorted alignments  SNP and indel calls  and  for RNA Sequencing  gene counts   exon counts  and splice junction counts  see CASAVA Build on page 105   This  information is organized in chromosome folders named cl or c2  for exampl
52. DCS AA vii   Chapter 1 Overview  ss 1  aeta AA 2   CASAVA Features              EG GEE EG cece renere eee eie 5   What s New    9   Frequently Asked Questions           2 22 EE EG EE EE cece cece EG Ee GE 10   Chapter 2 Interpretation of Run Ouality                                      11  Introduction    12   Quality Tables and Graphs               ES e cece eee EE GE EE Ge 13   SINA se ea eee eek see hoes Se eee ae 17   Chapter 3 Bcl Conversion and Demultiplexing                               19  WTO OUGHON AAP 20   Bel Conversion Input Files    EE EES EE eee eee eee GE Eie 26   Running Bcl Conversion and Demultiplexing                                  32   Bcl Conversion Output Folder    37   Chapter 4 Sequence Alignment                                                   45  Introduction    46  configureAlignment Input Files    SS aa 48   Running configureAlignment               cece eee eee eee eee ee eee eee ee 53  configureAlignment Output Files    73   Running ELAND as a Standalone Program                                         65   Chapter 5 Variant Detection and Countind                                     8   Introduction    88   RI  AA 91   Variant Detection Input Files                  e eee eee c eee ee eeee 93   Running Variant Detection and Counting                                           96   Variant Detection and Counting Output Files    102   Appendix A Requirements and Software Installation                     111  Hardware and Software Reguirements      
53. G TAGCAACGACC  GAAAAGAATGATAACAGTAACACACTTCTGT TAACCT TAAGATTACTTGATCCACTGATTCAACGTACCGTAAAGATTACTTGAT  IT TAAGAGCTAC     natah TTAGACCACH TN     car tTACCACAATTAA    ciThCAGTACGTACAACAT   AGOGMAGACAGGTTACCATANC  MTTATTAGATATTGTACAT CC AG    AAGAGTCAAGATT   GCAGGTGAAT    AGAAGTTG GG    FOR RESEARCH USE ONLY    ILLUMINA PROPRIETARY  Part   15011196 Rev D  December 2011    Part   15011196 Rev D    This document and its contents are proprietary to Illumina  Inc  and its affiliates   Illumina    and are  intended solely for the contractual use of its customer in connection with the use of the product s   described herein and for no other purpose  This document and its contents shall not be used or distributed  for any other purpose and or otherwise communicated  disclosed  or reproduced in any way whatsoever  without the prior written consent of Illumina  Illumina does not convey any license under its patent   trademark  copyright  or common law rights nor similar rights of any third parties by this document     The instructions in this document must be strictly and explicitly followed by qualified and properly trained  personnel in order to ensure the proper and safe use of the product s  described herein  All of the contents  of this document must be fully read and understood prior to using such product s      FAILURE TO COMPLETELY READ AND EXPLICITLY FOLLOW ALL OF THE INSTRUCTIONS  CONTAINED HEREIN MAY RESULT IN DAMAGE TO THE PRODUCT S   INJURY TO PERSONS   INCLUDING TO USERS 
54. Intensity    Intensity Cycle 5 Base A          1000       Parti 15011196 Rev D    Summary Tab    Another tab in the status htm page or SAV that you should examine is the Summary    tab  The key parameters are listed in the following sections  along with conditions   possible causes for those conditions  and suggested actions to correct the condition     qe   AMEUUNS    Clusters   This column contains the average number of clusters per tile detected in the first cycle  images    Condition Possible Cause Suggested Action   Fewer clusters than expected  Reanalyze with new default offsets in OLB    Few bright clusters   Problem with cluster formation You will need   cif files for that    on the flow cell   Blurred images Poor focus or dirty flow cell surface   Lots of clusters Cluster density or size is too great to   visible distinguish individual objects    More clusters than expected    Too many clusters   Problem with cluster formation  on the flow cell   Very large clusters   Double counting    Average First Cycle Intensity    Generally  brighter is better  but this result is instrument and sample dependent     Condition Possible Cause    Low Problem with cluster formation or poor  intensity focus    Percentage of First Cycle Intensity Remaining After 20 Cycles of  Sequencing  Generally  the higher  the better  The intensity remaining can be sample dependent     Condition Possible Cause Suggested Action   Low value A correct measure of rapid signal decay deduced   Check expe
55. LAND treats the reference sequence as being in    blocks    of 16  MB  of which there can be at most 240  This limits the total length of DNA that ELAND  can match against in a single run   In a single ELAND run you can match against   One file of at most 240 x 16 3824 MB  239 files  each up to 16 MB in size  Something in between  such as 24 files of up to 160 MB each   The NCBI human  genome will fit      Additional eland_rna Input Files    The following additional files are needed for eland rna   refFlat txt gz or seq_gene md gz file   as of CASAVA 1 7  eland_rna uses the  refFlat txt gz or seq_gene md gz file to generate the splice junction set automatically   The refFlat txt gz file is available from UCSC  while the seq_gene md gz file is from  NCBI  They should be provided gzip compressed  and should be from the same  build as the reference files you are using for alignment  This negates the need to  provide separate splice junction sets as in earlier versions of CASAVA  The  parameter to use for either one is ELAND_RNA_GENOME_ANNOTATOTION   9 WARNING    Do not change the names of the refFlat txt gz or seg gene md gz file   CASAVA uses the name to determine the type of file     at    CASAVA v1 8 2 User Guide D     s    jl4 Indu  jusuuBijyosanBIJUOD    Sequence Alignment    52    A set of contaminant sequences for the genome  typically the mitochondrial and  ribosomal sequences  These must be in single FASTA format  The parameter to use  to direct to the contaminant seque
56. ND   process  needed to ensure that the memory usage stays  below 2 GB  The optimal value is such that there are  approximately 10 to 13 million lines  reads  in one set   Only available for ANALYSIS eland  extended  ANALYSIS  eland pair  and ANALYSIS eland rna   pee ELAND FASTE FILES PER PROCESS on page 65 for  more information  Default value is 3 ANALYSIS Variables  on page 61    k  WARNING  y Default for USE BASES is Y n  which means perform a single read  i alignment and ignore the last base  If running ANALYSIS eland pair  make  sure to specify the USE BASES option for two reads  for example USE   BASES Y n Y n      Optional Parameters    Table 3 configureAlignment Configuration File Optional Parameters    Parameter Definition    SINGLESEED If SINGLESEED is set to   singleseed  ELANDv2e aligns only in    singleseed mode  Only available for ANALYSIS eland_extended  and ANALYSIS eland_pair  for which multiseed alignment is    default   See ELANDv2 Algorithm Description on page 133 for more  information    UNGAPPED If UNGAPPED is set to   ungapped  ELANDv2e aligns only in  ungapped mode   See ELANDv2 Algorithm Description on page 133 for more  information    INCREASED SENSITIVITY If you specify INCREASED SENSITIVITY  sensitive     ELANDv2e aligns in full repeat mode  Semi repeat resolution  alignment is default     You can also use  INCREASED SENSITIVITY     sensitive on the  command line     See Repeat Resolution on page 137 for more information     CASAVA v1 8 2 User Guid
57. Note that as a consequence of the candidate indel discovery process  indels can be  called using either gapped alignments or Grouper contig alignments as input  and the  evidence from these two sources will be combined if both are available  Typically  gapped alignments can be used to efficiently identify relatively small indels  roughly 1   10 bases in length   whereas local contig assembly can efficiently identify much larger  indels  The greatest indel sensitivity can be achieved by generating candidate indels  from both of these sources     The parameters described for candidate indel filtration above are configurable as  described in the CASAVA User Guide  Accepting too many candidate indels increases  runtime and can lead to occasional spurious indel calls or poorly realigned reads in  noisy regions of the genome     Realignment and Indel Calling    For the second stage of indel calling  the variant caller realigns all intersecting reads to  each candidate indel  in addition to aligning the read to the reference and any alternate  indel candidates at the same site  It is common for reads which intersect the indel  location to support the indel and reference alignments equally well  so the model is  designed in such a way that these reads do not affect the genotype call     The relative likelihoods of all alignments for each read are used to assign probabilities  to each of three possible indel genotypes  homozygous  heterozygous or not present  The  result of this calcu
58. OR OTHERS  AND DAMAGE TO OTHER PROPERTY     ILLUMINA DOES NOT ASSUME ANY LIABILITY ARISING OUT OF THE IMPROPER USE OF THE  PRODUCT S  DESCRIBED HEREIN  INCLUDING PARTS THEREOF OR SOFTWARE  OR ANY USE  OF SUCH PRODUCT S  OUTSIDE THE SCOPE OF THE EXPRESS WRITTEN LICENSES OR  PERMISSIONS GRANTED BY ILLUMINA IN CONNECTION WITH CUSTOMER S ACQUISITION OF  SUCH PRODUCT S      FOR RESEARCH USE ONLY     2009 2011 Illumina  Inc  All rights reserved     Illumina  illuminaDx  BaseSpace  Bead Array  BeadXpress  cBot  CSPro  DASL  DesignStudio  Eco    GAIIx  Genetic Energy  Genome Analyzer  GenomeStudio  GoldenGate  HiScan  HiSeq  Infinium   iSelect  MiSeq  Nextera  Sentrix  SeqMonitor  Solexa  TruSeq  VeraCode  the pumpkin orange color   and the Genetic Energy streaming bases design are trademarks or registered trademarks of Illumina  Inc   All other brands and names contained herein are the property of their respective owners     CASAVA v1 8 2 User Guide    Part   15011196 Rev D    Part      15011196    15011196    15011196  15011196    1509919    Revision History    Revision Date Description of Change   D December Updates in FASTO file control column  2011 description    C October Supports dual indexing and adapter masking for  2011 CASAVA v1 8 2   B May 2011 Supports CASAVA v1 8   A March  2010   A November  2009    CASAVA v1 8 2 User Guide    Part   15011196 Rev D    Table of Contents    Revision History    ili   Table of Contents              2 2 ieee EG EG EG EG EG ee V   EIST OD  A
59. On Leni   Yoon  This means     33     diuna  pue uoisiaAuo9 jog Buluuny    buixa    Bcl Conversion and Demultiplexing    Option       e10 Gama      mismatches      flowcell id      ignore missing stats      ignore missing bcl    ignore missing    control    with failed reads      adapter sequence      man   h    help    Description  e The read masks are separated by commas    ir Ty  J    The format for dual indexing is as follows     use bases mask Y gt   1   1   Y gt  or  variations thereof as specified above    If this option is not specified  the mask will be  determined from the  RunInfo xmI file in the  run directory  If it cannot do this  you will  have to supply the   use bases mask     Disable the masking of the quality values with  the Read Segment Quality control metric  filter    Comma delimited list of number of  mismatches allowed for each read  for  example  1 1   If a single value is provide  all  index reads will allow the same number  mismatches     Default is O     Use the specified string as the flowcell id    default value is parsed from the config file     Fill in with zeros when   stats files are missing    Interpret missing   bcl files as no call    Interpret missing control files as not set  control bits    Include failed reads into the FASTQ files  by  default  only reads passing filter are included      Path to a FASTA adapter sequence file  If  there are two adapters sequences specified in  the FASTA file  the second adapter will be  used to mask re
60. RALD     Running ELAND as a standalone program does not perform all of the various steps  that are included during a configureAlignment run  The most important differences are   ELAND standalone does not generates many of the statistics  ELAND standalone is not massively parallel like configureAlignment  If you require any or all of the above  it is best to create a modified config file to align to    a different genome  and rerun configureAlignment  For more information  see Running  configureAlignment on page 53     FASTQ Format    56    Any FASTO file will be supported  but the CASAV A FASTO file format is optimal for  populating the appropriate fields  The format is   note the space between y pos and read  number     lt instrument name gt   lt run ID gt   lt flowcell ID gt   lt lane gt   lt tile gt   lt x   pos gt   lt y pos gt       lt read number gt   lt is filtered gt   lt control  number gt   lt barcode sequence gt     The elements are described below     Element Requirements Description      Each sequence identifier  line starts with     lt instrument name gt  Characters allowed  Instrument name  a z  A Z  0 9   lt RunID gt  Characters allowed  Run ID  a z  A Z  0 9   lt flowcell ID gt  Characters allowed  flowcell ID  a z  A Z  0 9   lt lane gt  Numerical Lane number   lt tile gt  Numerical Tile number   lt X pos gt  Numerical X coordinate of cluster   lt y pos gt  Numerical Y coordinate of cluster   lt read number gt  Numerical Is usually 1 or 2 for paired     end 
61. TCAACGTACCGTAACGAACGTATCAATTGAGACTAAGCTACCGTGCAACGACGAAAAGAATGA  GAAAAGAATGATAACAGTAACACACTTCTGTTAACCTTAAGATTACTTGATCCACTGATTCAACGTACCGTAAAGATTACTTGATCCACTGATT CAACGTACCGTAACGAACGTATCAATTGAGACTAAATAT TAACGTACCAT TAAGAGCTACC   GATAACAGTAACACACTTCTGTTAACCTTI  A AE RHL TACCGTAACGAACGTATCAATT GAGACTAAATAT TAACGTACCAT TAAGAGCTACCGT GCAACGACGAAAAGAAT GATAACAGTAACACACTTCTGT   ACCAT TAAGAGCTACCGTGCAACAGTAACACACTTCTGTTAACCTTAAGAT TACT TGAT CCACT GATT CAACGTACCGTAACGAACGTAT CAAT TGAGACTAAATAT TAACGTACCAT TAAGAGCTACCGT GCAACGACGAAAAGAAT GATAA   GATAACAGTAACACACT TCT GT TAACCT TAAGAT TACT TGATCCACT GATT CAACGTACCGTAACGAACGTAT CAAT T GAGACTAAATAT TAACGTACCATTAAGAGCTACCGTCTTCTGT TAACCTTAAGATTACTTGATCCACTGATTCAAC    GTACCGTAACGAACGTATCATTAAGATTACTT GAT CCACT GATT CAACGTACCGTAACGAACGTAT CAAT T GAGACTAAATAT TAACGTACCATTAAGAGCTACCGTGCAACGACGAAAAGAATGATAACAGTAACACACTTCTGTTAACCTT    SHANG GATTCAACGTTAAGA EE BG AI MIT AA TATCAATTGAGCTTCTGTTAACCTTAAGAT TACTTGATCCACT GAT TCAACGTACCGTAACGAACGT  ee ee ANS    G G AT CTT  AC CT TACCG CGT  GL TTAACGTACCATTI  C    GATAACAGTAACACACTTCTGTTAACCTTAAGATTACTTGTTGATCCACTGAT TCAACGTACCGTATCAAT TGAGACTAAATATTAACGTACCATTAAGAGCTACCGTCTTCTGTTAACCTTAAGATTACTTGATCCACTGATTCAACGTACCG   CACTGAT TCAACGTACCAAGATTACTTGATCCACTGAT TCAACGTACCGTAACGAACGTATCAATTGAGACTAAATAT TAACGTACCAT TAAGAGCTACCGTCT TCTGTTAACCT TAAGATTACTTGATCCACTGATTCAACGTACCGTAACGA  GAAAAGAATGATAACAGTAACACACTICTGTTAACCT TAAGATTACTTGATCCACTGATTCAACGTACCGTAAAGAT TACT TGATCCACTGAT TCAACGTACCGTAACGAACGTATCAAT TG
62. TE  eland_rna does not support paired end cDNA reads yet     Prerequisites    Four sets of data files are needed    A genome sequence file   Fasta files of all chromosomes for on fly splice junction generation   refFlat  txt gz  from UCSC  or seq_gene md gz file  from NCBI     as of CASAVA 1 7   eland_rna uses the refFlat txt gz or seq_gene md gz file to generate the splice  junction set automatically  These files come from the following sources    The refFlat txt gz file is available from UCSC   The seq_gene md gz file is available from NCBI   They should be provided gzip compressed  and should be from the same build as  the reference files you are using for alignment  This negates the need to provide  separate splice junction sets as in previous version of CASAVA     A set of contaminant sequences for the genome   typically the mitochondrial and  ribosomal sequences     Description of the eland_rna Algorithm  The algorithm aligns the reads to each of three targets   Contaminants  Genome  Splice junctions  alignments need to span splice junction  Then a script decides which of the alignments is most likely for each read  The  following steps are taken in order     1 Ifa read aligns to the contaminants then the read is discarded  It is marked in the  export file as  RM     for repeat masked   2 If the read aligns to the genome and or splice junctions     If there is a unique alignment to the genome or splice junctions then that  alignment is printed     If there are multiple 
63. Table 23  Table 24  Table 25  Table 26  Table 27  Table 28  Table 29  Table 30    CASAVA v1 8 2 User Guide    List of Tables    ASCII Characters Encoding   scores 0 40                                       41  GERALD Configuration File Core Parameters                                     54  configureAlignment Configuration File Optional Parameters                     55  configureAlignment Configuration File Paired End Analysis Options            56  ANALYSIS Variables                   cece cee cee cee cece cee cence 61  USE BASES OptiONS           EE EE EG EG EE EE EE 62  Parameters for KAGU PAIR PARAMS and KAGU PARAMS                  65  Parameters for KAGU PAIR PARAMS Only                                      65  Parameters for ANALYSIS eland extended                                       68  Parameters for ANALYSIS eland rna                                            1  Intermediate Output File Descriptions                                             82  Intermediate Output File Formats                 aa 82  Required Parameters for ELAND standalone pl                                 85  Options for ELAND standalone pl                                               85  Targets for Variant Detection and Counting                                       97  Major File Options for Variant Detection and Counting                          98  Behavioral Options for Variant Detection and Counting                         98  Global Analysis Options for Variant Detection and Countin
64. UBIJEA    Variant Detection and Counting    110    Note that for a read to strongly support either the reference or the indel alignment  it  must overlap an indel breakpoint by at least 6 bases and the probability of the read   s  alignment following either the reference or the indel path must be at least 0 999     Count txt Files    There are three different types of count txt files  for exon  gene  or splice junction   Chromosome_exon_count txt  The _exon_count txt provides counts for the number  of times a particular exon has been detected in a sample   Chromosome_genes_count txt  The _genes_count txt provides counts for the  number of times a particular gene has been detected in a sample   Chromosome_splice_count txt  The _splice_count txt provides counts for the  number of reads that align over a particular splice junction   _count txt files are generated by RNA Sequencing  sorted by position  and there is one of   each type per chromosome  for example  c19 exon count txt   The _count txt files are   stored in the chromosome specific directory under the Parsed dd mm yy directory  and  contain the following columns     1 Chromosome  starting with a c  The chromosome on which the exon resides     cM     indicates a mitochondrial DNA alignment     Start   The start of the gene   End  The end of the gene   Genes    The gene symbol     GI AeA W N    Normalized count  RPKM   10  x raw count   feature length x number of mapped  bases    6 Raw count sum of coverages for each bas
65. VA accepts single sequence FASTA files as genome reference  which should be  provided unsquashed for both alignment and post alignment steps  The chromosome  name is derived from the file name     Direct CASAVA to a folder containing the FASTA files using the option     refSequences PATH for variant detection and counting     Multi Sequence FASTA Files    As of version 1 8  CASAVA accepts a multi sequence FASTA file as genome reference   This should be provided as a single genome  SAM compliant  unsquashed file  for both  alignment and post alignment steps  The chromosome name is derived directly from  the first word in the header for each sequence     Direct CASAVA to multi sequence FASTA file using the option     samtoolsRefFile FILE for variant detection and counting      a WARNING  y GenomeStudio does not support the use of multi sequence FASTA files   i Therefore  if you want to analyze your output in GenomeStudio  we  recommend using single sequence FASTA reference files     Chromosome Naming Restrictions    CASAVA does not accept the following characters in the chromosome name     NG TJ  ERA FEE ee    refFlat txt gz or seg gene md gz File    CASAVA 1 8 generates the non overlapping exon coordinates set automatically using  the refFlat txt gz file  from UCSC  or seq_gene md gz file  from NCBI   They should be  from the same build as the reference files you are using for alignment  and are available  from iGenome for the most common model organisms  Getting Reference Fil
66. a ITA Kanes adi EE Par ha T er IPA Tat E HAT TRACT U Naay ATT  TULALA Wet TAE OTACA   att ML O AWAL k Mr TN FT Fc Rt nur U Ai AG TA RD GE sd TET  ATi LAT AAT reat the  AN G  kong ST Q TAAL ag ar   ATI TA WT  Y ME at AA Nara SN les AG TAAAIAY TACO TAGA TAMA re Ea GEG E 2 Sek i a Aa PANGA te S  rit Hol oa le   F i i C PET T  O 5   ea  Ig LANG G FI ETE   NT FEIT ry  Gi L L GATAAI ri Sh sali Ai ert i 7 er Soret TG CT AS TAL EI te GR TA AA Be TGR RL La CATT TE T  TIARA TALA AAI AGE m  r pana N ed VERE Teli  2 J TE C P TADIA BAGOT nar L D iT va DAWA dresa AA NG TGA AG TELC  AABANG BAN a Saka IAAL AA ma  act ca   NE IKA AN Ra ETE TATTAR ee O sent ee TOAMCETAAAAGARTGATAAC AD TAAL AE TTE TETA NOTA  EER ATA nepal EM UA Ko   UTA O TELET EME ka aaa EE ARE Taga NG PAENG TE A PO MN    TRA TA AA GE TA TE AA le leed    re Ta  TA LE aoai GE ka baa KAT LITA er TRON Th YH Tee  a PA LE MEE SAMAL k apan CAAGOTAT pi  Bai BARU rr kad In ER  SKE anna ari la a   gt     AREA 1G laag d ie  RelA of BEDER  DB Sira Pl L ia   i aia GAA  LAVAL G TETE TS C  ie TET AA ABA GA tr  e EES L a         zi AE   z   ga Na  m    Ta    L al S i all    TAN fats SI  Lu aa re   r ska Ee AL AART r me L LC kir li z i  AA ITS T C due za F GI  Wet rl es EE jis ie AE Ha di RE   9 zalije   e BAC TAAATR Bars PERE mala ae ie paga RE TEE RE ACE HERFRA AE TA MEMETTETE AA k Pa TIAA  HMI is ATR TAA ian Bese KIRA MTAA S i SAATTA lap AT LGT AT TATTO E Hg era L HE NVO NETA  ee ES AE EET ie Aip N KG  r v  l ir  E 7 i F A F Pepe erp se re
67. a PTE FT fo  KI r x 1  a CA  EW D TAT GA  LA Ci la Ej  aa UN Y SAL f HEN Na EE AE BIST ATAT FT ad TRATA a NN Ten NEM MERE EE lede Ee AE TAN en  Dee MRM AE EEN INA AA TA N N ME Sars MI ARM GEN a nina RE ee    lumina   Headquartered in San Diego  Califomia  U S A    1 800 809 ILMN  4566     1 858 202 4566  outside North America   techsupport illumina com   www  illumina com    
68. a reference sequence using the  compressed FASTO files     5 Variant analysis and counting   Calls Single Nucleotide Polymorphisms  SNPs   and indels  and performs read counting  for RNA sequencing      After variant analysis and counting are finished  the results can be viewed and  analyzed further in the GenomeStudio   software  or the result files can be analyzed  using third party software     D Part   15011196 Rev D    Figure 1 Sequencing Data Analysis Workflow  Analysis Step Analysis System    Generating Sequencing Images    HiSeq  Genome Analyzer  HiSan SQ       Real Time Analysis  RTA     Performing Image Analysis         Base Calling    T files    FASTQ Generation and Demultiplexing    y    Align ing CASAVA    y    Detecting Variants and Counting         Viewing Results  Analysis files  txt  html  Visualization and analysis    GenomeStudio       Default Analysis Workflow    Several analysis software products can be used for the analysis cascade  The default  workflow uses these software products   HiSeq Control Software  HCS  and Real Time Analysis  RTA   or Genome  Analyzer   s Sequencing Control Software  SCS  and RTA  The instrument computer  running this software performs the following in real time   Image analysis  Base calling  CASAVA 1 8 2  running on a Linux analysis server  performs   Bcl conversion and demultiplexing  Off line sequence alignment  SNP calling and indel detection  read counting  for RNA sequencing   L NOTE  As of 1 8  CASAVA uses   bcl as pri
69. ach read     resulting in the length of each read being set to the number of sequencing  cycles associated with it minus one  The two reads do not need to be of the  same length     USE BASES ni  nY  Ignore the first base of each read and perform a paired read alignment     resulting in the length of each read being set to the number of sequencing  cycles associated with it minus one  The two reads do not need to be of the  same length     USE BASES nY  This means ignore the first base and perform a single read alignment     USE BASES n  Y n Ignore the first read and perform a single read alignment with the second    read  ignoring the last base     USE BASES Y n n  Perform a single read alignment with the first read  ignoring the last base     and ignore the second read     ELAND FASTO FILES PER PROCESS    CASAVA requires a minimum of 2 GB RAM per core  The parameter ELAND FASTQ  FILES PER PROCESS  optional  in the configureAlignment config txt specifies the  maximum number of FASTO files aligned by each ELAND process  to limit the per core  memory consumption   i NOTE    ELAND FASTQ FILES PER PROCESS supersedes the ELAND SET SIZE  parameter used in CASAVA 1 7 and earlier     The optimal value leads to approximately 10 to 13 million clusters in one set  Since the  FASTO file size  in reads  is determined by the Bcl conversion option   fastq cluster   count  while the maximum number of files per process is determined by ELAND_  FASTO FILES PER PROCESS  the product of these op
70. ad 2  Else  the same adapter  will be used for all reads    Default  None  no masking    Print a manual page for this command  Produce help message and exit    Makefile Options for Bcl Conversion and Demultiplexing    Examples   e Use first 50 bases for  first read  Y50    e Ignore the next  n    e Use 6 bases for index   16    e Ignore next  n    e Use 50 bases for second  read  Y50    e Ignore next  n      no eamss      mismatches 1      flowcell id flow  cell id 7    ignore missing   stats  e ignore mis  sinag bci    ignore missing   control    with failed reads      adapter sequence     adapter  dir gt  adapter fa      man    The options for make usage in demultiplexing analysis are described below     Parameter    Description    nohup Use the Unix nohup command to redirect the standard output and keep the    make       process running even if your terminal is interrupted or if you log out  The standard  output will be saved in a nohup out file and stored in the location where you are  executing the makefile     nohup make  j n  amp     The optional     amp     tells the system to run the analysis in the background  leaving you  free to enter more commands     We suggest always running nohup to help troubleshooting if issues arise      j N The  j option specifies the extent of parallelization  with the options depending on the  setup of your computer or computing cluster     For a description of parallellization  see Using Parallelization on page 119     34    Part   1501
71. ads passing filter aligned uniquely to the splice  junctions   8  genomeUsable   number of reads passing filter aligned uniquely to the genome   i NOTE      Sum of spliceUsable and genomeUsable is equal to Usable     CASAVA v1 8 2 User Guide 91    soji J 1NA1NO jusWubi yainbijuoo    Sequence Alignment    9 In the last rows numbers are provided for number of passing filter reads aligned to    each reference sequence file within the AbundantSequences directory  The names  are derived from the fasta headers  up to first space  used to list each reference in  the multifasta abundant sequences file  If you want a more descriptive names  like  ribosomal  E coli  or phiX  you should modify fasta headers in the abundant  sequences file    i NOTE     Difference between repeatMasked and sum of all abundant sequences gives   the number of reads that do not have unique alignments     contam_export txt gz    Contains unique alignments to sequences in the CONTAM directory  in the export  format  see Export txt gz on page 79      Intermediate Output Data Files    Intermediate output files are found in the Aligned folder and contain data used to build  the more meaningful results files described in Pipeline Analysis Output on page 43       CAUTION  y Do not use the intermediate files as input for custom scripts  These files may  aH  not be generated anymore in future CASAVA versions     The files are named using one of the following formats   s N TTTT name txt  where N is the lane number  T 
72. aired read sample prep   Unlike these short insert pairs that have a predominance in opposite and inwardly  facing read pairs  R    gt  R1 R2  lt    the large insert mate pair libraries expect to  produce a predominance in opposite and outwardly facing read pairs  R    lt  R2 R1   gt    High frequencies of paired reads having the same orientation  F    gt  R2 R1  gt  or  F    gt  R1 R2  gt   may be indicative of a sample preparation problem  or evidence of  an adapter read through problem found when the read lengths are long relative to  the library insert size   Insert Size Statistics   Statistics are derived from the insert sizes of those pairs in  which both reads were individually uniquely aligned and have the predominant  relative orientation  First  the median is determined  Then  a standard deviation  value is determined independently for those values below the median and those  above it  The lower and upper thresholds for acceptable insert sizes are then defined  as three of the relevant standard deviations below and above the median   respectively   Insert Statistics  Yo of individually uniquely alignable pairs   This table shows the  number of inserts  out of those used to calculate insert size statistics  considered  acceptable in size and of those falling outside the thresholds displayed in the Insert  Size Statistics table  The percentages are relative to the original number of pairs in  which both reads were individually uniquely aligned     Barcode Lane Summary
73. ally at high read   lengths     High  but constant mismatch rates from cycle 1    Possible Cause  Bubbles   Rapid focus fluctuations  Dirty flow cell surface  Low intensity at start  High decay rate   High phasing or  prephasing   Adapter read through  Genomic contamination    Parti 15011196 Rev D    Running ELAND as a Standalone Program    You can run ELAND without the rest of configureAlignment as a post analysis step   ELAND can be run as a standalone program for the following reasons   To test the effect of different filter parameters  To test alignment targets  To test applications that read export files  To run ELAND as a standalone program  use the script  Path to CASAV A1 8 bin ELAND standalone pl   Path to CASAVA1 8 bin ELAND standalone pl  if readl fastg  if  fear Faste o   ref  lustre data01 Mondas software Genomes E coli ELAND    Table 13 Required Parameters for ELAND_standalone pl    Option Short Form   Description    input file  lt input file gt   if Specify at least one file for single reads and two files for paired     reads  mandatory       ref sequences  lt path to    ref Full path of a genome directory  mandatory   genome dir gt     Table 14 Options for ELAND_standalone pl    Option Short Form   Description     bam Enables BAM output      base quality  lt value gt     bq Assumes all bases have this quality when in fasta mode  default is set  to 30      copy references  CT Copies the references to the output directory  Use this option if your  reference sequ
74. alysis     22 Filtering  Did the read pass filtering  N   No  Y   Yes     Additional configureAlignment Output Files    s N TITT rescore txt    The  txt  score and rescore files are produced by tile  The corresponding XML  summaries are by lane  Various breakdowns of base mismatches within aligned reads   e g  by cycle  called base and reference base   along with associated statistics  Tabular  text format  header data included     rnagc txt   The output file rnagc txt files in the Aligned folder provides the following information on  alignment distribution for eland rna    1  totalClusters   number of total clusters     2 PFClusters  number of clusters passing purity filter     3 Usable  number of reads passing filter and aligned uniquely to the genome plus  splice junction     4 QC   number of reads passing filter that were not aligned due to too many bases not  called  QC in the 11th field of the export file      5 noMatch  number of reads passing filter that did not match anything  including  repeat masked   these reads have NM label in the 11th field of the export file     6  repeatMasked  number of reads passing filter that were masked by eland ma  RM  label in the 11th field of the export file   These are reads mapping to abundant  sequences and reads that do not have unique alignments to the genome or splice  junctions    i NOTE      Sum of Usable  OC  noMatch  and repeatMasked reads is equal to number of  reads reported in PFC lusters     7 spliceUsable  number of re
75. ample  you would not  want to use a read with 10 mismatches for SNP calling  even if it is the only candidate  found  The same applies for a read of poor base quality     Gapped Alignment Scoring    Given a read  ELANDv2e determines positions in the genome to which substrings of  the read  seeds of length 32 bp  match with at most two errors  We then grab x  additional bases before and after the hit position  default value for x is 5  to account for  potential gaps in the alignment phase    We then compute a global alignment between the read and the reference which means  that the entire read is aligned to the reference  We are using affine gap penalties   opening a gap is more expensive than prolonging an existing gap  The alignment  algorithm is furthermore banded  i e  we restrict ourselves to a maximal length of an  expected insertion deletion  this value is set to 10      Conditions for Opening a Gap    ELANDv2 tries to be conservative about when to open a gap  There are two main  conditions that have to be satisfied to open a gap     1 A gap corrects at least five mismatches downstream  this means that the number of  mismatches between the ungapped and the gapped alignment is at least five     2 We set the number of mismatches in the gapped and ungapped alignment in  relation to each other  The reason is that we want to distinguish gaps that improve  noisy ungapped alignments and real small insertions deletions  To this end  we  define the _noise ratio_ as    mismatches
76. ane  and not demultiplexed  Each directory can be independently analyzed   alignment  variant analysis  and counting  with CASAVA and contains the files  necessary for alignment  variant analysis  and counting with CASAVA    i  NOTE     Some of the files needed for the alignment are at the top level of the   Unaligned directory     At the same time  CASAVA also separates multiplexed samples  demultiplexing    Multiplexed sequencing allows you to run multiple individual samples in one lane  The  samples are identified by index sequences that were attached to the template during  sample prep  The multiplexed samples are assigned to projects and samples based on  the sample sheet  and stored in corresponding project and sample directories as  described above  At this stage  adapter masking may also be performed  With this  feature  CASAVA will check whether a read has proceeded past the genomic insert and  into adapter sequence  If adapter sequence is detected  the corresponding basecalls will  be changed to N in the resultant FASTO file     q ja WARNING  y The CASAVA 1 8 directory organization differs considerably from the  i directory organization used in CASAVA 1 7     L NOTE    You cannot start Bcl conversion  demultiplexing  and alignment in one step  using CASAVA     Bcl Conversion Demultiplexing Directory Structure    Bcl conversion and demultiplexing is done in a single step  and generates a new  directory in the Run folder called Unaligned  which contains all of the dem
77. archival and non archival versions of the build    Fast creation of whole genome BAM files  After the sort module has completed  30   40x whole genome BAM files can now be created and indexed in approximately 1  hour    Spliced alignments are now represented in BAM using the same format as TopHat   allowing visualization of splice junctions in IGV     Archival Build    Archival builds  turned on with the option   sortKeepAllReads  include all reads  given as input to the build in their entirety   Purity filtered and duplicate reads are stored in the primary BAM files with the  appropriate bit settings to identify them  These will be ignored by variant calling  and RNA read counting   To handle various types of unmapped reads  the CASAVA 1 7  NMNM  directory  has been renamed as  notMapped   Reads within this directory are classified into  separate BAM files for the following categories  noMatch  qcFail  nonUnique   repeatMasked  mixed   In any situation where reads were trimmed in CASAVA 1 7 they are now soft   clipped  In some cases  where a read would be removed in non archival mode due  to some anomalous condition  that read is now marked as unmapped and stored in  the build instead  Note that the small variant caller is designed to preserve any soft   clip regions from an input read  though it may expand them as part of local  realignment    i NOTE    This is independent of the bam files produced by the target bam which    aggregates all reads into a single BAM file with 
78. ases are present  in the sample at 25  with pure signal  zero intensity in the non called channels    the Called intensity will be four times that of All  as the intensities will only be  averaged over 25  of the clusters  For impure clusters  the called intensity will be  less than four times that of All    The Called intensities are independent of base representation  so a well balanced  matrix will display all channels with similar intensities     Base Calls  The percentage of each base called as a function of cycle  Ideally   this should be constant for a genomic sample  reflecting the base representation of  the sample  In practice  later cycles often show some bases more than others  As the  signal decays  some bases may start to fall into the noise while other still rise above  it  Matrix adjustments may help to optimize data       o All and  Called   Exactly the same as All and Called  but expressed as a  percentage of the total intensities  These plots make it easier to see changes in  relative intensities between channels as a function of cycle by removing any  intensity decay     All Intensity Plots    The link to All htm file gives a representation of the mean matrix adjusted intensity of  clusters plotted as a function of cycle  It plots each channel  A  C  G  T  separately as a  different colored line  Means are calculated over all clusters  regardless of base calling     If all clusters are T  channels A  C  and G will be at zero  If all bases are present in th
79. at analysis     Standard configureAlignment Analysis    The standard way to run configureAlignment is to set the parameters in a configuration  file  create a makefile  and start the analysis with the    make    command     1 Edit the configureAlignment configuration file as described in configure Alignment  Configuration File on page 54     2 Check the analysis by running the configureAlignment pl command without     make    path to CASAVA bin configureAlignment pl config txt    EXPT DIR path to Unaligned folder    3 Enter the configureAlignment pl command  but now with   make  This creates the  makefile for sequence alignment    path to CASAVA bin configureAlignment pl config txt       EXPT DIR path vo Unaligned folder    make    4 Move into the newly created Aligned folder under the Run folder  see  configureAlignment Output Files on page 73   Type the    make    command for basic    analysis   make  L NOTE  You may prefer to use the parallelization option as follows     make  j 3 all    CASAVA v1 8 2 User Guide D 3    1uauubilveinbyuo2 buluuny    Sequence Alignment    The extent of the parallelization depends on the setup of your computer or  computing cluster     For a description of parallellization  see Using Parallelization on page 119     5 After the analysis is done  review the analysis   a View the analysis results of your run  See Analysis Summary on page 74 and  Analysis Results on page 79   b Interpret the run quality  See Interpretation of configureAlignment Ru
80. ated value containing the set of column labels in the following data segment     The data segment contains one entry per line  where each line is a set of tab delimited  columns  Wherever appropriate  columns for sequence name and position number are  included such that the files are tabix compatible     The following files are generated by CASAVA variant detection and counting   Depth and single position genotype call scores for every mapped site in the  reference genome are saved in each bin directory in the gzipped file  sites txt gz    Project Dir Parsed NN NN NN c1 0000 sites txt gz  Note that this output can be omitted with the   variantsNoSitesFiles option   The SNPs for each reference sequence are aggregated and filtered according to the     variantsSnpCovCutoff setting and summarized in the chromosome level file  snps txt   Project Dir Parsed NN NN NN cl snps txt  The indels for each reference sequence are aggregated and filtered according to the     VariantsIndelCovCutoff setting and summarized in the chromosome level  file indels txt   Project Dir Parsed NN  NN NN Cl indels  txt  If any SNPs and indels are removed by the high depth filter  they can be found in  their corresponding bin directory as   Project Dir Parsed NN NN NN c1 0000 snps removed txt  Project Dir Parsed NN NN NN c1 0000 indels removed  txt  When the   variantsWriteRealigned option is selected  there will alse be a  BAM file written to each reference seguence realigned bam directory containing only
81. ave to use    this option to generate a different output  directory     Path to a directory containing positions files     Defaults depends on the RTA version that is  detected     Format of the input cluster positions  information  Options     e locs   e  clocs   e _pos txt   Defaults to  clocs    Path to a directory containing filter files     Defaults depends on RTA version that is  detected     Path to a valid Intensities directory   Defaults to parent of base_calls_dir   Path to sample sheet file    Defaults to  lt input_dir gt  SampleSheet csv      tiles option takes a comma separated list of  regular expressions to match against the  expected  s_ lt lane gt _ lt tile gt   pattern  where   lt lane gt  is the lane number  1 8  and  lt tile gt  is  the 4 digit tile number  left padded with 0s    The   use bases mask string specifies how  to use each cycle     e An    n    means ignore the cycle   e A    Y     or  y   means use the cycle     e An    I    means use the cycle for the index  read     e Anumber means that the previous  character is repeated that many times     Examples      fastq cluster   count 6000000    EI NPYEAILE   lt BaseCalls dir     ein utedir Par   folder gt  Unaligned       eposillons  lt dir  sPOSLETONG Cir       eBOSTTLONS TOrMat    locs      filter dir   lt filter dir gt       intensities dir   lt intensities dir gt     sample sheet   lt input _  dir gt  SampleSheet csv     tilesss   2460  HOF  9   0 9   02468 5 s 1   0001      use bases mask  vyo
82. be called a SNP  and acts to reduce the rate  of any false positive SNP predictions made by the model  For this reason the genomic  prior is used to calculate the genotype probability distribution used for Q snp  and  O max gt      Polymorphic Prior    When considering a subset of sites from a genome that are known to be polymorphic in  a population  there is a much different prior expectation of the genotype distribution  than in the scenario described in the previous section for all sites in the genome  A  principle difference in this scenario is that the expectation that each site will be  homozygous for the reference allele is much lower  These sites also need to be examined  to distinguish strong evidence for the homozygous reference genotype from a site where  no observations have been made  The polymorphic prior is used to compute the  polymorphic site genotype quality score  O max gtl poly site   the probabilbity that the  true genotype is not the highest scoring  if this site is known to be polymorphic     New Variant Calling Parameter  Theta    The parameter theta as used in the variant calling model refers to the expected  proportion of differing sites between two chromosomes sampled from the population     For site genotyping  it is set by default to 1 1000  a value appropriate for human re   sequencing  Raising this value  to e g  1 100  would have the effect of increasing the  prior expectation of a non reference genotype and increase Q snp  values     The param
83. bout  the genotype distribution at the site before sequencing     The CASAVA 1 8 SNP caller expresses this notion of prior expectation based on a  reference sequence using its  genomic  prior distribution  which is used to calculate  Q snp  and Q max gt   A specialized  polymorphic  prior distribution is also used to  compute O max gtl poly site   which is applicable to sites where there is a greater prior  expectation of polymorphism  such as a set of sites from dbSNP     Genomic Prior    When resequencing an individual from a given population  there is a strong prior  expectation that a randomly selected site in the sample assembly will be homozygous  for an allele at the same locus in a reference chromosome from the same population   This expectation of similarity to a reference sequence in most portions of the genome is  referred to below as the    genomic    prior for the model  For example  suppose that on  average 1 in 1000 sites in a sample chromosome are expected to differ from a reference  chromosome  If the reference at a particular site is A  then the Q score for the reference  genotype AA will be approximately 30 in the absence of any sample observations   Because of this prior  the most likely genotype would still be AA even after observing a    CASAVA v1 8 2 User Guide 1 4 Q    U01 28 94 JUBIJEA    Algorithm Descriptions    single non reference basecall of modest quality  Thus  the genomic prior has the effect of  increasing the evidence required for a site to 
84. by or combined with target sort   This BAM file is independent of the archival bam file  which can be produced  using the option   sortKeepAllReads see Archival Build on page 90    gsIndex Pre compute Genome Studio linear index for all reads in the build     If you run a target other than the default target  a11   make sure to read the help written  for the target  This will help you identify any dependencies for the target you want to  run     Target help can be accessed by typing   Path to CASAVA bin configureBuild pl   help  lt target gt       NOTE     Prefixing any target name with no will exclude it from the targets list   Example     path to CASAVA bin configureBuild pl   targets all  noassembleIndels   variantsSkipContigs  options     Target callSmallVariants Usage    The callSmallVariants module is designed to use the results of the assemblelndels  module if available  so a new paired end build could be run with the following  minimum set of targets      targets sort assembleIndels callSmallVariants    If assemblelndels  Grouper  cannot be run  an alternative workflow is      elarljels Sort  CallomallVarianus    varlantsskipcontigs  i NOTE    To have the plugin provide a BAM file containing all reads which have had    their alignments altered during realignment  add the following to the  configuration command line       varlantsWriteRealigned    These reads will appear in the file  sorted realigned bam  in the  chromosome realigned bam directory     The primary option
85. c2fa c3fa c4 fa c5fa c6fa c7fa cXfa c8fa c9fa c10fa c11fa c12 fa c13 fa c14 fa c15 fa c16 fa c17 fa c18 fa c19 fa c20fa cY fa c22 fa c21 facMT fa    o  o    fraction of known sites mapped    name  cl fa  c2 fa  c3 fa  c4 fa  c5 fa  c6 fa  c7 fa  cX fa    sites  247249719  242951149  199501827  191273063  180857866  170899992  158821424  154913754    known sites bases mapped at known sites mean depth at known sites fraction of known sites mapped    224999719  237709794  194704822  187297063  177702766  167273991  154952424  151058754    7772224483  8638053015  7248243403  7003263067  6499205556  6235259537  5351528045  2615296049    34 54326  36 33865  37 22683  37 39121  36 57346  37 27573  34 53659  17 31310    0 96684  0 97455  0 98373  0 97914  0 97129  0 98115  0 96809  0 94708    Solid  Ind ng Buljunoy pug U 011791961 JUBIJEA       CASAVA Build    The CASAVA build  containing sequence  SNP  indels  and  for RNA Sequencing   counts information  is located in the buildDir Parsed xx xx xx folder     Sorted bam Files    The sorted bam file is a binary file that contains sorted sequence alignments  There is  one sorted bam file for each chromosome  stored in the  bam  subdirectory under each  chromosome specific directory     BAM Format    The Binary Alignment Map  BAM  file is the binary equivalent of SAM files and is  compressed in the BGZF format  Each BAM file is much smaller than its SAM  equivalent  yet it can be easily converted to SAM  e g  with samtools using  samtoo
86. ccur  the model does not strictly report a  genotype  but rather the max_gt call reflects the copy number for each of the two indel  alleles  and the probability of that copy number  Each indel allele of the two  overlapping indels are reported on separate lines by the model  Due to the approximate  nature of this model and the independent evaluation of each overlapping indel allele  it  is possible that the most likely copy number for each allele could conflict  e g  max_gt  will not be het for both indel alleles   in the rare cases where this occurs the associated  Q max_gt  scores are typically very low     Calling SNPs    Once the indels are called  and the reads are re aligned to take into account the  discovered indels  site genotyping and SNP calling is conducted using the following  steps     6 Given the set of filtered and realigned reads  the variant caller next runs certain  types of filtration on base calls within these reads   First  any contiguous trailing sequence of    N    base calls are effectively treated as  trimmed off of the ends of reads for the purpose of genotyping and depth  calculation     CASAVA v1 8 2 User Guide 1 A     U01 28 94 JUBIJEA    Algorithm Descriptions    Next the mismatch density filter is run on all reads to mask out sections of the  read having an unexpectedly high number of disagreements with the reference   The current default mismatch density filter behavior is as follows   Base calls are ignored where more than 2 mismatches to th
87. ce   Soft clip on the read  clipped seguence present in  lt seg gt    Hard clip on the read  clipped seguence NOT present in   lt seg gt     Padding  silent deletion from the padded reference  seguence     For example  the CIGAR string 30M1169M means 30 bases aligning to the reference   30M   1 base insert  11   and 69 bases aligning  69M      Optional Fields    Optional fields are in the format   lt TAG gt   lt VTYPE gt   lt V ALUE gt   for example  XD Z 73T26     Each tag is encoded in two alphanumeric characters and appears only once for an  alignment  Illumina SAM files may use some or all of the following optional fields     Tag  SM  AS  XD  XC    Description    ELAND single read alignment score   ELAND paired read alignment score   String for mismatching positions   Provides information to distinguish different unmapped read    types    The  lt VTYPE gt  describes the value type in the optional field  Valid types in SAM are  described in the following table     Type  A  i    f    Z  H    Description   Printable character  Signed 32 bit integer  Single precision float  number   Printable string   Hex string  high nybble  first     The  lt VALUE gt  field format is defined by the tag     Tag  SM  AS  XD    Value Field   The  lt VALUE field contains the ELAND single read alignment score  The  lt VALUE gt  field contains the ELAND paired read alignment score   The  lt VALUE field contains the string for mismatching positions     e Matching bases are numbered  For example  a
88. chine   Identifier of the sequencer    name   Run Number to identify the run on the sequencer   number   Lane Positive integer  currently 1 8     number   Tile Positive integer    number   X X coordinate of the spot  Integer     As of RTA 1 6  OLB 1 6  and CASAVA 1 6  the X and Y coordinates for each   clusters are calculated in a way that makes sure the combination will be unique    The new coordinates are the old coordinates times 10   1000  and then rounded   Y Y coordinate of the spot  Integer    As of RTA 1 6  OLB 1 6  and CASAVA 1 6  the X and Y coordinates for each   clusters are calculated in a way that makes sure the combination will be unique    The new coordinates are the old coordinates times 10   1000  and then rounded     Index Index sequence or 0  For no indexing  or for a file that has not been  demultiplexed yet  this field should have a value of 0   Read 1 for single reads  1 or 2 for paired ends or multiplexed single reads  1  2  or 3 for    Number   multiplexed paired ends    Sequence   Called sequence of read    Quality   The calibrated quality string    Filter  Did the read pass filtering  0   No  1   Yes     The Qseq Converter also looks for files that configureAlignment needs  and transfers  them to its output directory  These files are   Config xml file in the Basecalls folder     Requirements to Run configureAlignment    To run CASAVA 1 8 configureAlignment on FASTO files generated by the Qseq  Converter  the following is required   The input  _qseq 
89. chromosome re labeling  see  Targets on page 96      This is independent of the archival bam file  which can be produced using the  option   sortKeepAllReads see Archival Build on page 90      Part   15011196 Rev D    Methods    CASAVA uses a number of methods to efficiently assemble indel candidates  call SNPs  and indels  and provide counts  This section explains the methods     Variant Detection    Post alignment CASAVA performs variant detection using two modules   The assemblelndels module  Grouper  detects candidate indels using  singleton orphan and anomalous read pairs  The assemblelndels module works  well for detecting larger indels  The candidate indels detected by the assemblelndels  module are passed on to the small variant caller for consolidation and genotyping   The callSmallVariants module genotypes and provides quality scores for SNPs and  indels  Indels can be called from candidate indel evidence provided by both  ELAND gapped read alignments  for smaller indels  and from the assemblelndels  module  for larger indels     For each SNP or indel call the probability of both the called genotype and any non    reference genotype is provided as a quality score  Q score   Reads are re aligned around   candidate indels to improve the quality of SNP calls and site coverage summaries     The callSmallVariants module also generates files which summarize the depth and  genotype probabilities for every site in the genome  As a final step it produces tables  and html for
90. clusters with base call T double  Byte 76   Number of clusters with base call A integer  Byte 80   Number of clusters with base call C integer  Byte 84   Number of clusters with base call G integer  Byte 88   Number of clusters with base call T integer  Byte 92   Number of clusters with base call X integer    Parti 15011196 Rev D    Start Description Data   type  Byte 96   Number of clusters with intensity for A integer  Byte 100   Number of clusters with intensity for C integer  Byte 104   Number of clusters with intensity for G integer  Byte 108   Number of clusters with intensity for T integer  Filter Files    The filter files can be found in the BaseCalls directory     The     filter files are binary files containing filter results  the format is described below     Bytes Description   Bytes 0 3 Zero value  for backwards compatibility   Bytes 4 7 Filter format version number   Bytes 8 11 Number of clusters   Bytes 12  N 11  unsigned 8 bits integer     Where N is the cluster number e Bit 0 is pass or failed filter    Control Files    The control files can be found in the BaseCalls directory    lt run directory gt  Data Intensities BaseCalls L0O lt lane gt      They are named as follows   s  lt l  ne gt   lt tils gt  control    The     control files are binary files containing control results  the format is described  below     Bytes Description   Bytes 0 3 Zero value  for backwards compatibility   Bytes 4 7 Format version number   Bytes 8 11 Number of clusters   Bytes 12  
91. completes a major portion of the post alignment analysis pipeline  The first module      sort     bins aligned reads into separate regions of the reference genome  sorts these reads  and optionally removes PCR duplicates  for paired end reads  and finally converts  these reads into BAM format  In a paired end analysis the next module    assemblelndels   is used to search for clusters of poorly aligned and anomalous reads   These clusters of reads are de novo assembled into contigs which are aligned back to  the reference to produce candidate indels  Subsequently  the  callSmallVariants  module  uses the sorted BAM files and the candidate indels predicted by the assemblelndels  module to perform local read realignment and genotype SNPs and indels under a  diploid model  In an RNA Seq build the rnaCounts  module will also be run to  calculate gene and exon counts  Other optional modules can be added to the build  process to perform additional functions    CASAVA automatically generates a range of statistics  such as mean depth and  percentage chromosome coverage  to enable comparison with previous builds or other  individuals  Moreover  CASAVA provides expression levels for exons  genes and splice  junctions in the RNA Sequencing analysis     Use Cases    The application has three basic use cases    DNA Sequencing for large genomes    DNA Sequencing for small genomes  data sets     RNA Sequencing   All types of analysis take export files from configureAlignment as input and pro
92. converter  The path should always be to the Unaligned  directory  even when the run only contains one project  For a description of the run folder  see Bcl Conversion  Output Folder on page 37     USE BASES nY n Ignore the first and last base of the read     54    The USE BASES string contains a character for each cycle   e If the character is    Y     the cycle is used for alignment   e If the character is    n     the cycle is ignored     e Wild cards     are expanded to the full length of the  read   USE_BASES should not be used for masking custom index  cycles  use the   use bases mask option  Options for Bcl  Conversion and Demultiplexing on page 33      Parti 15011196 Rev D    Parameter Definition    Default is USE BASES Y n  which means perform a single   read alignment and ignore the last base  For a detailed  description of USE BASES syntax  see USE BASES Option    on page 62   ELAND GENOME  home user Genomes  Specify the single FASTA files that you want to use as  Eland BAC plus vector  genome reference for alignment with ELANDv2e   SAMTOOLS GENOME Direct CASAVA to a multi seguence FASTA reference file   ANALYSIS eland extended Specify the type of alignment that should be performed     Available options are      ANALYSIS eland extended   e ANALYSIS eland pair   e ANALYSIS eland rna   e ANALYSIS none  The default is ANALYSIS none  See ANALYSIS Variables on page 61 for more information    ELAND FASTO FILES PER PROCESS N The maximum number of files analyzed by each ELA
93. cribes how to perform Bcl conversion and demultiplexing in CASAVA  1 8     Usage of configureBclToFastq pl    The standard way to run bcl conversion and demultiplexing is to first create the  necessary Makefiles  which configure the run  Then you run make on the generated  files  which executes the calculations     1 Enter the following command to create a makefile for demultiplexing    path to CASAVA bin configureBclToFastg pl  options   i NOTE  The options have changed significantly between CASAVA 1 7 and 1 8  See  Options for Bcl Conversion and Demultiplexing on page 33     2 Move into the newly created Unaligned folder specified by   output dir     3 Type the    make    command  Suggestions for    make    usage  depending on your  workflow  are listed below     Make Usage Workflow  nohup make  j N Bcl conversion and demultiplexing  default    nohup make  j N rl Bcl conversion and demultiplexing for read 1     The option specifies the extent of parallelization  with the options depending  on the setup of your computer or computing cluster  see Using Parallelization on  page 119    The Unix nohup command redirects the standard output and keeps the    make     process running even if your terminal is interrupted or if you log out   See Makefile Options for Bcl Conversion and Demultiplexing on page 34 for explanation  of the options   L NOTE  The ALIGN option  which kicked off configure Alignment after  demultiplexing was done in CASAVA 1 7  is no longer available     4 After
94. d dependent upon the false  positive tolerance of a user s workflow  For this  reason summary Statistics about the called SNPs are  created at a higher  average application  threshold   which can be set using this option  default is 20      Example    variantsSummaryMinosnp 25    Table 28 Indel Options for callSmallVariants    Option Application   Description      variantsIndelTheta FLOAT ETE The freguency with which indels are expected  between two unrelated haplotypes  default is 0 0001    See New Variant Calling Parameter  Theta on page 150  for more explanation   Example    variantsIndelTheta 0 0002     variantsIndelCovCutoff FLOAT SE  PE Indels are filtered out of the final output if the local  sequence depth is greater than this value times the  mean chromosomal depth  The sequence depth of the  indel is approximated by the depth of the site 5  of the  indel   default 3 0     The filter may be disabled for targeted resequencing   or other applications by setting this value to  1  or   any negative number     Example    variantsIndelCovCutoff 4    variantsCanIndelMin INTEGER SE  PE Unless an indel is observed in at least this many   gapped or assemblelndels reads  the indel cannot    become a candidate for realignment and genotype  calling   default  3     Example    variantsCanIndelMin 4    variantsCanIndelMinFrac  SE  PE Unless an indel is observed in at least this fraction of  FLOAT intersecting reads  the indel cannot become a    candidate for realignment and genotyp
95. d have a value of 0     Read number  1 for single reads  1 or 2 for paired ends or multiplexed single reads   1 2  or 3 for multiplexed paired ends     Called sequence of read    Quality string  In symbolic ASCII format  ASCII character code   quality value   64     Match chromosome    Name of chromosome match OR code indicating why no  match resulted    ee ss dd   too many hits  where ee is the number of exact hits  ss is the number  of hits with a single mismatch and dd is the number of hits with a double  mismatch    NM  no match  OC  OC failure    RM  repeat masked  for example match against abundant sequences    Match Contig  Gives the contig name if there is a match and the match  chromosome is split into contigs  Blank if no match found     Match Position  Always with respect to forward strand  numbering starts at 1   Blank if no match found     Match Strand   F  for forward   R  for reverse  Blank if no match found     Match Descriptor  Concise description of alignment  Blank if no match found   A numeral denotes a run of matching bases  A letter denotes substitution of a nucleotide  For a 35 base read   35  denotes an  exact match and  32C2  denotes substitution of a  C  at the 33rd position  The escape sequence  A       represents an indel  An integer in the indel escape  sequence  e g   1012518   indicates an insertion relative to reference of the  specified size  A sequence in the indel escape sequence  e g   IN AGS20   indicates a deletion relative to reference  
96. d to one of the two  ends  of a large indel or  structural variant  for which the complete variant is either unknown or cannot be  represented by the small variant caller  Breakpoint events are reported as either left  or right breakpoints   A left breakpoint corresponds to a haplotype which can be mapped on the left side  of the breakpoint location but not on the right  A right breakpoint indicates that a  haplotype can be mapped to the right of the breakpoint location but not to the left  If    CASAVA v1 8 2 User Guide 1 D     U01 28 94 JUBIJEA    Algorithm Descriptions       a simple insertion or deletion were represented as two breakpoint calls  then they  would occur on the forward strand as a left breakpoint followed by a right  breakpoint    The figure below illustrates how two breakpoint calls could potentially be called  corresponding to a large insertion in the sample relative to a population reference     Figure 26 Left and Right Breakpoints    Reiererce a GNG n ed   Sample   l    Insertion that is  hard to resolve with  small variant caller    Sequencing  Sample    LFS    ba  get    Reads overlapping breakpoints    Reported as Reported as  left breakpoint right breakpoint    Mapping Reads  f    to Reference C   m mm ma    m               Advanced Options for Variant Detection    This section lists advanced options for variant detection  which will help you fine tune  the variant calling     Global Analysis Options    The options described below are global options 
97. dary analysis     To assess a run  you can use either the RTA based output  or the Sequence Analysis  Viewer  SAV   The SAV is an Illumina software package available on the Illumina  website  iCom   and can be used to view the performance metrics of a sequencing run   You can download it as part of the HiSeq Control Software  HCS  package     1    N OF A Q N    7    Log on icom illumina com    Click on Downloads    Search for SAV    Click on the HCS Software link    Download the  Installers zip file    Extract the SAV x x xx x msi file from the zip file     Run the SAV x x xx x msi file and follow the installation instructions     In general  using a PhiX or other balanced  suitable control sample  such as human  genomic DNA sequencing  as guide helps when interpreting these graphs     Parti 15011196 Rev D    Quality Tables and Graphs    Before beginning an analysis run  you should check the following tables and graphs in  status htm or SAV   Run Info  You can view basic information on the run s configuration  read length   and control specifications on the Run Info tab of the Status htm output or the  Summary tab of the SAV window   Data by Cycle  These graphs help you examine intensities  focus metrics  FWHM    percent base  qscores  error rates and other metrics per cycle and per lane  You can  identify sample properties or instrument related events that affect the data   Data by Tile Charts  These graphics show run metrics by cycle and by lane and  tile  These can be used
98. detailed  below for the options    variantsSnpCovCutoff  and   variantsIndelCovCutoff  This setting is  recommended for targeted reseguencing and RNA   Seg  Note it is already set by default for RNA Seg      Example    variantsNoCovCutoff    Table 27 SNP Options for callSmallVariants    Option Application   Description     variantsSnpTheta FLOAT SE PE The frequency with which single base differences are  expected between two unrelated haplotypes  default  is 0 001      Example    variantsSnpTheta 0 002    variantsSnpCovCutoffAll SE  PE By default the mean chromosomal depth filter is  based on  used depth   the number of basecalls used  by the snp caller after filtration  calculated from all  known sites  non N  in the reference seguence  When  this option is set  the threshold and the filtration use  the full depth at all known sites in the reference  seguence   Example    variantsSnpCovCutoffAll    variantsSnpCovCutoff lt  FLOAT SE  BE SNPs are filtered out of the final output if the depth of  reads used for that site is greater than this value times  the mean chromosomal used depth   default 3 0     The filter may be disabled for targeted reseguencing  or other applications by setting this value to  1  or  any negative number      Example    variantsSnpCovCutoff 4      variantsMDFilterCount INTEGER   SE  PE The mismatch density filter removes all basecalls  from consideration during SNP calling where greater  than  variantsMDFilterCount  mismatches to the  reference occur o
99. digit set number   export gz      Alignment Algorithms    CASAVA provides the alignment algorithm Efficient Large Scale Alignment of  Nucleotide Databases  ELAND   ELAND is very fast and should be used to match a  large number of reads against the reference genome   ELAND has been improved a number of times   CASAVA 1 6 introduced a new version of ELAND  ELANDv2  The most important  improvements of ELANDV2 are its ability to perform multiseed and gapped  alignments   As of CASAVA 1 8 a new version of ELANDv2 is available  ELANDv2e  The most  important improvements of ELANDv2e are improved repeat resolution and  implementation of orphan alignment   A short description of these improvements is provided below  more information about  ELANDV2 is available in Algorithm Descriptions on page 131     Multiseed and Gapped Alignment    ELANDv2e performs multiseed alignment by aligning consecutive sets of 16 to 32 bases  separately  After this  ELANDv2e extends each candidate alignment to the full length of    Parti 15011196 Rev D    the read  using a gapped alignment method that allows for gaps  indels  of up to 10  bases  ELANDv2e then picks the best alignment based on alignment scores     Repeat Resolution    ELANDv2e aligns reads in repeat regions using two new modes  semi repeat resolution  and full repeat resolution  Both modes take repetitive hits into account for the multiseed  pass of ELAND  Full repeat resolution is more sensitive and places more reads in repeat  regions  but 
100. duce  SNP and indel calls  but note that gapped alignments are required for indel calls in  RNA and single ended DNA builds  In addition  RNA Sequencing analysis provides  counts for exons  genes and splice junctions     DNA Sequencing Analysis for Large Genomes    DNA Sequencing whole genome analysis can be used for large genomes and high  coverage  like the human genome at 30x coverage   and both single read and paired end  runs  CASAVA can take the large numbers of aligned single read or paired end  sequences from multiple experiments  arrange them into a genome build  and describe  differences from the reference sequence     For big data sets  30x coverage human genome   the process can take between 5 hours  and several days  depending on available infrastructure   L NOTE      Large projects like human genome resequencing require high performance  computer clusters  see Hardware and Software Requirements on page 112     88 Parti 15011196 Rev D    DNA Sequencing Analysis for Small Genome    DNA Sequencing for small genomes  such as whole genome sequencing of bacteria or  targeted resequencing  is very similar to DNA Sequencing for large genomes with the  only difference being that it may process data from one lane or less  Thus a single    computer is enough to make the build     RNA Sequencing Analysis    UONONPOALUI    RNA Seguencing analysis supports whole transcriptome seguencing projects  In  addition to SNP and indel calls there are a few more data types produced  Ex
101. duces tables   and html formatted reports of SNP and indel calls      assemblelndels Module Improvements    The major changes for the assembleIndels module  Grouper  are   assemblelndels uses an additional method to identify indels  It finds read pairs that  map anomalously  for example  with unexpected insert size   and identifies  potential indels   assemblelndels merges indel calls detected through anomalous read pairs with  those identified through singleton orphan reads  and combines clusters that appear  to correspond to the same event     CASAVA v1 8 2 User Guide 1 4     U01 28 94 JUBIJEA    Algorithm Descriptions    Figure 24 Changes to the assembleIndels Workflow    CASAVA v1 8  Improvements    IndelFinder   Extract reads      Gapped alignments    Alignment worse than expected Introduced    Singleton shadow reads anomalous      Anomalous read pairs                    read pair  method    AlignCandidates  Localized alignment of extracted reads  If better  replace previous alignment    ClusterFinder  Cluster reads    Introduced    ClusterMerger  ClusterMerger Moe      Merge clusters from different types module    SmallAssembler  Assemble clusters in contigs    Update alignment details for  assembled reads    AlignContig  Align contigs to genome       assemblelndels Algorithm    The assemblelndels module  Grouper  runs only during paired read DNA CASAVA  builds  In CASAVA v1 8  it uses orphan reads and anomalous read pairs to detect  indels     Grouper detects indels in
102. e   Stats Folder    CASAVA v1 8 2 User Guide 1 O 3    Solid  Ind ng bununo9 pug U 011791961 JUBIJEA    The stats folder contains statistical information in computer readable form  such as  the runs summary xml file  which shows which lanes from which run were  aggregated and called for a CASAVA build    Conf Folder   The conf folder contains information about the configuration of the project  such as  the project conf file     Build Html Page    The build html page is located in buildDir html  When you open the file Home html   you will find a list of all runs  and a link to statistics     Figure 16 Build Html Page    illumina     Welcome    Ej Report Menu  CASA 5 8 0a1 10    JA CAS 101019 PE DNA Seg  CASAVA CASAVA 1 8 0a1 101019 PE DNA Seg   analysis    Name Clusters  PF   PF Clusters  Align  PF   Error Rate  PF  Yield Read Length Status    Variant Detection and Counting    090406 HWI EAS68 9096 FC400PR PE  Lane 1 Read 1  Lane 1 Read 2  Lane 2 Read 1  Lane 2 Read 2  Lane 3 Read 1  Lane 3 Read 2  Lane 5 Read 1  Lane 5 Read 2  Lane 6 Read 1  Lane 6 Read 2  090407 HWI EAS255 9097 FC304E3 PE    Lane 1 Read 1       Lane 1 Read 2    A Report Menu link    The Report Menu link on the build html page will lead you to graphs and tables for    important statistics   Coverage  Duplicates  Indels statistics  SNPs statistics    104    150840      150840      151657      151657      151240      151240      155088      155088      154346      154346        169443      169443        6539  65
103. e  found in the file snps removed txt in each chromosomal bin directory     Parti 15011196 Rev D    The SNP caller implemented in this module employs a probabilistic model which  ultimately produces probability distributions over all diploid genotypes for each site in  the genome  The primary values summarized from these distributions are a set of  quality scores   O snp   The value of Q snp  expresses the probability that the genotype at this site  is not the homozygous reference state   Q max gt   The value of O max gt  expresses the probability of the most likely  genotype state at this site  reported as the value  max gt   Note that the value  Q max_gt  corresponds to a value referred to as  consensus quality  in SNP calling  methods such as samtools pileup   O max gtl poly site  One additional score is provided by the SNP caller which can  be used to look at sites for which there is a strong expectation that the site is  polymorphic  This value is O max gt  poly site   which expresses the probability of  the most likely genotype state at the site  assuming the site is polymorphic  This  state is separately reported as the value  max gtl poly site   This genotype value  and quality score provides greater sensitivity when looking at  for example  a  particular set of polymorphic sites from dbSNP  This value should not be used to  evaluate the genotype for every position in the genome as this would resultin a  high number of false positive SNP predictions     To accommodate
104. e  sample at a rate of 25  and a well balanced matrix is used for analysis  the graph will  display all channels with similar intensities  If intensities are not similar  the results  could indicate either poor cross talk correction or poor absolute intensity balance among  each channel     A genome rich in GC content may not provide a balanced matrix for accurate cross talk  correction and absolute intensity balance     Mismatch Graphs    The Mismatch Graphs link leads to a file with graphs of error rates on a flow cell  The  red bar shows the percentage of bases at each cycle that are wrong  as calculated based  on alignment to the reference sequence  Issues such as focus or fluidics problems  manifest themselves as spikes in the graph     ELANDv2e is capable of aligning against large genomes  such as human  in reasonable  time  However  it allows only two errors per seed  This means that error rates based on  ELANDv2e alignments are underestimated     Mismatch Curves    The Mismatch Curves link leads to a file with graphs of the proportion of reads in a tile  that have 0  1  2  3  or 4 errors by the time they get to a given cycle     CASAVA v1 8 2 User Guide Fa     soji J 1NA1NO j usWubi yainbijuoo    Sequence Alignment    Additional Paired Statistics    For samples for which eland pair analysis was performed  there is a table called  Additional Paired Statistics  This table provides statistics about the alignment outcomes  of the two reads individually and as a pair  the 
105. e D D    1uauubilveinbyuo2 BulUUNH    Sequence Alignment    Parameter  OUT DIR    DATASET POST RUN COMMAND   yourPath yourCommand yourArgs    EMAIL LIST user example com  user2 example com    EMAIL SERVER mailserver  EMATL DOMAIN example com    WEB DIR ROOT  file   server example com share    NUM LEADING DIRS TO STRIP    ELAND RNA GENOME CONTAM    ELAND RNA GENOME ANNOTATION    ELAND RNA GENE MD GROUP LABEL    KAGU PARAMS    Definition   Path to configure Alignment output  The path must be to a  directory not already present    Defaults to  lt run_folder gt  Aligned   Note that there can be only one Aligned directory by default  If  you want multiple Aligned directories  you will have to use this  option to generate a different output directory    Allows user defined scripts to be run after all configureAlignment  targets have been built  Invoked per barcode lane for multiplexed  samples  per lane for non multiplexed samples    See also Using DATASET POST RUN COMMAND on page 66   Send a notification to the user at the end of an analysis run    For more information on email notification  see Setting Up Email  Reporting on page 116     Include hyperlinks with a specific prefix to the run folder     Specifies the number of directories to strip from the start of the  full run folder path before prepending the WEB DIR ROOT  Points to the folder containing a set of contaminant sequences for  the genome    typically the mitochondrial and ribosomal  sequences  The files must be in si
106. e RPKM for exons and genes is calculated slightly differently than RPKM  for splice junctions   The normalized values for genes and exons are counted as follows    Exons genes RPKM   10  x Cb   NbL   With    RPKM   Reads Per Kilobase of exon model per Million mapped reads   Cb  the number of bases that fall on the feature   Nb  total number of mapped bases in the experiment   L  the length of the feature in base pairs  The normalized values for splice junctions are counted as follows    Splice junctions RPKM   10  x Cr   NrL   With    Cr  the number of reads that cover the junction point   Nr  total number of mapped reads in the experiment   L  the length of the feature in base pairs     Only the reads with alignment score  gt   OV CutoffSingle are considered     Exons that have overlapping exons from other genes on the forward or reverse strand  are excluded from counting and are also not included to compute the total gene length     Reference    Mortazavi A  Williams BA  McCue K  Schaeffer L  Wold  2008  Mapping and quantifying  mammalian transcriptomes by RNA Seq  Nature Methods  5 585 7     1 D8 Part   15011196 Rev D    Qseq Conversion    Late ae AA 160  Oseg Converter Input Files    161  Running Oseg Converter    163  Oseg Converter Parameters               ennen nen une renere nere ener rrerrenen 164  Qseq Converter Output Data    165    2  ar  ER wa         sl     _ TE  s P Ft oe ow    sf  cang  EE af   Cara a Merete         NG  TE ES       GAY    ta          CASAVA v1
107. e calling    default  0 02     Example    variantsCanIndelMinFrac 0 01    YariantsSmallCanIndelMinFrac    SE  PE In addition to the above filter for all indels  for indels  FLOAT of size 4 or less  unless the indel is observed in at least    this fraction of intersecting reads  the indel cannot  become a candidate for realignment and genotype  calling   default  0 1     Example      variantsSmallCanIndelMinFrac lt 0 2     variantsIndelErrorRate FLOAT SE  PE Set the indel error rate used in the indel genotype  caller to a constant value of f  0 lt  f lt  1   The default  indel error rate is taken from an empirical function  accounting for homopolymer length and indel type   i e  insertion or deletion   This flag overrides the  default behavior with a constant error rate for all  indels     Example    variantsIndelErrorRate 0 5    1 DO Parti 15011196 Rev D    Option Application      variantsSummaryMinOindel  SE  PE  INTEGER      VariantsMaxIndelSize  INTEGER   SE  PE    CASAVA v1 8 2 User Guide    Description    The indels txt files contain all positions where Q indel    gt  0  however it is expected that only a higher O indel   subset of these will be used dependent upon the false  positive tolerance of a user s workflow  For this  reason summary Statistics about the called snps are  created at a higher  averege application  threshold   which can be set using this option  default is 20    Example    variantsSummaryMinOindel 25   Sets the maximum indel size for realignment and 
108. e genome     ANALYSIS none    Y    splice junctions  and contaminants using ELANDv2e     For more information on ELAND rna  see Using  ANALYSIS eland_rna on page 70   None Any Omits the indicated lane from the analysis     application   Setting the parameter 8 ANALYSIS none ignores lane  8     WARNING   Default for USE BASES is Y n  which means perform a single read  alignment and ignore the last base  If running ANALYSIS eland pair  make  sure to specify the USE BASES option for two reads  for example Y n Y n      USE BASES Option    The USE BASES option identifies which bases of a full read produced by a sequencing  run should be used for the alignment analysis  A fully expanded USE BASES value is a  string with one character per sequencing cycle but more compact formats can be used  as described in USE BASES Option on page 62  Each character in the string identifies  whether the corresponding cycle should be aligned  The following notation is used     44 JJ    A lower case    n    means ignore the cycle     I    NOTE   Prephasing correction cannot be applied to the last base  since you need to  know the next base in the sequence  Thus there will be a minor error  increase at the last base  Ignoring the last base from the sequence analysis  can reduce alignment errors somewhat     For this reason Illumina recommends that if  n  bases of sequence are  desired   n 1  cycles should be run     An upper case    Y    means use the cycle for the alignment    A comma     denotes a 
109. e gt  bel    The   bcl files are binary base call files with the format described below     Bytes Description Data type   Bytes 0 3 Number N of cluster Unsigned 32bits little  endian integer   Bytes 4    N 3  Bits 0 1 are the bases  respectively  A  C  G  T    Unsigned 8bits integer   Where N is the for  0  1  2  3     cluster index bits 2 7 are shifted by two bits and contain the    quality score   All bits    0    in a byte is reserved for no call     Stats Files    The stats files can be found in the BaseCalls directory    lt RunDirectory gt  Data Intensities BaseCalls L0O lt lane gt  C lt cycle gt  1    They are named as follows   s slane gt  silile usta    The Stats file is a binary file containing base calling statistics  the content is described  below  The data is for clusters passing filter only     Start Description Data  type  Byte 0 Cycle number integer  Byte 4 Average Cycle Intensity double  Byte 12   Average intensity for A over all clusters with intensity for A double  Byte 20   Average intensity for C over all clusters with intensity for C double  Byte 28   Average intensity for G over all clusters with intensity for G double  Byte 36   Average intensity for T over all clusters with intensity for T double  Byte 44   Average intensity for A over clusters with base call A double  Byte 52   Average intensity for C over clusters with base call C double  Byte 60   Average intensity for G over clusters with base call G double  Byte 68   Average intensity for T over 
110. e idxProj as their project field    PROJECT idxProj ANALYSIS eland pair   PROJECT idxProj USE BASES Y n Y n   PROJECT idxProj ELAND GENOME x y z G1   Align only PhiX of idxProj  assuming there are 2 references for idxProj  hum and   PhiX    Disable analysis by default so that anything not explicity described is not  analysed    ANALYSIS none  Disable analysis for noldxProj  This will take priority over REFERENCE scope  attributes below    PROJECT noldxProj ANALYSIS none  Set REFERENCE scope variables so that when the data belongs to PhiX  they  have an effect   noldxProj will not be analysed as PROJECT scope has higher  priority     REFERENCE phix ANALYSIS eland pair   REFERENCE phix USE BASES Y n Y n   REFERENCE phix ELAND GENOME x y z GP   Align only human for Lane 2  assuming 2 references for idxProj  human  PhiX    Disable analysis by default so that anything not explicitly described is not  analysed   ANALYSIS none  Notice that everything below is set only for lane 2 so the rest of the data has  ANALYSIS none    from above   Disable analysis for noldxProj  This will take priority over REFERENCE   scope  attributes below   2  PROJECT noldxProj ANALYSIS none  Set REFERENCE scope variables so that when the data belongs to PhiX  they  have an effect noldxProj will not be analysed as PROJECT scope has higher  priority     2  REFERENCE hum ANALYSIS eland pair   2  REFERENCE hum USE BASES Y n Y n   2  REFERENCE hum ELAND GENOME x y z GH    Samples Without Index    Unless otherw
111. e pairs  expressed as a percentage of the total  number of non orphaned clusters passing filters  must exceed a certain  number  set as decimal  for example    muf 0 1   Otherwise  no pairing is  attempted and the two reads are effectively treated as two sets of single  reads   e By default  this threshold is set to 0  e For some applications it may be useful to switch off the pairing  completely by specifying   muf 1 0  siter Minimum percentage of Consistent Fragments  set as set as decimal  for  example    mcf 0 6   Of the unique pairs  the vast majority should have  the same orientation with respect to each other  If they don t  it is  indicative of the following problems   e Sample prep  e A reference sequence is extremely diverged from the sample data  In such cases  no pairing is attempted and the two reads are effectively  treated as two sets of single reads   By default  the threshold for this parameter is set to 0 7    Ea Minimum Fragment Alignment Quality  For each cluster  all possible  pairings of alignments between the two reads are compared  This is the  score of the best one  Since we are considering the two reads as one  fragment  both reads in a cluster get the same paired read alignment  score     The alignment score is nominally on a Phred scale  However  it is  probably not safe to assume the calibration is perfect  Nevertheless  it is a  good discriminator between good and bad alignments  The score must  exceed this threshold to go in the export txt gz fi
112. e predicted genotype is not CC     The CASAVA1 8 model for indels comprises three possible indel genotypes   homozygous  heterozygous  or not present     NOTE     It is possible to have high confidence that the genotype is not the reference  without having high confidence in exactly what the genotype is at the site  In  this situation there is strong evidence of a SNP but the exact genotype at the  site is less certain     Q snp     The SNP caller s site genotyping methods take a set of base calls and associated  qualities for each site  and produce a probability distribution over the 10 diploid  genotype states  AA CC GG TT AC AG AT CG CT GTJ  Given this probability  distribution  the value Q snp  is a Q score expressing the probability that the site  genotype is not that of the homozygous reference     NOTE     The diploid genotypes are printed out as two letter codes representing two  unphased  single base  alleles  For each heterozygous genotype the two  alleles are provided in alphabetical order  e g  CT will be used and not TC      For example  if the reference base is C  and the probability of the reference genotype CC  is 0 001  the value for Q snp  is 30  reflecting a relatively high confidence that at least  one non reference allele exists at this site     Prior Probabilities and Quality Scores    An important component of the SNP calling model is the prior probability distribution  over diploid genotypes  The prior distribution expresses the information available a
113. e provides the location of the configureAlignment data  export txt gz    for each flow cell run  and describes their properties of each flow cell run  There is one  run section for each flow cell run  one set section for each Aligned folder in each flow  cell run  and one lane section for each lane in each set  The run conf xml file can be    CASAVA v1 8 2 User Guide O 3       S6ll4 Indu  UOI 2919   J JUBIJEA    provided  created  by the user or CASAVA will generate it automatically based on  command line options  run conf xml file should be placed in buildDirectory conf     Pair xml    The pair xml file provides information about pair distribution in the  configureAlignment output  only for paired end sequencing   Pair xml is required for  paired samples to be treated as paired  You do not need to point to it specifically  since  it should have been placed in the Aligned Project Sample folder for your sample by  configureAlignment     Genomesize xml    The   genomesizexml file contains names of reference genomes  and is required for  variant detection  You do not need to point to it specifically  since it should have been  placed in the Aligned Project Sample folder for your sample by configureAlignment     Reference Genome    CASAVA uses a reference genome in FASTA format  Both single sequence FASTA and  multi sequence FASTA genome files are supported     Variant Detection and Counting    Genome sequence files for most commonly used model organisms are available through  iG
114. e reads  and the  affected portions of these reads have high error rates and unreliable base calls   Typically  the increase in phasing causes quality scores to be low in these regions  and  thus these unreliable bases are scored correctly     However  the occurrence of phasing artifacts may not always correlate with segments of  high miscall rates and biased base calls  and therefore these low quality segments are  not always reliably detected by our current quality scoring methods  We therefore mark  all reads that end in a segment of low quality  even though not all marked portions of  reads will be equally error prone     The read segment quality control metric identifies segments at the end of reads that may  have low quality  and unreliable quality scores  If a read ends with a segment of mostly  low quality  Q15 or below   then all of the quality values in the segment are replaced  with a value of 2  encoded as the letter   in Illumina s text based encoding of quality  scores   while the rest of the quality values within the read remain unchanged  We flag  these regions specifically because the initially assigned quality scores do not reliably  predict the true sequencing error rate  This Q2 indicator does not predict a specific error  rate  but rather indicates that a specific final portion of the read should not be used in  further analyses     This is not a read level filter  the occurrence of consecutive Q2 values in a read does not  indicate that the read itself i
115. e reference  sequence occur within 20 bases of the call  Note that this filter treats each  insertion or deletion as a single mismatch   If the call occurs within the first or last 20 bases of a read then the  mismatch limit is applied to the 41 base window at the corresponding end  of the read   The mismatch limit is applied to the entire read when the read length is 41  or shorter   All bases marked by the mismatch density filter  together with any  N  base calls  which remain after the end trimming step  are filtered out by the variant caller   These filtered base calls are not used for site genotyping but appear in the filtered  base call counts in the variant caller s output for each site     7 All remaining base calls are used for site genotyping  The genotyping method  heuristically adjusts the joint error probability calculated from multiple observations  of the same allele on each strand of the genome to account for the possibility of  error dependencies between these observations  The method accomplishes this by  treating the highest quality base call of each allele from each strand as independent  observations  leaving their associated base call quality scores unmodified  However   subsequent base calls for each allele and strand have their qualities adjusted to  increase the joint error probability of that allele above the error expected from  independent base call observations    After running the site genotyper on all positions  a set of unfiltered SNP sites is 
116. e within the feature    i NOTE    For overlapping genes with different gene names only the non overlapping  portions for each gene participate in count generation     Exon counts are sum of base coverages from genomic and spliced reads  Therefore gene  counts are the sum of exon counts  And junction counts  in reads  are provided for  historical reasons and for alternative splicing analysis     An example of a chromosome genes count txt file opened in Excel is shown below     Figure 18 Chromosome genes count txt File Opened in Excel    E Cc  genes count Ext    16336627 16413045 CECR2  16423152 16453647 SLUSATE  16454502 16491555 ATPEVTI ET  16501454 16591959 BCLALIS  16596905 16637258 BIL  16650415 16007325 MILALJ  16940704 16952207 HE26  16973550 16994490 TUBAS       Part   15011196 Rev D    Requirements and Software  Installation    Hardware and Software Requirements                         0  a 112  Installing ARSAV    116  Pd       D P   7    aand  ennn     rr        map      pa WP  m mn ae JE nn ii  aen on xa sa T a B 4 Pil  nag  a 24             aa ma MAPS aft   naa Oes a e MT YET  C JAM  z     6 i      GCATEATGGAG TEE       ta       pancrase s  en Nee  SP GPE    CASAVA v1 8 2 User Guide T 1 1    v XIDUSdAY    Requirements and Software Installation    Hardware and Software Requirements    Network Infrastructure    The large data volumes generated and moved when running CASAVA mean that you  will need the following     112    1    A high throughput ethernet connection  1
117. eads    Spanning read score threshold  This is calculated in  exactly the same way as     indelsSpReadThresholdIndels  However it is  used in the opposite way  Here the point to find  reads with few or no mismatches  which are  presumed to arise from repeats and not from  indels  and exclude them from the clustering  process    Minimum coverage to extend contig  default 3      153    U01 28 94 JUBIJEA    Algorithm Descriptions    Option    indelsMinContext NUMBER PE    indelsSaveTempFiles PE    Application   Description    Demand at least x exact matching bases either side  of variant  default is 6   The idea here is to ensure  that an indel has a minimum number of exactly  matching bases on either side  Setting this to zero  might be good for finding reads which align to  breakpoints    Add this flag to save intermediate output files from  each stage of the indel assembly process     Options for Target callSmallVariants    The options described below are used to specify analysis for target    callSmallVariants     Table 24 Workflow Options for callSmallVariants    Option Application    VariantsSkipContigs PE    variantsNoSitesFiles SE  PE    variantsNoReadTrim SE  PE    variantsWriteRealigned SE  PE    Table 25 Read Mapping Options for callSmallVariants    Option Application    VariantsIncludeAnomalous PE    VariantsIncludeSingleton PE    variantsSEMapScoreRescue PE    154    Description    By default information from the assemblelndels  module is used  and required  in pair
118. ed end DNA  Sequencing analysis  This option disables use of indel  contigs during variant calling  and only uses gapped  alignment to find indels    Example    variantsSkipContigs   Do not write out the sites txt g7 files    Example    variantsNoSitesFiles   By default  the ends of reads can be trimmed if the  alignment path through an indel is ambiguous  This  option disables read trimming and chooses the  ungapped sequence alignment for any ambiguous  read segment  Note that this can trigger spurious SNP  calls near indels    Example    variantsNoReadTrim   Write only those reads which have been realigned to  bam file   sorted realigned bam   for each reference  sequence     Example    variantsWriteRealigned    Description    Include paired end reads which have anomalous  insert size or orientation  Note that      variantsSEMapScoreRescue  must also be specified  because ELAND gives anomalous reads a PE  mapping score of zero    Include paired end reads which have unmapped mate  reads  Note that    variants5EMapScoreRescue  must  also be specified because ELAND gives singleton  reads a PE mapping score of zero    Include reads if they have an SE mapping score equal  to or above that set by the    QV CutoffSingle  option   even if the read pair fails the PE mapping score  threshold     Parti 15011196 Rev D    Table 26 SNP and Indel Options for callSmallV ariants    Option Application   Description     VariantsNoCovlCutof   SE  PE Disables the SNP and indel coverage filters 
119. ence directory is write protected      force Forces existing output files to be overwritten      input type  lt input  it Type of input file  FASTQ  FASTA  export  or qseq     format gt      log  lt path to log gt   l The path to the log file  Default  ELAND standalone  log      output   od The output directory    directory lt output dir gt      output prefix  Op Produces a set of output files with a prefix of this value  default value    lt prefix gt  is    reanalysis         kagu options  ko Indicates paired read analysis parameters to pass to    lt  options  gt  alignmentResolver  e g   ko   c  enables circular reference sequence  support    Multiple arguments must be contained in quotation marks      remove temps  rt removes all files except exports  BAM files  and log files upon  successful completion      seed length  lt value gt     sl Length of read substring  seed  used for ELAND alignment  defaults  to the lower of read length and 32   Use twice for paired end data  sets      use bases  lt value gt   ub Expanded mask to apply to the FASTQ file    two values if paired analysis  Defaults to Y n    help  h Shows help text       NOTE    The orphan aligner is always enabled when performing paired end analysis  with ELAND standalone   just like configureAlignment     CASAVA v1 8 2 User Guide o D    WIEIDOJd SUOIEpue1S e se GNV 173 Suluuny    Sequence Alignment    The orphan aligner is always enabled when performing paired end analysis with  ELAND standalone   just like GE
120. end  eland extended   paired end  eland_  pair   and single end RNA  eland rna  analysis  The default behavior of  configureAlignment pl is to perform a multi seeded  gapped alignment  This allows for  the identification of small indels   lt  lt  10 nt  during alignment  a gap of up to 10 bases  can be opened during seed extension    DNA  The eland extended and eland pair analysis modes can be used to align   reads to a genome  The types of experiments supported include genome   resequencing  exome capture  targeted capture  and ChIP Seq data    Methylation  There is currently no support for aligning Bisulfite Seq data with   Eland    RNA Eland ma will align transcriptome data  Transcript data is limited to single   reads that cross at most one splice junction  Eland_rna cannot align paired end   data  For paired end read transcriptome data  it is recommended that a third party   tool such as BowTie TopHat be used     Variant Analysis    Variant analysis and RNA counting are controlled by the configureBuild pl script  The  script can be used to describe the following types of variation   Site genotypes and SNPs  Homozygous and heterozygous single nucleotide  variants  SNPs  are called using a Bayesian site genotyping model  which takes into  account base calls  quality scores  and alignment scores of the reads at the given  position   Indels  Indels are called using a two stage process  First  contigs are assembled from  poorly aligned anomolous reads and aligned back to the 
121. enome  Getting Reference Files on page 128      Single Sequence FASTA Files    CASAVA accepts single sequence FASTA files as genome reference  which should be  provided unsquashed for both alignment and post alignment steps  The chromosome  name is derived from the file name     Direct CASAVA to a folder containing the FASTA files using the option     refSequences PATH for variant detection and counting     Multi Sequence FASTA Files    As of version 1 8  CASAVA accepts a multi sequence FASTA file as genome reference   This should be provided as a single genome  SAM compliant  unsquashed file  for both  alignment and post alignment steps  The chromosome name is derived directly from  the first word in the header for each sequence     Direct CASAVA to multi sequence FASTA file using the option     samtoolsRefFile FILE for variant detection and counting     v I WARNING   GenomeStudio does not support the use of multi sequence FASTA files   Therefore  if you want to analyze your output in GenomeStudio  we  recommend using single sequence FASTA reference files     at    Chromosome Naming Restrictions    CASAVA does not accept the following characters in the chromosome name    F A LAP ere eee ep ED EE    94 Part  15011196 Rev D    refFlat txt gz or seg gene md gz File    CASAVA 1 8 generates the non overlapping exon coordinates set automatically using  the refFlat txt gz file  from UCSC  or seq_gene md gz file  from NCBI   They should be  from the same build as the reference files
122. ent Input Files    48  Running configureAlignment       aaa  53  configureAlignment Output Files    73  Running ELAND as a Standalone Program                  sense eee eee cece eee 85    3  w5 a   va EE ee      EER a ms 4    MA Y ng mw V PU   Cor  7  F riv  AT6CeGCihrearegactceteh             CASAVA v1 8 2 User Guide   i D    y saj aeyo    Introduction    The CASAVA module configureAlignment performs sequence alignments  This chapter  describes running configureAlignment  parameters  analysis variables  configuration file  options  and ELANDv2e alignments    L NOTE    For installation instructions  see Requirements and Software Installation on  page 111     Configuring configureAlignment    You can define configureAlignment analysis parameters in a configuration file or in the  command line  Command line arguments take precedence over parameters set in the  configuration file  For a full description of analysis parameters and variables  see  configureAlignment Parameters Detailed Description on page 61     sequence Alignment    configureAlignment uses multiple analysis parameters  Therefore  it is recommended to    include the parameters in a configuration file and provide that file as input to  configureAlignment     configureAlignment and Align As You Go    Bcl conversion supports alignment of the first read of a paired end run before  completion of the run  align as you go   You can kick off alignment for read 1 using the  target r1 when running make at any time after Bcl
123. ented in a range of locations  the caller attempts to report it in  the left most position possible    String summarizing the indel type  One of    e nI  Insertion of length n  e g  101 is a 10 base insertion    e nD   Deletion of length n  e g  10D is a 10 base deletion    e BP LEFT  Left side breakpoint   e BP RIGHT   Right side breakpoint   Segment of the reference sequence 5    of the indel event  For right side  breakpoints this field is set to the value    N A       Equal length sequences corresponding to the reference and indel alleles  which span the indel event  The character         indicates a gap sequence of  the reference or the indel allele    Segment of the reference sequence 3    of the indel event  For left side  breakpoints this field is set to the value    N A       Phred scaled quality score of the indel  which refers to probability that  this indel does not exist at the given position  The Q values given only  reflect those error conditions which can be represented in the indel calling  model  which is not comprehensive  See also Quality Scores on page 148   By default the variant caller reports all indels with Q indel   gt  0    Most probable indel genotype  The indel genotype categories are as  follows     hom refers to a homozygous indel   het refers to a heterozygous indel   ref refers to no indel at this position     Note that these do refer to true genotypes where indels overlap because  the model is not capable of jointly calling overlapping indels
124. er of reads against a genome     As of CASAVA 1 6 a new version of ELAND is available  ELANDv2  The most  important improvement of ELANDV2 is its ability to perform multiseed and gapped  alignments  As a consequence  ELANDv2 handles indels and mismatches better     CASAVA 1 8 also contains a new version of ELAND  ELANDv2e   with an orphan  aligner  repeat resolution  and performance enhancements     Input and Output Files    For a detailed description of the input and output files for ELANDv2e  see  configureAlignment Input Files on page 48 and configureAlignment Output Files on page 73     ELANDv2 Algorithm Description    Multiseed and Gapped Alignment    ELANDv2 introduces multiseed and gapped alignments   Multiseed alignment works by aligning the first seed of 32 bases and consecutive  seeds separately   Gapped alignment extends each candidate alignment to the full length of the read   using a gapped alignment method that allows for gaps up to 10 bases   A    match descriptor    string in the output file  see Output File Formats on page 1  encodes  which bases in the read matched the genome and which were mismatches  and reports  the gaps using the escape sequence         see Export txt gz on page 79      The differences between gapped and ungapped alignments  and singleseed and  multiseed alignments are illustrated below     Figure 19 Ungapped Versus Gapped Alignment    CASAVA v1 8 2 User Guide 1 3 3    SGACINY 12 PUB AANV ld    Algorithm Descriptions    Ungapped Alignmen
125. erate the Aligned analysis folder  and subsequently run the analysis     Rerunning the Analysis    The config txt file used to generate an analysis is copied to the analysis folder so it can  be used by configureAlignment if a reanalysis of the same data is required     Parallelization Switch    If your system supports automatic load sharing to multiple CPUs  you can parallelize  the analysis run to  lt n gt  different processes by using the    make    utility parallelization  switch    make all  j n    For more information on parallelization  see Using Parallelization on page 119     Nohup Command    You should use the Unix nohup command to redirect the standard output and keep the     make    process running even if your terminal is interrupted or if you log out  The  standard output will be saved in a nohup out file and stored in the location where you  are executing the makefile  nohup out can be used by Illumina Technical Support for  troubleshooting should problems arise   nohup make all  j n  amp     The optional     amp     tells the system to run the analysis in the background leaving you free  to enter more commands     Starting Alignment for Read 1    If you want to start alignment before completion of the run  use the makefile target r1   This can be started once Bcl conversion for read 1 has finished  Starting Bel Conversion  for Read 1 on page 35      Set up a regular configureAlignment analysis  but run make using the r1 target  for  example   nohup make  j 16 
126. es  The files  can be derived from the following sources    Mitochondrial DNA   Ribosomal repeat region sequences   5S RNA  optional    Other contaminants  for example phiX  if phiX spikes are used    eland_rna uses squashes the provided FASTA files at the start automatically  similar to  the genome sequence files     refFlat txt gz or seq_gene md gz File  eland_rna     As of CASAVA 1 7 eland_rna uses the refFlat txt gz or seq_gene md gz file to generate  the splice junction set automatically  The refFlat txt gz file comes from UCSC  while the  seg gene md gz file comes from NCBI  and are available through iGenomes  They  should be provided gzip compressed  and should be from the same build as the  reference files you are using for alignment  This negates the need to provide separate  splice junction sets as in previous versions of CASAVA     1 26 Part   15011196 Rev D    Variant Detection and Counting Reference Files    CASAVA variant detection and counting needs two types of files to analyze RNA  Sequencing data   Genome sequence files  refFlat txt gz or seq_gene md gz File  RNA Seq   L NOTE  CASAVA for DNA sequencing only needs the genome sequenee files     Reference Genome    CASAVA uses a reference genome in FASTA format  Both single sequence FASTA and  multi sequence FASTA genome files are supported     Genome sequence files for most commonly used model organisms are available through  iGenome  Getting Reference Files on page 128      Single Sequence FASTA Files    CASA
127. es on page  128      CASAVA v1 8 2 User Guide 1 D 7    92u0J9 84 Bul uno PUB U01 28 9   JUBIJEA    Reference Files CASAVA    Getting Reference Files    To run CASAVA  you will need to download genome and other reference files  You can  use iGenome for most commonly used model organisms  This is explained in this  section     Illumina Provided Genomes    Illumina provides a number of commonly used genomes at ftp illumina com along with  a reference annotation   Arabidopsis_thaliana  Bos_ taurus  Caenorhabditis_elegans  Canis familiaris  Drosophila  melanogaster  Equus_caballus  Escherichia coli K 12 DH10B  Escherichia coli K 12 MG1655  Gallus_gallus  Homo sapiens  Mus musculus  Mycobacterium  tuberculosis H37RV  Pan troglodytes  PhiX  Rattus norvegicus  Saccharomyces cerevisiae  Sus scrofa  You can login using the following credentials   Username  igenome  Password  G3nom3s4u    For example  download the FASTA  annotation  and bowtie index files for the human  hg18 genome with the following commands    gt wget   ftp user igenome   ftp password lt G3nom3s4u  ftp   ftp illumina com Homo sapiens UCSC hg18 Homo sapiens   UCSC igle Far ses    Unpack the tar file   tar xvzi Homo sapiens UCSC hgls tar gz    Unpacking will make its own folder  Homo sapiens UCSC hg18    Abundant Sequence Files  RNA Seg   Process the abundant seguence files the following way   1 Generate a folder for abundant sequences     2 Collect FASTA files for abundant sequences in the abundant sequences folder  fo
128. esult in longer run time  By default  ELANDv2e runs in  semi repeat resolution mode  Full repeat resolution can be turned on with the option  INCREASED SENSITIVITY   Performs orphan alignment by identifying read pairs for which only one of the  reads aligns  ELANDv2e then tries to align the other read in a defined window  by  default 450 bp      Configuring a Paired Read Analysis    The alignments of the two reads that provide input to the pairing process may be  varied by setting ELAND SEED LENGTH and ELAND MAX MATCHES  Both  parameters may be set lane by lane  but the same values will apply to each of the two  reads in a lane     The paired read analysis may be configured by passing options to alignmentResolver   This is done by setting a parameter KAGU PAIR PARAMS in the configureAlignment  configuration file  For additional information  see KAGU PAIR PARAMS and KAGU_  PARAMS on page 65     KAGU PAIR PARAMS can be specified lane by lane  All of the options must be  specified on a single line and space separated  as in the following example   8 KAGU PAIR PARAMS   circular   muf 0    CASAVA v1 8 2 User Guide 6 O    1uauubilveinbyuo2 BulUUNH    Sequence Alignment    70    Using ANALYSIS eland_rna    eland_rna is the eland module built specifically for RNA Sequencing   and is required to  provide the input files for CASAVA  eland_rna delivers the following information   Read alignments to the genome   Read alignments to splice junctions   Read alignments to contaminants   i NO
129. eter theta for indels is set to a default value of 1 10 000     Q Indel     Once the candidate indels are identified  the variant caller realigns all intersecting reads  to each candidate indel  in addition to aligning the read to the reference and any  alternate indel candidates at the same site  The relative likelihoods of all alignments for  each read are used to assign probabilities to each of three possible indel genotypes   homozygous  heterozygous or not present     The associated quality score Q indel  expresses the probability that the non reference  indel allele referred to in the indel call exists in the sample as either a heterozygous or  homozygous variant  analogous to Q snp      SNP Caller Reporting    150    The SNP caller reports the following files   snps txt  SNPs for each chromosome are summarized within each chromosome  directory in a file called snps txt  This file contains SNPs which have been called by  CASAVA   s callSmallVariants module   sites txt gz  As part of the SNP calling process  the variant caller also outputs  information on coverage and consensus genotype for every mapped site in the  genome  These results are found in each chromosome bin directory in a gzip  compressed file called sites txt gz   snps removed txt  As a final noise filtration step  the SNP calls in the snps txt files  have been filtered to remove SNP calls in regions close to centromeres and other  high copy number regions  The SNP calls filtered out by this procedure can b
130. exed samples need to be processed    The FASTO files for both multiplexed and non multiplexed samples are organized  using the Project and Sample concepts  as governed by the sample sheet   configureAlignment uses the sample sheet to identify projects and samples  and the    sample organization as described in the sample sheet should always match the  actual Unaligned folder organization     As a result of these changes  configureAlignment expects the following input files     A Unaligned directory with fastq gz files  even for cases where only one project  exists      A config txt file  which specifies the analysis    A base calling config xml file  DemultiplexedBustardConfig xml    A FASTA reference genome for alignments    For RNA applications  additional files are required     This section explains these files  For file locations  see figure below  Note that the    reference files may be located in a different location  depending on your CASAVA  installation     Parti 15011196 Rev D    Figure 11 Locations of configure Alignment Input Files     lt ExperimentName gt   YYMMDD  machinename XXXX FC R HUH  Input Files    E Data from RTA or OLB iGenomes    A    se LOO1    Intensities      By Lane  L Homo_sapiens    Basecalls UCSC      LOO     By Lane  r hg18  Eli bel files   C Lane Cycle  Sequence     lt   lt  Chromosomes     FASTA  Contaminants    Reference Files               ra    Unaligned File Structure generated by single FASTA  Bcl conversion Demultiplexing contaminants f
131. fault  If you want multiple    Aligned directories  you will have to use the option OUT_DIR to generate a  different output directory     Analysis Summary    14    The results of an analysis are summarized as web pages that enable a large number of  graphs to be viewed as thumbnail images  This section is intended to help you interpret  the various graphs that appear in an analysis directory     For each project  a Sample_Summary htm file is produced  which contains  comprehensive results and performance measures of your analysis run for a project  per sample  It is located in the Aligned project folder and provides an overview of  quality metrics for a project with links to more detailed information in the form of  pages of graphs    For each project  a Barcode_Lane_Summary htm file is produced  which contains  comprehensive results and performance measures of your analysis run for a project  per barcode and lane  It is located in the Aligned project folder and provides an  overview of quality metrics for a project with links to more detailed information in  the form of pages of graphs    For each run  a FlowCellSummary htm file is produced  which contains  comprehensive results and performance measures of your entire analysis run across  all projects  It is located in the Aligned folder     Sample_Summary Page    For each sample  a Sample_Summary htm file and Barcode_Lane_Summary htm file is  produced  which contains comprehensive results and performance measures of your  a
132. ference sequence name  The contains the export txt gz file match chromosome  value  and if the export txt gz file  Match contig  field is not empty  the SAM  RNAME field will be appended with a     character followed by the match contig  name  See Export txt gz on page 79    MAPQ   Mapping quality  Phred scaled posterior probability that the mapping position of  this read is incorrect    CIGAR   Extended CIGAR string  For a description  see Extended CIGAR Format on page  170    MRNM   Mate Reference sequence NaMe          if the same as  lt RNAME gt    MPOS 1 based leftmost mate position of the clipped sequence   ISIZE Inferred insert size   SEQ query SEQuence          for a match to the reference  n N   for ambiguity     Export to SAM Conversion    cases are not maintained  QUAL   query QUALity  ASCII 33 gives the Phred base quality  TAG TAG for an optional field  For a description  see Optional Fields on page 171     VTYPE   Value TYPE for an optional field  For a description  see Optional Fields on page  171     VALUE   Match lt VTYPE gt  for an optional field  For a description  see Optional Fields on  page 171     Bitwise Flag Values    The FLAG field in the alignment section is a bitwise flag  The meaning of predefined  bits is shown in the following table     Hexadecimal   Decimal   Description    Value Value   0x0001 1 The read is paired in seguencing  no matter whether it is mapped  in a pair   0x0002 2 The read is mapped in a proper pair  depends on the protocol   n
133. ferences and their insert size           E dm  Ee z   chr20 fa F 14812275 922 108M       gt       5 p ng  chr20 fa R 14812492 922 108M  pa pH chr5 fa F 99771317 966 108M  chr5 fa R 99771540 966 108M                Repeat Resolution    ELANDv2e aligns reads in repeat regions using two new modes  semi repeat resolution  and full repeat resolution  Both modes take repetitive hits into account for the multiseed  pass of ELAND  Full repeat resolution is more sensitive and places more reads in repeat  regions  but will result in longer run time     By default  ELANDv2e runs in semi repeat resolution mode  Full repeat resolution can  be turned on with the option INCREASED SENSITIVITY     CASAVA v1 8 2 User Guide 1 27    SGACINY la PUB AANV 13    Algorithm Descriptions       Figure 22 Changes between CASAVA 1 7 and 1 8 in multiseed ELAND alignment              ay  Al ig n th e fi rst 3 2b p of read S ATGCATGCACGGACTTTCGGGACTCTATCCGGCATCTATGGCTTTTCGC    WETE AA    ATGCATGCACGGACTTTCGGGACTCTATCCGGCATCTATGGCTTTTCGC    Reference AracATGCACGGACTTTCGGGACTCTATCCGGCATCTATGGCTITTCGC    a WO eg  2    CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCGGCAI CIAIGGCI ITICGC  2  Identify reads that do not align or hit a repetitive sequence    Seed  Read CCCCCCCCCCCCCCCCCCCCCCCCCCCCCGGCATCTA       Reference  CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCGGCATCTATGGCTTTT    3  CASAVA v1 7  Align unmapped reads using multiple seeds    Seed Seed    Read CCCCCCCCCCCGCCCCCCCCCCCGCCCCCGGCATCTA         Reference  CCCCCCCCCCCCCCCCCC
134. for each read which intersect include a  candidate indel    Select most likely read realignment for subsequent site counting and genotyping   Further filter individual basecalls based on mismatch density or ambiguity  N   Use all remaining base calls to predict site genotypes and SNPs    Filter to remove SNP and indel calls near the centromeres and within high copy  number regions     readBases Counting Method    As of version 1 6  CASAVA uses the readBases counting method  This method is for  exon and gene counts  and counts the number of bases that belong to each feature  Both  reads that map to the genome and reads that map to splice junctions contribute to exon  base coverage value   L NOTE  Before counting CASAVA split alignments to the splice junction to two  shorter genomic reads     Counts for splice junctions are provided for convenience and correspond to the number  of reads that cover the junction point  Bases within reads aligned to the junction are  counted only once in the exon counts The number of bases that fall into the exonic  regions of each gene is summed to obtain gene level counts  and normalized according  to feature size  and expressed as RPKM  Reads Per Kilobase per Million of mapped  reads      Exons that have overlapping exons from other genes on the forward or reverse strand  are excluded from counting and are also not included to compute the total gene length     Variant Caller and Counting Detailed Description    92    For a detailed description 
135. g                    99  Analysis Options forsort                   eee renee reen rreneee 99  Analysis Options for rnaCounts      2 2 2    2 22 a 100  Analysis Options for Da EER EE IE bm Nka DUDA KA BANAAG ND KE kag anG UU Ge 100  Global Analysis Options for Variant Detection and Counting                  152  Options for assemblelndels                                                  153  Workflow Options for callSmallVariants                                         154  Read Mapping Options for callSmallVariants                                  154  SNP and Indel Options for callSmallVariants                               155  SNP Options for callSmallVariants                                            155  Inde  Options for callSsmallVariants                                               156  Ilumina General Contact Information                                             179  Ilumina Customer Support Telephone Numbers                             179    Part   15011196 Rev D    Overview     gir ere   dle g a EE ee RE ec see ee eee ae et EI 2  CASAVA Features            iii 5  What s NeW    Ss eee Sec AA a a a anaa LLL oDDD a a SG GA 9  Frequently Asked Questions                     ee eee cece cece ccc c cece GE EE GEE 10    P    E S ZIM AA rer  Say  gt        ry   ATEEGGERTENTeGas TEER  Mad       CASAVA v1 8 2 User Guide 1    ffi      JOIACUD    Overview    Introduction    This user guide documents CASAVA 1 8 2  short for  Consensus Assessment of  Sequence And Y Ariation
136. he candidate indel contigs  produced by assemblelndels  The procedure is outlined below     144 Part   15011196 Rev D    Read in read alignments and candidate indel contigs  Filter out read alignments  based on quality checks  paired end anomalies  or ELAND alignment score  Filter  out contig alignments containing adjacent insertion deletion events    Consolidate indel evidence from read and contig alignments to produce a set of  candidate indels    Perform local read realignment using candidate indels    Call indels based on the set of alignments for each read which intersect include a  candidate indel    Select most likely read realignment for subsequent site counting and genotyping   Further filter individual basecalls based on mismatch density or ambiguity   N       Use all remaining base calls to predict site genotypes and SNPs    Filter to remove SNP and indel calls near the centromeres and within high copy  number regions     Read Filtering    The variant caller performs an initial read filtering to remove reads from both SNP and  indel calling based on the following criteria   Any reads marked as failing primary analysis quality checks  e g failing the purity  filter  or marked as PCR or optical duplicates are removed from consideration   For paired end reads  any reads which are not marked as being part of a    proper  pair    are removed from consideration  This is intended to remove any reads from  chimeric pairs  with unmapped mates or with an anomalous pair inser
137. he first three fields  For example  you may want to capture the flow  cell number in the run folder name as follows  YYMMDD machinename XX XX  FCYYY     1 NOTE     When publishing the data to a public database  it is desirable to extend the  exclusivity globally  for instance by prefixing each machine with the identity  of the sequencing center     BaseCalls Directory    Demultiplexing requires a BaseCalls directory as generated by RTA or OLB  Off Line  Basecaller   which contains the binary base call files    bcl files      I NOTE    As of 1 8  CASAVA does not use  _qseq txt files as input anymore     The BCL to FASTQ converter needs the following input files from the BaseCalls  directory     bel files     stats files     filter files     control files      clocs    locs  or  _pos txt files  The BCL to FASTQ converter determines which type  of position file it looks for based on the RTA version that was used to generate  them     CASAVA v1 8 2 User Guide D f    Sol INdUJ UOISIBAUOYD  DA    Bcl Conversion and Demultiplexing    RunInfo xml file  The RunInfo xml is at the top level of the run folder   config xml file    RTA is configured to copy these files off the instrument computer machine to the  BaseCalls directory on the analysis server  The files are described below     Bcl Files    The   bcl files can be found in the BaseCalls directory    lt run directory gt  Data Intensities BaseCalls L lt lane gt  C lt cycle gt  1    They are named as follows   s  lt lane gt   lt til
138. hese reads will be ignored  during variant calling    Example    sortKeepAl lReads   Minimum SE alignment score to put a read to NM  Default  1   1  means option is turned off    Ignore unanchored read pairs in indel assembly and variant  calling  Unanchored read pairs have a single read alignment  score of 0 for both reads     Example    ignoreUnanchored    The options described below are used to specify analysis for target sort     Table 19 Analysis Options for sort    Option Application      rmDup YES   NO PE      sortBufferSize INTEGER   SE  PE      sortKeepAllReads SE  PE    CASAVA v1 8 2 User Guide    Description    Turn On Off PCR duplicate marking removal for paired end  reads  default YES      Buffer size used by the read sorting process  in megabytes   default  1984      Run the sort module in archival mode instead of the default  filtered mode     See Archival Build on page 90    Example    sortKeepAl lReads    99    DUI1UNOD pue U01 28 9   uelJe A BuluunH    Variant Detection and Counting    Options for Target rnaCounts    The options described below are used to specify analysis for target rnaCounts     Table 20 Analysis Options for rnaCounts    Option Application   Description    refFlatFile SE Name and location of UCSC refFlat txt gz file   The file must be gz compressed   Example      refFlatFile  data Genome ELAND _  RNA Human refFlat txt gz    segGeneMdFile SE Name and location of NCBI seq_gene md gz  file   Example      seqGeneMdFile  data Genome ELAND    RN
139. human8  human  TGACCA  myTest N  32 7 CB  test PRC2    Examples below illustrate use of DATASET POST RUN COMMAND     DATASET_POST_RUN_COMMAND limited to a PROJECT  Following config file for PROJECT selection     ANALYSIS    Nong    PROJECT testEPK l ANALYSIS elana excended   PROJECT tesCPRCI USE BASES YN   PROJECT testPRC1 ELAND GENOME   illumina scratch iGenomes PhiX Illumina RTA Seguence Sguashed     PhiX  Illumina RTA    PROJECT testPRC1 DATASET POST RUN COMMAND echo   project     S  sample   gt  gt  out DPRC txt    testPRC1  testPRC1  testPRCl  testPRC1  testPRC1  testPRC1  testPRC1  testPRC1    humanl  humanl  humanl  humanl  humanl  humanl  humanl    S  barcode     will generate out DPRC txt in  Aligned folder   phixl TAGES 1    CGATGT  CGATGT  CGATGT  TGACCA  TGACCA  GCCAAT  SCGGCAAT    N PF Ol W OY OO JN    DATASET POST RUN COMMAND limited to a LANE  Following config file for LANE selection     ANALYSIS none   1 ANALYSIS eland extended   IsUSE  BASE  FR   1 ELAND GENOME   illumina scratch iGenomes PhiX Illumina RTA Seguence Sguashed   PhiX Illumina RTA   1 DATASET POST RUN COMMAND echo   project    sample   S  barcode    lane   gt  gt  out DPRC txt    will generate following out DPRC txt in  Aligned folder     testPRC1 phixl TTAGGC 1  Undetermined indices lanel Undetermined 1  testPRC1 humanl GCCAAT 1  testPRC2 humanl CTTGTA 1    POST RUN COMMAND    You can also run the workflow wide POST RUN COMMAND from the make    command lane  for example     CASAVA v1 8 2 User Gu
140. i sequence FASTA genome files are supported     Genome sequence files for most commonly used model organisms are available through  iGenome  Getting Reference Files on page 128    L NOTE    As of CASAVA 1 8  you do not need to squash the reference genome  anymore     Single Sequence FASTA Files    CASAVA accepts single sequence FASTA files as genome reference  which should be  provided unsquashed for both alignment and post alignment steps  The chromosome  name is derived from the file name     Direct CASAVA to a folder containing the FASTA files using the option ELAND GENOME  for configureAlignment     Multi Sequence FASTA Files    As of version 1 8  CASAVA accepts a multi sequence FASTA file as genome reference   This should be provided as a single genome  SAM compliant  unsquashed file  for both  alignment and post alignment steps  The chromosome name is derived directly from  the first word in the header for each sequence     Direct CASAVA to a multi sequence FASTA file using the option SAMTOOLS GENOME  for configureAlignment   ki WARNING  GenomeStudio does not support the use of multi seguence FASTA files     Therefore  if you want to analyze your output in GenomeStudio  we  recommend using single seguence FASTA reference files     ad    Chromosome Naming Restrictions    CASAVA does not accept the following characters in the FASTA chromosome name  header   TAN LA EAT ET TG UA    This validation can be disabled in configureAlignment using the following option   CHROM NAME
141. ide    O     1uauubiveinbyuo2 buluuny    Sequence Alignment    make all POST RUN COMMAND    echo everything is done       Using ANALYSIS eland extended    ANALYSIS eland extended is an improved version of the ANALYSIS eland mode that  existed in Pipeline and is now deprecated  ANALYSIS eland could align reads longer  than 32 bases but demanded that the first 32 bases of the read have a unique best  match in the genome  The position of this match is used as a  seed  to extend the match  along the full length of the read  ANALYSIS eland  extended removes the uniqueness  restriction by considering multiple 32 base matches and extending them     Multiseed  Gapped  Repeat Alignment    ANALYSIS eland extended performs the following alignment features implemented in  ELANDv2 and ELANDv2e     By default performs multiseed alignment by aligning consecutive sets of 16 to 32  bases separately    Uses a gapped alignment method to extend each candidate alignment to the full  length that allows for gaps  indels  of up to 10 bases    Aligns reads in repeat regions using two new modes  semi repeat resolution and  full repeat resolution  Full repeat resolution is more sensitive and places more reads  in repeat regions  but will result in longer run time  By default  ELANDv2e runs in    semi repeat resolution mode  Full repeat resolution can be turned on with the option  INCREASED SENSITIVITY     Configuring ANALYSIS eland extended    There are three parameters that affect the output of the al
142. ignment  ELAND SEED     LENGTH1  ELAND SEED LENGTH2  and ELAND MAX MATCHES  Both parameters  can be specified lane by lane     The following table describes the parameters for ANALYSIS eland extended     Table 9 Parameters for ANALYSIS eland extended    Parameter Description  ELAND SEED LENGTH1 By default  the first 32 bases of the read are used as a    seed    alignment   ELAND SEED LENGTH  Setting ELAND_SEED_LENGTH 1 to 25 will use 25 bases in read 1 instead    of the maximum of 32 for the initial seed alignment  This should increase  the sensitivity since two errors per 25 bases is less stringent than two  errors per 32 bases     A read is more likely to be repetitive at the 25 base level than at the 32  base level  so a decrease in ELAND SEED LENGTH should probably be  used in conjunction with an increase in ELAND MAX MATCHES   Setting this to very low values will drastically slow down the alignment  time and will probably result in a lot of poor confidence alignments     ELAND MAX MATCHES By default  ANALYSIS eland extended will consider at most ten    68    alignments of each read  This can    ELAND_MAX_MATCHES allows the maximum number of alignments  considered per read to be varied between 1 and 255     Both ANALYSIS eland_extended and ANALYSIS eland_pair produce export files that  contain all read  quality value  and alignment information for the analysis     For a detailed description of the export txt gz files  see Text Based Analysis Results on  page 51     Parti 
143. ijn graph of m symbols is a graph representing overlaps between  sequences  De Bruijn graphs are used for de novo assembly of short read  sequences into a genome     deprecated  Deprecated refers to software features that are superseded and should be  avoided  Although deprecated features remain in the current version  their  use may raise warning messages  and deprecated features may be removed  in the future  Features are deprecated    rather than being removed   in  order to provide backward compatibility and give programmers who have  used the feature time to bring their code into compliance with the new  standard     K    kmer hashing  Hashing refers to the use of subfragments of a particular read to find match   ing pieces of DNA in a hash table  k mer means the size of the fragment  used for hashing     CASAVA v1 8 2 User Guide 1 F   D    Glossary    1 6    O    orphan reads  An orphan read is the unaligned part of paired reads for which only one  read aligned  Identical to shadow read     S    shadow read  A shadow read is the unaligned part of paired reads for which only one  read aligned  Identical to orphan read     singleton  A singleton is the aligned part of paired reads for which only one read  aligned     W    wrapper script  A wrapper script is a script whose main purpose is to call a second func   tion in a computer program with little or no additional computation     Parti 15011196 Rev D    Index    A  abundant sequences 70  All htm file 17  analysis output
144. iles    Project A Annotation    A  fastq gz    Sample_A files     Genes  Ea  aa oz FASTA  Sample B files genome files    Project B      Z  fastg gz  Sample C files    Undetermined Indices    F    fastq gz  Sample Lane  fice    Basecall_Stats_FC       SampleSheet   csv file    Ea  DemultiplexConfig   xml    Ea  Demultiplexed  BustardSummary xml          Sequence Files    configureAlignment needs a Unaligned directory as generated by BCL to FASTQ  converter  which contains the gzipped sequence files     fastg gz files      a imw NOTE  4 i As of CASAVA 1 8  configureAlignment uses FASTO input files instead of  _  qseq txt files     For a description of the FASTO files  see FASTQ Files on page 39     CASAVA v1 8 2 User Guide 4  O    soji Indu  JuowuG6Iijyoe4n6BIJUOD    Sequence Alignment    Configuration File    The configureAlignment configuration file  generally named config txt  specifies what  analysis should be done for each lane  The requirements and options for the    configureAlignment configuration file are described in configure Alignment Configuration  File on page 54     Sample Sheet    The SampleSheet csv file describes the samples and projects in each lane  including the  indexes used  It is derived from the user generated sample sheet that is required for bcl    conversion and demultiplexing  The sample sheet should be located in the Unaligned  directory of the run folder     The sample sheet has to match the directory structure created during the bcl conversion  a
145. illumina    CASAVA v1 8 2  User Guide    T L LL ee r N EE a Mea a dla     NGATAACAGTAACACACTTCTGTTAACCT TAAGATTACTTGTTGATCCACTGATTCAACGTACCGTATCAAT TGAGACTAAATATTAACGTACCATTAAGAGCTACCGTCTTCTGTTAACCTTAAGATTACTTGATCCACTGATTCAACGTACCG   CACTGATTCAACG TAGCAAGATIACETGATO  ACTGAT I  AA  O TAGOGTAAC  AA  GIATCNATI GAGAOTAMATATNACOTAC  AT NAGAGCIAC  GTOTICI GTIAA  OTIRAG ATTACTTGATCCACTGATTCAACGTACCGTAACGE  GAAAAGAATGATAACAGTAACACACTTCTGT TAACCTTAAGATTACTTGATCCACTGATTCAACGTACCGTAAAGATTACTTGATCCACTGATTCAACGTACCGTAACGAACGTATCAATTGAGACTAAATATTAACGTACCAT TAAGAGCTACC   GATAACAGTAACACACTTCTG C GATTACTTGATCCACTGATTCAACGT  GTAACGAACGTATCAATTGAGACTAAATAT TAACGTACCATTAAGAGCTACCGTCTTCTG G ACTTGATCCACTGATTCAA  TTGAG T GTTAAGATTAGTTGATGGAGTGATTGAAGGTAGGGTAAGGAAGGTATGAATTGAGAGTAAATATTAAGGTACGATT G AGTTGATGCACTGATTCAAGGTAGCGT   TATCAATTGAGACTAAATAT TI CTTAACCTTAAGATTACTTGATCCACTGATTCAACGTACCGTAAC CGTCTTCTG TTAAGATTACTTGATCCACTGATTCAACGTACCGTAACGAA TT CAAT T AACGACG   GACTAAATAT TAACGTACCAT TAAGAGCTACAACC ACTTGATCCACTGAT TCAACGTACCGTAACGAACGTATCAAT TGAGACTAAATAT TAA ATTAAGAGCTACCGTGC CAGTAACAC    GATAACAGTAACACACTTCT G ATTACTTGATCCACTGATTCAACG GTAACGAACGTATCAATTGA TATTAACGTACCAT TAAGAGCTACCGTCTTC CTTAAGAT TACT TGATCCACTGATTCAAC   CATTAAGAGCTACCGTGCAACTTAA ACTTGATCCACTGATTCAACGTACCGTAACGAACGTATCAA G AACGTACCATTAAGAGCTACCGTGCAAC GTTAAGATTAGTTGA     AG CCTTAAGATT  A A IT TGAGACTAAATAT TAACG TT  GACGAACTTCTGTTAA    GCTACCGTGCAACGAAAATAACCTTAAGATTACTTGATCCACTGATTCAACGTACTTCTGTTAACCTTAAGATTACTTGATCCACTGAT
146. is the default value if no port number is specified   The utility nmap  if installed  may help you identify which port on a server is  hosting an SMTP service     VAVSVOBUIE1SU      5 Test your email reporting by entering the following from the machine where you are  running configureAlignment   telnet yourserver yourPortNumber  If you don t get a friendly message  then email reporting will not work   You can run runReport pl directly in test mode by entering    cunReport pl   test yourserver 25 yourdomain com anything  your name yourdomain com  You should receive a test email  If you do not  the transcript it generates should  identify the problem   L NOTE      The optional email reporting feature depends on how your SMTP servers  are set up locally  Email reporting is not required to run the    configureAlignment to a successful completion     CASAVA v1 8 2 User Guide 1 1 7    11 o Parti 15011196 Rev D    Using Parallelization       Make    Utilities       CASAVA v1 8 2 User Guide    g KIDUSAAY    Using Parallelization     Make  Utilities    Parallelization is built around the ability of the standard    make    utility to execute in  parallel across multiple processes on the same computer  configure Alignment also  provides a series of checkpoints and hooks that enables you to customize the  parallelization for your computing setup  See Customizing Parallelization on page 121 for  details     Standard    Make       The standard    make    utility has many limitations  but i
147. is the tile Number   lt sample name      lt barcode sequence gt _L lt lane gt _R lt read number gt   lt 0 padded 3 digit  set number name txt       Table 11 Intermediate Output File Descriptions    Output File configureAlignment   Description  Analysis Mode      eland extended txt ANALYSIS eland    Contains the corrected alignment positions and the full  extended alignment descriptions for 232 base reads  This file is not    eland extended txt   ANALYSIS eland  purity filtered   pair      extended contam txt   ANALYSIS eland rna   Alignments to the ELAND RNA GENOME CONTAM     extended splice txt   ANALYSIS eland rna   Alignments to the splice junctions     B2    Table 12 Intermediate Output File Formats    Output File Format  s N TTTT align txt   Deprecated sequence alignment format   s NTTIT Space separated text values   realign txt 1  Sequence  s NITIT 2  Best score  PIA 3  Number of hits at that score  4  The following columns only appear if hits equal 1  a single  unique  match   5  Target pos  6  Strand  7  Target sequence  8  Next best score    Parti 15011196 Rev D    Interpretation of configureAlignment Run Quality    After the analysis of a run is complete  you need to interpret the data in the report  summary and various graphical outputs  This section describes a standard  systematic  way to examine your data     The starting point is to know what a standard run of acceptable quality looks like  This  is something of a moving target and is dependent on individual in
148. ise specified in the sample sheet  samples without index will end up in  the project folder Undetermined  indices  and in a sample folder named after the lane   e g  Sample lane1      If you want to specify analysis for these samples without index other than the global  analysis  you can use identifiers PROJECT Undetermined_indices or SAMPLE lanel     CASAVA v1 8 2 User Guide D O    lJuswubi yaInbijuo2 BulUUNH    Sequence Alignment    4 NOTE    Normally you would want to use    PROJECT Undetermined indices ANALYSIS none  or   REFERENCE unknown ANALYSIS none    to avoid wasting CPU time on the Undetermined_indices data  which often is  of poor quality     Config txt Examples    The configureAlignment configuration file  generally named config txt  specifies what  analysis should be done for each lane  Some examples for DNA Sequencing analysis are  shown below     Assignment by Lane    If you want to   Use as reference single FASTA files from human genome build hg18 in your   lt GenomesFolder gt   Align paired end data from lanes 1  2  and 3  Use all bases except the last one for both reads    Generate the following config txt file   ELAND GENOME  lt GenomesFolder gt  iGenomes Homo    sapiens UCSC hg18 Sequence Chromosomes  123 ANALYSIS eland pair  I23tUSE BASES VN  VA    Assignment by PROJECT    If you instead want to align the samples from your project named Project1  generate the  following config txt file   ELAND GENOME  lt GenomesFolder gt  iGenomes Homo _  sapiens UCSC hg
149. ismatches     The alignment score of a read is computed from the p values of the candidate  alignments  The candidate with the highest p value is the best candidate and its  alignment score is its p value as a fraction of the sum of the p values of all the  candidates  This is also known as a Bayes  Theorem inversion  The alignment score is  expressed on the Phred scale  i e  Q20 corresponds to 1  chance of alignment being  wrong  Q30 0 1   etc     For example  if there are two candidates for a read with p values 0 9 and 0 3  the  alignment score calculation would be as follows     0 9  0 9 0 3    0 75  chance of highest scoring alignment being right  1  0 75   0 25  chance of highest scoring alignment being wrong  Expressed on the Phred scale   Alignment score    10 log  0 25    6    i NOTE      The alignment score of a read and the p values of the candidate alignments  for the read are not the same  The former is computed from the latter     Rest of Genome Correction    If only one candidate alignment is found  the scoring scheme above would give an  infinite Phred score  MAQ deals with this by giving such cases an arbitrary high score  of 255  ELANDv2e uses a constant known as the  rest of genome correction  that  depends on the average base quality of the read  the read length and the size of the  genome  This gives a scoring scheme with the following properties   Single candidate alignments for longer reads will score more highly than single   candidate alignments of sho
150. l human   ATCACG   descl N R1 name Projl  12345AAXX   1 sample2 human   CGATGT   desc2 N R1 name Proj1  12345AAXX  2 sample3   rat TTAGGC   desc3 N R1 name Proj2  12345AAXX  2 sample4   mouse   TGACCA   desc4 N R1 name Proj3    then this will initiate an eland_pair analysis for all human samples  samplel and  sample2   and use the global analysis eland_rna for all other samples  sample3 and  sample4   This allows you to set the analysis  reference genome  and all other ELAND  parameters project by project  or reference by reference  or sample by sample  or barcode  by barcode     Combining Specificity    It is also possible to combine specific analyses  like in this example   12  REFERENCE human ANALYSIS eland pair    which tells configureAlignment to perform eland_pair analysis on the human reference  samples from lanes 1 and 2     CASAVA v1 8 2 User Guide HD      1uauubilveinbyuo2 buluuny    Sequence Alignment    55    Priority   If multiple specific settings conflict  configureAlignment uses the following order of  priority    1 PROJECT   2 REFERENCE   3 SAMPLE   4 BARCODE   5 Lane   6 Global settings    This means  PROJECT settings override any other settings  while REFERENCE settings  can only be overruled by PROJECT settings  and so on     L WARNING  The attribute cannot be set for more than one scope at a time  In other  words the following is not allowed     PROJECT test BARCODE ACGT ANALYSIS eland extended    Additional Examples    Some more examples are listed below
151. l jobs belonging to the step have finished  Finally  hooks are  provided upon completion of the step to issue user defined external commands     Parallelization Limitations    The analysis works on a per file basis  so the maximum degree of parallelization  achievable is equal to the total number of files generated during demultiplexing     CASAVA v1 8 2 User Guide 1 O    Using Parallelization    However  some parts of configureAlignment operate on a per lane basis  and a few  parts on a per run basis  which means that scaling will cease to be linear at some stage  for more than 8 way parallelization  The ELAND FASTO FILES PER PROCESS affects  the maximum level of parallelism available for ELAND  If all sequence information is  stored in a total of 64 files  a value of 32 will lead to 2 processes  8 to 8 processes  4 to  16 processes  etc  These numbers of processes are doubled for paired end runs     Memory Limitations    122    CASAVA requires a minimum of 2 GB RAM per core  The parameter ELAND FASTQ  FILES PER PROCESS in the configureAlignment config txt specifies the maximum  number of tiles aligned by each ELAND process  The optimal value is such that there  are approximately 10 to 13 million lines  reads  in one set  For additional information   see Sequence Alignment on page 45     Parti 15011196 Rev D    Reference Files CASAVA    PIP OCCU LO AA 124   ELAND Reference Files    125   Variant Detection and Counting Reference Files                                          
152. l show  ELANDv2   from other as early cycle error rates  If error rates remain fairly constant  genetic with cycle  then the    correct    genome has probably  material sequenced correctly  Non smooth error rate plots or IVC    resulting in an   plots indicate the presence of specific tags or sequences   inability to  align data    Percentage Mismatch Rate of Clusters Passing Filters    This value should be as low as possible  but it is very dependent on read length  If there  is a sudden rise beyond cycle 32  then it is likely that ELANDv2e has effectively filtered   out many clusters with more than two errors  thus suppressing the true error rate up to   this point  The percentage aligning will also be low     IVC htm    For a detailed description of the plots found in the IVC htm file  see IVC Plots on page  76     CASAVA v1 8 2 User Guide o 3    Saoi J  nd ng  JusWubi yainbijuoo    Sequence Alignment    84    Condition Possible Cause    Intensity curves are not smooth  Called intensities are not equal    Cycle to cycle focus or fluidics problems  Poor fluidics or poorly blocked flow cell          Called    may be     5  out without If from cycle 1  initial matrix estimate may  major problems  also be in error    All htm and Mismatch htm    The results in both files should show consistency from tile to tile down a lane and from    lane to lane  if the results are from the same sample     Condition  Tile variability    Rising mismatch rates     Rates will always rise eventu
153. lation is reported as two quality scores   The first of these scores  Q indel   expresses the probability that the indel is present  in the sample as either a heterozygous or homozygous variant   The second score  Q max_gt   expresses the probability that the most probable indel  genotype  reported as the value max gt  is correct     To accommodate diverse applications  the CASAVA variant caller does not filter out  low confidence calls and thus prints out all indels with a Q indel  value of 1 or greater   Summary statistics for indels are generated for a subset of higher confidence indels    by  default any indel with Q indel  of 20 or greater is summarized in CASAVA   s reports   Note that for calls with a very low Q indel  score  it is possible that the most likely    Parti 15011196 Rev D    genotype will be    ref     indicating that the indel is not present  This should be interpreted  to mean that there is a non trivial probability of the indel existing as a heterozygous  variant at this site  but that the indel is more likely to be absent from the sample than  present     The predicted Q scores reflect only those error conditions which are represented in the  genotype calling model  which is not comprehensive  The model accounts for  basecalling error  diploid chromosome sampling  a spurious indel rate and an  approximation of read mapping error  However note that artifactual indel signatures  could still arise due to complex overlapping variants  atypical sample prepa
154. latter including relative orientation and  separation  insert size  of partner read alignments     If the criteria for paired alignment are not met  the subset of tables reporting paired  alignment results are replaced with the statement     Paired alignment not performed      When this happens  CASAVA builds for these paired reads cannot be performed  without first rerunning configureAlignment pl and adjusting parameters such as     min percent unique pairs and   min percent consistent pairs to  produce acceptable paired data and summaries      The following sections are displayed in Additional Paired Statistics   Relative Orientation Statistics    The relative orientation of a pair is the orientation  of read 2 relative to the orientation of read 1  based on the definition that the read 1  orientation is forward  The relative orientation is defined as positive if the read 2  position is greater than the read 1 position   These statistics are given only for those pairs in which both reads were individually  uniquely aligned  since these are the reads used to determine the predominant  relative orientation  Other orientations are considered anomalous and are filtered  out   The symbols used in the column headings are intended as a visual reminder of the  definitions of the four possible relative orientations  In the example below  the  nominal orientation is correctly computed as the two reads    pointing to    each  other  as expected for the standard Illumina short insert p
155. le     The default value is 4     Using DATASET POST RUN COMMAND    DATASET POST RUN COMMAND will be invoked at completion of DATASET  alignment  and may be constructed of a single or multiple shell calls  for latter   separated by semicolon 5   Following variables  derived from SampleSheet  will be  available  please use brackets properly     project    sample    barcode    lane      Assuming we use the following SampleSheet   FCID  Lane  S   ampleID  SampleRef  Index  Description  Control Recipe  Operator  Project   B80 9UWABXX 1TILE DMX 1 humanl  human  GCCAAT  myTest N  32 7 CB  testPRC1  B80 9UWABXX 1TILE DMX 1 humanl  human  CTTGTA  myTest N  32 7 CB  testPRC2  B80 9UWABXX 1TILE DMX  1  phixl phix  TTAGGC  myTest N  32 7 CB  testPRC1   B80 9UWABXX 1TILE DMX  2  humanl  human  GCCAAT  myTest N  32 7 CB  testPRCl  B80 9UWABXX 1TILE DMX  2  phix2  phix  TTAGGC  myTest N  32 7 CB  test PRC2    66 Parti 15011196 Rev D    B809UWABXX 1TILE DMX  3  humanl  human  TGACCA  myTest N 3247 CB  testPRC1    B80 9UWABXX 1TILE DMX  4 phix4 phix  TTAGGC  myTest N  32 7 CB  testPRC3    B80 9UWABXX 1TILE DMX 5 humanl  human  TGACCA  myTest N  32 7 CB  testPRC1  B80 9UWABXX 1TILE DMX 5 human5  human  GCCAAT  myTest N  32 7 CB  testPRC2  B80 9UWABXX 1TILE DMX  6  humanl  human  CGATGT  myTest N  32 7 CB  testPRC1  B80 9UWABXX 1TILE DMX  7  humanl  human  CGATGT myTest N 3247 CB  testPRC1  B80 9UWABXX 1TILE DMX  8  humanl  human  CGATGT  myTest N  32 7 CB  testPRC1  B80 9UWABXX 1TILE DMX  8  
156. ls  view  h file bam      When a BAM file is created for each chromosome  these files are placed in the bam  directory immediately under the Parsed chromosome directory  For example the BAM  file for chromosome 1 in a human build would be located here    Project Dir Parsed NN NN NN c1 fa bam     CASAVA v1 8 2 User Guide    105    Variant Detection and Counting    106    When one BAM file is created for the entire genome  using the target bam  it can be  found in   Project Dir genome  bam   A set of auxillary files is created with the whole genome BAM file to facilitate use in  downstream packages such as SAMtools or the Broad IGV  These files are   sorted bam   the bam file itself  sorted bam bai   index of the bam file  sorted bam fa gz     gzipped fasta file containing the reference sequence s   For a description of the BAM format  see samtools sourceforge net   The format of SAM files is described in SAM Format on page 169  BAM files are the  binary equivalent of SAM files  and Illumina s BAM convention has the following  features   The new private optional tag  XC  has been added to provide read status  information normally conveyed in the chromosome field of the export txt file for  unmapped reads  Specifically   XC Z QC  is used to mark an ELAND QC failure  read   XC Z RM  is used to mark an ELAND repeat mask read  and   XC Z CONTROL  is used to mark a control read  No optional field is added to  reads which are marked as no match   NM   in the export file   it is under
157. luuny    Sequence Alignment     2    KEEP INTERMEDIARY Option    The option KEEP INTERMEDIARY tells CASAVA not to delete the intermediary    alignment files in the alignment Temp dir after alignment is complete  This is a make  option  and needs to be used when you run make  For example     nohup make    ewd  y PAIN  q genexpr j     3532 all KEEP  INTERMEDIARY   yes   amp     Parti 15011196 Rev D    configureAlignment Output Files    The configureAlignment output files contain run information  statistical analysis   sequence information  and alignment information  They are described below     Figure 12 Run Folder after configureAlignment Analysis     lt ExperimentName gt   YYMMDD_machinename_XXXX_FC          Input Files from  Zi RTA or OLB    E Intensities    Basecalls       Unaligned File Structure generated by  Bcl conversion Demultiplexing    Project_A       fast  astq gz  Sample_A tite  4  ZEEE fastq qz  amp e_ files    Basecall Stats FC    Sai J 1Ndino juswubijyainbijuo2    Project_B           Undetermined_Indices       File Structure generated  by Alignment    JA  Flowcell  Summary htm    Project A    1  export txt gz  Sample A a S  d HEEL  EXPON tIXL GZ  Sample B sa 9      Sampl  ample_  F  Summary_Stats_FC Siman bia  A  Barcode_Lane  Summary htm  Project_B  df  export txt gz      Sample C pa S    A  Sample_  Summary Stats FC  E  Barcode Lane  Summary htm       CASAVA v1 8 2 User Guide   3    Sequence Alignment      NOTE    There can be only one Aligned directory by de
158. mary input  and does not support the _    qseq txt format  For  _qseq txt files  use an older version of CASAVA  or  convert the _qseq txt format as described in Qseq Conversion on page 159     CASAVA v1 8 2 User Guide 3    UONONPOALUI    Overview    Supporting Software    There are a number of software applications that support CASAVA   The Off Line Base caller  OLB  is an alternative for the on instrument base calling  by RTA   The Analysis Visual Controller  AVC  provides a GUI interface for running  CASAVA  and is especially convenient for users not proficient with running  applications through the Linux command line   GenomeStudio contains modules for viewing the data analysis results in the  genomic context  These modules are the GenomeStudio ChIP Sequencing Module   DNA Sequencing Module  and RNA Sequencing Module   The Sequencing Analysis Viewer  SAV  allows you to view primary analysis  metrics from the sequencing instrument   To download these applications and their documentation  go to  http   www illumina com or https   icom illumina com   L NOTE    If you do not have an Illumina customer account  register as a    new user  It may take up to three business days for initial  review of the application     4 Part  15011196 Rev D    CASAVA Features    The CASAVA 1 8 package processes sequencing reads provided by RTA or OLB     CASAVA can generate the following data     Sample specific reads from multiplexed flow cells    Aligned reads  SNP calls  Indel calls    Expres
159. mation    cwd tells the job to run in the current directory   v PATH passes the job the path to the executables needed for CASAVA      tells the job to pass everything after the    to the make command   j tells gmake how many tasks to run at the same time  Omake will then submit  this number of tasks to the SGE queue  As tasks finish more tasks will be submitted     Parti 15011196 Rev D    The number after the  j should be adjusted depending on the size of the system  and the number of users sharing it     This method uses resources efficiently  but job monitoring and management is harder  If  you need to kill a job you have to kill each of these tasks individually     Slots Dedicated Upfront    The second method uses a parallel environment where a number of slots are dedicated  upfront and the tasks are run on these slots     1 Move into the output folder     2 Create a script file which contains the following   gmake    CWA  v PATH    inherit         all    3 Submit the jobs to the SGE   qsub  cwd  v PATH  pe make 32  lt script file gt     In addition to the options described above  this method uses the following options    pe make says to run in the parallel environment  make is the default one  The number after the word make says how many slots the job needs to run  If you  set this number too high you may have to wait a long time for them all to become  free  It will never run if you set it to more slots than you have on your system   The more slots you use the quicker y
160. matted reports of SNP and indel calls      assemblelndels Algorithm    The assemblelndels module  Grouper  runs only during paired read DNA CASAVA  builds  In CASAVA v1 8  it uses orphan reads and anomalous read pairs to detect  indels     Grouper detects indels in five stages   1 Compute clusterings of non aligned    orphan reads        2 Compute clusterings of anomalous read pairs  with an insert size that is  anomalously large  possible deletion  or small  possible insertion      3 Combine clusters that appear to correspond to the same event   4 Assemble them into contigs     5 Align the contigs back to the genome  using the positions of associated    singleton     reads to narrow the search to a couple of thousand bp or so     Variant Caller Methods    The callSmallVariants module calls SNPs and small indels from both the sorted  alignment files  sorted bam  and optionally also from the candidate indel contigs  produced by assemblelndels  The procedure is outlined below   Read in read alignments and candidate indel contigs  Filter out read alignments  based on quality checks  paired end anomalies  or ELAND alignment score  Filter  out contig alignments containing adjacent insertion deletion events   Consolidate indel evidence from read and contig alignments to produce a set of  candidate indels   Perform local read realignment using candidate indels     CASAVA v1 8 2 User Guide 9     SPOUIEIN    Variant Detection and Counting    Call indels based on the set of alignments 
161. n Input Files    EE a 26  Running Bcl Conversion and Demultiplexing                      222 eee eee eee eee eee eeee 32  Bel Conversion Output Folder           2 222 2222 ieee eee eee eee EG EE cee EE EG Eie 37  de TT  EE  r K gf    Nanny Da  if d AT OOR N  WE S   mma     4 a  NG PAANU 2    LA premo N  lt           a EE EE  a aa  i SE aa of ee nm Rd x  a 7    gt  g AE NG ee pora sd gr     Sa ies   s    EAN C 3 an elf    S Nar eg     z NG  Teona  saamaa K geji sa     ee ae s    CASAVA v1 8 2 User Guide T O       J9 dPUD    Bcl Conversion and Demultiplexing    Introduction    As of CASAVA 1 8  configureAlignment uses FASTO files as input  Since Illumina  sequencing instruments generate   bcl files as primary sequencing output  CASAVA  contains a BCL to FASTO converter that combines these per cycle   bcl files from a run  and translates them into FASTO files    i NOTE     As of 1 8  CASAVA uses   bcl as primary input  and does not support the _    qseq txt format  For  _qseq txt files  use an older version of CASAVA  or  convert the _qseq txt format as described in Qseq Conversion on page 159     CASAVA 18 can start with bcl conversion and alignment as soon as the first read has  been sequenced completely     In addition to generating FASTO files  CASAVA uses a user created sample sheet to  divide the run output in projects and samples  and stores these in separate directories  If  no sample sheet is provided  all samples will be put in the Undetermined  Indices  directory by l
162. n Quality on  page 83     configureAlignment Configuration File    This section describes the features and parameters of the configureAlignment  configuration text file     The configureAlignment configuration file specifies the analysis for each lane  sample   project  reference  or index  barcode   The configureAlignment configuration file is a text  file  and the path to the file should be the first argument after the configureAlignment pl  command  configureAlignment translates the analysis in the configuration file into a  makefile  The makefile specifies exactly what commands will be executed to carry out  the requested analysis     As part of the creation of the Aligned output folder  the configureAlignment  configuration file is copied to the Aligned output folder using the filename config txt   Some sites use standard configuration files  which may be stored in a central repository     Config File Parameter List    The following tables list the parameters that can be specified in a configureAlignment  configuration file     The section configureAlignment Parameters Detailed Description on page 61 provides a  detailed description of these parameters     Core Parameters    Table 2 GERALD Configuration File Core Parameters    Parameter    Definition    EXPT DIR data 110113 ILMN 1 0217 Provide the path to the experiment  demultiplexed   FC1234 Unaligned directory in the run folder  if not specified on the command    line  Usually the output folder from the BCL to FASTQ  
163. n a read within a window of  1 2  variantsMDFilterFlank  positions encompassing  the current position  The default value for   variantsMDFilterCount  is 2 and for   variantsMDFilterFlank is 20  Set either value to less  than 0 to disable the filter    Example    variantsMDFilterCount 3     variantsMDFilterFlank INTEGER   SE  PE The mismatch density filter removes all basecalls  from consideration during SNP calling where greater  than  variantsMDFilterCount  mismatches to the  reference occur on a read within a window of  1 2  variantsMDFilterFlank  positions encompassing  the current position  The default value for   variantsMDFilterCount  is 2 and for   variantsMDFilterFlank is 20  Set either value to less  than 0 to disable the filter    Example    variantsMDFilterFlank 25     variantsIndependentErrorModel   SE  PE This switch turns off all error dependency terms in  the SNP calling model  resulting in a simpler model  where each basecall at a site is treated as an  independent observation     Example    variantsIndependentErrorModel    CASAVA v1 8 2 User Guide 1 D D    U01 28 94 JUBIJEA    Algorithm Descriptions    Option Application   Description    variantsMinQbasecall INTEGER   SE  PE The minimum basecall quality used for SNP calling    default is 0    Example    variantsMinQbasecall 10    variantsSummaryMinOsnp INTEGER   SE  PE The snps txt files contain all positions where Q snp   gt     0  however it is expected that only a higher Q snp   subset of these will be use
164. n an  iCompute cluster with j   32  metrics are for reads passing filtering     Alignment and Mismatch Rates    v 17 v1 8  semi repeat v1 8  full repeat       Mis    Yo Mis  Yo   Mis    Align match Align match Align match  Read1 8456 0 70 88 29 0 72 90 17 0 73  Read2 81 92 1 39 85 81 1 44 87 56 1 44    CASAVA v1 8 aligns a higher percentage of reads  with full repeat alignment  performing best  This higher alignment rate results from the improved ability to align in    CASAVA v1 8 2 User Guide 1 3 O    SGACINY 12 PUB AANV 13    Algorithm Descriptions    140    more challenging repeat regions  Remarkably  even with more reads aligned in repeat  regions  mismatch rates are still very similar     CPU Run Time Comparison    v1 7 v1 8  v1 8    CPU hours  semi repeat full repeat   CPU hours   CPU hours     ELAND 523 28 518 40 855 40  orphanAligner N A 54 17 31 20  PickBestPair  200 77 14 67 14 97    alignmentResolver    produceAlign  21 65 12 43 14 55  Stats   Other Processes 25 99 0 17 0 20  Total 181G 599 85 916 33    While CASAVA v1 8 provides the highest percentage of aligned reads  this level of  performance does require additional computational time  Table 2   For the ELAND step   v1 8 full repeat resolution takes quite a bit longer to run than semi repeat resolution   520 hours versus 855 hours   Therefore  researchers should consider the trade off  between higher performance and slower run time to select the type of analysis best  suited for their project       Other algori
165. n the  index sequence  This is done the following way for each cluster     1 Get the raw index for each index read from the bcl file   2 Identify the appropriate directory for the index based on the sample sheet     3 Optional  Detect and correct up to one error on the barcode  and identify the  appropriate directory  If there are multiple index reads  detect and correct up to one  error in each index read     4 Optional  Detect the presence of adapter sequence at the end of read  If adapter  sequence is detected  mask the corresponding basecalls with N     5 For each read   a Write the index sequence into the index field   b Append the end to the appropriate new FASTO file in the selected directory     6  Ifthe index cannot be identified  the data is written into the Undetermined indices  directory  unless the sample sheet specifies a project and sample for reads without  index     Updating Statistics and Reporting    The sample demultiplexer updates the following files   Generates statistics  While splitting the FASTO files  CASAVA recalculates the base calling analysis  statistic that were computed during base calling for the unsplit lanes  These files   Demultiplex Stats htm and IVC htm  are stored in the Unaligned Basecall Stats   FCID folder   Regenerates the analysis plots for each multiplexed sample  Updates config xml for each multiplexed sample  Copies raw matrix and phasing files  Updates sample sheet  The sample demultiplexer strips all the non relevant indexes fr
166. nalysis run for a project  It is located in the Aligned Project_ lt ProjectName gt  Summary_  Stats folder and provides an overview of quality metrics for a run per sample with links  to more detailed information in the form of pages of graphs     The metrics are described below     Project Summary    The Project Summary contains general project information   Project Name  Machine name  Run Folder   full path to the run folder  Flow Cell ID  Platform    instrument type  Control Software and version   Primary Analysis software and version  Secondary Analysis software and version    Project Results Summary    This table displays a summary of project wide performance statistics for the run   Clusters     Original number of detected clusters   Clusters  PF    Number of clusters that passed quality filtering     Yield   The sum of all bases  in Mb  in clusters that passed filtering for the entire  project     Parti 15011196 Rev D    Barcode Lane Summary    The Barcode Lane Summary records information about the barcoded samples in each  flow cell lane and the analysis that has been specified for it   Barcode Lane   The identity of the barcoded sample in a lane  The identity follows  the following format   lt SampleName gt   lt Barcode gt   lt Lane gt    Sample   Sample name   Barcode     The sequence of the barcode  index    Lane  Species     The reference sequence against which was aligned  Depending on the  analysis mode  this may be the name of a folder containing one or more se
167. nce files is ELAND RNA GENOME CONTAM     Parti 15011196 Rev D    Running configureAlignment    When running configureAlignment  two concepts are important to understand  the  configureAlignment configuration file that specifies analysis  and the make utility that  manages the analysis     configure Alignment Configuration File    configureAlignment uses a text based configuration file containing all parameters  required for alignment  visualization  and filtering  These parameters specify the type of  analysis to perform  which bases to use for alignment  and the reference files for a  sequence alignment  Analysis can be specified by lane  index  barcode   sample   reference  or project     Make Utility    configureAlignment is a collection of Perl scripts and C   executables  and is managed  by the    make    utility  The    make    utility is commonly used to build executables from  source code and is designed to model dependency trees by specifying dependency rules  for files  These dependencies are stored in a file called a makefile  The  configureAlignment pl script is used to generate a makefile config containing variable  definitions which uses static makefiles as required  These static makefiles  including the  main Makefile  have fixed content and so can be included in the distribution and do  not have to be regenerated for every run     When running configureAlignment  the configureAlignment configuration file specifies  the analysis  and the make utility manages th
168. nd demultiplexing  If you need to change the sample sheet  it is best to rerun the bcl  conversion and demultiplexing     DemultiplexedBustardConfig xml File    The base calling configuration file  DemultiplexedBustardConfig xml  in the  demultiplexed directory includes the start and end cycles of each read  The  DemultiplexedBustardConfig xml file is derived from the Config xml file generated  during base calling  but renamed and moved by the BCL to FASTO converter     Reference Genome    50    CASAVA uses a reference genome in FASTA format  Both single sequence FASTA and  multi sequence FASTA genome files are supported     Genome sequence files for most commonly used model organisms are available through  iGenome  Getting Reference Files on page 128    i NOTE    As of CASAVA 1 8  you do not need to squash the reference genome  anymore     Single Sequence FASTA Files    CASAVA accepts single sequence FASTA files as genome reference  which should be    provided unsquashed for both alignment and post alignment steps  The chromosome  name is derived from the file name     Direct CASAVA to a folder containing the FASTA files using the option ELAND GENOME  for configureAlignment     Multi Sequence FASTA Files    As of version 1 8  CASAVA accepts a multi sequence FASTA file as genome reference   This should be provided as a single genome  SAM compliant  unsquashed file  for both    alignment and post alignment steps  The chromosome name is derived directly from  the first word in 
169. nd the  DemultiplexedBustardConfig xml is not there     CASAVA v1 8 2 User Guide 1 6 D    ejeg ind no Ja 8 Auo09 base     1 66 Parti 15011196 Rev D    Export to SAM Conversion    Introduction  ss ss LLL LLL LLL LLL 168  SAM Format  vi L LLL LLL LLL LLL LLL LLL LLL LLL LL LL LLL LLL GR EE GE GE EE 169  BRS    AA 173          m          ow           out er Fi  CA ere rare  GCC Ci rtaresacrcelt      ta       CASAVA v1 8 2 User Guide 1 67      XIpusdAy    Export to SAM Conversion    Introduction    168    CASAVA 1 8 provides two SAM BAM conversion pathways   Running the post alignment    sort    and  bam modules  see Targets on page 96  Running the post alignment sort and bam modules together offers sorting  PCR  duplicate removal  indexing  and automatic chromosome renaming options  and by  default it will write out a reference sequence file with chromosome labels that have  been synchronized to the labels used in the BAM file  If the sort module is run in   archival  mode  the BAM file created will contain all of the reads provided in  export txt gz files given as input   The illumina export2sam pl script   The illumina_export2sam pl script provides basic conversion from export to SAM  format without sorting  duplicate removal  conversion to BAM format or indexing   This script is intended to be used as one component in a custom post alignment  pipeline  Users desiring a  turn key  BAM creation method  e g  to rapidly view  reads in IGV  are encouraged to use the post alignment
170. ndard naming format for  _export txt gz files is  lt sample name gt _ lt barcode  sequence gt _L lt lane gt _R lt read number gt   lt 0 padded 3 digit set number gt _export gz  like in   NA10831_ATCACG_L001_R1_001_export txt gz  The  _export txt gz files are saved as  compressed gzipped files  The content of the  _export txt gz files is described below  not  all fields are relevant to a single read analysis     i NOTE    The old Illumina specific transformation  ASCII offset of 64  will still be used  in the export files  but export txt gz is meant to be an internal file format     Machine  Parsed from run folder name   Run Number  Parsed from run folder name   Lane    Tile    OP A Q ND A    X Coordinate of cluster  As of RTA 1 6  OLB 1 6  and CASAVA 1 6  the X and Y  coordinates for each clusters are calculated in a way that makes sure the    combination will be unique  The new coordinates are the old coordinates times 10    1000  and then rounded     6 Y Coordinate of cluster  As of RTA 1 6  OLB 1 6  and CASAVA 1 6  the X and Y  coordinates for each clusters are calculated in a way that makes sure the    combination will be unique  The new coordinates are the old coordinates times 10    1000  and then rounded     CASAVA v1 8 2 User Guide Fa O    soji J  nd ng jusuuBijyolnBIJUOD    Sequence Alignment    SO    10  11    12    13    14  15    16    17    18    19    20    Index sequence or 0  For no indexing  or for a file that has not been demultiplexed  yet  this field shoul
171. ngle FASTA format    Path to transcripts mapping to the genome  refFlat txt gz or seg    gene md gz    See also Using ANALYSIS eland_rna on page 70   The group label above specifies which assembly to use in the seq_  gene file  and is found in column 13 of the file  seq_gene files can  hold entries for multiple assemblies    Example  ELAND RNA GENE MD GROUP LABEL GRCh37 p2   Primary Assembly    KAGU_PARAMS passes options to the alignmentResolver  through the configureAlignment configuration file    For additional information  see KAGU PAIR PARAMS and  KAGU_PARAMS on page 65     Paired End Analysis Options    Table 4 configureAlignment Configuration File Paired End Analysis Options    Parameter Definition   ANALYSIS Use the paired end alignment mode of ELANDv2e to align paired reads against a target   eland pair   USE BASES Use all bases of the first read and ignore the first and last base of the second read   TERT n   6 USE BASES   Ignore the first base on both the first and second read of lane 6  use 25 bases each and  arao ignore any other bases for lane 6 only    KAGU PAIR   KAGU PAIR PARAMS passes options for paired end runs to the alignmentResolver  PARAMS through the configureAlignment configuration file     For additional information  see KAGU_PAIR_PARAMS and KAGU_PARAMS on page    65     For more information on USE BASES syntax  see USE BASES Option on page 62     56    Parti 15011196 Rev D    Specifying Analysis    Analysis can be specified by project  reference  sam
172. ns    Frequently asked questions are available online   Go to http   www illumina com FAQs  and click on Software     Reporting Problems    10    When reporting an issue  it is critical to capture all the output and error messages  produced by a run  This is done by redirecting the output using    nohup    or the  facilities of a cluster management system  For an explanation of    nohup     see Nohup  Command on page 24     Provide a description of the error   bug   feature  along with the following information  if  available     Demultiplexing Bcl Conversion  The configureBcIToFastq pl command line  Nohup out from the make execution  SampleSheet csv  support txt file in the Unaligned folder    Alignment  The configureAlignment pl command line  Nohup out from the make execution  Config txt  support txt file in the Aligned folder    Variant Detection Counting    The command line  CASAVA log  conf project conf    Parti 15011196 Rev D    Interpretation of Run Quality    Introduction    ses SS a VRE MA 12  Quality Tables and Graphs       EE EE EE EG EG EE EG EG EE EE EG EG Eie 13  El BI   aD  EO EE a a ee N EE N Ee 17         lt n TG GT    4 fi  pisan O PP AA PU  Garry    P  AE CEGERTENTeGag re  l sa   my          CASAVA v1 8 2 User Guide 1 T    Z J61OGU7    Interpretation of Run Quality    Introduction    12    Before beginning a secondary analysis  you should do an assessment of a sequencing  run s performance metrics  This can help reveal any issues which may affect the  secon
173. ntrol files        ad  bel files        stats files          bel files        stats files        Unaligned             CASAVA v1 8 2 User Guide D T    UOI 2NPO  U     Bcl Conversion and Demultiplexing    Sample Sheet      a    The sample sheet  SampleSheet csv file  directs the software how to assign reads to  samples  and samples to projects  The sample sheet specifies for every index in every  lane which sample and which project it belongs to  Lanes with samples that were not  indexed can also be assigned to samples and projects using the sample sheet  Projects  can consist of multiple samples  and samples can consist of multiple lanes and  multiple indexes     The sample sheet contains the following columns     Column Description   FCID Flow cell ID   Lane Positive integer  indicating the lane number  1 8    SampleID ID of the sample   SampleRef The name of the reference   Index Index sequence s    Description Description of the sample   Control Y indicates this lane is a control lane  N means  sample   Recipe Recipe used during sequencing   Operator Name or ID of the operator    SampleProject   The project the sample belongs to    I NOTE  f The column SampleProject is new in CASAVA 1 8 and links samples to  projects     Every project in the sample sheet is linked to a corresponding project directory  Each  sample belonging to that project is linked to a corresponding sample directory within  that project directory  Reads are stored in the FASTQ files located in the projec
174. o any  alternate indels at the same site  The relative probabilities of these alignments for each  read are used to call the indel s genotype and calculate the associated quality score     CASAVA v1 8 2 User Guide 1 4 D    U01 28 94 JUBIJEA    Algorithm Descriptions    146    Candidate Indel search    For the first stage of indel calling  candidate indels are identified from two sources of  evidence   The first of these are from small indels already present in the input reads in the  form of gapped alignments   The second source are alignments of locally assembled contigs to the reference  provided by the assemblelndels module     Every indel present in a conventional read alignment or assemblelndels contig is stored  in a pool of potential indels     Support for each one of these potential indels is measured as the number of read  alignments which contain the indel  These alignments may be from the primary  alignment or from reads used by Grouper to assemble each contig  If the number of  reads supporting a potential indel is less than 3 or less than 2  of the total depth at the  indel site  the indel cannot become a candidate  Additionally for short indels  of length  4 or less   if the number of supporting read is less than 10  of the total depth the indel  cannot become a candidate  These cases are retained as    private    indels in the reads  alignments which support them  All other potential indels become candidate indels   subject to realignment and indel calling     
175. of the variant caller algorithm  see Variant Detection on page  141     Parti 15011196 Rev D    Variant Detection Input Files    The variant detection and counting input files come from the configureAlignment  module using the following eland modules     eland extended  configureAlignment Input Files on page 48  for single read DNA  sequencing projects    eland pair  Using ANALYSIS eland pair on page 69  for paired end DNA  sequencing projects    eland rna  Using ANALYSIS eland rna on page 70  for single read RNA sequencing  projects  paired end RNA sequencing projects are not supported      The configureAlignment input files for CASAVA variant detection can be found in the  Aligned directory of the run folder  and are described below     In addition  CASAVA variant detection and counting uses annotation files  genome  sequence files and refFlat txt gz or seq_gene md gz file       Figure 14 CASAVA Input Files     lt ExperimentName gt     YYMMDD machinename XXXX FC Genome Directory    L Aligned H Species    r  Project Genome fasta    files    Sample  d  export txt  files    L E  genomesize  xml file    TA  config xml    file    4  pair xml   Only for paired read  files alignments    export txt gz Files    The export txt gz files contain the aligned sequence information from the  configureAlignment module  and are required     The export txt gz files are tab delimited text files  for a detailed description  see See   Output File Formats      Run conf xml    The run conf xml fil
176. oject directory called Undetermined_indices  unless the sample  sheet specifies a specific sample and project for reads without index in that lane   No multiplexed samples present  with sample sheet   Reads are placed within the directory structure directed by the sample sheet  based  on the lane information   No multiplexed samples present  without sample sheet   Reads are placed in a project directory named after the flow cell  and sample  directories based on the lane number     CASAVA v1 8 2 User Guide D 3    UONONPOALUI    Bcl Conversion and Demultiplexing    Bcl Conversion As You Go    Bcl conversion supports alignment of the first read of a paired end run before  completion of the run  align as you go   You can kick off Bcl conversion for read 1 using  the target r1 when running make at any time after the last read has started  for  multiplexed runs  this is after completion of the indexing read   You can also start  alignment using the target r1 when running make for configureAlignment  or you can  use the POST RUN COMMAND R1 variable to automatically start the alignment of read 1  at the end of the Bcl conversion     For instructions  see Starting Bcl Conversion for Read 1 on page 35     Demultiplexing Methods    24    Demultiplexing involves splitting the FASTO files and updating the statistics and  reporting files  This section describes these two steps     Splitting FASTQ Files    The first step of demultiplexing in CASAVA is splitting the base call files  based o
177. om the original  sample sheet and places the stripped out version in the appropriate directory     Parti 15011196 Rev D    Creates the Demultiplex Stats FCID csv file in the Unaligned folder to indicate in  which subdirectory each index has been written     For a description of these files  see Bcl Conversion Output Folder on page 37     CASAVA v1 8 2 User Guide D D    UONONPOALUI    Bcl Conversion and Demultiplexing    Bcl Conversion Input Files    Demultiplexing needs a BaseCalls directory and a sample sheet to start a run  These  files are described below  See also image below     a kw NOTE  4 i For installation instructions  see Requirements and Software Installation on  page 111     Figure 10 Bcl Conversion Input Files     lt ExperimentName gt   YYMMDD machinename XXXX FC    Data      Intensities    4  config xml    tile    L      LOO      fi   By Lane  GOSS    BaseCalls    7     RunInfo xml SampleSheet  file  csv file  A  config xml  file    LOO1   By Lane     ey   C Lane Cycle     La La   stats files        Folder and File Naming    The top level run folder name is generated using three fields to identify the   lt ExperimentName gt   separated by underscores  For example      bcl files       26 Part   15011196 RevD    YYMMDD machinename NNNN    You should not deviate from the run folder naming convention  as this may cause the  software to stop     1 The first field is a six digit number specifying the date of the run  The YYMMDD  ordering ensures that a numerical sort
178. on counts   splice junction counts and gene counts can be used to determine gene expression levels    and expressed splice variants       TIP    As long as a gapped alignment is performed  small indels  up to 10    nucleotides  can be called from RNA Sequencing builds     L NOTE    RNA Sequencing only supports single read runs     Post Alignment Workflow    The CASAVA workflow for variant detection and counting is illustrated below    Figure 13 CASAVA Variant Detection and Counting Workflow    Z    export txt  files    Chromosome       BAM files    indelAssembler    smallVariantCaller           A 4    snp  txt E counts and   indels txt    files genotypes files    Whole genome    BAM files    stats reports         Summary  tables htm    FT                i    rnaCounts    Post sort BAM       third party    CASAVA has a number of changes in the way files are handled in the post alignment    workflow     CASAVA v1 8 2 User Guide    59    Variant Detection and Counting    CASAVA 1 8 operates entirely on BAM files after the sort module in the post   alignment workflow has completed  sorted txt files are no longer created or stored    This significantly reduces the build size  the combined changes in the new variant  caller and BAM files for CASAVA reduce the human DNA Sequencing post   alignment builds size by 75 8096    Archival mode  CASAVA can be run so that all input reads are retained in the  build in their entirety  Variant calling and RNA counting results are identical in  
179. onfigure   help    Setting Up Email Reporting    116    The script Gerald runReport pl is called at the end of a run and sends you an email  when a run successfully completes     To use email notification  set up an SMTP server and set the following parameters in  the configureAlignment configuration file  For additional information  see  configureAlignment Configuration File on page 54     1 Enter a space separated list of the email addresses that should receive the run  completion notification   EMAIL LIST your name domain com that name domain com    Parti 15011196 Rev D    2 Indicate the path to the Aligned folder  The software assumes it can create a valid  URL from the Aligned folder path by omitting a number of leading path elements  as specified by NUM LEADING DIRS TO STRIP  by default two  and prepending  WEB DIR ROOT    WEB DIR ROOT http   server SHARE   For example  if the path is  mnt yourDrive folder folder Aligned and WEB DIR  ROOT is http   server SHARE  the software will write the links as  http   server SHARE folder Aligned File htm     3 Identify your domain  Your SMTP server may refuse to accept emails from or send  emails to addresses that do not end in  yourdomain com   EMAIL DOMAIN yourdomain com    4 Identify your IP address   EMAIL SERVER yourserverido  where yourserver is the name or IP address of a mail server that will accept SMIP  email reguests from you and 25 is the port number of the SMTP service on that  server   Generally this will be 25  This 
180. operator    SampleProject   The project the sample belongs to    You can generate it using Excel or other text editing tool that allows  csv files to be  saved  Enter the columns specified above for each sample  and save the Excel file in the   csv format  If the sample you want to specify does not have an index sequence  leave  the Index field empty     30 Parti 15011196 Rev D    Illegal Characters    Project and sample names in the sample sheet cannot contain illegal characters not  allowed by some file systems  The characters not allowed are the space character and  the following    TUN KA NGA ERA UU LM    Multiple Index Reads    If multiple index reads were used  each sample must be associated with an index  sequence for each index read  All index sequences are specified in the Index field  The  individual index read sequences are separated with a hyphen character      For example   if a particular sample was associated with the sequence ACCAGTAA in the first index  read  and the sequence GGACATGA in the second index read  the index entry would be  ACCAGTAA GGACA TGA     Samples Without Index    As of CASAVA 1 8  you can assign samples without index to projects  samplelDs  or  other identifiers by leaving the Index field empty     CASAVA v1 8 2 User Guide 3     Sola Indu  UOISJAAUOL Jog    Bcl Conversion and Demultiplexing    Running Bcl Conversion and Demultiplexing    Bcl conversion and demultiplexing is performed by one script  configureBclToFastq pl   This section des
181. or large projects  such as human genome resequencing  we recommend  using highly distributed disk storage  like Lustre or Isilon      The space requirements for ELAND temporary files inside the Aligned directory  as long  as you stay at  lt 13M reads per eland set size  are as follows     Eight bytes per match  This should equate to less than 0 6 GB per process    CASAVA v1 8 2 User Guide 1 1 2    sj uawWaiinbay SIEMHOS DUE SIEMPJEH    Requirements and Software Installation    This is less than 5GB for 8 ELAND processes    If  tmp space is an issue  perform the following   Increase space for  tmp   Decrease ELAND FASTO FILES PER PROCESS  see ELAND FASTO FILES PER  PROCESS on page 63   Setting the right value for the ELAND FASTO FILES PER   PROCESS is very important  because too low may result in a decreased  performance     Memory Requirements    CASAVA requires a minimum of 2 GB RAM per core  The parameter ELAND FASTQ  FILES PER PROCESS in the configureAlignment config txt specifies the maximum  number of files aligned by each ELAND process  The optimal value is such that there  are approximately 10 to 13 million lines  reads  in one set   L NOTE    Peak memory usage occurs during the ELANDv2e portion of  configure Alignment     Software Requirements    114    CASAVA has been primarily developed and tested on CentOs 5  Illumina s  recommended and supported platform  It may be possible to install and run CASAVA  on other 64 bit Linux distributions  particularly on similar dis
182. or the percentage of molecules  in a cluster for which sequencing falls behind the current position  cycle  within a  read      o Prephasing    The estimated  specification is not recommended  value used for  the percentage of molecules in a cluster for which sequencing jumps ahead of the  current position  cycle  within a read      o Mismatch Rate  raw    The percentage of called bases in aligned reads from all  detected clusters that do not match the reference     PF Clusters    The percentage of clusters that passed filtering   Cycle 2 4 Av Int  PF    The intensity averaged over cycles 2  3  and 4 for clusters  that passed filtering   Cycle 2 10 Av   Loss  PF    The average percentage intensity drop per cycle over  cycles 2 10  derived from a best fit straight line for log intensity versus cycle  number    Cycle 10 20 Av   Loss  PF    The average percentage intensity drop per cycle over  cycles 10 20  derived from a best fit straight line for log intensity versus cycle  number        Align  PF    The percentage of reads passing filter that were uniquely aligned to  the reference       o Mismatch Rate  PF    The percentage of called bases in aligned reads passing   filter that do not match the reference      5 030 bases  PF    Yield of bases with Q30 or higher from clusters passing filter  divided by total yield of clusters passing filter    Mean Quality Score  PF   The total sum of quality scores of clusters passing filter  divided by total yield of clusters passing filter
183. ormally inferred during alignment    0x0004 4 The query sequence itself is unmapped   0x0008 8 The mate is unmapped   0x0010 16 Strand of the query  0 for forward  1 for reverse strand    0x0020 32 Strand of the mate   0x0040 64 The read is the first read in a pair   0x0080 128 The read is the second read in a pair   0x0100 256 The alignment is not primary  a read having split hits may have  multiple primary alignment records    0x0200 512 The read fails platform vendor quality checks   0x0400 1024 The read is either a PCR duplicate or an optical duplicate   i NOTE      The bitwise flag means that if multiple conditions are true  the values are  added  and only the total value is reported  For example  if a read is paired in  sequencing  value 1   the mate is unmapped  value 8   and the read is the  first read in a pair  value 64  a total of 1   8   64   73 is reported      Extended CIGAR Format    A CIGAR string is comprised of a series of operation lengths plus the operations  The  conventional CIGAR format allows for three types of operations  M for match or  mismatch  I for insertion and D for deletion  The extended CIGAR format further allows    170 Part   15011196 Rev D    four more operations  as is shown in the following table  to describe clipping  padding  and splicing     Operation   Description    al dale  han ie    9    Alignment match  can be a seguence match or mismatch   Insertion to the reference   Deletion from the reference   Skipped region from the referen
184. our job will run  up to the parallelization limit   but the correct number to use depends on how big the system is  the number of  other users  and the number of jobs you want to run at any one time     This method can have some inefficiency if there are fewer tasks than slots at any point   but it allows easy job monitoring and management  If you need to kill your job then this  is much easier with this method     When you submit the job the command will return the SGE job id  You can get  information about the state of your job with gstat  j    job id gt  or viewing it with  qmon     Customizing Parallelization    Many parts of configureAlignment are intrinsically parallelizable by lane or tile   However  some parts of configureAlignment cannot be parallelized completely   configureAlignment has a series of additional hooks and check points for  customization     The configureAlignment can be divided into a series of steps with different levels of  scalability where synchronization    barriers    cause configureAlignment to wait for each  of the tasks within a step to finish before going to the next step     You can parallelize the steps at the run level  no parallelization   the lane level  up to  eight jobs in parallel   and the tile level  up to thousands of jobs in parallel   Each step  is initiated by a    make    target  After completion of each of these steps   configureAlignment produces a file or a series of files at the lane tile level  that  determines whether al
185. ple  index  or lane  which is  explained in this section     Lane Specific Analysis  By adding the lane number s  followed by colon in front of an analysis option  you state    that the analysis option is only for samples from that lane  The lane number is only  valid for the configureAlignment settings on that same line     For example  567 ANALYSIS eland extended tells configureAlignment that eland_  extended should be run on samples from lane 5  6  and 7     Sample Specific Analysis   The config txt file has some keywords that enable you to specify analysis for project   reference  sample  or index  PROJECT  REFERENCE  SAMPLE  and BARCODE  These  keywords refer to the SampleProject  SampleRef  SampleID  and Index specified in the  samplesheet csv file located in the Unaligned directory of the run folder    Lines starting with PROJECT  REFERENCE  SAMPLE  and BARCODE override any  default settings specified in the config  txt file  but only for those samples for which the  SampleProject  SampleRef  SampleID  or Index matches the PROJECT  REFERENCE   SAMPLE  or BARCODE  The override is only valid for the configure Alignment settings  on that same line     Example Sample Specific Analysis   For example  if the config txt file describes the following analysis   ANALYSIS eland rna  REFERENCE human ANALYSIS eland pair   with the following sample sheet     FCID Lane   Sample   Sample   Index Descrip    Control   Recipe   Operator   Sample  ID Ref tion Project  12345AAXX   1 sample
186. possible alignments to the genome and splice junctions  then the read is marked as  RM  and discarded as above     3 If there is no alignment to either the contaminants  the genome or the splice  junctions then the read is marked as  NM    for  not matched   Multiseed  Repeat Alignment    ANALYSIS eland_rna performs the following alignment features implemented in  ELANDv2 and ELANDv2e     Parti 15011196 Rev D    By default performs multiseed alignment by aligning consecutive sets of 16 to 32  bases separately    Aligns reads in repeat regions using two new modes  semi repeat resolution and  full repeat resolution  Full repeat resolution is more sensitive and places more reads  in repeat regions  but will result in longer run time  By default  ELANDv2e runs in  semi repeat resolution mode  Full repeat resolution can be turned on with the option  INCREASED SENSITIVITY     Running an eland rna Analysis    The configureAlignment configuration file specifies how the sequences from a flow cell  are processed  which is described in configureAlignment Configuration File on page 54   The ANALYSIS parameter within the configureAlignment configuration file specifies  what analysis to perform on the sequences  you will need to set up this parameter the  following way  example shown    ANALYSIS eland rna  ELAND GENOME  data Genome ELAND hg18   ELAND RNA GENOME ANNOTATION  data Genome ELAND _  RNA Human refFlat txt gz  ELAND RNA GENOME CONTAM  data Genome ELAND RNA Human MT Ribo   Filter  
187. put for configureAlignment  The files are located in the Unaligned Project_   lt ProjectName gt  Sample_ lt SampleName gt  directories       NOTE    Reads that were identified as sample prep controls in the control files are not  saved in the FASTO files   Naming    Illumina FASTO files use the following naming scheme    lt sample name gt     barcode sequence gt  Lilane  U padded Lo 3  digits  gt  R lt read number gt   lt set number  0 padded to 3  Gigits gt  fastq qz  For example  the following is a valid FASTO file name   NA10831 ATCACG LU02 RI UUl Tastq gz    In the case of non multiplexed runs   lt sample name gt  will be replaced with the lane  numbers  lanel  lane2       lane8  and  lt barcode sequence    will be replaced with   Nolndex      Set Size    The FASTO files are divided in files with the file size set by the   fastq cluster count  command line option of configureBclToFastq pl The different files are distinguished by  the O padded 3 digit set number     HE    If you need to generate one unique fastq gzipped file for use in a third party  tool  you can set the   fastq cluster count option to 0    Compression  FASTO files are saved compressed in the GNU zip format  an open source file    compression program  This is indicated by the  gz file extension  CASAVA  automatically unzips the files before using them     Format    Each entry in a FASTQ file consists of four lines   Sequence identifier  Sequence  Quality score identifier line  consisting of a     Quality 
188. quence  files or the name of an individual file  The acceptable file formats also depend on  the analysis mode   Analysis Type   Contains the analysis mode for reads from this lane   Length   The number of bases used per read  excluding any bases masked out  using USE_BASES   Where multiple reads are produced per cluster and a  distinction is maintained between them during analysis  as in eland_pair analysis  of paired end reads  their respective lengths will be listed   Num Tiles   The number of tiles from the lane that are used in the analysis   Genome Directory    Full path to the genome directory     Sample Results Summary    This table displays basic data quality metrics for each sample  displayed on the  Summary   sample page   Sample Yield   The sum of all bases  in Mb  in clusters that passed filtering for the  sample   Clusters  raw     The number of clusters detected by the image analysis module   Clusters  PF    The number of detected clusters that meet the filtering criterion   1st Cycle Int  PF    The average of the four intensities  one per channel or base type   measured at the first cycle averaged over filtered clusters     Intensity after 20 cycles  PF    The corresponding intensity statistic at cycle 20 as  a percentage of that at the first cycle     PF Clusters   The percentage of clusters passing filtering     Align  PF    The percentage of reads passing filter that were uniquely aligned to  the reference  For eland_rna it is number of PF reads aligned to 
189. r  example  files containing these sequences   The cM fa file from the genome folder     1 28 Part   15011196 Rev D    A ribosomal sequences FASTA file  You will need to find it for your genome of  interest  for example  from GenBank   A 5SRNA FASTA file  optional   You will need to find it for your genome of  interest  for example  from GenBank   A contaminants file  You can use the same newcontam fa file as for human   mouse or rat   You do not need to have all of the files listed above  but you need at least one file  for eland rna to work properly  You can add other abundant sequences FASTA files    if desired   L NOTE  Abundant sequence files need to be single F ASTA files  no multi FASTA  allowed     CASAVA v1 8 2 User Guide 1 D O    SojiJ 92UdJa JAH bui j an    1 30 Parti 15011196 Rev D    Algorithm Descriptions    Introduction  ss ss eee 132  ELANDv2 and ELANDV  E  oo 133  Variant Detection    141  readBases Counting Method                      2 22 c cece SG ee cee cee ES ESEG 158     AA  NG TESE     Kapa    PA       CASAVA v1 8 2 User Guide T 37    O XPUSddvy    Algorithm Descriptions    Introduction    This appendix explains the algorithms used in CASAVA for the following functions   Alignment using ELAND  Indel detection and small variant genotyping  RNA sequencing counting methods    1 32 Part   15011196 Rev D    ELANDv2 and ELANDv2e    Efficient Large Scale Alignment of Nucleotide Databases  ELAND  is a very fast aligner  and should be used to match a large numb
190. rFinder separately clusters each type of anomalous read  The resulting  clusters are labeled in the output file as    Shadow SemiAligned  orphan semi aligned    DeletionPair  insert size anomalously large    InsertionPair  insert size anomalously small   ClusterMerger   This stage combines clusters of different types above that appear to  correspond to the same event  One anticipated case is that of two  Shadow SemiAligned clusters and a DeletionPair cluster corresponding to the same   medium or larger scale  deletion  The currently supported merging mechanism is  the combination of clusters of different types that share reads  This is possible as a  read may be detected as being both SemiAligned as one partner in an anomalously  mapped read pair  Apart from its role in merging related clusters  this step also  ensures that reads are not multiply represented in the subsequent assemblelndels  stages and downstream analysis   SmallAssembler   SmallAssembler takes the output of ClusterMerger and  assembles clusters of reads into contigs  It uses an approach based on kmer hashing  and a de Bruijn graph  If a read is successfully assembled into a contig  the read s  alignment details are updated to describe its position in the contig   AlignContig    AlignContig does a dynamic programming alignment of contig to  genome     Variant Caller Methods    The callSmallVariants module calls SNPs and small indels from both the sorted  alignment files  sorted bam  and optionally also from t
191. raph for a more diverse  sample  Note the low diversity for cycles 102 109  this was a multiplexed sample and  these are the index read cycles  so this is normal     bah RUDI   Z    Interpretation of Run Quality       Figure 5 Proper Diversity Samples    Data By Cycle    Lane 1 Both Surfaces    Q  N  co   CD    amp   il  Ki  a  o   m   T       Cluster Density    The figure below shows a screen shot from SAV displaying cluster densities for lanes 1   8 of a flow cell  The cluster density of lanes 7 and 8 is very low  if any of these lanes is  set as the control lane for the run  you might need to repeat basecalling  using OLB   with a more successful control lane  Note that the raw cluster density for lane 1 is too    14 Part   15011196 Rev D    high  resulting in a lower percentage of clusters passing filter  the green box   although  the total number of clusters passing filter is still acceptable     Figure 6 Cluster Density    Analysis   Imaging   Summary   Tila Status   Controls    Status  Cxtracted  199 Called  109 Scored  109    Data By Lane       Sudeuy pue Selde   ANIEND          Fluidics Leak    The figure below depicts a flow cell with spatial variability in intensity  Typically  we  would expect intensity to be nearly even within each lane  This variability might  indicate a fluidics issue such as a large volume of bubbles moving through the flow cell    CASAVA v1 8 2 User Guide 1 D    Interpretation of Run Quality    16    Figure 7 Fluidics Leak    Flowcell Chart  
192. ration   large scale structural variants or other phenomena not accounted for by the model  The  Q score provided by the model should be interpreted with respect to these limitations     Homopolymers    The indel calling model accounts for the probability of a spurious indel error as a  function of homopolymer length and indel type  This spurious indel correction causes  simple expansions and contractions of homopolymers to be predicted as less likely as  homopolymer length increases  The spurious indel error probabilities are calculated  from empirical observations  There is an option available in the small variant caller to  replace these values with a single constant indel error probability to be used for all  homopolymer lengths and indel types     Overlapping Indels    Note that the model handles overlapping indels in an approximate fashion  by  evaluating the probability of each indel allele compared to either the reference or any  other indel allele at the same site  Thus it does not explicitly enumerate all possible  pairs of alignment paths at the site to calculate the joint probability of the path for both  haplotypes of a diploid sample   instead the method considers the current indel allele  compared to all other possible alignment paths at the site     This approximation effectively handles most simple overlapping indels  but will tend to  undercall indels in regions with very high indel error rates  A consequence of this  model is that where overlapping indels o
193. read boundary used for multiple reads    An asterisk     means    fill up the read as far as possible with the preceding  character        A number means that the previous character is repeated that many times   Unspecified cycles are set to    n    by default  If USE BASES is not specified at all   every cycle is used for the alignment    Note that the symbol  I  for indexing is no longer accepted syntax for USE BASES                NOTE   Default is USE BASES Y n  which means perform a single read alignment  and ignore the last base  If running ANALYSIS eland_pair  make sure to  specify the USE_BASES option for two reads  for example USE_BASES  Y n Y n      The following table describes examples of USE_BASES options     Table 6 USE_BASES Options    Option   USE BASES nYYY  USE BASES Y30  USE BASES nY30  USE BASES nY30n    62    Definition   Ignore the first base and use bases 2 4    Align the first 30 bases    Ignore the first base and align the next 30 bases    Ignore the first base  align the next 30 bases  and ignore the last base     Parti 15011196 Rev D    Option    USE BASES nY n Ignore the first base  perform a single read alignment  and ignore the last    Definition    base     The length of read is automatically set to the number of sequencing cycles  minus two     USE BASES Y n This means perform a single read alignment and ignore the last base     Default for single read alignment     USE BASES Y n Y n Perform a paired read alignment but ignore the last base of e
194. reads or 1 for single   end reads  There field can  support more than two    reads     lt is filtered gt  YorN is Y if the read is filtered   N otherwise    lt control number gt  0 Is 0 when none of the    control bits are on   reserved for future use      lt barcode seguence gt  ACGT Represents the USE  BASES masked barcode  sequence  empty  otherwise    An example is shown below   E 15951502 FC106 742459321000512 850 1  E18 CATENOG  AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA        Part   15011196 Rev D    Variant Detection and  Counting    WE OP a oo cara eee maha teen DELIA A GE EE ME MAVRI 88  Methods    91  Variant Detection Input Files                  ESEG SG EG EE GE cece cece cece Ee Gee 93  Running Variant Detection and Counting            2 2 22 22    cece eee eee eee ee cece eie 96  Variant Detection and Counting Output Files                                                  102    3  iza  san    ffi  kai Ko   e  B e    mx att        Gary   ee j 1          t             CASAVA v1 8 2 User Guide 8      G Ja aky o    Variant Detection and Counting    Introduction    This chapter explains how to use CASAVA1 8 to detect Single Nucleotide  Polymorphisms  SNPs  and insertions deletions  indels   and count hits on transcripts  for RNA sequencing     CASAVA generates a CASAVA build  which is a post sequencing analysis of data from  reads aligned to a reference genome by configureAlignment     The CASAVA build process is divided into several modules  or targets   each of which  
195. reference genome to  produce indel candidates  and then the variant caller consolidates these candidates   performs local realignment  and genotypes the indel  Indels of up to 300 bases in  length can be genotyped using this process  Small indels  up to 10 bases  can be  detected directly from the gapped alignment   RNA counting  The number of bases that fall into the exonic regions of each gene  are summed to obtain gene level counts  normalized according to feature size  and  expressed as RPKM  Reads Per Kilobase per Million of mapped reads   Only splice  sites from known splice variants are reported  one at a time  If a read represents a  new splice variant or spans multiple splice junctions it will not be counted     Parti 15011196 Rev D    What s New    Important Changes in CASAVA 1 8 2    Bcl Conversion and Demultiplexing  Supports dual and single indices  Supports adapter masking  CASAVA 1 8 2 FASTO files contain only reads that passed filtering  If you want all  reads in a FASTQ file  use the   with failed reads option     For more information  see the Release Notes for CASAVA 1 8 2  or the Changes file in   CASAV AlnstallationDirectory  share CASAV A 1 8 2     New Options    The new options for release 1 8 2 are listed below     Bcl Conversion and Demultiplexing    For descriptions  see Options for Bcl Conversion and Demultiplexing on page 33     adapter sedquence    with failed reads    CASAVA v1 8 2 User Guide O    MON S IEYM    Overview    Frequently Asked Questio
196. rerun using    path to CASAVA bin configureBuild pl  od  PROJECT DIR     targets callSmallVariants EH  L NOTE  We only support data sets originated from the same version of the software   Generate BAM File with Altered Alignments  An advanced option useful for variant diagnosis is to create BAM files for those  reads which had their alignments altered by the variant caller during local  realignment  This may be done by adding the command     variantsWriteRealigned to any command line which runs the variant caller     Targets  The targets that define CASAVA analysis are listed in the tables below     96 Parti 15011196 Rev D    Options    Table 15 Targets for Variant Detection and Counting    Option Description   all Run all pre configured targets for the given analysis type  default   except for  target bam    sort Bin reads and sort by position  Remove PCR duplicates for paired end data     assembleIndels   Search for candidate indels from paired end reads via de novo assembly of  contigs which are aligned back to the reference     callSmallVariants   Call SNPs and indels from locally re aligned reads  Candidate indels from the  assemblelndels target can be used to improve indel results  See also Target  callSmallVariants Usage on page 97     rnaCounts Calculate gene and exon counts in an RNA Seq build     bam Aggregate all reads into a single BAM file with chromosome re labeling  This  target is not part of target all  and is therefore not done by default  Must be  preceded 
197. riment fluidics or  from intensity plots temperature control  Problem with cycle 20 deduced from intensity Check fluidics and focus for this  plots  cycle   Exceptionally Low first cycle intensity Check first cycle focus   high value    Percentage of Clusters Passing Filters    To remove the least reliable data from the analysis  the raw data can be filtered to    remove any clusters that have    too much    intensity corresponding to bases other than  the called base  By default  the purity of the signal from each cluster is examined over  the first 25 cycles and calculated as Chastity   Highest_Intensity    Highest_Intensity    Next_Highest_Intensity  for each cycle  The new default filtering implemented at the  base calling stage allows at most one cycle that is less than the Chastity threshold     CASAVA v1 8 2 User Guide 1 Fi    Interpretation of Run Quality    The higher the value  the better  This value is very dependent on cluster density  since  the major cause of an impure signal in the early cycles is the presence of another cluster  within a few micrometers     Condition Possible Cause Suggested Action  Very few clusters   Poor flow cell  perhaps unblocked   Some of the causes may be at a single cycle  If  passing filter DNA the problem is isolated to these early cycles  it    Faint clusters is possible that this filtering throws away very    Out of focus good data   Base calling errors may be limited to affected    cycles  and  as early cycles are fairly resistan
198. rl      NOTE   i the  j  lt n gt  command line option is supported to indicate up to  lt n gt   processes in parallel  However  for Bcl conversion the maximum level of  parallelization is 8     Starting Alignment    You can also start alignment before completion of the run using the target r1 when  running make for configureAlignment     See Starting Alignment for Read 1 on page 64     Alternatively  you can use the POST RUN COMMAND R1 variable to automatically start  the alignment of read 1 at the end of the Bcl conversion  For example   make  j 8 rl POST RUN COMMAND Ri  cd    Aligned   make  j 16  T 1 11    Starting the Second Read    To start Bcl conversion of the second read  use the regular make command in the  Unaligned folder  Perform the following     1 Move into the Unaligned folder specified by  output dir     2 Type the regular    make    command   make  j 8    CASAVA v1 8 2 User Guide 3 D    buixa di j  hwag  pue UOISJBAUOYD jog DUIUUNH    Bcl Conversion and Demultiplexing    36    3 After the analysis is done  review the analysis for each sample   See Demultiplex_Stats File on page 42    Part   15011196 RevD    Bcl Conversion Output Folder    The Bcl Conversion output directory has the following characteristics   The project and sample directory names are derived from the sample sheet   The Demultiplex Stats file shows where the sample data are saved in the directory  structure   The Undetermined_indices directory contains the reads with an unresolved or  erroneo
199. rl    Parti 15011196 Rev D    L NOTE  the  j  lt n gt  command line option is supported to indicate up to  lt n gt   processes in parallel     Starting the Second Read    To start alignment of the second read  use the regular  make  command in the Aligned  folder  Perform the following     1 Move into the Aligned folder     2 Type the regular    make    command   make  j n    KAGU PAIR PARAMS and KAGU PARAMS    The parameters KAGU PARAMS  for all runs  and KAGU PAIR PARAMS  for paired   end runs  pass options to the alignmentkesolver through the configureAlignment  configuration file  For additional information  see configureAlignment Configuration File  on page 54     The parameters can be specified lane by lane  All of the options must be specified on a  single line and space separated  as in the following examples   StKAGU PRIR  PARAMS    Circular     mui 0    OT  8 KAGU PARAMS   mmag 4    The following tables describe the parameters     Table 7 Parameters for KAGU PAIR PARAMS and KAGU PARAMS    Parameter Description      mmag    Minimum Mate Alignment Quality  Each read is given a single read  alignment score     This is identical to the alignment score from an eland extended analysis   If a read has a zero paired read alignment score  but a single read  alignment score that exceeds this threshold  its alignment will still go in  the export txt gz files    If the alignments of the two reads can not be paired  resulting in a zero  paired score  and only one of the reads ha
200. rter reads  Single candidate alignments for better quality reads will score more highly than  single candidate alignments of lower quality reads  Single candidate alignments to shorter genomes will score more highly than single   candidate alignments to longer genomes    Unreported Unique Alignments    A linein an export file will only contain alignment information if the alignment score  for that read exceeds a threshold  The primary purpose of this threshold is to retain only  alignments that are markedly better than any other possible alignment for the read     configureAlignment reduces alignment quality to a single confidence score and read  quality  the number of mismatches in the best alignment  and the presence of other  candidate alignments all contribute to the calculation of that score  Therefore  changes  in any of these three variables will affect whether the alignment passes the alignment  quality threshold  So even if only a single candidate alignment has been found for a    CASAVA v1 8 2 User Guide 1 3 D    SGACINY Ia PUE AANV 13    Algorithm Descriptions    read  it may still fail the alignment quality threshold for one of two reasons  and not be  reported in export txt gz   Low base quality values   Excessive number of mismatches in the candidate alignment  There will be at most  2 mismatches in the seed but potentially there can be any number of mismatches in  the remainder of the read     For most applications  this is the right thing in both cases  For ex
201. s an alignment exceeding     min single read alignment score  the read pair is treated as a singleton   The alignment of the orphan read is unreliable enough to be ignored     The default value is 4     Table 8 Parameters for KAGU PAIR PARAMS Only    Parameter Description     moi Gulat This causes alignmentResolver to treat each chromosome as circular and    not linear  enabling it to detect valid pairings that    wrap around    when  the two alignments are mapped onto the linear representation of the  chromosome        circular lt my mitochondria file fa    Treat alignments to my mitochondria  file fa as circular but other  chromosomes as linear  as you might want to do when e g  aligning to  the whole human genome     CASAVA v1 8 2 User Guide 6 D    1uauubiveinbyuo2 BuluuNH    Sequence Alignment    Parameter Description  mag bed Minimum percentage of Unique Fragments  A unique pair is defined as a  read pair such that its constituent reads can each be aligned to a unique  position in the genome without needing to make use of the fact that they  are paired   alignmentResolver works in a two pass fashion    1  On the first pass it looks for all clusters that pass the quality filter and  have a unique alignment of each of their two reads  then uses this  information to determine the nominal insert size distribution and the  relative orientation of the two reads    2  On a second pass this information is used to resolve repeats and other  ambiguous cases     The number of uniqu
202. s that define CASAVA variant detection and counting analysis are  listed in the tables below  with SE  lt  single end  single read   PE  lt  paired end      The primary options that define CASAVA variant detection and counting analysis are  listed on the next pages  with SE  lt  single end  single read   PE  lt  paired end      CASAVA v1 8 2 User Guide O7    DUI1UNOD pue U01 28 8  1ueleA Buiuuny    Variant Detection and Counting    Advanced options for fine tuning the variant calling are listed in Advanced Options for    Variant Detection on page 152     i NOTE      The option   outDir is mandatory for all analysis types  CASAVA will not  run if this option is missing   CASAVA will only run without   inSampleDir if the build has been already  configured with   inSampleDir before     Global Options    The options described below are global options used to specify analysis across different    targets     Table 16 Major File Options for Variant Detection and Counting    Option Application   Description     id  SE  PE    inSampleDir PATH     od  SE  PE       outDir PATH   ref  SE  PE      refSequences PATH      samtoolsRefFile FILE SE  PE    PATH to the aligned sample input directory    Example   id TestData Aligned Project __   lt SampleProject gt  Sample  lt SampleID gt    PATH to the build sample output directory    Example   od  home user name data Project 01  PATH of the reference genome sequences  Default is  buildDir genomes     Example   ref  data Genome CASAVA hg18   The
203. s unreliable  but rather that only the base calls flagged with  Q2 are unreliable  Note  however  that these regions are included in the Gerald error rate  calculations for aligned reads  In typical sequencing runs  most reads are reliable over  their entire length  and are not marked with Q2 indicators  Of the reads that are marked  with the Q2 indicator  most are flagged only in the final few cycles     Demultiplex Stats File    42    The Demultiplex Stats htm file provides stats about demultiplexing and shows where  samples are saved in the directory structure  The Demultiplex Stats file is located in the  Unaligned Basecall Stats FCID directory     The file contains the sample information from the sample sheet  with added rows for  reads that end up in the Undetermined  indices directory  If no sample sheet exists   CASAVA generates rows for each lane  The Demultiplex Stats file has a number of  additional columns that display demultiplexing stats and show the directory the  samples are saved in  The Demultiplex Stats file contains the following fields     Parti 15011196 Rev D    Field Description   Lane Positive integer  indicating the lane number  1 8    SampleID ID of the sample   SampleRef The reference sequence for the sample   Index Index sequence   Description Description of the sample   Control Y indicates this lane is a control lane  N means sample   Project The project the sample belongs to     Reads Number of reads  equals  total number of lines in fastq files 
204. score  Each sequence identifier  the line that precedes the sequence and describes it  needs to  be in the following format     lt instrument gt   lt run number gt   lt flowcell ID gt   lt lane gt   lt tile gt   lt x   pos gt   lt y pos gt   lt read gt   lt is filtered gt   lt control number gt   lt index  sequence gt     The elements are described below     CASAVA v1 8 2 User Guide 3 O    J19p 04  nd ng UOISIBAUOD DA    Bcl Conversion and Demultiplexing    40    Element Requirements Description      Each sequence identifier line starts with     lt instrument gt    Characters Instrument ID  allowed   a z  A Z  0 9 and  underscore   lt run number gt    Numerical Run number on instrument   lt flowcell Characters  TE allowed   a z  A Z  0 9   lt lane gt  Numerical Lane number   lt tile gt  Numerical Tile number   lt x pos  Numerical X coordinate of cluster   lt y POS  Numerical Y coordinate of cluster   lt read gt  Numerical Read number  1 can be single read or read 2 of paired   end   lt is YorN Y if the read is filtered  N otherwise  filtered gt    lt control Numerical 0 when none of the control bits are on  otherwise it is  number gt  an even number  See below    lt index ACTG Index sequence  sequence gt     An example of a valid entry is as follows  note the space preceding the read number  element    BASI3O1362FC106YJ222531000512850 ery  R ATCACG  AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA         NOTE     CASAVA 1 8 2 FASTQ files contain only reads that passed filtering  If you 
205. sent in CASAVA 1 7 has been  integrated in the bcl conversion step     Demultiplexing    Multiplexed sequencing allows you to run multiple samples per lane  The samples are  identified by index sequences  barcodes  that are attached to the template during sample  preparation  For TruSeq dual indexing  you can analyze up to 96 individual samples  per lane  while TruSeq multiplexing with a single index allows up to 12 samples in one  lane     Multiplexed sequencing runs from SCS 2 4 and later versions set the index reads as  separate reads  Sample demultiplexing in CASAVA creates several subdirectories to  dispatch the data associated with the different barcodes  Each subdirectory has a  structure similar to the original BaseCalls directory     Aligning Reads    CASAVA performs sequence alignment using the configureAlignment module  which is  a set of utilities supplied as source code and scripts     The output data produced by configureAlignment are stored in a hierarchical folder  structure called the run folder  The run folder includes all data folders generated from  the sequencing platform and the data analysis software     For the alignment step  the standard input files for reads are the compressed FASTQ  files  ssample name    sbarcode sequence gt _L lt lane gt _R lt read number gt   lt 0 padded 3 digit  set number gt  fastq gz   The standard output files for reads are the export files  ssample  name    sbarcode sequence    I lt lane gt  R lt read number     0 padded 3 
206. sion levels for exons  genes and splice junctions in the RNA Sequencing    analysis    In addition  CASAVA automatically generates a range of statistics  such as mean depth  and percentage chromosome coverage  to enable comparison with previous builds or    other samples     CASAVA analyzes sequencing reads in three stages     FASTO file generation and demultiplexing  Alignment to a reference genome  Variant detection and counting    These three stages are explained below     Figure 2 CASAVA Workflow      bel files         FASTO Generation and Demultiplexing  Convert   bcl files into compressed FASTO files    Separate multiplexedsequence runs by index         Aligning    Align to reference genome         Detecting Variants and Counting  Build consensus sequence    Call SNPs  Detect indels and structural variants  Count RNA reads    i    CASAVA output files build for GenomeStudio     Bcl Conversion       FASTO generation    and demultiplexing scrip    Alignment    Variant detection  and counting    CASAVA 1 8 uses   bcl files as primary sequence input  The first step  bcl conversion     performs the following     Generates compressed FASTO files that can be used by configureAlignment     CASAVA v1 8 2 User Guide    SoANIESA VAVSVO    Overview    Organizes the output in Project and Sample folders  based on the sample sheet  if  provided     Demultiplexes samples into that same run folder organization  based on the sample  sheet     4 NOTE    The separate demultiplexing step pre
207. sion used to generate the file    CL   The configureBuild pl command line used to execute or create the workflow for the  SAM target    An example of a header line is shown below   GPG ID CASAVA VN CASAVA 1 8 0 CL  home userl CASAVA  20091209 bin configureBuild pl  p testBaseMiniBAM   targets  bam    Alignment Section    The alignment section consists of multiple TAB delimited lines with each line  describing an alignment  Each line is    lt ONAME gt   lt FLAG gt   lt RNAME gt   lt POS gt   lt MAPO gt   lt CIGAR gt   lt MRNM gt   lt MPOS gt    lt ISIZE gt   lt SEQ gt   lt QUAL gt  N   lt TAG gt   lt VTYPE gt   lt VALUE gt            An example of a line in an alignment section is shown below    HW1 EAS568 9096 2115 512 204 99 Cczz ta 14483804 29  76M6I118M   14484254 550   AGAAATGTTCTAAAATTAAATTGTAGTGATGTCTGCACAACTTTGTAAGT   TTATAAAAAATAATTGACTTGTACACTTAATATTAATGAGTTGTATGGCA   HGFGHGHHGHHIHEGHHHEHHHFEECBBFBGFHHHEHHHEHHGHHHDHHD   HEHDEFHH  CC C6HHHEED FFFHHHF HEHH  HHH   HGHHGBHFBD    KDE TOO lo SMST 1511429    The format of each field is explained in the following table     Field Description   QNAME   Query pair name if paired  or Query name if unpaired  This consists of the  following sequence    lt Machine gt _ lt Run number gt   lt Lane gt   lt Tile gt   lt X coordinate of cluster gt   lt Y coordinate  of cluster gt    FLAG Bitwise flag  For a description  see Bitwise Flag Values on page 170     CASAVA v1 8 2 User Guide 1 6 O    JEJO J WYS    Field Description   RNAME   Re
208. stood  that this is the default status of an unmapped read    Reads which cross a splice junction are annotated as a single record using the SAM  CIGAR  N   SKIP  character  For example  a 75 base read spanning a 1000 base  intron may have the cigar string   35M1000N40M    Chromosome names cannot be changed in the chromosome BAM files on which  CASAVA operates  The bam module may be used to create a whole genome BAM  file with translated chromosome labels  Note that whole genome bam files are now  the only option for the bam module     Converting Sam and Bam Files    If you want to convert Sam files into Bam files  enter the following   samtools view  b  h  S  o output bam   lt in sam gt   Where    b Output in the BAM format    h Include the header in the output    S Input is in SAM  If  SQ header lines are absent  the     t    option is required    o FILE Output file  stdout   If you want to convert Bam files into Sam files  enter the following   samtools view  h  o output sam  lt in bam gt   Where    h Include the header in the output    o FILE Output file  stdout     For more information  see samtools sourceforge net     Variant Detection Output Files    All variant caller output files are written in a text format composed of one header  segment followed by one data segment     Parti 15011196 Rev D    All lines in the header segment begin with the    character  Header lines beginning with  the sequence      contain a key value pair  The reserved key  COLUMNS has an  associ
209. struments   instrument configuration  genomic sample type  type of analysis  flow cell preparation   and the current state of the art  Therefore  the numbers shown in this section are for  example only     Summary Pages    After analysis is complete  check the FlowCellSummary_FCID htm file  Sample   Summary htm  and Barcode Lane Summary htm files  These provide metrics per flow  cell  sample  and barcode lane  respectively  For a description of the tables found  see  Flow Cell Summary on page 79   Sample Summary Page on page 74  and Barcode Lane   Summary Page on page 78     The key parameters that you should examine are listed in SummaryTab on page 17 and  in the following sections     Percentage of Clusters Passing Filters that Align Uniquely to the  Reference Genome    Optimal value depends on the genome sequenced and the read length  the higher  up to  100  max   the better     This result is genome specific and dependent on the completeness of the reference  A  failure to align could be due to repeat or missing regions  or due to indels where sample  and reference do not match     Condition   Possible Cause   Suggested Action    Much Fluidics or Look for an intensity dip in IVC plots  If there is a problem  lower than   instrument and it occurs after a sufficiently useful read length  re run  expected   problem ELANDv2e analysis using only the    good    cycles before the  when instrument problem   using Contamination   Align a few sample tiles  Genomic contamination wil
210. t  to minor focus and fluidics problems  even the  Bubbles in individual tiles number of errors may be few  The filtering  Too many clusters can always be set manually to some other  values     Check before assuming all the data are poor     Poor matrix  A fluidics or sequencing failure    Large clusters  High phasing or prephasing    Percentage of Phasing and Prephasing    Ideally  these values should be as low as possible     Condition Possible Cause Suggested Action  High phasing or   Reagent issue  reagents Check for leaks or bubbles in images or early cycle  prephasing have deteriorated  discrepancies in intensity plots   Fluidics  Poor flow cell Poor blocking can be evident as intensity in all channels  from cycle 1   Ambient temperature of Check whether machine or facility temperature gets  system beyond recommended limits     Standard Deviations    Many values have standard deviations associated with them  This can be the first  indication as to the uniformity of the flow cell  If standard deviations are high  then it  indicates variability from tile to tile with a lane     Condition Possible Cause   Suggested Action  High standard Check poor tiles   Look at the tile by tile statistics that appear below the flow cell   deviations for  wide summary    e Bubbles   e Focus   e Dirty flow    cell surface    After reviewing the tables in Summary htm  examine the thumbnails     1 o Parti 15011196 Rev D    Bcl Conversion and  Demultiplexing        BILOLO ae 20  Bcl Conversio
211. t Gapped Alignment  serpe J         Sel    Reads Spanning        Indel   V  R    Singleseed IIH mm         Ungapped Gapped  Extension  mmm X III ma  Alignment    Reference nde Genome    Reads spanning indel  align properly    Reads spanning indel  do not align properly       Multiseed Alignment   Semple      First 32 First 32 Second 32  base seed base seed base seed     NN    Reference Inde Genome    Seed spanning indel Seed spanning indel Second seed  does not align properly does not align properly aligns properly    Gapped      No extension possible UML HEIL       Note that a read has to have at least one seed that matches with at most 2 mismatches   and for that seed no gaps are allowed  For the whole read we allow any number of  gaps  as long as they correct at least five mismatches downstream     134 Part   15011196 RevD    Alignment Score Calculation    The base quality values and the positions of the mismatches in a candidate alignment  are used to give a probability score  p value  to each candidate  This is the probability  that the candidate position in the genome aligned to would  if its bases were sequenced  at error rates that correspond to the read s quality values  give rise to the observed read   This way the contribution of each base is weighted according to its quality    L NOTE     A consequence of this is that the best alignment does not necessarily have    the least number of mismatches  although an exact match will always beat  any alignment containing m
212. t and  sample directories specified in the sample sheet  as illustrated below for the sample in  line 4 of the sample sheet     Part   15011196 Rev D    Figure 9 Relation between Sample Sheet and Directory Structure    B  cC D E      F G H   J  FCID Lane SamplelD SampleRef Index Description Control Recipe Operator SampleProject  Z  FC200DMAB 2A hg18 ATCACG Example N PE indexing FZ A  FC200DMAB 2B ng18 CGATGT Example N PE indexing FZ A  FC200DMAB 3C hg18 ATCACG Example N PE indexing FZ B       dd K K  Sheetl p Sheet Mn sheets 7 7               lt ExperimentName gt   YYMMDD_machinename_XXXX_FC    Data    Unaligned    Project_A     lt  fast  astq gz  Sample_A flee    VA  fasta gz  Sample B dle    Project B       fastq gz    Sample_C files       Bci Conversion Demultiplexing Examples    Bcl conversion and demultiplexing support four scenarios   Multiplexed samples present  with sample sheet   Reads are placed within the directory structure specified by the sample sheet  based  on the index and lane information  Reads for which the index sequence was  ambiguous will be placed in a project directory called Undetermined_indices   unless the sample sheet specifies a specific sample and project for reads without  index in that lane   Multiplexed and non multiplexed samples present  with sample sheet   Reads are placed within the directory structure specified by the sample sheet  based  on the index and lane information  Reads containing ambiguous or no barcodes  will be placed in a pr
213. t improvements of ELANDv2e are improved repeat resolution and  implementation of orphan alignment     A short description of these improvements is provided below  more information about  ELANDV2 is available in Algorithm Descriptions on page 131     ELANDv2    The most important improvement of ELANDv2 are the following   Handles indels and mismatches better by performing multiseed and gapped  alignments   Enhanced match descriptor options to handle the gaps identified  see Export txt ez  on page 79    Ability to split queries on a per tile basis now to allow for much greater  parallelization   The hashing method in ELANDv2 has been optimized in CASAVA 1 7 for performance   This leads to a significant improvement in running times for the seed matching step of  the alignment in CASAVA  More information about ELANDv2 is available in on page  151     ELANDv2e Alignment Improvements    CASAVA 1 8 features ELANDv2e  This updated alignment program includes the  following new features  Better repeat resolution A new orphan alignerShorter run times  with a new version of alignmentResolver    CASAVA v1 8 2 User Guide A     UOILONDOJ UI    Sequence Alignment    configureAlignment Input Files    48    The folder structure and format of configureAlignment input has changed significantly  in CASAVA 1 8  The major changes are as follows     configureAlignment uses FASTO files as sequence input    Bcl conversion and demultiplexing are merged in one step  and both multiplexed  and non multipl
214. t is universally available and  has a built in parallelization switch     4      For example  on a dual processor  dual core  system  running    make  j 4  instead of    make     executes the configureAlignment run in  parallel over four different processor cores  with an almost 4 fold decrease in analysis  run time  On a system with more sockets or more cores per socket    j 8    or more may  be advisable     Distributed    Make       120    There are several distributed versions of    make    for cluster systems  Frequently used  versions include    qmake    from Sun Grid Engine  SGE      To use    qmake     a short wrapper script is required  See below for details     There are known issues with the use of    Ismake    that prevent parts of CASAVA from  running  Therefore   lumina does not recommend using    Ismake    to run CASAVA   i NOTE    Distributed cluster computing may require significant system  administration expertise     Ilumina does not support external installations     Using qmake    SGE has the utility gmake  which can run the tasks of a make across a cluster in  parallel  There are two possible ways to run this     Separate Jobs on Queuing System    The first generates each make tasks as a separate job run on the queuing system   6 Move into the output folder     7 Create a script file which contains the following   gmake  cwd  v PATH     j 32    8  Submit the jobs to the SGE   gsub  cwd  V PATH  lt script file gt     The options convey the following infor
215. t size   Reads are filtered on ELAND alignment score  For paired end reads the variant   caller removes by default any read with a paired end alignment score less than 90   and for single end reads  those with a single end alignment score less than 10 are  removed     Detecting Indels and Realigning    The variant caller proceeds with candidate indel discovery and generation of alternate  read alignments based on these candidate indels     As part of this re alignment process the variant caller selects a representative alignment  to be used for site genotype calling and depth summarization by the SNP caller  This  alignment is selected to be within a certain threshold of the most likely of all alignments  for a read  and any leading or trailing portions of the read with ambiguous support for  2 or more different alignments are marked as clipped  This representative alignment  does not affect the indel caller   the indel calling process considers all alignments for  each read without end clipping  For diagnostic purposes  the set of reads which have  their alignments altered during re alignment may be written out to a separate BAM file  for each chromosome using the   variantsWriteRealigned flag     Indel Caller    The indel caller finds indels using a two stage process  In the first stage an indel must  be identified as a candidate indel  In the second stage  after indel candidates have been  identified  all intersecting reads are aligned to each indel  to the reference and t
216. tact Illumina Support   For example  for a laboratory generating 200 GB of sequence per week  the Tier 1  IlluminaCompute solution is recommended  for which the specifications are listed  below  non IlluminaCompute systems satisfying these requirements are also fully  supported     1 APC Netshelter  40U Rack with 1U KMM console   3 Dell R610 Server  8 CPU cores  48 GB RAM   3 Isilon I012000x storage modules   1 Serial MGT Console 16   2 Cisco 3750e switches  Sequence alignment takes somewhere between a few hours  using our fast short read  whole genome alignment program ELAND  and days  using more traditional alignment  programs    CASAVA parallelization is built around the multi processor facilities of the    make     utility and scales very well to beyond eight nodes  Substantial speed increases are  expected for parallelization across several hundred CPUs  For a detailed description  see  Using Parallelization on page 119     Disk Space Requirements    When running CASAVA without keeping temporary data  removeTemps ON    Disk space needed while running   3 x size of export files  Disk space needed after running   1 5 x size of export file    When running with all temporary files saved  removeTemps Of f    Disk space while running   5 x size of export files  For example  to generate a build from one lane of E  coli data  1 GB with    removeTemps ON   we recommend an additional 3 GB of disc space while running  CASAVA and  1 5 GB for the final build directory       NOTE    F
217. tered to remove those indels which are  found at a depth greater than a multiple of the mean chromosomal depth  3 times the  mean chromosomal depth is used by default  which can be changed using the     variantsIndelCovCutoff option   This filter is designed to remove indel calls in  regions close to centromeres and other high depth regions likely to generate spurious  calls      NOTE     This filter is off for RNA variant calling  and we recommend to turn it off for   targeted resequencing     The indels txt file follows the general variant caller output file structure  The data  segment of this file consists of 16 tab delimited fields  The fields are described in the  Table below  note that all information is given with respect to the forward strand of the  reference sequence      No   Label Description  1 seq_name Reference sequence label    Part   15011196 Rev D    No   Label   Z pos   3 type   4 ref upstream   5 ref indel   6 ref  downstream   7 Q indel     8 max_gtype    9 O max gtype     10 depth    11 alt reads    12  indel reads  13 other reads    14   repeat unit  15 ref repeat   count    16   indel repeat   count    CASAVA v1 8 2 User Guide    Description    Except for right side breakpoints  the reported start position of the indel  is the first  left most  reference position following the indel breakpoint   For right side breakpoints the reported position is the right most position  preceding the breakpoint  Also note that wherever the same indel could  be repres
218. the genome and  splice junctions  Reads aligned to abundant sequences and masked by eland rna do  not participate in this number   Alignment Score  PF    The average filtered read alignment score  reads with  multiple or no alignments effectively contribute scores of 0   For phiX spikes  the  number of reads aligning to PhiX is small and therefore the reported alignment  score  small number of aligned reads divided by total number of PF reads  is  usually small   Mismatch Rate  PF    The percentage of called bases in aligned reads that do not  match the reference     5 030 bases  PF    Yield of bases with Q30 or higher from clusters passing filter  divided by total yield of clusters passing filter   Mean Quality Score  PF    The total sum of quality scores of clusters passing filter  divided by total yield of clusters passing filter     CASAVA v1 8 2 User Guide P D    soji J 1NA1NO j usWubi yainbijuoo    Sequence Alignment    76    If eland_pair analysis has been specified for one or more lanes  then two Lane Results  Summaries are produced  one for each read  All lanes for which analysis has been  specified are represented in the Read 1 table  but only those for which eland_pair  analysis has been specified contribute statistics to the Read 2 table     Expanded Sample Summary    This displays more detailed quality metrics for each sample   Clusters  raw     The number of clusters detected by the image analysis module      o Phasing   The estimated  or specified  value used f
219. the header for each sequence     Direct CASAVA to a multi sequence FASTA file using the option SAMTOOLS GENOME  for configureAlignment     Parti 15011196 Rev D      WARNING  y GenomeStudio does not support the use of multi sequence FASTA files   i Therefore  if you want to analyze your output in GenomeStudio  we  recommend using single sequence FASTA reference files     Chromosome Naming Restrictions    CASAVA does not accept the following characters in the FASTA chromosome name  header    F  J Tae KERE ER ty ee T    This validation can be disabled in configureAlignment using the following option   CHROM NAME VALIDATION off    vi WARNING  y You may run into problems with downstream analysis if you disable  i chromosome name validation       NOTE     If ELAND finds two alignments with identical alignment scores  ELAND will  pick the first alignment  in the single end case  or combination of alignments   in the paired end case  that exhibit the highest observed alignment quality   These are the alignments that make it into the export files  which only  contain the best alignment for each read   In practice  post alignment  CASAVA ignores these reads because of the low alignment qualities Using a  reference with lexicographic chromosome names  like chr1  will yield  slightly different results compared to a reference with numerical  chromosome names  like 1  for these reads  since the hits are sorted ina  different way     Reference Sequence Blocks    For reasons of efficiency  E
220. thms have been updated in CASAVA v1 8 to improve run times  The  module alignmentResolver  previously called PickBestPair  has been rewritten  which  has resulted in much faster run times for this step  200 hours for v1 7  versus 15 for  v1 8      The best analysis type therefore depends on the project  is a shorter run time more  important  or the highest number of aligned reads     Parti 15011196 Rev D    Vanant Detection    Post alignment CASAVA performs variant detection using two modules   The assembleIndels module  Grouper  detects candidate indels using  singleton orphan and anomalous read pairs  The assemblelndels module works  well for detecting larger indels  The candidate indels detected by the assemblelndels  module are passed on to the small variant caller for consolidation and genotyping   The callSmallVariants module genotypes and provides quality scores for SNPs and  indels  Indels can be called from candidate indel evidence provided by both  ELAND gapped read alignments  for smaller indels  and from the assemblelndels  module  for larger indels     For each SNP or indel call the probability of both the called genotype and any non    reference genotype is provided as a quality score  Q score   Reads are re aligned around   candidate indels to improve the quality of SNP calls and site coverage summaries    The callSmallVariants module also generates files which summarize the depth and   genotype probabilities for every site in the genome  As a final step it pro
221. tions should not exceed 16  million    ELAND FASTO FILES PER PROCESS value  x    fastg cluster count value  s  16 million   i NOTE      The   fastq cluster count used during Bcl conversion can be found in  Unaligned Makefile     See the table below for set size   cluster count combinations     1 W CAUTION    Setting the right value for the ELAND FASTO FILES PER PROCESS is very  i important  Too high may result in silent crashes due to too high memory  utilization  and should be avoided  Too low may result m a decreased  performance  Use is optional  and we generally recommend using default    values     fastq cluster  ELAND FASTO FILES   Reads per Comment  count PER PROCESS process  12 000 000 1 12000000  6 000 000 2 12000000    CASAVA v1 8 2 User Guide 6 3    1uauubilveinbyuo2 buluuny    Sequence Alignment    64      fastq cluster  ELAND_FASTQ_FILES_   Reads per Comment  count PER_PROCESS process  4 000 000 3 12000000 Default values  3 000 000 4 12000000  2 000 000 6 12000000  1 000 000 12 12000000   L NOTE      Slight differences can be expected when using different combinations of     fastq cluster count and ELAND FASTQ FILES PER PROCESS     The   fastq cluster count used during Bcl conversion can be found in  Unaligned Makefile     Make Option    The   make option creates Aligned output directories and makefiles  Without the option   configureAlignment pl will not create any directories and files and only operates in a  diagnostic mode  You must specify this option to gen
222. tributions such as RedHat  and Fedora  or on other Unix variants  if all of the prerequisites described in this section  are met   The required software environment is described below    CASAVA installation may not work properly with gcc versions 3 x  If you have a   ecc version 3 x  install gcc 4 0 0 or newer up to and including version gcc 4 5 2    with the exception of gcc version 4 0 2  which is not supported    Installation of CASAVA 1 8 now requires the Boost C   library  version 1 44 0 and   cmake version 2 8 0 and above  These packages are included in the CASAVA   installation package  and will automatically install during the configure stage if   either package is not found in the user   s environment   The following software is required to run the CASAVA 1 8  check whether it has been  installed    GNU make  3 81 recommended    Perl   gt   5 8    Python  5 2 3 and  lt  2 6    PyXML   gnuplot   gt   3 7  4 0 recommended    ImageMagick   gt   5 4 7    ghostscript   libxslt   libxslt devel   libxml2   libxml2 devel   libxml2 python   ncurses   ncurses devel    Part   15011196 Rev D    gcc  4 0 0 or newer up to and including version gcc 4 4 x  except 4 0 2   with c     libtiff   libtiff devel   bzip2   bzip2 devel   zlib   zlib devel   Perl modules  perl XML Dumper  perl XML Grove  perl XML LibXML  perl XML LibXML Common  perl XML NamespaceSupport  perl XML Parser  perl XML SAX  perl XML Simple  perl XML Twig  perldoc    SIUSWOJINDOY SIEMHOS DUE SJEMADIEH    CASAVA v1 8
223. txt files must be non multiplexed or already demultiplexed into  separate directories  If the converter finds reads 1  2  and 3 from a multiplexed run   it will convert all three to FASTQ  but configureAlignment cannot run on these files   A config xml file must be found in the Qseq Basecalls folder  or the   config   file argument to the Qseq Converter must point to an equivalent file  The    CASAVA v1 8 2 User Guide 1 67    Sol Indu  J9   8 AL01  basi     Qseq Conversion    162    config xml file must be copied to the FASTQ root folder and renamed  DemultiplexedBustardConfig xml   L NOTE  configure Alignment requires SampleSheet csv and SampleSheet xm1 files but  default versions of both files are created by the Qseq Converter    Parti 15011196 Rev D    Running Qseq Converter    To convert  _qseq txt files  you need to run the configureQseqToFastq pl script   This sets up the run by generating a makefile and metadata  Running make or qmake  then converts the   qseq txt files into FASTO files     9 Enter the following command to create a makefile for sequence alignment with the  desired compression option    path to CASAVA bin configureQsegToFastq pl   input dir DIR    options     10 Move into the newly created output folder  Type the    make    command for basic    analysis   make  L NOTE  You may prefer to use the parallelization option as follows     make  j 3 ali    The extent of the parallelization depends on the setup of your computer or  computing cluster     CASAVA v
224. ule to perform local read  realignment and genotype SNPs and indels under a diploid model     4 In an RNA Seg build the  maCounts  module will also be run to calculate gene and  exon counts  Other optional modules can be added to the build process to perform  additional functions     For the variant discovery and counting step  the standard input file format for reads is  the export format   lt sample name gt _ lt barcode sequence gt _L lt lane gt _R lt read number gt   lt 0   padded 3 digit set number gt _export gz   The standard output file format for reads is the  BAM format  The sorted bam files are stored in chromosome specific directories under  the output directory     Use and properties of CASAVA s post alignment modules are explained in Variant  Detection and Counting on page 87  More information about the algorithms is available  in Variant Detection on page 141     CASAVA v1 8 2 User Guide 4    SIIN EOZ VAVSVO    Overview    Capabilities and Limitations    This section explains the capabilities and limitations of CASAVA when performing  data analysis     Demultiplexing    Demultiplexing is required for downstream analysis when a run is indexed   Demultiplexing processes the read data so that the reads are segregated and copied into  separate directories  along with the indexing read or barcodes being parsed and  removed     Alignment    Alignment is controlled by the configure Alignment pl wrapper script  which includes  several analysis modes that initiate single 
225. ultiplexed  compressed FASTO files  One level down from the Unaligned directory are the project  directories and within each project directory are the sample directories   Reads with undetermined indices will be placed in the directory Undetermined  indices   unless the sample sheet specifies a specific sample and project for reads without index  in that lane    i NOTE    CASAVA 1 8 introduces samples and projects as organizing principle  which  differs from CASAVA 1 7  which organized output by lanes or index     20 Part   15011196 Rev D    Figure 8 Typical Run Folder Structure after Bcl Conversion and Demultiplexing    Before Bcl Conversion After Bcl Conversion   lt ExperimentName gt   lt ExperimentName gt   YYMMDD machinename XXXX FC YYMMDD machinename XXXX FC            Data    EF Intensities    4  Config xml  file   A  stn    BaseCalls    A  Config xml  file        SampleSheet    Data    N Intensities  ad  Dos  files       Config xml  file  df   an    BaseCalls    4  Config xml  file  7   SampleSheet  csv file  L001     By Lane        RunInfo xml  file       L   Runlnfo xml  file    oh           csv file        L001   By Lane               cm   C Lane Cycle     L   filter files  A   control files  Project A    O fast  astq gz  Sample A fes        Sample B SampleSheet   csv file    Project B    E Sample C    Undetermined Indices        fastq gz  Sample Lane        Sample Lane2 SampleSheet   csv file    Basecall Stats FC    SI   C Lane Cycle     4     filter files  Pa   co
226. us index   If no sample sheet exists  CASAVA generates a project directory named after the  flow cell  and sample directories for each lane   Each directory is a valid base calls directory that can be used for subsequent  alignment analysis in CASAVA   i NOTE    If the majority of reads end up in the  Undetermined  indices  folder  check  the   use bases mask parameter syntax and the length of the index in the  sample sheet  It may be that you need to set the   use bases mask option to  the length of the index in the sample sheet   the character n  to account for    phasing  Note that you will not be able to see which indices have been placed  in the  Undetermined indices  folder    CASAVA v1 8 2 User Guide 3     J19p 04  nding UOISJAAUOLH DY    Bcl Conversion and Demultiplexing     lt ExperimentName gt     YYMMDD_machinename_XXXX_FC    a Unaligned          Si fastq gz  Project_DirA Sample_DirA files  7   ee SampleSheet   csv file  Sample DirX    Project DirX    Undetermined Indices Sample lane            SampleSheet   csv file    Sample lane8    Bustard    Basecall Stats FC  Summary xml    7  Demultiplex    _Stats files    IVC  htm       har NOTE  L There can be only one Unaligned directory by default  If you want multiple  Unaligned directories  you will have to use the option   output dir to  generate a different output directory     38 Part   15011196 RevD    FASTO Files    As of 1 8  CASAVA converts   bcl files into FASTO files  and uses these FASTO files as  sequence in
227. used to specify analysis across different  targets     Table 22 Global Analysis Options for Variant Detection and Counting    Option Application   Description     QVCutoff NUMBER PE Sets the paired end alignment score threshold to NUMBER   default 90    Example    QVCutoff 60     QVCutoffSingle NUMBER   SE  PE Sets the single read alignment score threshold to NUMBER   default 10    Example    QVCutoffSingle 60     read NUMBER PE Limit input to the specified read only  Forces single ended  analysis on one read of a double ended data set   Example    read 1     singleScoreForPE VALUE   PE Sets the variant caller to filter reads with single score below    152    OV CutoffSingle in PE mode YES   NO  Default NO   Example    single5coreForPE YES    Parti 15011196 Rev D    Option Application   Description      sortKeepAllReads SE  PE      toNMScore lt NUMBER SE PE      ignoreUnanchored PE    Generate an archive BAM file  Keep all purity filtered  duplicate  and unmapped reads in the build  These reads will be ignored  during variant calling    Example    sortKeepAllReads   Minimum SE alignment score to put a read to NM  Default  1   1  means option is turned off    Ignore unanchored read pairs in indel assembly and variant  calling  Unanchored read pairs have a single read alignment  score of 0 for both reads     Example    ignoreUnanchored    Options for Target assemblelndels    The options described below are used to specify analysis for target assemble Indels     Table 23 Options
228. will result in longer run time     By default  ELANDv2e runs in semi repeat resolution mode  Full repeat resolution can  be turned on with the option INCREASED SENSITIVITY     Orphan Alignment    ELANDv2e performs orphan alignment by identifying read pairs for which only one of  the reads aligns  ELANDv2e tries to align the other read in a defined window  by  default 450 bp   If the number of mismatches is  lt 10  of the read length  ELANDv2e  reports the alignment     Variant Detection and Counting    During variant detection and counting  CASAVA generates a CASAVA build  which is  a post sequencing analysis of data from reads aligned to a reference genome by  configure Alignment     The CASAVA build process is divided into several modules  or targets   each of which  completes a major portion of the post alignment analysis pipeline     1 The first module   sort   bins aligned reads into separate regions of the reference  genome  sorts these reads by alignment position and optionally removes PCR  duplicates  for paired end reads  and finally converts these reads into BAM format     2 In a paired end analysis the next module   assemblelndels   is used to search for  clusters of poorly aligned and anomalous reads  These clusters of reads are de novo  assembled into contigs which are aligned back to the reference to produce candidate  indels     3 Subsequently  the  callSmallVariants  module uses the sorted BAM files and the  candidate indels predicted by the assembleIndels mod
229. with the sequence given the deleted  reference sequence     Single Read Alignment Score  Alignment score of a single read match  or for a  paired read  alignment score of a read if it were treated as a single read  Blank if no  match found  any scores less than 4 should be considered as aligned to a repeat   1  for orphan reads     Paired Read Alignment Score  Alignment score of a paired read and its partner   taken as a pair  Blank if no match found  any scores less than 4 should be    considered as aligned to a repeat  Note that in single ended analyses it is always  blank     Partner Chromosome  Name of the chromosome if the read is paired and its partner    aligns to another chromosome    Partner Contig    Not blank if read is paired and its partner aligns to another chromosome and  that partner is split into contigs     Blank for single read analysis    Partner Offset    Parti 15011196 Rev D    If a partner of a paired read aligns to the same chromosome and contig  this  number  added to the Match Position  gives the alignment position of the  partner    If partner is a orphan read  this value is 0    If partner aligns to a different chromosome and or contig  the number represents  the absolute position of the partner     Blank for single read analysis unless the record belongs to a part of a spliced  RNA read     21 Partner Strand  To which strand did the partner of the paired read align   F  for  forward   R  for reverse   N  if no match found  blank for single read an
230. y reads whose alignment is much worse than expected given its quality   Any  orphan  reads not thought to be due just to poor base quality   Reads from read pairs mapped anomalously  The expected relative orientation   of read partners and the insert size statistics required to detect the anomalies   are read per lane from the s_ _pair xml files produced during the alignment   phase by alignmentResolver  An anomalously large insert size is defined as 3   standard deviations above the median  an anomalously small one as 5   standard deviations below the median Two types of anomalous mapping are   used    Insert size anomalously large   Possible deletion  Insert size anomalously small   Possible insertion   IndelFinder tries to exclude reads for which the bad or non existent alignment is  just a consequence of poor base quality  AlignCandidates    The component AlignCandidates does a dynamic programming  alignment of each orphan read  looking in the interval within which it is expected  to sit  It takes the output of IndelFinder and does a localized alignment of each read   If this procedure finds an alignment for a read where none existed previously  or  finds a better alignment than the existing one  then the previous alignment is  replaced   ClusterFinder  This takes the output of AlignCandidates  a list of orphan and  badly aligning reads  and tries to group them in clusters of reads that are thought to  have been caused by the same indel  based on genomic location   Cluste
    
Download Pdf Manuals
 
 
    
Related Search
    
Related Contents
Dale Tiffany TH13013 Installation Guide  George Foreman GR26EVT Kitchen Grill User Manual  MANUEL D`UTILISATION - Amazon Web Services  (PDF) 取扱説明書  Téléchargez - Image-Line  Braun BNC 006  Lenco Xemio 760 BT 8GB  Chamberlain 8808CB User's Manual  Pioneer CDJ-100S User Guide Manual  HDSPARK セットアップガイド(PDF)    Copyright © All rights reserved. 
   Failed to retrieve file