Home

Trans-ABySS v1.3.2: User Manual

1. deletion inversion or duplication Junction breakpoint Format chrA coordinate1 chrB coordinate2 Format GeneA GeneB where indicates the relative orientation of the contig alignment to the gene strand i e indicates the 20 alignment_params type contig aligns in the same direction of the gene strand indicates the contig aligns in opposite direction of the gene strand Alignment details mainly for debug purpose Format TO CO CC 11 12 AF1 AF2 where TO target overlap fraction overlap target_region1 target_region2 total_target_region_length CO contig overlap fraction overlap query_region1 query_region2 total_query_region_length CC contig coverage match_length1 match_length2 overlap query length I1 percent identity of alignment 1 I2 percent identity of alignment 2 AF1 alignment fraction of alignment 1 match_length1 query_length AF2 alignment fraction of alignment 2 Can be gene_fusion if a gene resides in both genomic regions lsr large scale rearrangement any event not a gene_fusion 21 4 2 SNV INDEL snv_caller py Output Description snv txt snv_filtered txt snv_filtered_novel txt snv_exons txt snv_exons_novel txt snv gff snv_filtered gff LOG Unfiltered snv indel events captured by gapped contig alignments Filtered events lt event_reads gt gt min_reads_contigs defa
2. lt type gt ins lt chr_end gt lt chr_start gt If lt type gt del lt chr_end gt last base of deletion Length size of event 22 ref alt event_reads contig_reads genome_reads gene repeat length ctg_strand from_end confirm_contig_region within_simple_repeats repeatmasker within_segdup at_least_1_read_opposite dbsnp Reference allele If lt type gt ins lt ref gt na Alternative allele If lt type gt del lt ref gt na Total number of reads spanning event from reads to contig alignment Number of reads spanning event in contig from reads to contig alignment Total number of reads spanning event from reads to genome alignment Gene in affected locus Format gene transcript intron exon number effect on open reading frame see below if event spans more than 1 exon intron the output becomes geneA transcriptA intron exon numberA geneB transcriptB intron exon numberBleffect on open reading frame Length of repeat in alternative allele e g AAAA 4 CAGCAG 2 Query strand of alignment in relation to reference Distance bases from event to end of contig Contig coordinate range start end used for checking for event existence in reads to contig alignments Overlap with simple repeats Name of tandem repeat reported if overlap is True e g TRF_SimpleTandemRepeat_CATC if overlap is False Overlap with RepeatMasker an
3. Transcriptome Assemblies with ABySS 1 1 Installing ABySS The input to Trans ABySS TA are one or more ABySS 1 3 2 or above assemblies This section describes only one of the many ways to generate transcriptome assemblies with ABySS 1 3 2 ABySS can be compiled as described in its README Should you run into any difficulties in compiling or running ABySS please contact the ABySS Google Group abyss users googlegroups com 1 2 Choosing k mer values for your assemblies Transcriptome RNAseq samples are composed of transcripts with a wide range of expression levels Because it is observed that reconstruction of transcripts tend to be performed with various degrees of completeness at different k mer values TA takes the approach of generating assemblies using a range of k mer values and then merging the different k assemblies into a single meta assembly The choice of k mer sizes depend on the read length of an RNAseq library and we suggest the following k mer sizes for the given read lengths read length bp k mer sizes total number of assemblies 50 26 28 30 46 48 50 13 75 38 40 42 70 72 74 19 100 52 54 56 92 94 96 23 Please note that the above is just a guideline based on our experience with the given read lengths The choice is basically a compromise between performance level of reconstruction and practicality time computing resources etc Users can experiment with t
4. contains a topdir attribute which specifies the directory under which the output of each library belonging to the project is stored under The path for topdir must exist before running TA for a library under the particular project Projects can have their own specific running parameters which is to be applied on all libraries belonging to the same project Here is an example project brain_tumor model _matcher py mem 15G model_matcher py cmd TRACK GENOME 1 d o OUTDIR f PATH anchor LIB final fa r C PATH reads_to_contigs LIB contigs bam contact helloworld email com topdir working directory for this project reference hgl9 The reference attribute specifies the name of the reference genome and is required if any analysis is to be performed on the library If your project does not have a reference genome simply define reference as none and only stages 0 2 and 3 can be run Other attributes are only needed if the default settings need to be overridden for the project The mem and cma postfixes distinguish what to override for the project b Setting up model_matcher cfg This file specifies the gene models that are used by the module model matcher py for contig transcript mapping The file is organized into sections where each section represents a reference genome The gene model files which are referenced here are expected to be present in the annotations folder See Section 2 5 annotations for instructions to download ann
5. gt min_read_pairs default 4 AND lt max_read_pairs default 2000 2 lt spanning_reads gt gt min_span_reads default 2 local events when 1 alignment target regions overlap or 2 alignment target regions overlap same gene or 3 transcripts mapped by target regions overlap Run log recording command run and parameters used Content of fusion_filtered tsv Field Description id contig contig size genomic_regions contig_regions strands read_pairs spanning_reads rearrangement breakpoint genes Event ID Each line represents an event captured by an individual contig Identical events will be linked by the first number of of id Example 2 1 2 2 2 3 represent the same event captured by 3 different contigs Events are grouped by lt rearrangement gt and lt breakpoint gt contig ID size length of lt contig gt The 2 genomic regions the contig aligns to Format chromosomeA start1 end1 chromosomeB start2 end2 The corresponding contig coordinates of the 2 genomic regions Format start1 end1 start2 end2 regions in the same order of genomic regions Relative orientation of the 2 alignments in relation to the genome Format Number of read pairs flanking junction from genome algnment of reads Number of reads spanning junction from contig alignment of reads Underlying genome rearrangement deduced by relative contig alignment orientations Can be translocation
6. novel_exon novel_intron novel_transcript and novel_utr Effect on open reading frame See below Number of reads spanning novel junction gathered from reads to contig alignments Number of reads spanning novel block If size of novel block is small lt contig_coverage gt will be equal to lt spanning_reads gt Number of reads spanning blocks junctions immediately upstream and downstream This is to inform relative expression levels passed if the following skipped_exon novel_intron lt spanning_reads gt gt minimum_spanning_reads AS5 AS3 AS53 novel_exon novel_utr retained_intron lt contig_coverage gt gt minimum_spanning_reads AND lt contig_neighbor gt lt contig_coverage gt gt maximum_coverage_differential novel_transcript lt spanning_reads gt of each of the novel junction gt minimum_spanning_reads passed if read support passed and the following AS5 AS3 AS53 novel_exon novel_utr novel_intron nove_transcript surrounding splice sequences are canonical splice sites retained_intron lt multi_3 gt is True Contents of coverage txt Field Description transcript total_coverage gene Transcript name Gene name Number of exonic bases covered by contig 25 transcript_length Length of lt transcript gt best_contig Best contig mapped to lt transcript gt in terms of bases covered best_contig_coverage Coverage of lt transcript gt by lt best_cont
7. to BWA http bowtie bio sourceforge net index shtml 0 1 2 or Samtools Python API for Pysam above R extracting read support in http code google com p pysam downloads list analysis modules Samtools 0 1 18 R View create BAM files EE ABySS 1 3 2 or R Assembly and stage 0 of TA http www bcgsc ca platform bioinfo software ab above yss releases 1 3 2 Convert secondary alignments xa2multi pl R kept in XA tag into individual http sourceforge net projects bio bwa files records Blat aro R SHES OnE OTS http users soe ucsc edu kent src above genome Align contigs to reference GMAP O genome an alternative to Blat http research pub gene com gmap Python 2 6 R a e http www python org getit releases 2 6 modules Perl 5 8 R For running Perl wrappers http www perl org get html mqsub R subi isonet een provided in bin in TA v1 3 2 cluster 0 3 1 or rapper rnmm HWA cap http www bcgsc ca platform bioinfo software an Anchor O also be used for correcting above chor releases 0 3 1 erroneous SNVs and indels We recommend users put the executables if any exists of the above software inside TA s bin directory and include the path of the bin directory in the PATH variable in the setup file TA stage 0 FEM requires two special purpose modules from ABySS 1 3 2 abyss filtergraph and abyss junction These two modules are compiled when ABySS is compiled but are not installed by default Therefore you must copy the executab
8. 1 3 2 merge lt LIBRARY gt contigs fa output files lt TOPDIR gt lt LIBRARY gt Assembly abyss 1 3 2 merge cluster lt LIBRARY gt contigs output seq psl If you used the scrub and or anchor features of Anchor please make sure that you use the output FASTA file from ANCHOR for aligning contigs to the reference genome ii Filter the BLAT alignments and generate a UCSC custom track input files lt TOPDIR gt lt LIBRARY gt Assembly abyss 1 3 2 merge cluster lt LIBRARY gt contigs output seq psl output files lt TOPDIR gt lt LIBRARY gt Assembly abyss 1 3 2 tracks lt LIBRARY gt merge contigs best unique m90 blat psl 3 6 Stage 5 This step finds novel transcript splicing events input files lt TOPDIR gt lt LIBRARY gt Assembly abyss 1 3 2 tracks lt LIBRARY gt merge contigs best unique m90 blat psl lt TOPDIR gt lt LIBRARY gt Assembly abyss 1 3 2 reads_to_contigs lt LIBRARY gt contigs bam output files Please refer to Section 4 3 Novel Splicing model_matcher py 18 3 7 Stage 6 This step finds candidate gene fusions and large structural rearrangements input files lt TOPDIR gt lt LIBRARY gt Reads_to_genome bam lt TOPDIR gt lt LIBRARY gt Assembly abyss 1 3 2 merge cluster lt LIBRARY gt contigs output seq psl lt TOPDIR gt lt LIBRARY gt Assembly abyss 1 3 2 reads_to_contigs lt LIBRARY gt contigs bam lt TOPDIR gt lt LIBRARY gt Assembly abyss 1 3 2 reads_to
9. IBRARY gt anchor bam bai Should you want to use other features in Anchor ie scrub and anchor please modify the following script to suit your needs lt TA_DIR gt wrappers setup pl sub finalize If you do not need Anchor you may align your reads to contigs with BWA and generate a BAM file for the alignments Make sure your output BAM file is called lt LIBRARY gt anchor bam and is placed in the anchor directory so you can carry on with stage 3 Alternatively you may align your reads to contigs with BOWTIE and skip stage 3 You must however name your output BAM file as lt LIBRARY gt contigs bam and put the BAM file and its index in the reads _to_contigs directory i 3 4 Stage 3 This step runs xa2multi p1 on the output BAM file from stage 2 which is a script from BWA that converts any records with xa tags to multiple lines of records input files lt TOPDIR gt lt LIBRARY gt Assembly abyss 1 3 2 anchor lt LIBRARY gt anchor bam lt TOPDIR gt lt LIBRARY gt Assembly abyss 1 3 2 anchor lt LIBRARY gt anchor bam bai output files lt TOPDIR gt lt LIBRARY gt Assembly abyss 1 3 2 reads_to_contigs lt LIBRARY gt contigs bam lt TOPDIR gt lt LIBRARY gt Assembly abyss 1 3 2 reads_to_contigs lt LIBRARY gt contigs bam bai 3 5 Stage 4 This step performs two tasks i Aligns contigs in the meta assembly from stage 0 to the reference genome with BLAT input files lt TOPDIR gt lt LIBRARY gt Assembly abyss
10. OPDIR gt lt LIBRARY gt Assembly abyss 1 3 2 k lt LIBRARY gt b fa ABySS single end contigs lt LIBRARY gt 3 fa lt LIBRARY gt 4 fa lt LIBRARY gt 5 fa are extended with the following criteria e Extends 1 in 1 out contigs which have read pair support between flanking contigs e The number of read pairs required is the value of the n parameter from the ABySS assembly e Excludes 1 in 1 out contigs used in the final stage of assembly ABySS indel bubbles lt LIBRARY gt indel fa are extended as long as there is no ambiguity in adjacency iii Combining output from i and ii input files lt TOPDIR gt lt LIBRARY gt Assembly abyss 1 3 2 k lt LIBRARY gt f fa lt TOPDIR gt lt LIBRARY gt Assembly abyss 1 3 2 k lt LIBRARY gt j fa lt TOPDIR gt lt LIBRARY gt Assembly abyss 1 3 2 k lt LIBRARY gt b fa output files lt TOPDIR gt lt LIBRARY gt Assembly abyss 1 3 2 k lt LIBRARY gt contigs fa LA iv Merge assemblies from iii input files lt TOPDIR gt lt LIBRARY gt Assembly abyss 1 3 2 k lt LIBRARY gt contigs fa output files lt TOPDIR gt lt LIBRARY gt Assembly abyss 1 3 2 merge lt LIBRARY gt contigs fa The filtered and extended assemblies iii are merged into a single meta assembly iv where a contig is removed if it has an exact full length match to its sequence in an assembly of smaller k mer size 16 The four parts of FEM happen within a single cluster job The cluster job is config
11. RY gt Reads_to_genome Assembly current gt abyss 1 3 2 abyss 1 3 2 fusions k k1 k kn merge cluster novelty reads_to_contigs snv source gt lt ASSEMBLY gt tracks b Prepare reads for alignments to contigs command lt TA_DIR gt wrappers setup pl lt INPUT gt get_reads cluster lt HEAD_NODE gt This part uses Anchor s setup py to setup the directory for running Anchor and convert the reads files if necessary You can skip this part and stage 2 completely if your want to align reads to contigs the way you prefer 14 The directory and input files for Anchor are set up like so lt TOPDIR gt lt LIBRARY gt Assembly abyss 1 3 2 anchor reads lt LIBRARY gt in lt LIBRARY gt contigs fa gt merge lt LIBRARY gt contigs fa The anchor directory and its contents would be created in this step If the input reads are bam or fq gz files then the reads directory would not be created under the anchor directory and anchor lt LIBRARY gt in would be a symlink to source lt LIBRARY gt in Otherwise cluster jobs would be submitted to convert compress reads file to q gz with the ABySS utility abyss tofastq If you did not follow our aforementioned method to generate your paired end assemblies with ABySS this part of stage 0 might not work for you In this case you must create anchor lt LIBRARY gt in yourself The format of anchor lt LIBRARY gt in uses the same form
12. Trans ABySS v1 3 2 User Manual February 2012 Prepared by Readman Chiu Ka Ming Nip Contact rchiu bcgsc ca kmnip bcgsc ca On behalf of Tony Raymond Shaun Jackman Karen Mungall Inanc Birol Canada s Michael Smith Genome Sciences Centre BC Cancer Agency Vancouver BC Canada V5Z 4S6 Table of Contents 1 Generating Transcriptome Assemblies with ABYSG cccsccssscssscccsssessseseccesscssnssesscesnscesscesnssernceees 2 1 11 Installing ABYSS insien a E E E R S 2 1 2 Choosing k mer values for your aSSemDleS cic0ccsscscsesesisssas0nidenecevenatondesasedeaseodiocatsandsvedncnseessvoniee 2 1 3 R nning EE 2 2 Installing Trans AByS EE 5 PR EE 5 KE DE D Rer H EE ere 8 12 5 ANNOtAL OLS ninn na E aoc che ccdebiccedeegsacdldeissdes cog ceccedohledesauedcecsdocsisestericcctotivcecsesites 8 RENE ie Une 9 2 7 Eeer EE 10 3 Ru unning TANTS EE 11 ENNER e 14 EAR Ma arctiorct sacar ned nase E E a E e a E ES 17 ESCH 17 3 4 Staje EE 18 Bal State DEE 18 Egeter 18 E E 19 3 8 Stage E 19 EERE SETE E S A E E E A EN 19 A Analysis en E 20 4 1 Fusion f sions Py EE 20 E Ee KT E 22 4 3 Novel Splicing model matcher py scocicsssssaacacseasandestasssiesecacens tspaveancavaedesdessieeunwetateeessatenstcuneetes 24 4 4 Gen EENEG 27 5 Mee 27 5 1 Open Reading Frame Effect DeSCTiptors ccsccsssccsescesscccessessscsnscessccesssesscssencessceesssecessonnes 27 6 Technical SUPDOT eege dee Eeer 27 1 Generating
13. _contigs lt LIBRARY gt contigs bam bai output files Please refer to Section 4 1 Fusion fusions py 3 8 Stage 7 This step finds candidate single nucleotide variants insertions and deletions input files lt TOPDIR gt lt LIBRARY gt Assembly abyss 1 3 2 merge cluster lt LIBRARY gt contigs output seq psl lt TOPDIR gt lt LIBRARY gt Assembly abyss 1 3 2 merge cluster lt LIBRARY gt contigs input seq fa lt TOPDIR gt lt LIBRARY gt Assembly abyss 1 3 2 reads_to_contigs lt LIBRARY gt contigs bam lt TOPDIR gt lt LIBRARY gt Assembly abyss 1 3 2 reads_to_contigs lt LIBRARY gt contigs bam bai output files Please refer to Section 4 2 SNV INDEL snv_caller py 3 9 Stage 8 This step reports the coverage for each gene from the reference genome that was detected input files lt TOPDIR gt lt LIBRARY gt Assembly abyss 1 3 2 novelty coverage txt lt TOPDIR gt lt LIBRARY gt Assembly abyss 1 3 2 tracks lt LIBRARY gt merge contigs best unique m90 blat psl lt TOPDIR gt lt LIBRARY gt Assembly abyss 1 3 2 reads_to_contigs lt LIBRARY gt contigs bam output file lt TOPDIR gt lt LIBRARY gt Assembly abyss 1 3 2 novelty gene_coverage txt 19 4 Analysis Output 4 1 Fusion fusions py Output Description fusions tsv fusions_filtered tsv local tsv LOG Unfiltered fusion events captured by split contig alignments Filtered events where 1 lt num_read_pairs gt
14. at described in Section 1 3 Running ABySS except the file format reads must be in either bam or fq gz files c Filter assemblies extend contigs merge assemblies FEM command lt TA_DIR gt wrappers setup pl lt INPUT gt fem cluster lt HEAD_NODE gt TA takes the output of the multiple k ABySS assemblies and performs FEM to generate a single meta assembly for analysis i Filter assemblies input files lt ASSEMBLY gt k lt LIBRARY gt contigs fa lt ASSEMBLY gt k lt LIBRARY gt 5 adj lt ASSEMBLY gt k lt LIBRARY gt 5 path output lt TOPDIR gt lt LIBRARY gt Assembly abyss 1 3 2 k lt LIBRARY gt f fa 15 The input assembly lt LIBRARY gt contigs fa is filtered with the following criteria e Removes all contigs less than 2k 1 in size e Removes island contigs contigs that have no neighbors in the ABySS adjacency graph less than or equal to 150bp in size ii Extend contigs and indel bubbles input files lt ASSEMBLY gt k lt LIBRARY gt 1 fa lt ASSEMBLY gt k lt LIBRARY gt 2 adj lt ASSEMBLY gt k lt LIBRARY gt 3 fa lt ASSEMBLY gt k lt LIBRARY gt 4 fa lt ASSEMBLY gt k lt LIBRARY gt 5 fa lt ASSEMBLY gt k lt LIBRARY gt 5 dist lt ASSEMBLY gt k lt LIBRARY gt 5 adj lt ASSEMBLY gt k lt LIBRARY gt 5 path lt ASSEMBLY gt k lt LIBRARY gt indel fa output lt TOPDIR gt lt LIBRARY gt Assembly abyss 1 3 2 k lt LIBRARY gt j fa lt T
15. cluster for any practical performance The cluster job shell scripts created by TA are intended for the Sun Grid Engine version 6 2u5 Users might need to modify the relevant modules ie wrappers analyze pl wrappers setup pl utilities submitjobs sh for each stage of TA accordingly to fit your cluster environment Experience in programming in Perl Python and submitting jobs to your cluster would be of great value Please read Section 3 Running Trans ABySS for more details on running TA To ensure seamless submission of cluster jobs please set up automatic login to your cluster head node Ask your system administrator for help or simply do a Google search for SSH login without password 2 3 configs a Setting up transcriptome cfg The configuration file transcriptome cfg specifies how the different steps of the TA pipeline are run It contains the following major sections commands Contains the default command options for running each script memory Contains the default memory request for cluster jobs genome Contains the paths to your reference genomes on the cluster contact Contains the default email address to contact if the cluster jobs failed Projects and libraries TA processes data on a per library basis Libraries are grouped under projects each library must belong to a single project Projects are configured as individual sections under the above major sections in transcriptome cfg Each project section
16. epresents an event captured by an individual contig Identical events will be linked by the first number of lt id gt Example 2 1 2 2 2 3 represent the same event captured by 3 different contigs Events are grouped by lt type gt and lt coord gt Event type Can be AS3 novel 3 splice site AS5 novel 5 splice site AS53 novel 5 and 3 splice site on the same alignment block novel_exon novel_intron novel intron novel_transcript novel transcript when contig cannot be mapped to any known novel exon novel_utr novel UTR when novel alignment blocks exist beyond annotated 5 and 3 exons of mapped transcript retained_intron retained intron skipped_exon skipped exon Transcript ID 24 gene exons align_blocks coord splice multi_3 size orf spanning reads contig coverage contig_neighbor read_support filter Gene name Exon number s relative to transcript strand start from 1 Alignment block numbers counted in ascending order of coordinate start from 1 Will be multiple values for skipped_exon novel_intron novel_utr and novel_transcript Coordinate of novel block Format chromosome start end Splice site sequence surrounding novel junction e g GI ag U2 U12 where U2 U12 is name of splice motif Only applicable to retained_intron events True if the size of the intron retained is a multiple of 3 i e retained open reading frame Size of novel block Only applicable to AS53
17. hat these files exist in each k directory Sname 1l fa Sname 2 adj Sname 3 dist Sname 3 fa name 4 fa can be empty name 5 fa can be empty Sname 5 adj Sname 5 path Sname 6 fa name contigs fa a symbolic link to name 6 fa Sname indel fa If there are no missing output files or error messages you have generated the ABySS assemblies needed for TA If you decided to generate the assemblies using your own methods or pipelines then you must rename any files and or construct a directory that looks like so assemblies parent _directory Sname in k26 Sname l fa Sname 2 adj Sname 3 dist Sname 3 fa Sname 4 fa Sname 5 fa Sname 5 adj Sname 5 path Sname 6 fa Sname contigs fa Sname indel fa k28 k48 k50 Note name is the name of your library 2 Installing Trans ABySS Upon extracting the TA package you should see the following directories files bin setup configs input annotations utilities analysis sample output Note From now on the directory containing the above files directories is denoted as lt TA_DIR gt 2 1 bin TA requires the following external software packages for various purposes e R equired Software Version O ptional Purpose Download 0 5 9 r16 3 i e e BWA EE R Align reads to contigs http sourceforge net projects bio bwa files Align reads to contigs an SE p bio ge Bowtie O alternative
18. he output for each library lt LIBRARY gt name of the library as known as name in previous section lt ASSEMBLY gt path to the directory containing the ABySS multi k mer assemblies lt STAGE gt the stage number Before running any scripts from TA you need to set up your environment with this command source lt TA_DIR gt setup To understand the usage and available parameters for TA lt TA_DIR gt wrappers trans abyss sh h To run each stage on the cluster with the TA wrapper lt TA_DIR gt wrappers trans abyss sh c lt HEAD NODE gt i lt INPUT gt A lt STAGE gt Suppose lt INPUT gt contains multiple libraries and you want to run TA on one particular library called lt LIBRARY gt you can run the wrapper like so lt TA_DIR gt wrappers trans abyss sh c lt HEAD NODE gt i lt INPUT gt lt STAGE gt l lt LIBRARY gt Now suppose you want to run TA on lt v gt libraries starting with library called lt LIBRARY gt in lt INPUT gt you can run the wrapper like so lt TA_DIR gt wrappers trans abyss sh c lt HEAD NODE gt i lt INPUT gt lt STAGE gt s lt LIBRARY gt n lt N gt 13 3 1 Stage 0 Three tasks are preformed in this step a Set up the working directories for TA command lt TA_DIR gt wrappers setup pl lt INPUT gt make_dir cluster lt HEAD_NODE gt TA will set up the directories and symbolic links like so lt TOPDIR gt lt LIBRA
19. heir own k mer sizes to suit their sample characteristics and computing resources 1 3 Running ABySS Before running ABySS to generate the assemblies set up a directory to store your assemblies for each k mer size mkdir name where name is the library name In this newly created directory make a text file called name in that lists the paths to the input reads one read file per line The input reads can be bam fq gz fastq gz Illumina export qseq files or any other formats that ABySS can read This text file will also be required in TA Stage 0 Please read Section 3 1 Stage 0 for more details The input reads may need to be processed if you want to run Anchor in stage 2 Since you will be generating paired end assemblies the list of reads files in name in should be ordered like so lt first pair read file 1 gt lt first pair read file 2 gt lt second pair read file 1 gt lt second pair read file 2 gt lt third pair read file 1 gt lt third pair read file 2 gt In addition you must name your reads files like so lt any string gt _ 1 2 3 _ e q c lt any string gt The pairing between _1 and _2 or between _1 and 3 indicate that the two reads files contain mate pair reads The string _e gq and _c denote that the reads files are either Illumina export files Illumina qseq files or concatenated Illumina qseq files This naming requirement is not required if you do want to use Anchor Please read Sectio
20. ig gt nbr_contigs Number of contigs mapped to lt transcript gt contigs List of contigs mapped to lt transcript gt contig_coverage Total coverage of lt transcript gt by lt contigs gt Sample line of mapping txt lt A gt matches lt B gt lt C gt lt D gt model lt E gt wt lt F gt in lt G gt blocks total_blocks lt H gt total_exons lt I gt lt J gt coord lt K gt score lt L gt events lt M gt coverage lt N gt Field Description A oe DD CO mi Di Do D Z ep Contig ID Transcript name Gene name CODING or NONCODING of transcript Gene model initial specified in model_matcher cfg e g e Ensembl r Refseq Weight of gene model in matching first model s wt models used second model s wt models used 1 etc Number of alignment blocks mapped to exons Total number of blocks in alignment Total number of exons in transcript partial_match or full_match full_match if both edges of internal alignment blocks match and internal edges of outermost blocks match partial_match otherwise Coordinate of alignment Score Number of edges matched AS5 and AS3 junctions considered matched Number of novel splicing events Coverage of transcript 26 4 4 Gene coverage Field Description gene Gene name nreads Total number of reads covering gene total_read_length Sum of length of reads covering gene union_aligned_block_length Total le
21. les for these two modules into either TA s bin directory ABySS s installation directory or anywhere accessible by TA 2 2 setup The purpose of the setup file is to define all the proper environment variables needed by TA To ensure all the dependent software can be accessible a typical TA job begins with the Unix command source lt TA_DIR gt setup The setup file from the download package looks like so export TRANSABYSS VERSION 1 3 2 export ANCHOR_DIR directory where anchor_pipeline py is accessible export TRANSABYSS PATH directory where TA is installed export PERL5LIB per15 libraries export PYTHONPATH python path export ABYSSPATH directory containing abyss executables export LD LIBRARY _PATH shared libraries export PATH TRANSABYSS PATH bin SABYSSPATH LD_LIBRARY_ PATH PYTHONPATH PATH Users must update the setup file with all the pertinent file paths before attempting to run TA The following environment variables must be defined because they are referenced in the wrapper scripts TRANSABYSS VERSION TRANSABYSS_PATH note that this is actually lt TA_DIR gt ANCHOR_DIR However ANCHOR_DIR is not needed if you decided not to use the Anchor package After you have installed the required software and configured your setup file you can check the paths with this command sh lt TA_DIR gt check prereq sh A few notes on cluster use Because of the sheer volume of transcriptome data TA assumes the use of a
22. lit up the dbSNP annotation by chromosome split_dbsnp sh split_dbsnp sh lt TA_DIR gt annotations lt genome gt snp1xx txt lt TA_DIR gt The user is expected to have the single reference genome sequence FASTA file available on the cluster for contig alignments For example the reference genome hg19 can be downloaded from ftp ftp ncbi nih gov genbank genomes Eukaryotes vertebrates_ mammals Homo_sapiens GRCh37 spe cial_requests After that put the path to the downloaded reference FASTA file in configs transriptome cfg under genomes ie genomes hg19 path to your hg19 fasta_file here A lt genome gt 2bit version of the same genome sequence is expected to be present in the genome folder for quick random access to the reference sequence A lt genome gt 2bit file can be generated from the utility faToTwoBit available from http users soe ucsc edu kent src 2 6 analysis and utilities These folders contain the analysis modules written in Python 2 7 sample_output We have provided sample output files for our sample library We encourage users to run TA on the sample library This is a great exercise to get familiar with the process for setting up your project and running TA In addition this exercise may also help you check whether the required software have installed properly Otherwise it is very unlikely that you will get the same output files Before you begin make sure you have installed and se
23. mor L00002 1 3 2 abyss assembly L00002 brain tumor L00003 1 3 2 abyss assembly L00003 brain tumor It is important that the assembly directories are set up as described in Section 1 3 Running ABySS 2 5 annotations Analysis modules of TA require comparisons to a reference genome and gene annotation files TA organizes annotation files by genome under the annotations folder for example annotations hg19 genome 2bit splice_motifs fa copied from annotations shared rest of annotation files shared provided splice motifs txt TA mainly uses the annotation files available from the UCSC genome browser ftp hgdownload cse ucsc edu goldenPath lt genome gt database for this purpose A list of files required lt genome gt _annot txt and a downloading script lt genome gt _annot sh available for the genomes hg18 hg19 and mm9 are provided in the annotations folder for executing the downloads and running the following processing steps This is an example of how to use the provided shell script to get hg19 annotation files cd lt TA_DIR gt annotations hg19_annot sh hg19 hg19_ annot txt hg19 lt TA_DIR gt where hg19 is the destination folder hg19 is the name of the genome The script uses wget for downloading Note that a snp1xx txt gz is included in all genome s file lists This dbSNP file is used to annotate the snv indel events detected To speed up this annotation process the user needs to run this command to sp
24. ngth of union of alignment blocks mapped to gene normalized_coverage lt total_read_length gt lt union_aligned_block_length gt 5 Miscellaneous 5 1 Open Reading Frame Effect Descriptors Throughout the output from TA a standard nomenclature used for example by the Human Genome Variation Society is used to denote the effect of an event on a gene at the protein level The following table describes the changes with an example notation and explanation Change Example frameshift A245Sfs Alanine 235 becomes Serine followed by a frameshift deletion V422_S431del deletion from Valine 433 to Serine 431 insertion Q484_I485insVA insertion of Valine and Alanine in between Glutamine 484 and Isoleucine 485 indel S293 _Y294insKS Serine 293 to Tyrosine 294 becomes Lysine and Serine synon Synonymous silent substitution T327S Threonine 327 to Serine 6 Technical Support Please direct your bug reports questions and suggestions to the Trans ABySS Google Group trans abyss googlegroups com You can also read and search existing discussions on the Google Group at http groups google com group trans abyss End of User Manual 27
25. nning TA on any new libraries please check 1 Your ABySS multi k mer transcriptome assemblies have completed successfully There should be no errors in the logs and all output files are present The directory structure and names of the output files of your assemblies are adjusted to be compatible for Please refer to Section 1 3 Running ABySS for more details 2 Your project is set up correctly in config transcriptome cfg and the directory path for topdir exists Note that topdir defines where will place its output for the project Please refer to Section 2 3 configs for more details 3 Your input file is set up correctly Particularly check whether the library name and the path to the assemblies directory are correct Please refer to Section 2 4 input for more details Figure 1 shows an overview of TA The pipeline is divided into 9 stages 0 to 8 Each stage is described in this section 11 Stage 0 Stage 1 Stage 2 Stage 3 Stage 4 Stage 5 Stage 6 Stage 7 Figure 1 An overview of the Trans ABySS pipeline FEM Filter extend merge R2G Reads to genome alignments R2C Reads to contigs alignments R2C multi split multi mapping records to one alignment per record C2G Contigs to genome alignments 12 Abbreviations used in this section lt HEAD_NODE gt name of the cluster head node lt INPUT gt path to the input file lt TOPDIR gt path to the project directory that holds t
26. notations Type of repeat reported if overlap is True e g AluSx LTR47A if overlap is False Overlap with segmental duplication Chromosome Start_coordinate of segdup partner reported if overlap is True e g chr1 17048246 if overlap is False If at least 1 supporting read is aligned in opposite orientation to rest of supporting reads Can be true or false dbSNP entries if event is already annotated in dbSNP e g rs12028735 rs71510514 23 4 3 Novel Splicing model_matcher py Output Description events txt events_filtered txt events_summary txt events_filtered_summary txt coverage txt mapping txt log txt events_reads events bed events_filtered bed LOG Unfiltered novel splicing events not observed in annotations specified in model_matcher cfg Filtered events See below for filtering criteria Tally of unfiltered events by lt type gt Tally of filtered events by lt type gt Transcript coverage Mapping of contig to annotated transcripts Detailed block by block mapping of alignments to exons Directory containing FASTA files of event spanning reads Format of file names contig event_type chromosome start end fa Unfiltered events in bed format Filtered events in bed format Run log recording command run and parameters used Contents of events_filtered txt Field Description id type transcript contig Contig ID transcript Event ID Each line r
27. ns 3 1 Stage 0 b and 3 3 Stage 2 for more details Here is an example of name in and the name of the reads files path to reads s_5 1 concat_qseq txt path to reads s_5 2 concat_qseq txt path to reads s_6 1 concat_qseq txt path to reads s_6 3 concat_qseq txt If your reads are stored in BAM files then there are no restrictions on how the files are named This is a sample command to generate one paired end assembly with ABySS cd name amp amp mkdir kSk amp amp cd kSk amp amp exec abyss pe E 0 n 5 v v k Sk name Sname in lt Sname in OVERLAP _OPTIONS no scaffold SIMPLEGRAPH OPTIONS no scaffold MERGEPATHS OPTIONS greedy mp where k is the k mer size and name is the library name The assembly output files would be generated in the directory called k k which is strictly required for TA You can vary the options n E c s etc to generate your assemblies but please keep the no scaffold option for OVERLAP_OPTIONS and SIMPLEGRAPH_OPTIONS because TA does not deal with scaffolds Please execute the above command in a cluster job script if necessary for all k mer sizes in the same directory so you would have one directory per k mer size For example these are the directories and files created if the library name has 50 bp reads Sname k26 k28 k48 k50 Sname in After the assemblies have finished please check for error messages in the log files and make sure t
28. otation files Each gene model is given a one letter alias for quick referencing For example e represent the Ensembl gene model file A comma separated order field is used to specify the priority of the gene models when comparisons are made Order is also used in breaking ties when the same contig can be mapped to genes from multiple models An earlier model given in the order will be given precedence over the later ones when a single transcript is assigned to a contig Here is an example of the contents hg19 k knownGene_ref txt e ensGene ref txt r refGene txt a acembly ref txt order k e r a 2 4 input An input file is what initiates the TA pipeline There is no restrictions on how to name an input file As discussed TA analysis is performed on a per library basis therefore each line in the input file represents a single library and a single input file can contain multiple lines The format of each line in an input file contains 4 space separated columns lt LIBRARY gt lt ABYSS VERSION gt lt ASSEMBLIES DIR gt lt PROJECT NAME gt where lt LIBRARY gt is the library name lt ABYSS VERSION gt is the version number of ABySS used for the transcriptome assembly lt ASSEMBLIES DIR gt is the path to the directory containing the library s multi k mer assemblies lt PROJECT NAME gt is the project name which has to be defined in transcriptome cfg An example input file L00001 1 3 2 abyss assembly L00001 brain tu
29. t up ABySS 1 3 2 and Trans ABySS 1 3 2 properly Please refer to the Sections 1 1 Installing ABySS 2 1 bin 2 5 annotations for details Step 1 Generate the multi kmer transcriptome assemblies with ABySS 1 3 2 Please refer to Section 1 3 Running ABySS for details These are the input reads for the sample library lt TA_DIR gt sample_output ABySS SampleProject abyss 1 3 2 sim0003 reads_ 1 export fq lt TA_DIR gt sample_output ABySS SampleProject abyss 1 3 2 sim0003 reads 2 export fq We used k mer sizes 62 64 66 68 70 72 74 and the ABySS settings described in Section 1 3 We called the sample library sim0003 Step 2 Set up a new working directory for TA and put the path as topdir under SampleProject in lt TA_DIR gt configs transcriptome cfg SampleProject topdir path to your topdir here reference hgl9 Please refer to Section 2 3 configs for details Step 3 Set up the input file like so sim0003 1 3 2 path to your abyss assemblies here SampleProject Please refer to Section 2 4 input for details Step 4 Run TA from stages 0 and 2 to 8 as described in Section 3 Running Trans ABySS After running stage 0 you can skip stage 1 and copy our JAGUAR BAM file and its index lt TA_DIR gt sample_output Trans ABySS SampleProject sim0003 Reads_to_genome output jag sorted bam output jag sorted bam bai to your Reads_to_genome directory 10 3 Running Trans ABySS Before ru
30. ult 3 Filtered events not annotated in dbSNP Filtered non synonymous events residing in gene exons Filtered non synonymous events residing in gene exons not annotated in dbSNP Unfiltered events reported in gff format Filtered events reported in gff format Run log recording command run and parameters used Content of events filtered tsv Field Description id type chr chr_start chr_end ctg ctg_len ctg_start ctg_end len Event ID Each line represents an event captured by an individual contig Identical events will be linked by the first number of of id Example 2 1 2 2 2 3 represent the same event captured by 3 different contigs Events are grouped by lt type gt and lt alt gt Event type Can be snv ins del inv Chromosome number EE Chromosome start coordinate If lt type gt ins lt chr_start gt coordinate immediately upstream of insertion If lt type gt del lt chr_start gt first base of deletion Chromosome end coordinate If lt type gt ins lt chr_end gt lt chr_start gt If lt type gt del lt chr_end gt last base of deletion Contig ID Length of lt ctg gt that captures event Contig start coordinate If lt type gt ins lt chr_start gt coordinate immediately upstream of insertion If lt type gt del lt chr_start gt first base of deletion Contig end coordinate If
31. ured to use 8 threads To configure the job scripts to fit your cluster environment please modify setup p1 sub fem and abyss rmdups iterative function rmdups 3 2 Stage 1 Mate pair reads need to be aligned to a reference genome for finding evidence for fusion candidates However these alignments are not performed as part of TA We use JAGUAR to align reads to genome For more information please read http www bcgsc ca platform bioinfo software jaguar http www bcgsc ca platform bioinfo docs jaguar Butterfield JAGuaR_Nov2011 pdf When you run this stage you would see this message Please put your code in lt TA_DIR gt wrappers setup pl sub copy bam for copying JAGUAR s BAM file to the Reads_to_ genome directory Obviously this stage does not do anything Please modify setup p1 to suit your needs 3 3 Stage 2 This step runs Anchor to align reads to the meta assembly from stage 0 input files lt TOPDIR gt lt LIBRARY gt Assembly abyss 1 3 2 anchor lt LIBRARY gt in lt TOPDIR gt lt LIBRARY gt Assembly abyss 1 3 2 anchor lt LIBRARY gt contigs fa Remember from stage 0 lt LIBRARY gt in is a text file that list the paths to input reads files fastq fq gz or bam lt LIBRARY gt contigs fa is a symlink to merge lt LIBRARY gt contigs fa output files lt TOPDIR gt lt LIBRARY gt Assembly abyss 1 3 2 anchor lt LIBRARY gt anchor bam lt TOPDIR gt lt LIBRARY gt Assembly abyss 1 3 2 anchor lt L

Trans-ABySS v1.3.2: User Manual

Contents

Download Pdf Manuals

Related Search

Related Contents