Home
DNACloud: A Tool for Storing Big Data on DNA
Contents
1. 1 do Read the bytes string for the chunk Convert the bytes string to ascii values list Convert the ascii value list to base 3 string Convert base 3 string to DNA bases using pervious chunk s last base Concatante this new DNA string with original one Store the last base of the DNA string end for 4 Read bytes string for the last chunk Convert the bytes string to ascii values list Convert this ascii value list to base 3 string Convert base 3 string to DNA bases using pervious chunk s last base Concatenate the new DNA string with original one else if no of chunks 1 then Read entire file and perform conversion steps directly for converting to ASCII then to base 3 then to DNA base trivial case end if end if 5 Convert length of the final DNA String obtained to base three and add leading zeros unless length is 20 6 Add zeros in between the DNA string and base 3 string obtained in previous step such that total string length is divisible by 25 7 Convert the remaining base 3 string to DNA base Algorithm 2 Algorithm for generating DNA chunks Require DNA string for the file obtained in algorithm 1 Ensure DNA chunks of length 117 1 if file size lt chunk size then Do not divide file into chunks else Divide file into chunks where no of chunks file size chunk size 1 end if 23 if no of chunks gt 1 then Read DNA string for the chunk one Divide the string into chunks of length 100 and add index info in these chunk S
2. Organic data memory using the DNA approach Communications of the ACM vol 46 no 1 pp 95 98 2003 M Arita and Y Ohashi Secret signatures inside genomic DNA Biotechnology progress vol 20 no 5 pp 1605 1607 2004 G M Skinner K Visscher and M Mansuripur Biocompatible writing of data into DNA Journal of Bionanoscience vol 1 no 1 pp 17 21 2007 G M Church Y Gao and S Kosuri Next generation digital infor mation storage in DNA Science vol 337 no 6102 pp 1628 1628 2012 G M Church and E Regis Regenesis how synthetic biology will reinvent nature and ourselves Basic Books 2012 A Driscoll and R D Sleator Synthetic DNA the next gen eration of big data storage Bioengineered vol 4 no 3 pp 123 125 2013 PubMed Central PMC3669150 DOI 10 4161 bi0e 24296 PubMed 23514938 S Greengard A new approach to information storage Commun ACM vol 56 no 8 pp 13 15 Aug 2013 Online Available http doi acm org 10 1145 2492007 2492013 K Swearingen How much information Online Available chnm gmu edu digitalhistory links pdf preserving 8_Sa pdf M Hilbert How much information is there in the world Online Available R Thomchick NSA national security agency or FBI federal bureau of investigation will have one yottabyte Online Available http www metaholic musings com 2013 03 20 brontobytes S Higginbotham
3. V COMPARISION OF DATA STORAGE ON DNA The software has limitation of encoding and decoding the file beyond certain file size Table II compares the file size limit of different file types encoded by the software At present the maximum file size of 3486784400 bytes or 3 4 GB of DNA strings could be decoded by the software and any file of size 581130733 333 bytes or 554 MB can be encoded with DNACloud VI CONCLUSION Considering the current rate of data explosion DNA stor age becomes an absolutely indispensable data storage medium because of its low maintenance cost high data density eco friendliness and durability However the technological ad vancements are rudimentary since still the cost for sequencing and synthesizing DNA is pretty high But since the cost is decreasing every day we expect that the research in encoding and decoding algorithms can avail common man with this technology within next few years Thus DNACloud can be considered as a potential tool to convert data files into DNA and vice versa We are anticipating to enhance the capability of the software to encode large size data by implementing better encoding and decoding techniques and error correction methods VII SOFTWARE AVAILABILITY The software source code installers for Mac and Windows user manual product demo and other related materials can be downloaded from http www guptalab org dnacloud VIII ACKNOWLEDGEMENT We would like to thank Thorsten
4. Weimann and Anand B Pillai whose open source libraries of python barcode and pytxt2pdf respectively are used in the software REFERENCES 1 N Goldman P Bertone S Chen C Dessimoz E M LeProust B Sipos and E Birney Towards practical high capacity low maintenance information storage in synthesized DNA Nature 2013 2 G Budman NSA might want some backblaze pods Online Available http blog backblaze com 2009 1 1 12 nsa might want some backblaze pods 3 L Dixita and M K Gupta Natural data storage on DNA A review 2013 preprint J Davis Microvenus Art Journal vol 55 no 1 pp 70 74 1996 5 E Kac 1999 Genesis art of DNA Online Available www ekac org geninfo html 6 N Yachie K Sekiyama J Sugahara Y Ohashi and M Tomita Alignment based approach for durable data storage into living organ isms Biotechnology progress vol 23 no 2 pp 501 505 2007 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 N G Portney Y Wu L K Quezada S Lonardi and M Ozkan Length based encoding of binary data in DNA Langmuir vol 24 no 5 pp 1613 1616 2008 M Ailenberg and O D Rotstein An improved huffman coding method for archiving text images and music characters in DNA Biotechniques vol 47 no 3 p 747 2009 P C Wong K k Wong and H Foote
5. it is On what it can be stored Remarks Tera Byte TB 1000 GB 200000 Photos 1 TB Hard Disk 400 Terabytes National Climactic Data Center NOAA database Peta Byte PB 1000 TB 3 years of EOS data 16 Backblaze storage pads 200 Petabytes NASA s Earth Observing System racked in two datacenter All printed material Exa Byte EB 1000 PB 2 Exabytes Total volume of A city Block of 4 5 Exabytes All words ever information generated in 1999 storey datacentre spoken by humans Zetta Byte ZB 1000 EB 1 9 zettabytes of information sent 20 percent of Manhattan 5 Zetta Byte is equal to through broadcast technology like T V and GPS New york US NSA s Utah Data Center Yotta Byte YB 1000 ZB 1 YB is the total Volume of State of Delware and 1 3 zettabytes is of traffic government data the NSA National Security Agency Rhode Island with million Data centre annually over the internet in 2016 Algorithm 1 Algorithm for generating DNA string Require File size and chunk size Ensure DNA string for the file 1E if file size lt chunk size then Do not divide file into chunks else Divide file into chunks where no of chunks file size chunk size 1 end if 2 if no of chunks gt 1 then Read bytes string from chunk one Convert the string to ascii values list Convert the ascii values list to base 3 string Convert base 3 string to DNA bases and store the last base of the DNA string 3 for chunk number 1 to total number of chunks
6. the file is selected click on Decode option To save the decoded file give the name and save the file at specific location This will TABLE IL COMPARISION OF THE FILE FORMATS ENCODED BY DNACLOUD DIFFERENT FILE TYPES WERE ENCODED AND DECODED USING DNACLOUD FOR COST CALCULATIONS SEE File_Type Limit of File size can be encoded Encoded File size Bytes Required amount of DNA Cost of DNA US Memory required DNA Chunks Ref Text 581130733 bytes ASCII characters 15902545 3 4 x 107 4gms 197191 6 409MB Audio 554 MB around 50 songs 151391203 3 3 x 101 gms 1877250 9 3896 MB Video 581130733 33 bytes around 65 minutes 598292824 1 31 x 10 7 gms 7418831 0 15400 MB Image HD 5 7MB 100 HD images 23013231 5 051 x 10 77 gms 285364 06 24MB generate the original file that was encoded in DNA RESET button can clear the selected file i e It is like a clear button C Storage Estimator As mentioned above it has two estimators To estimate the memory required user can select the option from File gt Estimator Memory Required This will take data file to be encoded as input and Calculate button will estimate the values as mentiond above For second estimator user can select File Estimator Biochemical Properties option This will ask 2 dnac file as an input and give the GC content and Melting temperature values and cost for total DNA Save button will help to save estimated
7. 1310 6992v2 cs ET 16 May 2014 arXiv DNACloud A Tool for Storing Big Data on DNA Shalin Shah Dixita Limbachiya and Manish K Gupta Laboratory of Natural Information Processing Dhirubhai Ambani Institute of Information and Communication Technology Gandhinagar Gujarat 382007 India Email shalinshah1993 gmail com dlimbachiya acm org mankg computer org Abstract The term Big Data is usually used to describe huge amount of data that is generated by humans from digital media such as cameras internet phones sensors etc By building advanced analytics on the top of big data one can predict many things about the user such as behavior interest etc However before one can use the data one has to address many issues for big data storage Two main issues are the need of large storage devices and the cost associated with it Synthetic DNA storage seems to be an appropriate solution to address these issues of the big data Recently in 2013 Goldman and his collegues from European Bioinformatics Institute demonstrated the use of the DNA as storage medium with capacity of storing 1 peta byte of information on one gram of DNA and retrived the data successfully with low error rate i This significant step shows a promise for synthetic DNA storage as a useful technology for the future data storage Motivated by this we have developed a software called DNACloud which makes it easy to store the data on the DNA In this work we present d
8. As data gets bigger what comes after a yottabyte Online Available http gigaom com 2012 10 30 as data gets bigger what comes after a yottabyte M Arita Writing information into DNA in Aspects of Molecular Computing Springer 2004 pp 23 35 D A Huffman A method for the construction of minimum redundancy codes Proceedings of the IRE vol 40 no 9 pp 1098 1101 1952 P Ribenboim How to recognize whether a natural number is a prime in The New Book of Prime Number Records Springer 1996 pp 19 178 J Singh Vakratunda mahakaya prathameshwara ganadheeshwara On line Available http music raag fm Bhakti_Sangeet songs 9797 Shri_ Ganesh Jagjit_Singh J Rover National geographic television megastructures 53 ultimate skyscraper nyc Online Available https www youtube com watch v 71V1SQTqhl0 Wikipedia DNA structure image Online Available apload wikimedia org wikipedia commons thumb d d8 Benzopyrene_ DNA_adduct_1JDG png 433px Benzopyrene_DNA_adduct_1JDG png T Weimann Code for barcode Online Available https bitbucket org whitie python barcode http code A Pillai Convert text to pdf Online Available activestate com recipes 189858 python text to pdf converter
9. ating a sentence from the biblical book of Genesis into Morse Code and converting the Morse code into DNA base pairs according to a conversion principle 5 In the 20 century many researchers have translated English text mathematical equations 6 latin text and simple musical notations to DNA using different DNA coding principles 9 ml All the above mentioned efforts were successful on a small scale giving birth to the idea of data storage on DNA But the most prolific work was done in 2012 by Church et al of Harvard University They encoded successfully entire book of Regenisis How Synthetic Biology Will Reinvent Nature and Ourselves including 53 426 words 11 JPG images and a JavaScript program into DNA using 1 bit per base encoding The main draw back of their method was that it had high error rate I2 In the subsequent year in 2013 this limitation was overcomed by the Goldman and his group They implemented a modified approach that includes error correction and scaled DNA based data storage I Based on this method of DNA data storage I in this work we present the software called DNACloud which converts the data file to DNA sequences and vice versa The reader is referred to excellent short reviews of synthetic DNA storage 14 to get an overview of this new area This paper is organized as follows Section 2 includes algorithms used for encoding and decoding data into DNA Section 3 provides an overview of Graphical Us
10. des the user various estimations related to the data storage on DNA as shown in Figure 1 There are three basic modules of the software as discussed in Sections A B and C A DNA Encoder File to DNA To store data on DNA one has to find ways for encoding the given data into DNA sequence There are many encoding techniques available to convert the data into DNA sequences Require DNA string obtained from algorithm 3 Ensure Original computer file 13 if file size lt chunk size then Do not divide file into chunks else Divide file into chunks where no of chunks file size chunk size 1 end if 23 if no of chunks gt 1 then Read DNA string for the chunk 1 Convert the DNA string to base 3 string Convert the base 3 string to list of Huffman values if possible while not decoded do Remove last base and try decoding again Add removed base to prepend string end while Convert the huffman list to corrosponding ascii list Convert ascii list to string of bytes and write to file 3 for chunk number 1 to total number of chunks 1 do Read DNA string for the chunk and prepend prepend string to it if not null Convert the DNA string to list of huffman values if possible while not decoded do Remove last base and try decoding again Add removed base to prepend string after clearing end while Convert huffman list to corrosponding ascii list Convert the ascii list to string of bytes and write to file end for 4 Read la
11. er Interface GUI Section 4 describes detailed GUI while Section 5 has remarks on limitations and assumptions in the software Section 6 concludes with challenges in the area of synthetic hard drive and last section provides a link for downloading the software and related material II ALGORITHMS FOR ENCODING AND DECODING DATA FILES While implementing the methods of I we modified the algorithms little bit so that they are memory efficient For encoding algorithm 1 generates DNA string from given data file which is further divided into DNA chunks of lengths 117 using algorithm 2 For decoding algorithm 3 takes the DNA file containing DNA chunks of length 117 and produces DNA string which is further decoded to get the original data file using algorithm 4 In order to describe these algorithms we define a term index info and also give remarks for algorithms B and 4 Definition 1 Index Info Index info is base 3 string of length 15 which has format ID no of chunk parity of the chunk where ID has length 2 no of chunks has length 12 and parity of chunk has length 1 T Later on every chunk is also appended with G or C and prepended with A or T Remark 2 For in Algorithm 3 The decoding is always not possible since the format of dnac file is P chunky chunkg chunky Now while reading x chunks TABLE I How BIG IS THE BIG DATA Data Unit Size How big
12. etailed description of the software Keywords DNA storage Biostorage DNA Computing DNA codes Huffman Coding Software Open source DNA hard disk Error correction Synthetic DNA Organic data stroage I INTRODUCTION Storage has been a fundamental requirement for the Hu mans In the modern era of computing and communication huge amount of data is being generated and there is a pressing need for dense storage medium which is cost effective Table shows the typical amount of the data generated and the kind of storage device it will require to store such a data It is predicted that by 2015 the amount of data generated by NSA National Security Agency will be so large that it may need 1000 billion tera bytes of hard disk space worth 1 000 trillion P At present the world is producing 1 exabytes of data per day and soon devices machines and sensors of Internet of Things IoT will generate data in the order of bronobytes where 1 bronobyte is 10 bytes for which a dense storage medium is needed From the past 30 years the blue print of life viz DNA has been used as storage medium Unlike existing storage device DNA requires no maintenance and can be stored without electricity in cold and dark place One of the venture to use the DNA as artistic material and convert the graphic image to the language of genetic code was initiated by Joe Davis in the work Microvenus 4 In 1999 Synthetic gene that was created by Kac by transl
13. information D Export Button This will help the user to export the file generated to different formats DNA strings can be exported to file format that can be used as input for the synthesizer File can be exported to file format that is required by the sequencer These options are available in File menu This will generate the feasible output of the DNA strings that is to be used by respective machines To export the DNA file for the synthesizer to synthesize the DNA use Export DNA synthesizer File option To decode the file stored in DNA use Import DNA sequencer file option to get the DNA sequences to be decoded These options will be available in the next version of the software The dnac file can be exported to PDF and latex file with all software output details in single PDF by using the option Export dnac to PDF and Export to latex respectively E Clear temp files This will clear all your history of the software It will remove all the temporary files generated by the software F Exit Exit will help to quit the software G User details This option is available in preferences menu For the data security user has to feed his details then the dialogue box to enter the password appear Password can be reset with Change Password option in same menu This helps the user to retrieve his files stored in DNA safely The barcode generated can be used by biotech companies to tag the DNA on which the particular file is stored
14. n will generate a barcode of it which can be used as unique identification by Biotech companies when performing the experiments It is not mandatory but recommendable to fill the details else the box will remind you again and again To Encode or decode the file select either of the options from File menu Software includes options A to G in the menu A Encode File to DNA Button This option is available under the File menu which will convert any type of data to DNA strings User can select the file to be converted into DNA string by clicking on Choose File button Once the user selects the file to be encoded the list of information for the encoded file will be displayed as below 1 Length of DNA string 2 No of DNA oligonucleotides chunks 3 Length of each DNA oligonucleotide 4 File size in bytes To save the encoded file user can save the file with specific name on specific location by using Encode your File button It will generate the file with extension dnac that has DNA string for the file selected RESET button can clear the selected file then user can select new file It is like clear button B Decode DNA to File Button This option is available under the file menu which will retrieve the data stored in DNA User can enter the DNA string from which data is to be retrieved in the text box against Please write DNA string option User can also decode the encoded file from the system by option Select dnac file Once
15. pful while doing the experiments for storing data in DNA Estimator has two main sections 1 Memory Required User can select the file to be encoded from the system and the following values of the file are estimated This will help the user to decide how much memory of his system will be occupied for encoding and storing particular data file in DNA These values are approximated a File size in bytes b Size of DNA string c Free Memory Required d Amount of DNA Required 2 Biochemical Properties and Cost This will estimate the biochemical properties of the DNA sequence used to store the data Select dnac file from the system which contains the DNA se quences to estimate the properties This will take salt concentration mM and cost per base as an input It will estimate the GC content of DNA melting temperature of the DNA and total cost to store the file in DNA All the values are approximated This facilites the user to figure out the budget for the experiments depending on the total amount of DNA IV DETAILED DESCRIPTION OF GUI When the program is executed a dialouge box is popped up for workspace where one can save his work All the files generated will be automatically saved in this workspace You can switch to other workspace After this dialouge box user details are asked It includes name contact number and email address and file you are using as an input This will save your details and Generate Barcode butto
16. rithm 2 Ensure DNA string for the chunks 13 if file size lt chunk size then Do not divide file into chunks else Divide file into chunks where no of chunks file size chunk size 1 end if 2 if no of chunks gt 1 then Decode the given chunks read to corrosponding DNA string if possible while not decoded do Remove last base from buffer of dnac file and try decoding again Store this base at the end of prepend string if a bit is removed end while 3 for chunk number 1 to total number of chunks 1 do Prepend last stored String to buffer read if prepend string is not null while not decoded do Remove last base from buffer of dnac file and try decoding again Store this base at the end of prepend string if a bit is removed end while write the DNA string obtained to original end for 4 Prepend last stored String to buffer read if prepend string is not null decode it and append it to original DNA string else if no of chunks 1 then Trivial case so read entire file at once and convert it to DNA string end if end if II GRAPHICAL USER INTERFACE GUI OVERVIEW DNACloud has been primarily developed to facilitate the storage of data on DNA The software converts any type of data text image audio or video etc into DNA strings and enables it to store on DNA and helps to retrieve the data stored on DNA The GUI of DNACloud is developed to enable this feature Along with the encoding and decoding facility DNACloud provi
17. st chunk convert this to base 3 string to cor responding Huffman list to corresponding ascii list to string of bytes and write this bytes to file else if no of chunks 1 then Trivial case process entire file at once and convert it to base 3 then to huffman list which in turn is converted to ascii list and then to stream of bytes which are then written to a file end if end if by using DNA codes 20 One of the most the efficient source coding technique called Huffman codes is well known for data compression 21 The DNA encoding by Huffman is uniquely decodable In this software similar Huffman encoding is implemented 1 For error correction the overlapping codes 1 are implemented and data is retrived from DNA with reduced error rates The encoding module takes the data file of any format text png jpg mp3 mkv etc as an input The DNA sequence encoded is divided into fixed length of DNA chunks and the part of the DNA chunks were overlapped implementing four fold redundancy for error correction The Fig 1 Functionality of DNACloud This flowchart represents the basic function of the DNACloud As it shows that there are three main modules 1 Encode 2 Decode and 3 Estimator Encode converts the data file of any input and gives DNA sequences as an output Decode takes DNA sequences as an input and convert it back to original data Estimator is developed to estimate certain numerical values like memory req
18. tore last 75 DNA bases of the DNA string read Ae for chunk number 1 to total number of chunks 1 do Read DNA string in the chunk Append the read DNA string to last stored temporary DNA string of 75 bases Again divide the string into chunks of length 100 add index info and store its last 75 DNA bases Concatenate the list of chunks to original list of chunks end for A Read last chunk append the read DNA string to last stored temporary DNA string of 75 bases Divide the string into chunks of length 100 and add index info in these chunks Concatenate the list of chunks to original list of chunks else if no of chunks 1 then Read entire file and divide the string into chunks of length 100 and add index info in these chunks trivial case end if end if This entire list obtained is stored in dnac file it may happen that last chunk is not completely read hence we keep on removing the last byte from the read string unless we get before which entire chunk is decodable Remark 3 For in Algorithm 4 The decoding here also is not always possible since huffman values are either of length 5 or length 6 So we keep on removing the last byte from the read string and try decoding again and again unless decoded Algorithm 3 Algorithm for regenerating DNA string from DNA chunks Algorithm 4 Algorithm for regenerating original file from DNA chunks Require dnac file containing DNA chunks of length 117 obtained from algo
19. uired for DNA storage that configure your memory of the system and other estimates the biochemical properties of the DNA employed in the wetlab experiments original file is converted to Huffman base 3 code 0 1 2 with code length of 5 which is transformed to triplet codon to DNA code according to the conversion principle as substituting each trit triplet with one of the the three nucleotide different from the preceding one i e if G is the preceding then A or T or C will be placed this ensures that no homo polymers are generated to reduce the sequencing error For these code if any DNA chunk or base was deleted then it can be regenerated by reading that overlaped code sequence This module saves the encoded file with extension fileformatextension dnac E g an image file will be encoded and saved as png dnac B DNA Decoder DNA to File To retrive the data stored on DNA data has to be decoded from DNA The reverse step of encoding is followed for decoding The data stored in DNA can be retrived by excluding the index bits and converting base 3 Huffman DNA codes back to original data This module takes the DNA sequence as input and gives original data stored as output The output of the sequencer can be used as input for this module It takes dnac file as input C Storage Estimator This module gives various statistics and biochemical properties of DNA for the encoded file These estimated values are hel
Download Pdf Manuals
Related Search
Related Contents
Super Talent Technology USB 3.0 Express RC4 Samsung E1100T User's Manual Le guide EPEC des PPP - European Investment Bank MINI HERO VACUUM - Monster by Euroflex Italy MSL 05_04_11 - Advance Lifts, Inc. Technics sx-PR804/M Electronic Keyboard User Manual 会社情報・IR情報提供の モバイルサイト開設のお知らせ McCulloch 532 43 41-98 Rev. 1 Lawn Mower User Manual PRM-6-EUR Copyright © All rights reserved.