Volume 50, Number 2, April-June 2016
Special issue: Recent Advances in Operations Research in Computational Biology, Bioinformatics and Medicine
|Page(s)||351 - 361|
|Published online||24 March 2016|
High-order statistical compressor for long-term storage of DNA sequencing data∗
1 Institute of Computing Science, Poznan University of
Technology, Piotrowo 2, 60-965 Poznan, Poland.
2 Institute of Bioorganic Chemistry, Polish Academy of Sciences, Poznan, Poland
Accepted: 21 September 2015
We present a specialized compressor designed for efficient data storage of FASTQ files produced by high-throughput DNA sequencers. Since the method has been optimized for compression quality, it is especially suitable for long-term storage and for genome research centers processing huge amount of data (counted in petabytes). The proposed compressor uses high-order statistical models for range encoding, similar to Markov models, but the whole input is considered in building a symbol context. Compression of DNA reads is performed according to LZ-style with the use of the 5–7th order model, while nucleotides’ scores are encoded with the 3rd order model.
Mathematics Subject Classification: 68P20 / 68P30 / 68W32 / 92D20
Key words: High-throughput DNA sequencing / data compression / FASTQ files
© EDP Sciences, ROADEF, SMAI 2016
Current usage metrics show cumulative count of Article Views (full-text article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.
Data correspond to usage on the plateform after 2015. The current usage metrics is available 48-96 hours after online publication and is updated daily on week days.
Initial download of the metrics may take a while.