High-order statistical compressor for long-term storage of DNA sequencing data∗
1 Institute of Computing Science, Poznan University of
Technology, Piotrowo 2, 60-965 Poznan, Poland.
2 Institute of Bioorganic Chemistry,
Polish Academy of Sciences, Poznan, Poland
We present a specialized compressor designed for efficient data storage of FASTQ files produced by high-throughput DNA sequencers. Since the method has been optimized for compression quality, it is especially suitable for long-term storage and for genome research centers processing huge amount of data (counted in petabytes). The proposed compressor uses high-order statistical models for range encoding, similar to Markov models, but the whole input is considered in building a symbol context. Compression of DNA reads is performed according to LZ-style with the use of the 5–7th order model, while nucleotides’ scores are encoded with the 3rd order model.
Mathematics Subject Classification: 68P20 / 68P30 / 68W32 / 92D20
Key words: High-throughput DNA sequencing / data compression / FASTQ files
