Abstract :
This paper introduces a novel algorithm for biological sequence compression that
makes use of both statistical properties and repetition within sequences. A panel of
experts is maintained to estimate the probability distribution of the next symbol in
the sequence to be encoded. Expert probabilities are combined to obtain the final distribution. The resulting information sequence provides insight for further study of
the biological sequence. Each symbol is then encoded by arithmetic coding. Most compression algorithms fall into one of two categories, namely substitutional
compression and statistical compression. Those in the former class replace a long
repeated subsequence by a pointer to an earlier instance of the subsequence or to
an entry in a dictionary Experiments show that our algorithm outperforms existing compressors on typical DNA and protein sequence datasets while maintaining a practical running time.
1. Introduction
Modelling DNA and protein sequences is an important step in understanding biology.
Deoxyribonucleic acid (DNA) contains genetic instructions for an organism.
A DNA sequence is composed of nucleotides of four types: adenine (abbreviated A),
cytosine (C), guanine (G) and thymine (T). In its double-helix form, two complementary
strands are joined by hydrogen bonds joining A with T and C with G. The
reverse complement of a DNA sequence is also considered when comparing DNA
sequences. Certain regions in a…………….So on ..........(download any of the following links to get complete paper presentation in word document)
No comments:
Post a Comment