Genome Database Mining

K.RAMESH

M.Tech(C.S.E)

Roll. No: 01410116

kramesh@iitg.ernet.in

Objectives:

Extraction or discovery of patterns from the biological databases like genetic databases, protein databases.
Classification of the biological databases based on frequency of occurrence of patterns.

Description:

Genome database mining is the activity of extracting relevant information from the raw databases consisting of biological data. The biological databases are classified into different types based on their contents as genetic or protein databases etc. Computer-based analysis of biosequences increasingly affects the field of biology. Computational bioseqeunce analysis and database searching tools are now and integrated and essential part of the field leading to numerous scientific discoveries in the last few years. Most have resulted in database searching revealing unexpected similarities between molecules previously not known to be related.

My work aims at discovering patterns from these biological databases. Firstly, I extract patterns from the protein database and generalize it for any biological database. The database consists of protein sequences where each sequence is a string of characters of varying length. Each character represents an amino acid. There are 20 different amino acids in the alphabet set. The real protein sequences are available in FASTA format. This format is most widely used in storing the sequences instead of flat file format. Each protein sequence can have length in the range of 250 to 1000 characters where each character is an amino acid.

Implementation Details:

In order to mine on the sequences present in the database, a topological structure called “ Generalized Suffix Tree” is created. This structure represents the sequences in the database. It embeds all possible suffixes of the sequences. The time and space needed to construct this structure is O (n), where ‘n’ is the length of all the sequences.

The user provides requirements such as:

· Type of pattern to be extracted from the database.

· Minimum occurrence of the pattern.

· Allowed number of mutations.

· Minimum length of the pattern.

Based on these requirements, the candidate patterns are obtained by traversing the generalized suffix tree. By evaluating the frequency of occurrence of these patterns in the total database of sequences, the most-likely patterns are obtained. These are the desired patterns that meet the user’s requirements. Using the frequency of occurrence of the patterns as an evaluation criterion, the sequences in the database can be classified.

Current status:

Currently, I am performing the activity of genome database mining i.e., pattern discovery, on sample sequences in FASTA format. I have to extend this approach to the real protein sequences of protein database. The real sequences are available downloadable from the Internet. There are different families of protein sequences. For my work, I chose kinase family of protein sequences. Based on the pattern frequency I can classify the sequences in the database.

Advantages:

· It shows homologies between the sequences which is not discernible from the raw databases

· It helps in classification of sequences.

· Pattern discovery is widely used in many biological fields such as gene therapy, evaluation of drug performance etc.

Resources:

· http://www.ncbi.nlm.nih.gov/

· http://www.biodatabases.com/

· http://www.kdnuggets.com/

· http://www.ebi.ac.uk/

· www.uq.edu.au/vdu/Geneticslinks.htm

Relevant papers:

· Bibliography John L.Houle, Wanda Cadigan, Sylvian Henry, Anu Pinnamaneni and Sonny Lundahl. Database Mining in the Human Genome Initiative.

· Bibliography A.Floratos, I.Jurisica and I.Rigoutsos. Knowledge Discovery in Biological Domains.

· Bibliography Matrin Tompa, Technical Report#2000-06-01.Lecture Notes on Biological Sequence Analysis.

· Bibliography Usama Fayyad, David Haussler, and Paul Stolorz. Mining Scientific Data.

· K.A.Frenkel. The human genome project and informatics. Communications of the ACM, 34(11): 41-51, Nov. 1991.