Motif Detection in Yeast

- Vishakh, Joe Bertolami, Nick Urrea, Jeff Weiss

Objective

The objective of our project is to detect regulatory sequences in genetic data.

Biological Perspective

Regulatory sequences are segments of DNA where proteins can bind to enhance trascription of a gene. We will specifically be looking at upsteam promoters in prokaryotes such as yeast. We hope to find sequences whose frequencies exceed their expected value, calculated using statistical tools.

The especially interesting sequences that we finally select could be of use. Since prokaryotes and humans share many genes, they also share diseases such as cystic fibrosis. The ability to isolate regulatory sequences in simpler organisms might lead to medical advances that treat genetic disorders in humans.

Computational Perspective

We will be given several genomic strings consisting of the characters 'A', 'T', 'C' and 'G'. We will pick out the most frequently occuring substrings and check which ones unusually frequent with respect to the whole genome. The process can be illustrated by considering the strings below:

TCGAAAGATTTGCT

GATTGCTAACGTCC

TATGGATTGCGCAT

TTTTTTTTTTGATT

In the above strings, the substring 'GATT' would be very interesting to us. It crops up in all four strings and fairly long. Since regulatory sequences are prone to errors, our code would have to be able to handle minor variances such as the one below:

TCGAAAGATTTGCT

GAATGCTAACGTCC

TATGGACTGCGCAT

TTTTTTTTTTGAAT

In effect, our code will construct a table of substrings of length two and greater with their frequencies. It would then pick out the most common substrings and look at their frequencies in the whole genome. The ones that stand out, i.e. are much more frequent in the given strings than the genome, will be reported. This will be done on the basis of expected values of frequencies calculated using Poisson distributions.