- Vishakh, Joe Bertolami, Nick Urrea, Jeff Weiss
Objective
The objective of our project is to detect regulatory sequences in genetic data.
Biological Perspective
Regulatory sequences are segments of DNA where proteins can bind to enhance trascription of a gene. We will specifically be looking at upsteam promoters in prokaryotes such as yeast. We hope to find sequences whose frequencies exceed their expected value, calculated using statistical tools.
The especially interesting sequences that we finally select could be of use. Since prokaryotes and humans share many genes, they also share diseases such as cystic fibrosis. The ability to isolate regulatory sequences in simpler organisms might lead to medical advances that treat genetic disorders in humans.
Computational Perspective
We will be given several genomic strings consisting of the characters 'A', 'T', 'C' and 'G'. We will pick out the most frequently occuring substrings and check which ones unusually frequent with respect to the whole genome. The process can be illustrated by considering the strings below:
TCGAAAGATTTGCT
GATTGCTAACGTCC
TATGGATTGCGCAT
TTTTTTTTTTGATT
In the above strings, the substring 'GATT' would be very interesting to us. It crops up in all four strings and fairly long. Since regulatory sequences are prone to errors, our code would have to be able to handle minor variances such as the one below:
TCGAAAGATTTGCT
GAATGCTAACGTCC
TATGGACTGCGCAT
TTTTTTTTTTGAAT
In effect, our code will construct a table of substrings of length two and greater with their frequencies. It would then pick out the most common substrings and look at their frequencies in the whole genome. The ones that stand out, i.e. are much more frequent in the given strings than the genome, will be reported. This will be done on the basis of expected values of frequencies calculated using Poisson distributions.