Methodology

The average free energy (E) over known promoter sequences and the difference (D) between E and the average free energy over downstream random sequences (REav) are used to search for promoters in the genomic sequences. Difference in free energy or stability of neighboring regions are calculated and compared with the assigned cutoff (obtained from the energy difference between upstream and downstream regions in the vicinity of known TSS), to predict promoters in genomic DNA sequences (Rangannan V and Bansal M 2007, 2009).

The stability of a double stranded DNA molecule can be expressed as sum of free energy of its constituent base paired dinucleotides. In the present study free energy over a long continuous stretch of DNA sequence was calculated by dividing the sequence into overlapping windows of 15 base pairs (or 14 dinucleotide steps). The energy values corresponding to the 16 dinucleotide steps (10 unique dinucleotide) are taken from the unified parameters obtained from melting studies on 108 oligonucleotides (Allawi and SantaLucia 1997; SantaLucia 1998).

Dinucleotide step	Free energy (kcal/mol)
AA	-1.0
TT	-1.0
AT	-0.88
TA	-0.58
CA	-1.45
TG	-1.45
AC	-1.44
GT	-1.44
CT	-1.28
AG	-1.28
GA	-1.30
TC	-1.30
CG	-2.17
GC	-2.24
GG	-1.84
CC	-1.84

Protein coding promoter sequences of 1001 nt long were grouped together on the basis of their GC composition, at 5 %GC intervals and their average free energy was calculated. The average free energies over the promoter regions (particularly the 1001 nt long regions spanning from -500 to +500 with respect to TSSs) with similar GC composition are observed to be approximately same for the above classification. The average free energy (E) over known promoter sequences and the difference (D) between E and the average free energy over downstream random sequence (REav) were used as cut-off values. Threshold values (E and D) have been generalized for every 5 %GC interval. These cutoff values were then applied over DNA sequences of 1001 nt length (spanning over -500 to +500 w.r.t to the annotated TSSs) as well as whole genome sequence to predict the promoter regions.

Average free energy profile for 429 E. coli promoters that are more than 500 nt apart (taken from EcoCyc Database version 9.1). Blue line represents the average stability profile for shuffled sequences. E-cutoff andD-cutoffare used as threshold values to predict promoter regions.

True positive, false positive and false negative definition for known promoter sequences

A promoter is considered to be predicted correctly if it meets at least one of the following three conditions (as illustrated in the figure below), (A) A transcription start site (TSS) lies within the predicted promoter region, (B) Predicted promoter region lies within the 200 nt region spanning from -150 nt upstream of TSS to +50 nt downstream of TSS, (C) Predicted promoter region overlaps with the 200nt region mentioned above. (i.e. at least 20 nt of predicted promoter region overlaps with the 200 nt region, if not the overhang region outside of 200 nt region is less than 20 nt in length). If a predicted promoter region for the 1000nt long genomic sequence (spanning the region -500 to +500 w.r.t to the annotated TSSs) meets any of the above criteria, then it is considered as a true positive. If more than one predicted region satisfies any one of the above condition, the one nearest to TSS is considered as true positive. All other predicted promoter signals are considered as false positives. If no promoter signal is located in a 1000 nt long sequence, it is considered as a case of false negative.

Illustration of various cases of true positive prediction. In the following figures gray shaded region represents the promoter region as predicted by our method. A promoter is considered to be predicted correctly (TP) if it meets one of three conditions.