GeneNotes

Welcome to Pengyu Hong's Homepage

MotifBooster

Identifying transcription factor binding sites (TFBSs) is an important step towards the understanding of gene regulation. Building an accurate binding model for a transcription factor (TF) is essential to differentiate its true binding targets from those spurious ones that resemble the true motif pattern yet are non-functional. This paper describes a boosting approach for modeling TF-DNA binding. Different from the widely used weight matrix model, which predicts TF-DNA binding based on a linear combination of position-specific contributions, our approach builds a TF binding classifier by combining a set of weight-matrix-based linear classifiers, thus yielding a non-linear binding decision rule. The proposed approach is applied to the ChIP-chip data of Saccharomyces cerevisiae. When compared to the weight matrix method, our new approach shows significant improvements on the specificity in a majority of cases.

Reference: Hong, P., X. S., Zhou, Q., Lu, X., Liu, J. S., Wong, W. H. "A Boosting Approach for Motif Modeling Using ChIP-chip Data". submitted to Bioinformatics.

Supplementary data (WinZip is required to decompress the files.):

(a) ChIP-chip data (Reference: Lee, T. I., et al., Transcriptional regulatory networks in Saccharomyces cerevisiae. Science, 2002. 298(5594): p. 799-804.)

(b) The upstream sequences (up to 800 bases) of yeast genes.

(c) MotifBooster (Tested on Windows XP. Implemented in C#). (a) Load positive and negative sequence data in FASTA format. (b) Load Initial PSSM (weight matrix model). (c) Specify where to save the boosted model. (d) Click "Boosting" button to run.

(d) Detailed data 1: Organized by TFs, directories are named after TFs. Each TF's directory contains: (1) the positive and negative sequences of TFs, (2) the seed matrix generated by Motif Regressor if there is any, (3) the logo of the seed matrix if there is any, (4) the ensemble models of TFs, and (5) the logos of the ensemble models. Each directory has a README file.

(e) Detailed data 2: An Excel file containing: (1) Data summary and cross validation results for 31 ChIP-chip data; (2) Contributions of the base classifiers in the “leave-one-out” cross validation tests; and (3) The sensitivity-specificity of base classifiers in the ensemble models (include several plots).