Welcome to Pengyu Hong's Homepage

Home Resume Research Publication

My group is carrying out interdisciplinary research to advance Informatics in the fields of biology, biochemistry, e-healthcare, fintech, intelligent education, and mobile social. Some of our research results are summarized below. More details on intelligent education (with a startup) and mobile social (online social content analysis) will follow. On machine learning side, I am continue working on Automatic Spatial and Temporal Pattern Modeling and Detection, which I believe will come to the front stage and deliver useful tools due to the recent rise of Deep Learning and the availability of related large datasets. I have also done some research on Human-Computer Interaction research.

Automatic Temporal-Spatial Pattern Modeling and Detection

Glycomics (A New Frontier)

We are developing efficient algorithms to accurately de novo sequence glycans from their tandem mass spectra with no or little help from an oligosaccharide database. Glycosylation is a common modification by which a glycan (or oligosaccharide) is covalently attached to a target macromolecule (such as, proteins and lipids). This modification serves various important functions (such as, protein folding, protein stability, cell-cell/matrix/environment interaction, immunity, and so on) and is one of the essential factors in optimizing many glycoprotein-based drugs. Hence it is important to profile the changes of glycans under different conditions. Glycan structure analysis is a challenging and essential task in biological and biomedical research. We have developed an advanced algorithm, named GlycoDeNovo (executalbe software, source codes), for reconstructing glycans from data of pure samples and theoretically proof that this problem can be solved with polynomial complexity rather than NP in conventional thinking. GlycoDeNovo features a Machine Learning based IonClassifier which can effectively rank the reconstructed topology candidates (experimental results showed that IonClassifier is essential!). We have also advanced into an uncharted landscape where we need to reconstruct glycans from data of mixture samples and have obtained excellent results under certain experimental conditions. This research is a collaboration with Boston University Medical School (NIH P41).

Intelligent Education

Knowledge tracing is an important task in Intelligent Education. It aims to quantify how well students master the knowledge (tags) being tutored by analyzing their learning activities (e.g., coursework interaction data). It plays an important role in intelligent tutoring systems. In this paper, we cast knowledge tracing as a performance-prediction problem, which predicts the performances of students on exercises labeled by multiple knowledge tags, and propose to tackle this problem using Deep Learning techniques. We applied several Recurrent Neural Network architectures to model complex representations of student knowledge and predict future performances of students. Our experimental results demonstrate that the neural network architecture based on stacked Long Short Term Memory and residual connections give superior predictions on the future performances of learners. To model how a student answered a question that contains multiple knowledge tags, we explored three different setups to map knowledge states to prediction. Keywords: Recurrent Neural Network, LSTM, Stacked LSTM, Residual Connection.

BioImage Informatics

3D Neuron Tracing (Investigating Biological Neuronal Networks Using Artificial Neural Networks): Automated neuron tracing from microscopic images enables high-throughput quantitative analysis of neuronal morphology to elucidate functions of neural circuits. Better understandings of biological neural networks can greatly help us improve artificial neural networks that play important roles in Deep Learning. We have developed a Deep-Transfer-Learning approach that trains a deep neural network to accurately trace neurons in 3D image stacks. Our approach learns essential features from synthetic lines and transfers the learnt knowledge to process real neuron images. One major advantage of our approach is that it does not requires a large amount of manually labeled training data. (Zheng and Hong, BrainInfo 2016, Zheng and Hong, BIH 2016).
High-Content Neuronal Image Analysis: Cell-based high-content screening is becoming a widely used high-throughput methodology in therapeutic drug discovery and functional genomics. One of the challenges in high-content screens is to reliably and automatically analyze a large quantity of high-content images. Images of neuronal cell cultures are particularly challenging to analyze because of their complex morphology. My group has developed a robust pipeline for quantifying and comparing the morphology of neuronal cell cultures (Wu et al. 2010). Our analysis pipeline has been a key innovation for high-throughput neuronal screening and has been applied to several large-scale high-content screening projects (Sepp et al. 2008, Wu et al. 2010, Schoemans et al. 2010, Schulte et al. 2011). Our computational results of high-throughput neuronal images have led to successful follow-up studies (Sepp et al 2008, Schulte et al 2010).

Computational Systems Biology

Animal organogenesis is a complicate process controlled by a network of intercellular signaling, intracellular signal transduction, and transcriptional regulation. The construction of quantitative models for this network can provide insights into the underlying biological mechanisms. We have developed a mathematical model to model multi-level biological networks (spanning molecular level to tissue level) that govern C. elegans vulval induction. The model was automatically learned from heterogeneous biological data and is capable of simulating vulval induction under various different genetic conditions (Sun X. and Hong P. 2007, Sun X. and Hong P. 2009). At cellular level, together with our biological collaborators, we elucidated the topologies of complex molecular networks (Breitkopf S.B., et al. 2016, Kwon Y., et al. 2013, Sun and Hong 2013, Friedman AA. et al. 2011, Vinayagam A. et al. 2016), which advanced our knowledge of complex biological processes and related diseases.

Medical Informatics

We applied machine learning techniques to detect medical concepts and the relationships among them from the free-text (unstructured) clinical narratives in electronic healthcare records (Anick et al. presented at I2B2 Challenge 2010 and cited in Uzuner et al. 2011, Warner et al. 2011). In another study, we investigated various psychiatric attributes in the categories of depression, anxiety and insomnia by comparing the patients with liver cirrhosis, patients with non-cirrhosis chronic diseases, and healthy people. A set of psychiatric attributes were found to be uniquely associated with cirrhosis. Literature shows that these attributes are related to some mechanisms underlying liver cirrhosis, such as zinc/selenium deficiency and vitamin D deficiency. In addition, we investigated the relationships between those psychiatric attributes unique to cirrhosis and the “four diagnostic methods” of Traditional Chinese Medicine (TCM), and discovered interesting correlations. The above two findings can shed new light on the diagnosis and prognosis of liver cirrhosis, and the integration of Traditional Chinese and Western Medicine in the future Bei et al. 2013).

Other Bioinformatics Efforts

We developed memory efficient and fast algorithms for accurately calculating the isotopic fine structures of large molecules (Li et al. 2008, Li et al. 2010). The algorithms will greatly increase our capability of interpreting experimental mass spectrometry data. We also developed a new data processing pipeline for top-down mass spectrometry (Karabacak N.M. 2009). We have developed a boosting approach for building better TF-DNA binding models (Hong et al. 2005), which are important for differentiating the true binding targets of transcription factors from spurious ones and understanding gene regulation.