The publication data currently available has been vetted by Vanderbilt faculty, staff, administrators and trainees. The data itself is retrieved directly from NCBI's PubMed and is automatically updated on a weekly basis to ensure accuracy and completeness.
If you have any questions or comments, please contact us.
Enhancers and promoters both regulate gene expression by recruiting transcription factors (TFs); however, the degree to which enhancer promoter activity is due to differences in their sequences or to genomic context is the subject of ongoing debate. We examined this question by analyzing the sequences of thousands of transcribed enhancers and promoters from hundreds of cellular contexts previously identified by cap analysis of gene expression. Support vector machine classifiers trained on counts of all possible 6-bp-long sequences (6-mers) were able to accurately distinguish promoters from enhancers and distinguish their breadth of activity across tissues. Classifiers trained to predict enhancer activity also performed well when applied to promoter prediction tasks, but promoter-trained classifiers performed poorly on enhancers. This suggests that the learned sequence patterns predictive of enhancer activity generalize to promoters, but not vice versa. Our classifiers also indicate that there are functionally relevant differences in enhancer and promoter GC content beyond the influence of CpG islands. Furthermore, sequences characteristic of broad promoter or broad enhancer activity matched different TFs, with predicted ETS- and RFX-binding sites indicative of promoters, and AP-1 sites indicative of enhancers. Finally, we evaluated the ability of our models to distinguish enhancers and promoters defined by histone modifications. Separating these classes was substantially more difficult, and this difference may contribute to ongoing debates about the similarity of enhancers and promoters. In summary, our results suggest that high-confidence transcribed enhancers and promoters can largely be distinguished based on biologically relevant sequence properties.
Copyright © 2019 by the Genetics Society of America.
Genomic regions with gene regulatory enhancer activity turnover rapidly across mammals. In contrast, gene expression patterns and transcription factor binding preferences are largely conserved between mammalian species. Based on this conservation, we hypothesized that enhancers active in different mammals would exhibit conserved sequence patterns in spite of their different genomic locations. To investigate this hypothesis, we evaluated the extent to which sequence patterns that are predictive of enhancers in one species are predictive of enhancers in other mammalian species by training and testing two types of machine learning models. We trained support vector machine (SVM) and convolutional neural network (CNN) classifiers to distinguish enhancers defined by histone marks from the genomic background based on DNA sequence patterns in human, macaque, mouse, dog, cow, and opossum. The classifiers accurately identified many adult liver, developing limb, and developing brain enhancers, and the CNNs outperformed the SVMs. Furthermore, classifiers trained in one species and tested in another performed nearly as well as classifiers trained and tested on the same species. We observed similar cross-species conservation when applying the models to human and mouse enhancers validated in transgenic assays. This indicates that many short sequence patterns predictive of enhancers are largely conserved. The sequence patterns most predictive of enhancers in each species matched the binding motifs for a common set of TFs enriched for expression in relevant tissues, supporting the biological relevance of the learned features. Thus, despite the rapid change of active enhancer locations between mammals, cross-species enhancer prediction is often possible. Our results suggest that short sequence patterns encoding enhancer activity have been maintained across more than 180 million years of mammalian evolution.
OBJECTIVE - Hepatorenal Syndrome (HRS) is a devastating form of acute kidney injury (AKI) in advanced liver disease patients with high morbidity and mortality, but phenotyping algorithms have not yet been developed using large electronic health record (EHR) databases. We evaluated and compared multiple phenotyping methods to achieve an accurate algorithm for HRS identification.
MATERIALS AND METHODS - A national retrospective cohort of patients with cirrhosis and AKI admitted to 124 Veterans Affairs hospitals was assembled from electronic health record data collected from 2005 to 2013. AKI was defined by the Kidney Disease: Improving Global Outcomes criteria. Five hundred and four hospitalizations were selected for manual chart review and served as the gold standard. Electronic Health Record based predictors were identified using structured and free text clinical data, subjected through NLP from the clinical Text Analysis Knowledge Extraction System. We explored several dimension reduction techniques for the NLP data, including newer high-throughput phenotyping and word embedding methods, and ascertained their effectiveness in identifying the phenotype without structured predictor variables. With the combined structured and NLP variables, we analyzed five phenotyping algorithms: penalized logistic regression, naïve Bayes, support vector machines, random forest, and gradient boosting. Calibration and discrimination metrics were calculated using 100 bootstrap iterations. In the final model, we report odds ratios and 95% confidence intervals.
RESULTS - The area under the receiver operating characteristic curve (AUC) for the different models ranged from 0.73 to 0.93; with penalized logistic regression having the best discriminatory performance. Calibration for logistic regression was modest, but gradient boosting and support vector machines were superior. NLP identified 6985 variables; a priori variable selection performed similarly to dimensionality reduction using high-throughput phenotyping and semantic similarity informed clustering (AUC of 0.81 - 0.82).
CONCLUSION - This study demonstrated improved phenotyping of a challenging AKI etiology, HRS, over ICD-9 coding. We also compared performance among multiple approaches to EHR-derived phenotyping, and found similar results between methods. Lastly, we showed that automated NLP dimension reduction is viable for acute illness.
Copyright © 2018 Elsevier Inc. All rights reserved.
Computational protein design has been successful in modeling fixed backbone proteins in a single conformation. However, when modeling large ensembles of flexible proteins, current methods in protein design have been insufficient. Large barriers in the energy landscape are difficult to traverse while redesigning a protein sequence, and as a result current design methods only sample a fraction of available sequence space. We propose a new computational approach that combines traditional structure-based modeling using the Rosetta software suite with machine learning and integer linear programming to overcome limitations in the Rosetta sampling methods. We demonstrate the effectiveness of this method, which we call BROAD, by benchmarking the performance on increasing predicted breadth of anti-HIV antibodies. We use this novel method to increase predicted breadth of naturally-occurring antibody VRC23 against a panel of 180 divergent HIV viral strains and achieve 100% predicted binding against the panel. In addition, we compare the performance of this method to state-of-the-art multistate design in Rosetta and show that we can outperform the existing method significantly. We further demonstrate that sequences recovered by this method recover known binding motifs of broadly neutralizing anti-HIV antibodies. Finally, our approach is general and can be extended easily to other protein systems. Although our modeled antibodies were not tested in vitro, we predict that these variants would have greatly increased breadth compared to the wild-type antibody.
OBJECTIVE - To characterize in vivo signatures of pathological diagnosis in a large cohort of patients with primary progressive aphasia (PPA) variants defined by current diagnostic classification.
METHODS - Extensive clinical, cognitive, neuroimaging, and neuropathological data were collected from 69 patients with sporadic PPA, divided into 29 semantic (svPPA), 25 nonfluent (nfvPPA), 11 logopenic (lvPPA), and 4 mixed PPA. Patterns of gray matter (GM) and white matter (WM) atrophy at presentation were assessed and tested as predictors of pathological diagnosis using support vector machine (SVM) algorithms.
RESULTS - A clinical diagnosis of PPA was associated with frontotemporal lobar degeneration (FTLD) with transactive response DNA-binding protein (TDP) inclusions in 40.5%, FTLD-tau in 40.5%, and Alzheimer disease (AD) pathology in 19% of cases. Each variant was associated with 1 typical pathology; 24 of 29 (83%) svPPA showed FTLD-TDP type C, 22 of 25 (88%) nfvPPA showed FTLD-tau, and all 11 lvPPA had AD. Within FTLD-tau, 4R-tau pathology was commonly associated with nfvPPA, whereas Pick disease was observed in a minority of subjects across all variants except for lvPPA. Compared with pathologically typical cases, svPPA-tau showed significant extrapyramidal signs, greater executive impairment, and severe striatal and frontal GM and WM atrophy. nfvPPA-TDP patients lacked general motor symptoms or significant WM atrophy. Combining GM and WM volumes, SVM analysis showed 92.7% accuracy to distinguish FTLD-tau and FTLD-TDP pathologies across variants.
INTERPRETATION - Each PPA clinical variant is associated with a typical and most frequent cognitive, neuroimaging, and neuropathological profile. Specific clinical and early anatomical features may suggest rare and atypical pathological diagnosis in vivo. Ann Neurol 2017;81:430-443.
© 2017 American Neurological Association.
BACKGROUND - Peptide sequence assignment is the central task in protein identification with MS/MS-based strategies. Although a number of post-database search algorithms for filtering target peptide spectrum matches (PSMs) have been developed, the discrepancy among the output PSMs is usually significant, remaining a few disputable PSMs. Current studies show that a number of target PSMs which are close to decoy PSMs can hardly be separated from those decoys by only using the discrimination function.
RESULTS - In this paper, we assign each target PSM a weight showing its possibility of being correct. We employ a SVM-based learning model to search the optimal weight for each target PSM and develop a new score system, CRanker, to rank all target PSMs. Due to the large PSM datasets generated in routine database searches, we use the Cholesky factorization technique for storing a kernel matrix to reduce the memory requirement.
CONCLUSIONS - Compared with PeptideProphet and Percolator, CRanker has identified more PSMs under similar false discover rates over different datasets. CRanker has shown consistent performance on different test sets, validated the reasonability the proposed model.
OBJECTIVE - Drug-drug interactions (DDIs) are an important consideration in both drug development and clinical application, especially for co-administered medications. While it is necessary to identify all possible DDIs during clinical trials, DDIs are frequently reported after the drugs are approved for clinical use, and they are a common cause of adverse drug reactions (ADR) and increasing healthcare costs. Computational prediction may assist in identifying potential DDIs during clinical trials.
METHODS - Here we propose a heterogeneous network-assisted inference (HNAI) framework to assist with the prediction of DDIs. First, we constructed a comprehensive DDI network that contained 6946 unique DDI pairs connecting 721 approved drugs based on DrugBank data. Next, we calculated drug-drug pair similarities using four features: phenotypic similarity based on a comprehensive drug-ADR network, therapeutic similarity based on the drug Anatomical Therapeutic Chemical classification system, chemical structural similarity from SMILES data, and genomic similarity based on a large drug-target interaction network built using the DrugBank and Therapeutic Target Database. Finally, we applied five predictive models in the HNAI framework: naive Bayes, decision tree, k-nearest neighbor, logistic regression, and support vector machine, respectively.
RESULTS - The area under the receiver operating characteristic curve of the HNAI models is 0.67 as evaluated using fivefold cross-validation. Using antipsychotic drugs as an example, several HNAI-predicted DDIs that involve weight gain and cytochrome P450 inhibition were supported by literature resources.
CONCLUSIONS - Through machine learning-based integration of drug phenotypic, therapeutic, structural, and genomic similarities, we demonstrated that HNAI is promising for uncovering DDIs in drug development and postmarketing surveillance.
Published by the BMJ Publishing Group Limited. For permission to use (where not already granted under a licence) please go to http://group.bmj.com/group/rights-licensing/permissions.
OBJECTIVES - Generalizable, high-throughput phenotyping methods based on supervised machine learning (ML) algorithms could significantly accelerate the use of electronic health records data for clinical and translational research. However, they often require large numbers of annotated samples, which are costly and time-consuming to review. We investigated the use of active learning (AL) in ML-based phenotyping algorithms.
METHODS - We integrated an uncertainty sampling AL approach with support vector machines-based phenotyping algorithms and evaluated its performance using three annotated disease cohorts including rheumatoid arthritis (RA), colorectal cancer (CRC), and venous thromboembolism (VTE). We investigated performance using two types of feature sets: unrefined features, which contained at least all clinical concepts extracted from notes and billing codes; and a smaller set of refined features selected by domain experts. The performance of the AL was compared with a passive learning (PL) approach based on random sampling.
RESULTS - Our evaluation showed that AL outperformed PL on three phenotyping tasks. When unrefined features were used in the RA and CRC tasks, AL reduced the number of annotated samples required to achieve an area under the curve (AUC) score of 0.95 by 68% and 23%, respectively. AL also achieved a reduction of 68% for VTE with an optimal AUC of 0.70 using refined features. As expected, refined features improved the performance of phenotyping classifiers and required fewer annotated samples.
CONCLUSIONS - This study demonstrated that AL can be useful in ML-based phenotyping methods. Moreover, AL and feature engineering based on domain knowledge could be combined to develop efficient and generalizable phenotyping methods.
BACKGROUND - Recent observations suggest that immune-mediated tissue destruction is dependent upon coordinate activation of immune genes expressed by cells of the innate and adaptive immune systems.
METHODS - Here, we performed a retrospective pilot study to investigate whether the coordinate expression of molecular signature mostly associated with NK cells could be used to segregate breast cancer patients into relapse and relapse-free outcomes.
RESULTS - By analyzing primary breast cancer specimens derived from patients who experienced either 58-116 months (~5-9 years) relapse-free survival or developed tumor relapse within 9-76 months (~1-6 years) we found that the expression of molecules involved in activating signaling of NK cells and in NK cells: target interaction is increased in patients with favorable prognosis.
CONCLUSIONS - The parameters identified in this study, together with the prognostic signature previously reported by our group, highlight the cooperation between the innate and adaptive immune components within the tumor microenvironment.
With the rapidly increasing availability of High-Throughput Screening (HTS) data in the public domain, such as the PubChem database, methods for ligand-based computer-aided drug discovery (LB-CADD) have the potential to accelerate and reduce the cost of probe development and drug discovery efforts in academia. We assemble nine data sets from realistic HTS campaigns representing major families of drug target proteins for benchmarking LB-CADD methods. Each data set is public domain through PubChem and carefully collated through confirmation screens validating active compounds. These data sets provide the foundation for benchmarking a new cheminformatics framework BCL::ChemInfo, which is freely available for non-commercial use. Quantitative structure activity relationship (QSAR) models are built using Artificial Neural Networks (ANNs), Support Vector Machines (SVMs), Decision Trees (DTs), and Kohonen networks (KNs). Problem-specific descriptor optimization protocols are assessed including Sequential Feature Forward Selection (SFFS) and various information content measures. Measures of predictive power and confidence are evaluated through cross-validation, and a consensus prediction scheme is tested that combines orthogonal machine learning algorithms into a single predictor. Enrichments ranging from 15 to 101 for a TPR cutoff of 25% are observed.