The publication data currently available has been vetted by Vanderbilt faculty, staff, administrators and trainees. The data itself is retrieved directly from NCBI's PubMed and is automatically updated on a weekly basis to ensure accuracy and completeness.
If you have any questions or comments, please contact us.
The two key steps for analyzing proteomic data generated by high-resolution MS are database searching and postprocessing. While the two steps are interrelated, studies on their combinatory effects and the optimization of these procedures have not been adequately conducted. Here, we investigated the performance of three popular search engines (SEQUEST, Mascot, and MS Amanda) in conjunction with five filtering approaches, including respective score-based filtering, a group-based approach, local false discovery rate (LFDR), PeptideProphet, and Percolator. A total of eight data sets from various proteomes (e.g., E. coli, yeast, and human) produced by various instruments with high-accuracy survey scan (MS1) and high- or low-accuracy fragment ion scan (MS2) (LTQ-Orbitrap, Orbitrap-Velos, Orbitrap-Elite, Q-Exactive, Orbitrap-Fusion, and Q-TOF) were analyzed. It was found combinations involving Percolator achieved markedly more peptide and protein identifications at the same FDR level than the other 12 combinations for all data sets. Among these, combinations of SEQUEST-Percolator and MS Amanda-Percolator provided slightly better performances for data sets with low-accuracy MS2 (ion trap or IT) and high accuracy MS2 (Orbitrap or TOF), respectively, than did other methods. For approaches without Percolator, SEQUEST-group performs the best for data sets with MS2 produced by collision-induced dissociation (CID) and IT analysis; Mascot-LFDR gives more identifications for data sets generated by higher-energy collisional dissociation (HCD) and analyzed in Orbitrap (HCD-OT) and in Orbitrap Fusion (HCD-IT); MS Amanda-Group excels for the Q-TOF data set and the Orbitrap Velos HCD-OT data set. Therefore, if Percolator was not used, a specific combination should be applied for each type of data set. Moreover, a higher percentage of multiple-peptide proteins and lower variation of protein spectral counts were observed when analyzing technical replicates using Percolator-associated combinations; therefore, Percolator enhanced the reliability for both identification and quantification. The analyses were performed using the specific programs embedded in Proteome Discoverer, Scaffold, and an in-house algorithm (BuildSummary). These results provide valuable guidelines for the optimal interpretation of proteomic results and the development of fit-for-purpose protocols under different situations.
Given the potential wealth of insights in personal data the big databases can provide, many organizations aim to share data while protecting privacy by sharing de-identified data, but are concerned because various demonstrations show such data can be re-identified. Yet these investigations focus on how attacks can be perpetrated, not the likelihood they will be realized. This paper introduces a game theoretic framework that enables a publisher to balance re-identification risk with the value of sharing data, leveraging a natural assumption that a recipient only attempts re-identification if its potential gains outweigh the costs. We apply the framework to a real case study, where the value of the data to the publisher is the actual grant funding dollar amounts from a national sponsor and the re-identification gain of the recipient is the fine paid to a regulator for violation of federal privacy rules. There are three notable findings: 1) it is possible to achieve zero risk, in that the recipient never gains from re-identification, while sharing almost as much data as the optimal solution that allows for a small amount of risk; 2) the zero-risk solution enables sharing much more data than a commonly invoked de-identification policy of the U.S. Health Insurance Portability and Accountability Act (HIPAA); and 3) a sensitivity analysis demonstrates these findings are robust to order-of-magnitude changes in player losses and gains. In combination, these findings provide support that such a framework can enable pragmatic policy decisions about de-identified data sharing.
Isobaric labeling techniques coupled with high-resolution mass spectrometry have been widely employed in proteomic workflows requiring relative quantification. For each high-resolution tandem mass spectrum (MS/MS), isobaric labeling techniques can be used not only to quantify the peptide from different samples by reporter ions, but also to identify the peptide it is derived from. Because the ions related to isobaric labeling may act as noise in database searching, the MS/MS spectrum should be preprocessed before peptide or protein identification. In this article, we demonstrate that there are a lot of high-frequency, high-abundance isobaric related ions in the MS/MS spectrum, and removing isobaric related ions combined with deisotoping and deconvolution in MS/MS preprocessing procedures significantly improves the peptide/protein identification sensitivity. The user-friendly software package TurboRaw2MGF (v2.0) has been implemented for converting raw TIC data files to mascot generic format files and can be downloaded for free from https://github.com/shengqh/RCPA.Tools/releases as part of the software suite ProteomicsTools. The data have been deposited to the ProteomeXchange with identifier PXD000994.
© 2015 by The American Society for Biochemistry and Molecular Biology, Inc.
BACKGROUND - Many cancer clinical trials now specify the particular status of a genetic lesion in a patient's tumor in the inclusion or exclusion criteria for trial enrollment. To facilitate search and identification of gene-associated clinical trials by potential participants and clinicians, it is important to develop automated methods to identify genetic information from narrative trial documents.
METHODS - We developed a two-stage classification method to identify genes and genetic lesion statuses in clinical trial documents extracted from the National Cancer Institute's (NCI's) Physician Data Query (PDQ) cancer clinical trial database. The method consists of two steps: 1) to distinguish gene entities from non-gene entities such as English words; and 2) to determine whether and which genetic lesion status is associated with an identified gene entity. We developed and evaluated the performance of the method using a manually annotated data set containing 1,143 instances of the eight most frequently mentioned genes in cancer clinical trials. In addition, we applied the classifier to a real-world task of cancer trial annotation and evaluated its performance using a larger sample size (4,013 instances from 249 distinct human gene symbols detected from 250 trials).
RESULTS - Our evaluation using a manually annotated data set showed that the two-stage classifier outperformed the single-stage classifier and achieved the best average accuracy of 83.7% for the eight most frequently mentioned genes when optimized feature sets were used. It also showed better generalizability when we applied the two-stage classifier trained on one set of genes to another independent gene. When a gene-neutral, two-stage classifier was applied to the real-world task of cancer trial annotation, it achieved a highest accuracy of 89.8%, demonstrating the feasibility of developing a gene-neutral classifier for this task.
CONCLUSIONS - We presented a machine learning-based approach to detect gene entities and the genetic lesion statuses from clinical trial documents and demonstrated its use in cancer trial annotation. Such methods would be valuable for building information retrieval tools targeting gene-associated clinical trials.
Spectral counting has become a widely used approach for measuring and comparing protein abundance in label-free shotgun proteomics. However, when analyzing complex samples, the ambiguity of matching between peptides and proteins greatly affects the assessment of peptide and protein inventories, differentiation, and quantification. Meanwhile, the configuration of database searching algorithms that assign peptides to MS/MS spectra may produce different results in comparative proteomic analysis. Here, we present three strategies to improve comparative proteomics through spectral counting. We show that comparing spectral counts for peptide groups rather than for protein groups forestalls problems introduced by shared peptides. We demonstrate the advantage and flexibility of this new method in two datasets. We present four models to combine four popular search engines that lead to significant gains in spectral counting differentiation. Among these models, we demonstrate a powerful vote counting model that scales well for multiple search engines. We also show that semi-tryptic searching outperforms tryptic searching for comparative proteomics. Overall, these techniques considerably improve protein differentiation on the basis of spectral count tables.
Spectral libraries have emerged as a viable alternative to protein sequence databases for peptide identification. These libraries contain previously detected peptide sequences and their corresponding tandem mass spectra (MS/MS). Search engines can then identify peptides by comparing experimental MS/MS scans to those in the library. Many of these algorithms employ the dot product score for measuring the quality of a spectrum-spectrum match (SSM). This scoring system does not offer a clear statistical interpretation and ignores fragment ion m/z discrepancies in the scoring. We developed a new spectral library search engine, Pepitome, which employs statistical systems for scoring SSMs. Pepitome outperformed the leading library search tool, SpectraST, when analyzing data sets acquired on three different mass spectrometry platforms. We characterized the reliability of spectral library searches by confirming shotgun proteomics identifications through RNA-Seq data. Applying spectral library and database searches on the same sample revealed their complementary nature. Pepitome identifications enabled the automation of quality analysis and quality control (QA/QC) for shotgun proteomics data acquisition pipelines.
Peptide identification via tandem mass spectrometry sequence database searching is a key method in the array of tools available to the proteomics researcher. The ability to rapidly and sensitively acquire tandem mass spectrometry data and perform peptide and protein identifications has become a commonly used proteomics analysis technique because of advances in both instrumentation and software. Although many different tandem mass spectrometry database search tools are currently available from both academic and commercial sources, these algorithms share similar core elements while maintaining distinctive features. This review revisits the mechanism of sequence database searching and discusses how various parameter settings impact the underlying search.