The publication data currently available has been vetted by Vanderbilt faculty, staff, administrators and trainees. The data itself is retrieved directly from NCBI's PubMed and is automatically updated on a weekly basis to ensure accuracy and completeness.
If you have any questions or comments, please contact us.
Large-scale DNA databanks linked to electronic medical record (EMR) systems have been proposed as an approach for rapidly generating large, diverse cohorts for discovery and replication of genotype-phenotype associations. However, the extent to which such resources are capable of delivering on this promise is unknown. We studied whether an EMR-linked DNA biorepository can be used to detect known genotype-phenotype associations for five diseases. Twenty-one SNPs previously implicated as common variants predisposing to atrial fibrillation, Crohn disease, multiple sclerosis, rheumatoid arthritis, or type 2 diabetes were successfully genotyped in 9483 samples accrued over 4 mo into BioVU, the Vanderbilt University Medical Center DNA biobank. Previously reported odds ratios (OR(PR)) ranged from 1.14 to 2.36. For each phenotype, natural language processing techniques and billing-code queries were used to identify cases (n = 70-698) and controls (n = 808-3818) from deidentified health records. Each of the 21 tests of association yielded point estimates in the expected direction. Previous genotype-phenotype associations were replicated (p < 0.05) in 8/14 cases when the OR(PR) was > 1.25, and in 0/7 with lower OR(PR). Statistically significant associations were detected in all analyses that were adequately powered. In each of the five diseases studied, at least one previously reported association was replicated. These data demonstrate that phenotypes representing clinical diagnoses can be extracted from EMR systems, and they support the use of DNA resources coupled to EMR systems as tools for rapid generation of large data sets required for replication of associations found in research cohorts and for discovery in genome science.
(c) 2010 The American Society of Human Genetics. Published by Elsevier Inc. All rights reserved.
Colorectal cancer (CRC) screening rates are low despite proven benefits. We developed natural language processing (NLP) algorithms to identify temporal expressions and status indicators, such as "patient refused" or "test scheduled." The authors incorporated the algorithms into the KnowledgeMap Concept Identifier system in order to detect references to completed colonoscopies within electronic text. The modified NLP system was evaluated using 200 randomly selected electronic medical records (EMRs) from a primary care population aged >/=50 years. The system detected completed colonoscopies with recall and precision of 0.93 and 0.92. The system was superior to a query of colonoscopy billing codes to determine screening status.
OBJECTIVE - Many healthcare organizations follow data protection policies that specify which patient identifiers must be suppressed to share "de-identified" records. Such policies, however, are often applied without knowledge of the risk of "re-identification". The goals of this work are: (1) to estimate re-identification risk for data sharing policies of the Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule; and (2) to evaluate the risk of a specific re-identification attack using voter registration lists.
MEASUREMENTS - We define several risk metrics: (1) expected number of re-identifications; (2) estimated proportion of a population in a group of size g or less, and (3) monetary cost per re-identification. For each US state, we estimate the risk posed to hypothetical datasets, protected by the HIPAA Safe Harbor and Limited Dataset policies by an attacker with full knowledge of patient identifiers and with limited knowledge in the form of voter registries.
RESULTS - The percentage of a state's population estimated to be vulnerable to unique re-identification (ie, g=1) when protected via Safe Harbor and Limited Datasets ranges from 0.01% to 0.25% and 10% to 60%, respectively. In the voter attack, this number drops for many states, and for some states is 0%, due to the variable availability of voter registries in the real world. We also find that re-identification cost ranges from $0 to $17,000, further confirming risk variability.
CONCLUSIONS - This work illustrates that blanket protection policies, such as Safe Harbor, leave different organizations vulnerable to re-identification at different rates. It provides justification for locally performed re-identification risk estimates prior to sharing data.
OBJECTIVE - De-identified medical records are critical to biomedical research. Text de-identification software exists, including "resynthesis" components that replace real identifiers with synthetic identifiers. The goal of this research is to evaluate the effectiveness and examine possible bias introduced by resynthesis on de-identification software.
DESIGN - We evaluated the open-source MITRE Identification Scrubber Toolkit, which includes a resynthesis capability, with clinical text from Vanderbilt University Medical Center patient records. We investigated four record classes from over 500 patients' files, including laboratory reports, medication orders, discharge summaries and clinical notes. We trained and tested the de-identification tool on real and resynthesized records.
MEASUREMENTS - We measured performance in terms of precision, recall, F-measure and accuracy for the detection of protected health identifiers as designated by the HIPAA Safe Harbor Rule.
RESULTS - The de-identification tool was trained and tested on a collection of real and resynthesized Vanderbilt records. Results for training and testing on the real records were 0.990 accuracy and 0.960 F-measure. The results improved when trained and tested on resynthesized records with 0.998 accuracy and 0.980 F-measure but deteriorated moderately when trained on real records and tested on resynthesized records with 0.989 accuracy 0.862 F-measure. Moreover, the results declined significantly when trained on resynthesized records and tested on real records with 0.942 accuracy and 0.728 F-measure.
CONCLUSION - The de-identification tool achieves high accuracy when training and test sets are homogeneous (ie, both real or resynthesized records). The resynthesis component regularizes the data to make them less "realistic," resulting in loss of performance particularly when training on resynthesized data and testing on real data.
OBJECTIVES - Improvements in electronic health record (EHR) system development will require an understanding of psychiatric clinicians' views on EHR system acceptability, including effects on psychotherapy communications, data-recording behaviors, data accessibility versus security and privacy, data quality and clarity, communications with medical colleagues, and stigma.
DESIGN - Multidisciplinary development of a survey instrument targeting psychiatric clinicians who recently switched to EHR system use, focus group testing, data analysis, and data reliability testing.
MEASUREMENTS - Survey of 120 university-based, outpatient mental health clinicians, with 56 (47%) responding, conducted 18 months after transition from a paper to an EHR system.
RESULTS - Factor analysis gave nine item groupings that overlapped strongly with five a priori domains. Respondents both praised and criticized the EHR system. A strong majority (81%) felt that open therapeutic communications were preserved. Regarding data quality, content, and privacy, clinicians (63%) were less willing to record highly confidential information and disagreed (83%) with including their own psychiatric records among routinely accessed EHR systems.
LIMITATIONS - single time point; single academic medical center clinic setting; modest sample size; lack of prior instrument validation; survey conducted in 2005.
CONCLUSIONS - In an academic medical center clinic, the presence of electronic records was not seen as a dramatic impediment to therapeutic communications. Concerns regarding privacy and data security were significant, and may contribute to reluctances to adopt electronic records in other settings. Further study of clinicians' views and use patterns may be helpful in guiding development and deployment of electronic records systems.
Medication information is one of the most important types of clinical data in electronic medical records. It is critical for healthcare safety and quality, as well as for clinical research that uses electronic medical record data. However, medication data are often recorded in clinical notes as free-text. As such, they are not accessible to other computerized applications that rely on coded data. We describe a new natural language processing system (MedEx), which extracts medication information from clinical notes. MedEx was initially developed using discharge summaries. An evaluation using a data set of 50 discharge summaries showed it performed well on identifying not only drug names (F-measure 93.2%), but also signature information, such as strength, route, and frequency, with F-measures of 94.5%, 93.9%, and 96.0% respectively. We then applied MedEx unchanged to outpatient clinic visit notes. It performed similarly with F-measures over 90% on a set of 25 clinic visit notes.
OBJECTIVES - Healthcare organizations must adopt measures to uphold their patients' right to anonymity when sharing sensitive records, such as DNA sequences, to publicly accessible databanks. This is often achieved by suppressing patient identifiable information; however, such a practice is insufficient because the same organizations may disclose identified patient information, devoid of the sensitive information, for other purposes and patients' organization-visit patterns, or trails, can re-identify records to the identities from which they were derived. There exist various algorithms that healthcare organizations can apply to ascertain when a patient's record is susceptible to trail re-identification, but they require organizations to exchange information regarding the identities of their patients prior to data protection certification. In this paper, we introduce an algorithmic approach to formally thwart trail re-identification in a secure setting.
METHODS AND MATERIALS - We present a framework that allows data holders to securely collaborate through a third party. In doing so, healthcare organizations keep all sensitive information in an encrypted state until the third party certifies that the data to be disclosed satisfies a formal data protection model. The model adopted for this work is an extended form of k-unlinkability, a protection model that, until this work, was applied in a non-secure setting only. Given the framework and protection model, we develop an algorithm to generate data that satisfies the protection model. In doing so, we enable healthcare organizations to prevent trail re-identification without revealing identified information.
RESULTS - Theoretically, we prove that the proposed data protection model does not leak information, even in the context of an organization's prior knowledge. Empirically, we use real world hospital discharge records to demonstrate that, while the secure protocol induces additional suppression of patient information in comparison to an existing non-secure approach, the quantity of data disclosed by the secure protocol remains substantial. For instance, in a population of over 7700 sickle cell anemia patients, the non-secure protocol discloses 99.48% of DNA records whereas the secure protocol permits the disclosure of 99.41%.
CONCLUSIONS - Our results demonstrate healthcare organizations can collaborate to disclose significant quantities of personal biomedical data without violating their anonymity in the process.
2009 Elsevier B.V. All rights reserved.
OBJECTIVE - Clinical notes, typically written in natural language, often contain substructure that divides them into sections, such as "History of Present Illness" or "Family Medical History." The authors designed and evaluated an algorithm ("SecTag") to identify both labeled and unlabeled (implied) note section headers in "history and physical examination" documents ("H&P notes").
DESIGN - The SecTag algorithm uses a combination of natural language processing techniques, word variant recognition with spelling correction, terminology-based rules, and naive Bayesian scoring methods to identify note section headers. Eleven physicians evaluated SecTag's performance on 319 randomly chosen H&P notes.
MEASUREMENTS - The primary outcomes were the algorithm's recall and precision in identifying all document sections and a predefined list of twenty-nine major sections. A secondary outcome was to evaluate the algorithm's ability to recognize the correct start and end boundaries of identified sections.
RESULTS - The SecTag algorithm identified 16,036 total sections and 7,858 major sections. Physician evaluators classified 15,329 as true positives and identified 160 sections omitted by SecTag. The recall and precision of the SecTag algorithm were 99.0 and 95.6% for all sections, 98.6 and 96.2% for major sections, and 96.6 and 86.8% for unlabeled sections. The algorithm determined the correct starting and ending text boundaries for 94.8% of labeled sections and 85.9% of unlabeled sections.
CONCLUSIONS - The SecTag algorithm accurately identified both labeled and unlabeled sections in history and physical documents. This type of algorithm may assist in natural language processing applications, such as clinical decision support systems or competency assessment for medical trainees.