TY - JOUR
T1 - Computable phenotypes to identify respiratory viral infections in the All of Us research program
AU - Waxse, Bennett J.
AU - Bustos Carrillo, Fausto Andres
AU - Tran, Tam C.
AU - Mo, Huan
AU - Ricotta, Emily E.
AU - Denny, Joshua C.
N1 - Publisher Copyright:
© This is a U.S. Government work and not under copyright protection in the US; foreign copyright protection may apply 2025.
PY - 2025/12
Y1 - 2025/12
N2 - Electronic health records (EHRs) contain rich temporal data about respiratory viral infections, but methods to identify these infections from EHR data vary widely and lack robust validation. We developed computable phenotypes by integrating virus-specific International Classification of Diseases (ICD) billing codes, prescriptions, and laboratory results within 90-day episodes. Analysis of 265,222 participants with EHR data from the All of Us Research Program yielded national cohorts of varied size: large cohorts for SARS-CoV-2 (n = 28,729) and influenza (n = 19,784); medium cohorts for rhinovirus, human coronavirus, and respiratory syncytial virus (n = 1,161-1,620); and smaller cohorts for the other viruses (n = 238–486). Using laboratory results as a reference standard, phenotypes using virus-specific ICD codes and medications had variable sensitivity (8–67%) but high positive predictive value (PPV, 90–97%) for most viruses, while influenza virus and SARS-CoV-2 phenotypes had lower PPV (69–70%) that improved with the inclusion of additional ICD codes. Identified infections exhibited expected seasonal patterns matching CDC data. This integrated approach identified infections more effectively than individual components alone and demonstrated utility for severe infections in hospital settings. This method enables large-scale studies of host genetics, health disparities, and clinical outcomes across episodic diseases, with flexibility to optimize sensitivity or PPV depending on the specific research question.
AB - Electronic health records (EHRs) contain rich temporal data about respiratory viral infections, but methods to identify these infections from EHR data vary widely and lack robust validation. We developed computable phenotypes by integrating virus-specific International Classification of Diseases (ICD) billing codes, prescriptions, and laboratory results within 90-day episodes. Analysis of 265,222 participants with EHR data from the All of Us Research Program yielded national cohorts of varied size: large cohorts for SARS-CoV-2 (n = 28,729) and influenza (n = 19,784); medium cohorts for rhinovirus, human coronavirus, and respiratory syncytial virus (n = 1,161-1,620); and smaller cohorts for the other viruses (n = 238–486). Using laboratory results as a reference standard, phenotypes using virus-specific ICD codes and medications had variable sensitivity (8–67%) but high positive predictive value (PPV, 90–97%) for most viruses, while influenza virus and SARS-CoV-2 phenotypes had lower PPV (69–70%) that improved with the inclusion of additional ICD codes. Identified infections exhibited expected seasonal patterns matching CDC data. This integrated approach identified infections more effectively than individual components alone and demonstrated utility for severe infections in hospital settings. This method enables large-scale studies of host genetics, health disparities, and clinical outcomes across episodic diseases, with flexibility to optimize sensitivity or PPV depending on the specific research question.
KW - Computable phenotype
KW - Electronic health records
KW - Influenza virus
KW - Precision medicine
KW - Reproducibility of results
KW - Respiratory tract infections
UR - http://www.scopus.com/inward/record.url?scp=105006829503&partnerID=8YFLogxK
U2 - 10.1038/s41598-025-02183-9
DO - 10.1038/s41598-025-02183-9
M3 - Article
C2 - 40437102
AN - SCOPUS:105006829503
SN - 2045-2322
VL - 15
JO - Scientific Reports
JF - Scientific Reports
IS - 1
M1 - 18680
ER -