Clean and prepare UK Biobank primary care EHR for research. Tested with the interim EHR data release.
- Install the
ukbbhelpr
R package from here. - Clone the EHR code set repository here.
- Clone this repository and follow the instructions below.
The following R packages are required. Install them using:
required <- c("zoo", "dplyr", "plyr", "ggplot2", "cowplot")
optional <- c("caret", "QDiabetes", "survival")
install.packages(required)
install.packages(optional) # needed to run code in the paper directory
Download the data for your UK Biobank application from the data showcase. The following fields are required to process the primary care EHR data:
Description | Field |
---|---|
Year and month of birth | 34 , 52 |
Date of assessment centre visit | 53 |
Linked date of death | 40000 |
The fields below are required to run the code in the 02_extract_records
and paper
directories:
Description | Field |
---|---|
Demographic data | 31 , 189 , 21000 |
Anthropomorphic measurements | 48 , 50 , 21002 |
HbA1c blood glucose | 30750 |
Self-reported non-cancer medical history | 2986 , 20002 , 20003 , 20008 |
Smoking history | 1249 , 2887 , 3456 , 20116 |
Summary secondary care data | 41270 , 41271 , 41272 , 41273 , 41280 , 41281 , 41282 , 41283 |
01_prepare_data/01_subset_visit_data.R
if any of the optional fields above are unavailable.
In addition, the primary care data is required:
Description | File |
---|---|
Participant registration records | gp_registrations.txt |
Clinical event records | gp_clinical.txt |
Prescription records | gp_scripts.txt |
- Update
file_paths.R
with the paths to your downloaded data. - Run the scripts in the
01_prepare_data
directory sequentially to infer periods of data collection for each participant. The results are saved indata/data_period.rds
by default. - Run the scripts in the
02_extract_records
directory sequentially to extract the files marked * in the table below.
Alternatively, run_all.R
can be run instead of steps 3 and 4.
run_all.R
in particular is very memory intensive. Use of a high performance computing service is recommended. UK Biobank data must be stored and processed as required under the Material Transfer Agreement.
Tested with the September 2019 interim EHR release on an Intel Xeon E5-2699 v4 processor (2.2 GHz, 22 cores, 55 MB cache) with 256Gb RAM running R 3.6 on CentOS Linux 7. The code has not been tested on R 4.0+.
The following files are saved in the data
directory by default:
File | Description |
---|---|
data_period.rds |
Period(s) of EHR data collection for each participant |
gp_event.rds |
Clean event/diagnosis data |
gp_presc.rds |
Clean prescription data |
biomarkers.rds * |
Extracted biomarkers |
demographic.rds * |
Ethnicity, smoking history and Townsend deprivation |
family_history.rds * |
Family history data |
diagnoses.rds * |
Extracted diagnosis codes for a range of common conditions |
prescriptions.rds * |
Estimated periods during which selected drugs were prescribed |
Files marked * are generated by the scripts in the 02_extract_records
directory.
visualisation/01_algorithm.R
can be used to plot the results of the algorithm used to infer periods of EHR data collection for a participant.
visualisation/02_phenotyping.R
can be used to plot the results of the diabetes phenotyping algorithm. paper/02_diabetes_phenotyping.R
must be run first.
If you use this work, please cite it as below:
@article{10.1093/jamia/ocab260,
author = {Darke, Philip and Cassidy, Sophie and Catt, Michael and Taylor, Roy and Missier, Paolo and Bacardit, Jaume},
title = "{Curating a longitudinal research resource using linked primary care EHR data - a UK Biobank case study}",
journal = {Journal of the American Medical Informatics Association},
volume = {29},
number = {3},
pages = {546-552},
year = {2021},
month = {12},
issn = {1527-974X},
doi = {10.1093/jamia/ocab260},
url = {https://doi.org/10.1093/jamia/ocab260},
eprint = {https://academic.oup.com/jamia/article-pdf/29/3/546/42333190/ocab260.pdf},
}
Made available under the MIT Licence.