Characterise databases
2025-06-26
datasetName <- "GiBleed"
dbdir <- here(paste0(datasetName, ".duckdb"))
con <- dbConnect(drv = duckdb(dbdir = dbdir))
cdm <- cdmFromCon(
con = con,
cdmSchema = "main",
writeSchema = "main",
writePrefix = "os_",
cdmName = datasetName
)
summariseOmopSnapshot()
function provides a real-time overview of the OMOP database, highlighting key characteristics at a specific moment.
vocabulary
table.person
and observation_period
tables.observation_period
table.summariseOmopSnapshot()
snapshot <- summariseOmopSnapshot(cdm = cdm)
glimpse(snapshot)
Rows: 13
Columns: 13
$ result_id <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1
$ cdm_name <chr> "GiBleed", "GiBleed", "GiBleed", "GiBleed", "GiBleed", "GiBleed", "GiBleed", "GiBleed", "GiBl…
$ group_name <chr> "overall", "overall", "overall", "overall", "overall", "overall", "overall", "overall", "over…
$ group_level <chr> "overall", "overall", "overall", "overall", "overall", "overall", "overall", "overall", "over…
$ strata_name <chr> "overall", "overall", "overall", "overall", "overall", "overall", "overall", "overall", "over…
$ strata_level <chr> "overall", "overall", "overall", "overall", "overall", "overall", "overall", "overall", "over…
$ variable_name <chr> "general", "general", "observation_period", "cdm", "general", "cdm", "cdm", "cdm", "cdm", "cd…
$ variable_level <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
$ estimate_name <chr> "snapshot_date", "person_count", "count", "source_name", "vocabulary_version", "version", "ho…
$ estimate_type <chr> "date", "integer", "integer", "character", "character", "character", "character", "character"…
$ estimate_value <chr> "2025-06-26", "2694", "5343", "Synthea synthetic health database", "v5.0 18-JAN-19", "v5.3.1"…
$ additional_name <chr> "overall", "overall", "overall", "overall", "overall", "overall", "overall", "overall", "over…
$ additional_level <chr> "overall", "overall", "overall", "overall", "overall", "overall", "overall", "overall", "over…
tableOmopSnapshot()
tableOmopSnapshot(result = snapshot, type = "gt")
Estimate |
Database name
|
---|---|
GiBleed | |
General | |
Snapshot date | 2025-06-26 |
Person count | 2,694 |
Vocabulary version | v5.0 18-JAN-19 |
Observation period | |
N | 5,343 |
Start date | 1908-09-22 |
End date | 2019-07-03 |
Cdm | |
Source name | Synthea synthetic health database |
Version | v5.3.1 |
Holder name | OHDSI Community |
Release date | 2019-05-25 |
Description | SyntheaTM is a Synthetic Patient Population Simulator. The goal is to output synthetic, realistic (but not real), patient data and associated health records in a variety of formats. |
Documentation reference | https://synthetichealth.github.io/synthea/ |
Source type | duckdb |
summariseObservationPeriod()
function generates essential statistics derived from the observation_period table.
summariseObservationPeriod()
result <- summariseObservationPeriod(observationPeriod = cdm$observation_period, sex = TRUE)
glimpse(result)
Rows: 9,306
Columns: 13
$ result_id <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ cdm_name <chr> "GiBleed", "GiBleed", "GiBleed", "GiBleed", "GiBleed", "GiBleed", "GiBleed", "GiBleed", "GiBl…
$ group_name <chr> "observation_period_ordinal", "observation_period_ordinal", "observation_period_ordinal", "ob…
$ group_level <chr> "all", "all", "all", "all", "all", "all", "all", "all", "all", "all", "all", "all", "all", "a…
$ strata_name <chr> "overall", "overall", "overall", "overall", "overall", "overall", "overall", "overall", "over…
$ strata_level <chr> "overall", "overall", "overall", "overall", "overall", "overall", "overall", "overall", "over…
$ variable_name <chr> "Number records", "Number subjects", "Records per person", "Records per person", "Records per…
$ variable_level <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ estimate_name <chr> "count", "count", "mean", "sd", "min", "q05", "q25", "median", "q75", "q95", "max", "mean", "…
$ estimate_type <chr> "integer", "integer", "numeric", "numeric", "integer", "integer", "integer", "integer", "inte…
$ estimate_value <chr> "2694", "2694", "1", "0", "1", "1", "1", "1", "1", "1", "1", "21602.6035634744", "5460.692649…
$ additional_name <chr> "overall", "overall", "overall", "overall", "overall", "overall", "overall", "overall", "over…
$ additional_level <chr> "overall", "overall", "overall", "overall", "overall", "overall", "overall", "overall", "over…
tableObservationPeriod()
tableObservationPeriod(result = result, type = "gt")
Observation period ordinal | Variable name | Estimate name |
CDM name
|
---|---|---|---|
GiBleed | |||
overall | |||
all | Number records | N | 2,694 |
Number subjects | N | 2,694 | |
Records per person | mean (sd) | 1.00 (0.00) | |
median [Q25 - Q75] | 1 [1 - 1] | ||
Duration in days | mean (sd) | 21,602.60 (5,460.69) | |
median [Q25 - Q75] | 20,872 [17,495 - 24,702] | ||
1st | Number subjects | N | 2,694 |
Duration in days | mean (sd) | 21,602.60 (5,460.69) | |
median [Q25 - Q75] | 20,872 [17,495 - 24,702] | ||
Female | |||
all | Number records | N | 1,373 |
Number subjects | N | 1,373 | |
Records per person | mean (sd) | 1.00 (0.00) | |
median [Q25 - Q75] | 1 [1 - 1] | ||
Duration in days | mean (sd) | 21,666.77 (5,623.53) | |
median [Q25 - Q75] | 20,861 [17,382 - 24,683] | ||
1st | Number subjects | N | 1,373 |
Duration in days | mean (sd) | 21,666.77 (5,623.53) | |
median [Q25 - Q75] | 20,861 [17,382 - 24,683] | ||
Male | |||
all | Number records | N | 1,321 |
Number subjects | N | 1,321 | |
Records per person | mean (sd) | 1.00 (0.00) | |
median [Q25 - Q75] | 1 [1 - 1] | ||
Duration in days | mean (sd) | 21,535.91 (5,287.44) | |
median [Q25 - Q75] | 20,973 [17,557 - 24,704] | ||
1st | Number subjects | N | 1,321 |
Duration in days | mean (sd) | 21,535.91 (5,287.44) | |
median [Q25 - Q75] | 20,973 [17,557 - 24,704] |
plotObservationPeriod()
plotObservationPeriod(result = result,
plotType = "barplot",
variableName = "Number subjects",
colour = "sex")
plotObservationPeriod()
plotObservationPeriod(result = result,
plotType = "densityplot",
variableName = "Duration in days",
colour = "sex",
facet = "observation_period_ordinal")
summariseInObservation()
function summarises trends derived from the observation_period
table.
summariseInObservation()
result <- summariseInObservation(observationPeriod = cdm$observation_period,
interval = "years",
output = c("record","person", "person-days", "sex", "age"),
ageGroup = list(c(0,50), c(51, Inf)))
plotInObservation()
result |>
filter(variable_name == "Number person-days") |>
plotInObservation(colour = "age_group")
plotInObservation()
result |>
filter(variable_name == "Median age in observation") |>
plotInObservation(facet = "age_group")
Create a plot showing the quarterly trend of the number of people in observation, stratified by sex and age group (0-30, 31-60, 61+).
Create the summarised result using summariseInObservation()
:
Create the plot using plotInObservation()
, with:
facet = "sex"
→ Separate panels by sexcolour = "age_group"
→ Different colors for age groups (0-30
, 31-60
, 61+
)result <- summariseInObservation(observationPeriod = cdm$observation_period,
output = "person",
interval = "quarters",
sex = TRUE,
ageGroup = list(c(0,30), c(31,60), c(61, Inf)))
plotInObservation(result = result, facet = "sex", colour = "age_group")
summariseMissingData()
function summarises missingness in a OMOP table
summariseMissingData()
result <- summariseMissingData(cdm = cdm,
omopTableName = "drug_exposure",
col = NULL,
sample = 1000)
tableMissingData()
tableMissingData(result = result, type = "gt")
Column name | Estimate name |
Database name
|
---|---|---|
GiBleed | ||
drug_exposure | ||
drug_exposure_id | N missing data (%) | 0 (0.00%) |
N zeros (%) | 0 (0.00%) | |
person_id | N missing data (%) | 0 (0.00%) |
N zeros (%) | 0 (0.00%) | |
drug_concept_id | N missing data (%) | 0 (0.00%) |
N zeros (%) | 0 (0.00%) | |
drug_exposure_start_date | N missing data (%) | 0 (0.00%) |
drug_exposure_start_datetime | N missing data (%) | 0 (0.00%) |
drug_exposure_end_date | N missing data (%) | 0 (0.00%) |
drug_exposure_end_datetime | N missing data (%) | 0 (0.00%) |
verbatim_end_date | N missing data (%) | 70 (7.00%) |
drug_type_concept_id | N missing data (%) | 0 (0.00%) |
N zeros (%) | 0 (0.00%) | |
stop_reason | N missing data (%) | 1,000 (100.00%) |
refills | N missing data (%) | 0 (0.00%) |
quantity | N missing data (%) | 0 (0.00%) |
days_supply | N missing data (%) | 0 (0.00%) |
sig | N missing data (%) | 1,000 (100.00%) |
route_concept_id | N missing data (%) | 0 (0.00%) |
N zeros (%) | 1,000 (100.00%) | |
lot_number | N missing data (%) | 0 (0.00%) |
provider_id | N missing data (%) | 0 (0.00%) |
N zeros (%) | 1,000 (100.00%) | |
visit_occurrence_id | N missing data (%) | 12 (1.20%) |
N zeros (%) | 0 (0.00%) | |
visit_detail_id | N missing data (%) | 0 (0.00%) |
N zeros (%) | 1,000 (100.00%) | |
drug_source_value | N missing data (%) | 0 (0.00%) |
drug_source_concept_id | N missing data (%) | 0 (0.00%) |
N zeros (%) | 0 (0.00%) | |
route_source_value | N missing data (%) | 1,000 (100.00%) |
dose_unit_source_value | N missing data (%) | 1,000 (100.00%) |
summariseClinicalRecords()
function provides a comprehensive overview of the content of a clinical table.
summariseClinicalRecords()
result <- summariseClinicalRecords(cdm = cdm,
omopTableName = "condition_occurrence",
recordsPerPerson = c("mean", "sd", "median", "q25", "q75", "min", "max"),
inObservation = TRUE,
standardConcept = TRUE,
sourceVocabulary = TRUE,
domainId = TRUE,
typeConcept = TRUE)
tableClinicalRecords()
tableClinicalRecords(result = result, type = "gt")
Variable name | Variable level | Estimate name |
Database name
|
---|---|---|---|
GiBleed | |||
condition_occurrence | |||
Number records | - | N | 65,332 |
Number subjects | - | N (%) | 2,694 (100.00%) |
Records per person | - | Mean (SD) | 24.25 (7.41) |
Median [Q25 - Q75] | 23 [19 - 29] | ||
Range [min to max] | [5 to 65] | ||
In observation | No | N (%) | 450 (0.69%) |
Yes | N (%) | 64,882 (99.31%) | |
Domain | Condition | N (%) | 65,332 (100.00%) |
Source vocabulary | Icd10cm | N (%) | 479 (0.73%) |
No matching concept | N (%) | 27 (0.04%) | |
Snomed | N (%) | 64,826 (99.23%) | |
Standard concept | S | N (%) | 65,332 (100.00%) |
Type concept id | Ehr encounter diagnosis | N (%) | 65,332 (100.00%) |
summariseRecordCount()
function summarises the record trend over specified time intervals. Only records that fall within the observation period are considered.
result <- summariseRecordCount(cdm = cdm,
omopTableName = "visit_occurrence",
interval = "years",
dateRange = as.Date(c("2000-01-01", "2015-12-31")))
plotRecordCount()
plotRecordCount(result = result, colour = "omop_table")
Create a plot showing the yearly trend in the number of drug and condition records, stratified by sex and in the date range 2010-2018.
Create the summarised result using summariseRecordCount()
:
Create the plot using plotRecordCount()
, with:
facet = "omop_table"
→ Separate panels by tablecolour = "sex"
→ Different colors for males and femalesresult <- summariseRecordCount(cdm = cdm,
omopTableName = c("drug_exposure", "condition_occurrence"),
interval = "years",
sex = TRUE,
dateRange = as.Date(c("2010-01-01", "2018-12-31")))
plotRecordCount(result = result, facet = "omop_table", colour = "sex")
summariseConceptIdCounts()
provides the counts for each concept id available in the table
summariseConceptIdCounts()
countBy = "record"
by default, or “person” to count distinct subjects)result <- summariseConceptIdCounts(cdm = cdm,
omopTableName = "procedure_occurrence",
countBy = c("record", "person"))
glimpse(result)
Rows: 102
Columns: 13
$ result_id <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ cdm_name <chr> "GiBleed", "GiBleed", "GiBleed", "GiBleed", "GiBleed", "GiBleed", "GiBleed", "GiBleed", "GiBl…
$ group_name <chr> "omop_table", "omop_table", "omop_table", "omop_table", "omop_table", "omop_table", "omop_tab…
$ group_level <chr> "procedure_occurrence", "procedure_occurrence", "procedure_occurrence", "procedure_occurrence…
$ strata_name <chr> "overall", "overall", "overall", "overall", "overall", "overall", "overall", "overall", "over…
$ strata_level <chr> "overall", "overall", "overall", "overall", "overall", "overall", "overall", "overall", "over…
$ variable_name <chr> "Surgical manipulation of joint of knee", "Surgical manipulation of joint of knee", "Review o…
$ variable_level <chr> "44783196", "44783196", "4187458", "4187458", "4062501", "4062501", "4330583", "4330583", "43…
$ estimate_name <chr> "count_records", "count_subjects", "count_records", "count_subjects", "count_records", "count…
$ estimate_type <chr> "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "inte…
$ estimate_value <chr> "269", "254", "100", "100", "101", "101", "61", "61", "100", "100", "130", "76", "438", "236"…
$ additional_name <chr> "source_concept_id &&& source_concept_name", "source_concept_id &&& source_concept_name", "so…
$ additional_level <chr> "44783196 &&& Surgical manipulation of joint of knee", "44783196 &&& Surgical manipulation of…
tableConceptIdCounts()
display
option:
display = "standard"
: Show only standard conceptsdisplay = "source"
: Show only source codesdisplay = "missing standard"
: Show source codes without a mapped standard conceptdisplay = "missing source
“: Show standard concepts without a mapped source codedisplay = "overall" (default)
: Show all recordstableConceptIdCounts()
tableConceptIdCounts(result = result, display = "standard",type = "reactable")
tableTopConceptCounts()
top = 10
)tableTopConceptCounts()
tableTopConceptCounts(result = result, countBy = "record", top = 10, type = "gt")
Top |
Cdm name
|
---|---|
GiBleed | |
procedure_occurrence | |
1 | Standard: Subcutaneous immunotherapy (4107731) Source: Subcutaneous immunotherapy (4107731) 17520 |
2 | Standard: Cognitive and behavioral therapy (4043071) Source: Cognitive and behavioral therapy (4043071) 2820 |
3 | Standard: Suture open wound (4125906) Source: Suture open wound (4125906) 2487 |
4 | Standard: Bone immobilization (4170947) Source: Bone immobilization (4170947) 1789 |
5 | Standard: Sputum examination (4151422) Source: Sputum examination (4151422) 1659 |
6 | Standard: Direct current cardioversion (4078793) Source: Direct current cardioversion (4078793) 1342 |
7 | Standard: Intramuscular injection (4295880) Source: Intramuscular injection (4295880) 908 |
8 | Standard: Pulmonary rehabilitation (4035793) Source: Pulmonary rehabilitation (4035793) 908 |
9 | Standard: Allergy screening test (4191853) Source: Allergy screening test (4191853) 704 |
10 | Standard: Bone density scan (4195803) Source: Bone density scan (4195803) 655 |
Identify the top 5 concepts in the drug_exposure table based on the number of subjects
Top |
Cdm name
|
---|---|
GiBleed | |
drug_exposure | |
1 | Standard: tetanus and diphtheria toxoids, adsorbed, preservative free, for adult use (40213227) Source: tetanus and diphtheria toxoids, adsorbed, preservative free, for adult use (40213227) 2660 |
2 | Standard: Acetaminophen 325 MG Oral Tablet (1127433) Source: Acetaminophen 325 MG Oral Tablet (1127433) 2580 |
3 | Standard: poliovirus vaccine, inactivated (40213160) Source: poliovirus vaccine, inactivated (40213160) 2140 |
4 | Standard: Amoxicillin 250 MG / Clavulanate 125 MG Oral Tablet (1713671) Source: Amoxicillin 250 MG / Clavulanate 125 MG Oral Tablet (1713671) 2021 |
5 | Standard: Aspirin 81 MG Oral Tablet (19059056) Source: Aspirin 81 MG Oral Tablet (19059056) 1927 |
Create the summarised result using summariseConceptIdCount()
Display the 5 most frequent concept in using tableTopConceptCounts()
, with:
top = 5
→ Show the 5 most frequentresult <- summariseConceptIdCounts(cdm = cdm, omopTableName = "drug_exposure", countBy = "person")
tableTopConceptCounts(result = result, top = 5)
databaseCharacteristics()
function brings together all the key summaries in one place. It includes:
Snapshot of the database
Characterisation of the observation period
Quality checks on the tables
Overview of clinical table content
databaseCharacteristics()
Select the tables to analyse
Choose whether to stratify by sex and age groups
Define the time interval for temporal summaries
Set the date range to focus the analysis
Optionally include concept ID counts (conceptIdCounts = FALSE
by default)
shinyCharacteristics()
creates a Shiny app to explore the results of the database characterisation.
Specify the directory where the app should be created
Optionally customise the title, logo, and visual theme
shinyCharacteristics(result = result, directory = here())
shinyCharacteristics()