library(CDMConnector)
library(CodelistGenerator)
library(CohortConstructor)
library(CohortCharacteristics)
library(dplyr)
<- DBI::dbConnect(duckdb::duckdb(),
db dbdir = eunomiaDir(datasetName = "synthea-covid19-10k"))
<- cdmFromCon(db, cdmSchema = "main", writeSchema = "main") cdm
7 Adding cohorts to the CDM
7.1 What is a cohort?
When performing research with the OMOP common data model we often want to identify groups of individuals who share some set of characteristics. The criteria for including individuals can range from the seemingly simple (e.g. people diagnosed with asthma) to the much more complicated (e.g. adults diagnosed with asthma who had a year of prior observation time in the database prior to their diagnosis, had no prior history of chronic obstructive pulmonary disease, and no history of use of short-acting beta-antagonists).
The set of people we identify are cohorts, and the OMOP CDM has a specific structure by which they can be represented, with a cohort table having four required fields: 1) cohort definition id (a unique identifier for each cohort), 2) subject id (a foreign key to the subject in the cohort - typically referring to records in the person table), 3) cohort start date, and 4) cohort end date. Individuals can enter a cohort multiple times, but the time periods in which they are in the cohort cannot overlap. Individuals will only be considered in a cohort when they have have an ongoing observation period.
It is beyond the scope of this book to describe all the different ways cohorts could be created, however in this chapter we provide a summary of some of the key building blocks for cohort creation. Cohort-building pipelines can be created following these principles to create a wide range of study cohorts.
7.2 Set up
We’ll use our synthetic dataset for demonstrating how cohorts can be constructed.
7.3 General concept based cohort
Often study cohorts will be based around a specific clinical event identified by some set of clinical codes. Here, for example, we use the CohortConstructor
package to create a cohort of people with Covid-19. For this we are identifying any clinical records with the code 37311061.
$covid <- conceptCohort(cdm = cdm,
cdmconceptSet = list("covid" = 37311061),
name = "covid")
$covid cdm
# Source: table<covid> [?? x 4]
# Database: DuckDB v1.1.3 [eburn@Windows 10 x64:R 4.4.0/C:\Users\eburn\AppData\Local\Temp\Rtmpor3Jzw\file62807441748d.duckdb]
cohort_definition_id subject_id cohort_start_date cohort_end_date
<int> <int> <date> <date>
1 1 5780 2020-11-20 2020-12-14
2 1 466 2020-12-04 2020-12-25
3 1 8959 2020-12-29 2021-01-13
4 1 10057 2020-10-31 2020-11-21
5 1 668 2020-09-27 2020-10-24
6 1 4918 2021-04-07 2021-05-04
7 1 378 2020-12-13 2021-01-11
8 1 8308 2020-10-28 2020-11-12
9 1 8954 2020-07-15 2020-08-09
10 1 7037 2020-04-08 2020-04-29
# ℹ more rows
In the defining the cohorts above we have needed to provide concept IDs to define our cohort. But, where do these come from?
We can search for codes of interest using the CodelistGenerator
package. This can be done using a text search with the function CodelistGenerator::getCandidateCodes()
. For example, we can have found the code we used above (and many others) like so:
getCandidateCodes(cdm = cdm,
keywords = c("coronavirus","covid"),
domains = "condition",
includeDescendants = TRUE)
Limiting to domains of interest
Getting concepts to include
Adding descendants
Search completed. Finishing up.
✔ 37 candidate concepts identified
Time taken: 0 minutes and 1 seconds
# A tibble: 37 × 6
concept_id found_from concept_name domain_id vocabulary_id standard_concept
<int> <chr> <chr> <chr> <chr> <chr>
1 3655977 From initia… Rhabdomyoly… Condition SNOMED S
2 40479642 From initia… Pneumonia d… Condition SNOMED S
3 37310283 From initia… Gastroenter… Condition SNOMED S
4 3661406 From initia… Acute respi… Condition SNOMED S
5 3661631 From initia… Lymphocytop… Condition SNOMED S
6 3661632 From initia… Thrombocyto… Condition SNOMED S
7 3655976 From initia… Acute hypox… Condition SNOMED S
8 3655973 From initia… At increase… Condition SNOMED S
9 439676 From initia… Coronavirus… Condition SNOMED S
10 3656668 From initia… Conjunctivi… Condition SNOMED S
# ℹ 27 more rows
We can also do automated searches that make use of the hierarchies in the vocabularies. Here, for example, we find the code for the drug ingredient Acetaminophen and all of it’s descendants.
::getDrugIngredientCodes(cdm = cdm,
CodelistGeneratorname = "acetaminophen")
── 1 codelist ──────────────────────────────────────────────────────────────────
- 161_acetaminophen (25747 codes)
Note that in practice clinical expertise is vital in the identification of appropriate codes so as to decide which the codes are in line with the clinical idea at hand.
We can see that as well as having the cohort entries above, our cohort table is associated with several attributes.
First, we can see the settings associated with cohort.
settings(cdm$covid) |>
glimpse()
Rows: 1
Columns: 4
$ cohort_definition_id <int> 1
$ cohort_name <chr> "covid"
$ cdm_version <chr> "5.3"
$ vocabulary_version <chr> "v5.0 22-JUN-22"
Second, we can get counts of the cohort.
cohortCount(cdm$covid) |>
glimpse()
Rows: 1
Columns: 3
$ cohort_definition_id <int> 1
$ number_records <int> 964
$ number_subjects <int> 964
And last we can see attrition related to the cohort.
attrition(cdm$covid) |>
glimpse()
Rows: 4
Columns: 7
$ cohort_definition_id <int> 1, 1, 1, 1
$ number_records <int> 964, 964, 964, 964
$ number_subjects <int> 964, 964, 964, 964
$ reason_id <int> 1, 2, 3, 4
$ reason <chr> "Initial qualifying events", "Record start <= rec…
$ excluded_records <int> 0, 0, 0, 0
$ excluded_subjects <int> 0, 0, 0, 0
As we will see below these attributes of the cohorts become particularly useful as we apply further restrictions on our cohort.
7.4 Applying inclusion criteria
7.4.1 Only include first cohort entry per person
Let’s say we first want to restrict to first entry.
$covid <- cdm$covid |>
cdmrequireIsFirstEntry()
7.4.2 Restrict to study period
$covid <- cdm$covid |>
cdmrequireInDateRange(dateRange = c(as.Date("2020-09-01"), NA))
7.4.3 Applying demographic inclusion criteria
Say for our study we want to include people with a GI bleed who were aged 40 or over at the time. We can use the add variables with these characteristics as seen in chapter 4 and then filter accordingly. The function CDMConnector::record_cohort_attrition()
will then update our cohort attributes as we can see below.
$covid <- cdm$covid |>
cdmrequireDemographics(ageRange = c(18, 64), sex = "Male")
7.4.4 Applying cohort-based inclusion criteria
As well as requirements about specific demographics, we may also want to use another cohort for inclusion criteria. Let’s say we want to exclude anyone with a history of cardiac conditions before their Covid-19 cohort entry.
We can first generate this new cohort table with records of cardiac conditions.
$cardiac <- conceptCohort(
cdmcdm = cdm,
list("myocaridal_infarction" = c(
317576, 313217, 321042, 4329847
)), name = "cardiac"
)$cardiac cdm
# Source: table<cardiac> [?? x 4]
# Database: DuckDB v1.1.3 [eburn@Windows 10 x64:R 4.4.0/C:\Users\eburn\AppData\Local\Temp\Rtmpor3Jzw\file62807441748d.duckdb]
cohort_definition_id subject_id cohort_start_date cohort_end_date
<int> <int> <date> <date>
1 1 4593 2017-01-05 2017-01-05
2 1 5411 2019-08-29 2019-08-29
3 1 7705 1966-02-27 1966-02-27
4 1 8051 2007-06-25 2007-06-25
5 1 9128 1983-06-09 1983-06-09
6 1 3167 2017-09-24 2017-09-24
7 1 4816 2007-09-14 2007-09-14
8 1 5490 2003-03-12 2003-03-12
9 1 2780 2014-04-23 2014-04-23
10 1 6587 1982-04-06 1982-04-06
# ℹ more rows
And now we can apply the inclusion criteria that individuals have zero intersections with the table in the time prior to their Covid-19 cohort entry.
$covid <- cdm$covid |>
cdmrequireCohortIntersect(targetCohortTable = "cardiac",
indexDate = "cohort_start_date",
window = c(-Inf, -1),
intersections = 0)
Note if we had wanted to have required that individuals did have a history of a cardiac condition we would instead have set intersections = c(1, Inf)
above.
7.5 Cohort attributes
We can see that the attributes of the cohort were updated as we applied the inclusion criteria.
settings(cdm$covid) |>
glimpse()
Rows: 1
Columns: 8
$ cohort_definition_id <int> 1
$ cohort_name <chr> "covid"
$ age_range <chr> "18_64"
$ sex <chr> "Male"
$ min_prior_observation <dbl> 0
$ min_future_observation <dbl> 0
$ cdm_version <chr> "5.3"
$ vocabulary_version <chr> "v5.0 22-JUN-22"
cohortCount(cdm$covid) |>
glimpse()
Rows: 1
Columns: 3
$ cohort_definition_id <int> 1
$ number_records <int> 158
$ number_subjects <int> 158
attrition(cdm$covid) |>
glimpse()
Rows: 11
Columns: 7
$ cohort_definition_id <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1
$ number_records <int> 964, 964, 964, 964, 964, 793, 363, 171, 171, 171,…
$ number_subjects <int> 964, 964, 964, 964, 964, 793, 363, 171, 171, 171,…
$ reason_id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11
$ reason <chr> "Initial qualifying events", "Record start <= rec…
$ excluded_records <int> 0, 0, 0, 0, 0, 171, 430, 192, 0, 0, 13
$ excluded_subjects <int> 0, 0, 0, 0, 0, 171, 430, 192, 0, 0, 13
For attrition, we can use CohortConstructor::summariseCohortAttrition()
and then CohortConstructor::tableCohortAttrition()
to better view the impact of applying the additional inclusion criteria.
<- summariseCohortAttrition(cdm$covid)
attrition_summary plotCohortAttrition(attrition_summary)
8 Further reading
- …