library(DBI)
library(duckdb)
library(CDMConnector)
library(CodelistGenerator)
library(CohortConstructor)
library(CohortCharacteristics)
library(dplyr)
<- dbConnect(drv = duckdb(),
db dbdir = eunomiaDir(datasetName = "synthea-covid19-10k"))
<- cdmFromCon(db, cdmSchema = "main", writeSchema = "main") cdm
8 Adding cohorts to the CDM
8.1 What is a cohort?
When performing research with the OMOP common data model we often want to identify groups of individuals who share some set of characteristics. The criteria for including individuals can range from the seemingly simple (e.g. people diagnosed with asthma) to the much more complicated (e.g. adults diagnosed with asthma who had a year of prior observation time in the database prior to their diagnosis, had no prior history of chronic obstructive pulmonary disease, and no history of use of short-acting beta-antagonists).
The set of people we identify are cohorts, and the OMOP CDM has a specific structure by which they can be represented, with a cohort table having four required fields: 1) cohort definition id (a unique identifier for each cohort), 2) subject id (a foreign key to the subject in the cohort - typically referring to records in the person table), 3) cohort start date, and 4) cohort end date. Individuals can enter a cohort multiple times, but the time periods in which they are in the cohort cannot overlap. Individuals will only be considered in a cohort when they have have an ongoing observation period.
It is beyond the scope of this book to describe all the different ways cohorts could be created, however in this chapter we provide a summary of some of the key building blocks for cohort creation. Cohort-building pipelines can be created following these principles to create a wide range of study cohorts.
8.2 Set up
We’ll use our synthetic dataset for demonstrating how cohorts can be constructed.
8.3 General concept based cohort
Often study cohorts will be based around a specific clinical event identified by some set of clinical codes. Here, for example, we use the CohortConstructor
package to create a cohort of people with Covid-19. For this we are identifying any clinical records with the code 37311061.
$covid <- conceptCohort(cdm = cdm,
cdmconceptSet = list("covid" = 37311061),
name = "covid")
$covid cdm
# Source: table<covid> [?? x 4]
# Database: DuckDB v1.2.1 [unknown@Linux 6.11.0-1012-azure:R 4.5.0//tmp/RtmprVfhda/file2f0c37a9187.duckdb]
cohort_definition_id subject_id cohort_start_date cohort_end_date
<int> <int> <date> <date>
1 1 1635 2020-07-14 2020-08-08
2 1 2000 2020-11-15 2020-12-10
3 1 4649 2021-09-08 2021-10-12
4 1 9822 2021-02-06 2021-02-25
5 1 10394 2020-11-29 2020-12-31
6 1 4783 2020-05-30 2020-06-12
7 1 5796 2020-11-04 2020-11-23
8 1 6583 2021-01-10 2021-02-13
9 1 8721 2020-04-14 2020-04-23
10 1 2386 2021-01-29 2021-02-24
# ℹ more rows
In the defining the cohorts above we have needed to provide concept IDs to define our cohort. But, where do these come from?
We can search for codes of interest using the CodelistGenerator
package. This can be done using a text search with the function CodelistGenerator::getCandidateCodes()
. For example, we can have found the code we used above (and many others) like so:
getCandidateCodes(cdm = cdm,
keywords = c("coronavirus","covid"),
domains = "condition",
includeDescendants = TRUE)
Limiting to domains of interest
Getting concepts to include
Adding descendants
Search completed. Finishing up.
✔ 37 candidate concepts identified
Time taken: 0 minutes and 1 seconds
# A tibble: 37 × 6
concept_id found_from concept_name domain_id vocabulary_id standard_concept
<int> <chr> <chr> <chr> <chr> <chr>
1 3656668 From initia… Conjunctivi… Condition SNOMED S
2 37310254 From initia… Otitis medi… Condition SNOMED S
3 37310287 From initia… Myocarditis… Condition SNOMED S
4 37310283 From initia… Gastroenter… Condition SNOMED S
5 3661631 From initia… Lymphocytop… Condition SNOMED S
6 3661748 From initia… Acute kidne… Condition SNOMED S
7 3655976 From initia… Acute hypox… Condition SNOMED S
8 3655977 From initia… Rhabdomyoly… Condition SNOMED S
9 705076 From initia… Post-acute … Condition OMOP Extensi… S
10 1340294 From initia… Exacerbatio… Condition OMOP Extensi… S
# ℹ 27 more rows
We can also do automated searches that make use of the hierarchies in the vocabularies. Here, for example, we find the code for the drug ingredient Acetaminophen and all of it’s descendants.
getDrugIngredientCodes(cdm = cdm, name = "acetaminophen")
── 1 codelist ──────────────────────────────────────────────────────────────────
- 161_acetaminophen (25747 codes)
Note that in practice clinical expertise is vital in the identification of appropriate codes so as to decide which the codes are in line with the clinical idea at hand.
We can see that as well as having the cohort entries above, our cohort table is associated with several attributes.
First, we can see the settings associated with cohort.
settings(cdm$covid) |>
glimpse()
Rows: 1
Columns: 4
$ cohort_definition_id <int> 1
$ cohort_name <chr> "covid"
$ cdm_version <chr> "5.3"
$ vocabulary_version <chr> "v5.0 22-JUN-22"
Second, we can get counts of the cohort.
cohortCount(cdm$covid) |>
glimpse()
Rows: 1
Columns: 3
$ cohort_definition_id <int> 1
$ number_records <int> 964
$ number_subjects <int> 964
And last we can see attrition related to the cohort.
attrition(cdm$covid) |>
glimpse()
Rows: 4
Columns: 7
$ cohort_definition_id <int> 1, 1, 1, 1
$ number_records <int> 964, 964, 964, 964
$ number_subjects <int> 964, 964, 964, 964
$ reason_id <int> 1, 2, 3, 4
$ reason <chr> "Initial qualifying events", "Record start <= rec…
$ excluded_records <int> 0, 0, 0, 0
$ excluded_subjects <int> 0, 0, 0, 0
As we will see below these attributes of the cohorts become particularly useful as we apply further restrictions on our cohort.
8.4 Applying inclusion criteria
8.4.1 Only include first cohort entry per person
Let’s say we first want to restrict to first entry.
$covid <- cdm$covid |>
cdmrequireIsFirstEntry()
8.4.2 Restrict to study period
$covid <- cdm$covid |>
cdmrequireInDateRange(dateRange = c(as.Date("2020-09-01"), NA))
8.4.3 Applying demographic inclusion criteria
Say for our study we want to include people with a GI bleed who were aged 40 or over at the time. We can use the add variables with these characteristics as seen in chapter 4 and then filter accordingly. The function CDMConnector::record_cohort_attrition()
will then update our cohort attributes as we can see below.
$covid <- cdm$covid |>
cdmrequireDemographics(ageRange = c(18, 64), sex = "Male")
8.4.4 Applying cohort-based inclusion criteria
As well as requirements about specific demographics, we may also want to use another cohort for inclusion criteria. Let’s say we want to exclude anyone with a history of cardiac conditions before their Covid-19 cohort entry.
We can first generate this new cohort table with records of cardiac conditions.
$cardiac <- conceptCohort(
cdmcdm = cdm,
list("myocaridal_infarction" = c(
317576, 313217, 321042, 4329847
)), name = "cardiac"
)$cardiac cdm
# Source: table<cardiac> [?? x 4]
# Database: DuckDB v1.2.1 [unknown@Linux 6.11.0-1012-azure:R 4.5.0//tmp/RtmprVfhda/file2f0c37a9187.duckdb]
cohort_definition_id subject_id cohort_start_date cohort_end_date
<int> <int> <date> <date>
1 1 502 2022-10-30 2022-10-30
2 1 4403 1990-05-05 1990-05-05
3 1 8464 1968-04-03 1968-04-03
4 1 9181 1993-11-01 1993-11-01
5 1 9737 2017-09-02 2017-09-02
6 1 10639 2011-12-06 2011-12-06
7 1 777 2021-08-30 2021-08-30
8 1 1761 2015-10-02 2015-10-02
9 1 3334 1990-02-19 1990-02-19
10 1 5010 2021-09-30 2021-09-30
# ℹ more rows
And now we can apply the inclusion criteria that individuals have zero intersections with the table in the time prior to their Covid-19 cohort entry.
$covid <- cdm$covid |>
cdmrequireCohortIntersect(targetCohortTable = "cardiac",
indexDate = "cohort_start_date",
window = c(-Inf, -1),
intersections = 0)
Note if we had wanted to have required that individuals did have a history of a cardiac condition we would instead have set intersections = c(1, Inf)
above.
8.5 Cohort attributes
We can see that the attributes of the cohort were updated as we applied the inclusion criteria.
settings(cdm$covid) |>
glimpse()
Rows: 1
Columns: 8
$ cohort_definition_id <int> 1
$ cohort_name <chr> "covid"
$ age_range <chr> "18_64"
$ sex <chr> "Male"
$ min_prior_observation <dbl> 0
$ min_future_observation <dbl> 0
$ cdm_version <chr> "5.3"
$ vocabulary_version <chr> "v5.0 22-JUN-22"
cohortCount(cdm$covid) |>
glimpse()
Rows: 1
Columns: 3
$ cohort_definition_id <int> 1
$ number_records <int> 158
$ number_subjects <int> 158
attrition(cdm$covid) |>
glimpse()
Rows: 11
Columns: 7
$ cohort_definition_id <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1
$ number_records <int> 964, 964, 964, 964, 964, 793, 363, 171, 171, 171,…
$ number_subjects <int> 964, 964, 964, 964, 964, 793, 363, 171, 171, 171,…
$ reason_id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11
$ reason <chr> "Initial qualifying events", "Record start <= rec…
$ excluded_records <int> 0, 0, 0, 0, 0, 171, 430, 192, 0, 0, 13
$ excluded_subjects <int> 0, 0, 0, 0, 0, 171, 430, 192, 0, 0, 13
For attrition, we can use CohortConstructor::summariseCohortAttrition()
and then CohortConstructor::tableCohortAttrition()
to better view the impact of applying the additional inclusion criteria.
<- summariseCohortAttrition(cdm$covid)
attrition_summary plotCohortAttrition(attrition_summary, type = 'png')
9 Further reading
- …