9 Working with cohorts

9.1 Cohort intersections

When conducting research, it is often necessary to study patients who meet multiple clinical criteria simultaneously. For example, we may be interested in analysing outcomes among patients who have both diabetes and hypertension. Using the OMOP CDM, this typically involves first creating two separate cohorts: one for patients with diabetes and another for those with hypertension. To identify patients who meet both conditions, the next step is to compute the intersection of these cohorts. This ensures that the final study population includes only individuals who satisfy all specified criteria. Hence, finding cohort intersections is a common and essential task when working with the OMOP CDM, enabling researchers to define precise target populations that align with their research objectives.

Depending on the research question, the definition of a cohort intersection may vary. For instance, you might require patients to have a diagnosis of hypertension before developing diabetes, or that both diagnoses occur within a specific time window. These additional temporal or clinical criteria can make cohort intersection more complex. The PatientProfiles R package addresses these challenges by providing a suite of flexible functions to support the calculation of cohort intersections under various scenarios.

9.2 Intersection between two cohorts

Suppose we are interested in studying patients with gastrointestinal (GI) bleeding who have also been exposed to acetaminophen. First, we would create two separate cohorts: one for patients with GI bleeding and another for patients with exposure to acetaminophen. Below is an example of the code used to create these cohorts within the GiBleed synthetic database. A characterisation of this dataset can be found here.

library(omock)
library(dplyr)
library(PatientProfiles)
library(CohortConstructor)
library(omopgenerics) # TODO https://github.com/OHDSI/omock/issues/189 

cdm <- mockCdmFromDataset(datasetName = "GiBleed", source = "duckdb")

# gi_bleed contains all records of gi bleed, end date is 30 days after index 
# date
cdm$gi_bleed <- conceptCohort(
  cdm = cdm,
  conceptSet = list("gi_bleed" = 192671L), 
  name = "gi_bleed", 
  exit = "event_start_date"
) |>
  padCohortEnd(days = 30)

# drugs cohort contains records of acetaminophen using start and end dates of 
# the drug records and collapsing record separated by less than 30 days
cdm$drugs <- conceptCohort(
  cdm = cdm,
  conceptSet = list("acetaminophen" = c(
    1125315L, 1127078L, 1127433L, 40229134L, 40231925L, 40162522L, 19133768L
  )), 
  name = "drugs", 
  exit = "event_end_date"
) |>
  collapseCohorts(gap = 30)

The PatientProfiles package contains functions to obtain the intersection flag, count, date, or number of days between cohorts. To get a binary indicator showing the presence of an intersection between the cohorts within a given time window, we can use addCohortIntersectFlag().

9.2.1 Flag

x <- cdm$gi_bleed |>
  addCohortIntersectFlag(
    targetCohortTable = "drugs",
    window = list("prior" = c(-Inf, -1), "index" = c(0, 0), "post" = c(1, Inf))
  )

x |>
  summarise(
    acetaminophen_prior = sum(acetaminophen_prior, na.rm = TRUE),
    acetaminophen_index = sum(acetaminophen_index, na.rm = TRUE),
    acetaminophen_post = sum(acetaminophen_post, na.rm = TRUE)
  ) |>
  collect()

# A tibble: 1 × 3
  acetaminophen_prior acetaminophen_index acetaminophen_post
                <dbl>               <dbl>              <dbl>
1                 467                   1                315

Window naming

Windows work very similarly to age groups that we have seen before, if a name is not provided an automatic name will be obtained from the values of the window limits:

cdm$gi_bleed |>
  addCohortIntersectFlag(
    targetCohortTable = "drugs",
    window = list(c(-Inf, -1), c(0, 0), c(1, Inf))
  ) |>
  glimpse()

Rows: ??
Columns: 7
Database: DuckDB 1.4.0 [unknown@Linux 6.11.0-1018-azure:R 4.4.1//tmp/RtmprUA31E/file39391caa398f.duckdb]
$ cohort_definition_id     <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ subject_id               <int> 2541, 3719, 3658, 5252, 1118, 915, 3540, 5245…
$ cohort_start_date        <date> 2017-12-17, 2003-11-03, 1999-09-01, 1995-12-…
$ cohort_end_date          <date> 2018-01-16, 2003-12-03, 1999-10-01, 1996-01-…
$ acetaminophen_minf_to_m1 <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ acetaminophen_0_to_0     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ acetaminophen_1_to_inf   <dbl> 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, …

Note that to avoid conflicts with column naming, all names will be lower case, spaces are not allowed and the - symbol for negative values is replaced by m. That’s why it is usually nice to provide your own custom names:

cdm$gi_bleed |>
  addCohortIntersectFlag(
    targetCohortTable = "drugs",
    window = list("prior" = c(-Inf, -1), "index" = c(0, 0), "post" = c(1, Inf))
  ) |>
  glimpse()

Rows: ??
Columns: 7
Database: DuckDB 1.4.0 [unknown@Linux 6.11.0-1018-azure:R 4.4.1//tmp/RtmprUA31E/file39391caa398f.duckdb]
$ cohort_definition_id <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ subject_id           <int> 2541, 3719, 3658, 5252, 1118, 915, 3540, 5245, 27…
$ cohort_start_date    <date> 2017-12-17, 2003-11-03, 1999-09-01, 1995-12-19, …
$ cohort_end_date      <date> 2018-01-16, 2003-12-03, 1999-10-01, 1996-01-18, …
$ acetaminophen_post   <dbl> 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ acetaminophen_prior  <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ acetaminophen_index  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…

New column naming

By default the name of new columns is ‘{cohort_name}_{window_name}’ as we have seen in the prior examples, in some cases you only have one cohort or one window and you might want to rename the column as you please. In that case you can use the nameStyle argument to change the new naming of the columns:

cdm$gi_bleed |>
  addCohortIntersectFlag(
    targetCohortTable = "drugs",
    window = list("prior" = c(-Inf, -1), "index" = c(0, 0), "post" = c(1, Inf)),
    nameStyle = "my_column_{window_name}"
  ) |>
  glimpse()

Rows: ??
Columns: 7
Database: DuckDB 1.4.0 [unknown@Linux 6.11.0-1018-azure:R 4.4.1//tmp/RtmprUA31E/file39391caa398f.duckdb]
$ cohort_definition_id <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ subject_id           <int> 2541, 3719, 3658, 5252, 1118, 915, 3540, 5245, 27…
$ cohort_start_date    <date> 2017-12-17, 2003-11-03, 1999-09-01, 1995-12-19, …
$ cohort_end_date      <date> 2018-01-16, 2003-12-03, 1999-10-01, 1996-01-18, …
$ my_column_post       <dbl> 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ my_column_prior      <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ my_column_index      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…

If multiple windows are provided but ‘{window_name}’ is not included in nameStyle then an error will prompt:

cdm$gi_bleed |>
  addCohortIntersectFlag(
    targetCohortTable = "drugs",
    window = list("prior" = c(-Inf, -1), "index" = c(0, 0), "post" = c(1, Inf)),
    nameStyle = "my_new_column"
  ) |>
  glimpse()

Error in `.addIntersect()`:
! The following elements are not present in nameStyle:
• {window_name}

Many functions that create new columns (usually start with add*()) have this nameStyle functionality that allows you to control the naming of the new columns created.

9.2.2 Count

To get the count of occurrences of intersection between two cohorts, we can use addCohortIntersectCount():

x <- cdm$gi_bleed |>
  addCohortIntersectCount(
    targetCohortTable = "drugs",
    window = list("prior" = c(-Inf, -1), "index" = c(0, 0), "post" = c(1, Inf)),
  )

x |>
  summarise(
    sum_prior = sum(acetaminophen_prior, na.rm = TRUE),
    mean_prior = mean(acetaminophen_prior, na.rm = TRUE),
    sum_index = sum(acetaminophen_index, na.rm = TRUE),
    mean_index = mean(acetaminophen_index, na.rm = TRUE),
    sum_post = sum(acetaminophen_post, na.rm = TRUE),
    mean_post = mean(acetaminophen_post, na.rm = TRUE)
  ) |>
  collect()

# A tibble: 1 × 6
  sum_prior mean_prior sum_index mean_index sum_post mean_post
      <dbl>      <dbl>     <dbl>      <dbl>    <dbl>     <dbl>
1      1669       3.48         1    0.00209      758      1.58

Handling the obsrevation period

Note that only intersections in the current observation period are considered.

The count and flag new columns can also have NA values meaning that the individual was not in observation in that window of interest. I we see individual 2070 it has 3748 of future obsevation:

cdm$gi_bleed |>
  filter(subject_id == 2070) |>
  addFutureObservation() |>
  glimpse()

Rows: ??
Columns: 5
Database: DuckDB 1.4.0 [unknown@Linux 6.11.0-1018-azure:R 4.4.1//tmp/RtmprUA31E/file39391caa398f.duckdb]
$ cohort_definition_id <int> 1
$ subject_id           <int> 2070
$ cohort_start_date    <date> 2008-08-15
$ cohort_end_date      <date> 2008-09-14
$ future_observation   <int> 3748

Now we will preform the intersect with the following window of interest: c(2000, 3000), c(3000, 4000), c(4000, 5000).

cdm$gi_bleed |>
  filter(subject_id == 2070) |>
  addCohortIntersectCount(
    targetCohortTable = "drugs",
    window = list(c(2000, 3000), c(3000, 4000), c(4000, 5000)),
  ) |>
glimpse()

Rows: ??
Columns: 7
Database: DuckDB 1.4.0 [unknown@Linux 6.11.0-1018-azure:R 4.4.1//tmp/RtmprUA31E/file39391caa398f.duckdb]
$ cohort_definition_id       <int> 1
$ subject_id                 <int> 2070
$ cohort_start_date          <date> 2008-08-15
$ cohort_end_date            <date> 2008-09-14
$ acetaminophen_2000_to_3000 <dbl> 0
$ acetaminophen_3000_to_4000 <dbl> 0
$ acetaminophen_4000_to_5000 <dbl> NA

See that for the window 2000 to 3000 where the individual is still in observation a 0 is reported, the same happens for the window 3000 to 4000 even if the individual does not have complete observation in the window. But for the last window as individual is not in observation at any point of the window, NA is reported.

9.2.3 Date and times

To get the date of the intersection with a cohort within a given time window, we can use addCohortIntersectDate. To get the number of days between the index date and intersection, we can use addCohortIntersectDays.

Both functions allow the order argument to specify which value to return:

first returns the first date/days that satisfy the window
last returns the last date/days that satisfy the window

x <- cdm$gi_bleed |>
  addCohortIntersectDate(
    targetCohortTable = "drugs",
    window = list("post" = c(1, Inf)),
    order = "first"
  )

x |>
  summarise(acetaminophen_post = median(acetaminophen_post, na.rm = TRUE)) |>
  collect()

# A tibble: 1 × 1
  acetaminophen_post 
  <dttm>             
1 2004-02-01 00:00:00

x <- cdm$gi_bleed |>
  addCohortIntersectDays(
    targetCohortTable = "drugs",
    window = list("prior" = c(-Inf, -1)),
    order = "last"
  )

x |>
  summarise(acetaminophen_prior = median(acetaminophen_prior, na.rm = TRUE)) |>
  collect()

# A tibble: 1 × 1
  acetaminophen_prior
                <dbl>
1               -3159

Note that for the window in the future we used order = "first" and for the window in the past we used order = "last" as in both cases we wanted to get the intersection that was closer to the index date. Individuals with no intersection will have NA on the new created columns.

9.3 Intersection between a cohort and tables with patient data

Sometimes we might want to get the intersection between a cohort and another OMOP table. PatientProfiles also includes several addTableIntersect* functions to obtain intersection flags, counts, days, or dates between a cohort and clinical tables.

For example, if we want to get the number of general practitioner (GP) visits for individuals in the cohort, we can use the visit_occurrence table:

x <- cdm$gi_bleed |>
  addTableIntersectCount(
    tableName = "visit_occurrence",
    window = list(c(-Inf, -1)),
    nameStyle = "number_visits"
  )

x |>
  summarise(visit_occurrence_prior = median(number_visits, na.rm = TRUE)) |>
  collect()

# A tibble: 1 × 1
  visit_occurrence_prior
                   <dbl>
1                      0

9.4 Disconnecting

Once we have finished our analysis we can close our connection to the database behind our cdm reference.

cdmDisconnect(cdm)

9.5 Further reading

Full details on the intersection functions in PatientProfiles can be found on the package website: https://darwin-eu.github.io/PatientProfiles/.