5  Creating a cdm reference

5.1 The OMOP CDM layout

The OMOP CDM standardises the structure of health care data. Data is stored across a system of tables with established relationships between them. In other words, the OMOP CDM provides a relational database structure, with version 5.4 of the OMOP CDM shown below.

5.2 Creating a reference to the OMOP common data model

As we saw in Chapter 4, creating a data model in R to represent the OMOP CDM can provide a basis for analytic pipleines using the data. Luckily for us we won’t have to create functions and methods for this ourselves. Instead we will use the omopgenerics package which defines a data model for OMOP CDM data and the CDMConnector package which provides functions for connecting to a OMOP CDM data held in a database.

To see how this works we will use the omock to create example data in the format of the OMOP CDM, which we then copy to a duckdb database.

library(DBI)
library(here)
library(dplyr)
library(omock)
library(omopgenerics)
library(CDMConnector)

cdm_local <- omock::mockCdmReference() |>
    omock::mockPerson(nPerson = 100) |>
    omock::mockObservationPeriod() |>
    omock::mockConditionOccurrence() |>
    omock::mockDrugExposure() |>
    omock::mockObservation() |>
    omock::mockMeasurement() |>
    omock::mockVisitOccurrence() |>
    omock::mockProcedureOccurrence()

db <- DBI::dbConnect(duckdb::duckdb())
CDMConnector::copyCdmTo(con = db,
                        cdm = cdm_local,
                        schema ="main", 
                        overwrite = TRUE)

Now we have OMOP CDM data in a database we can use cdmFromCon() to create our cdm reference. Note that as well as specifying the schema containing our OMOP CDM tables, we will also specify a write schema where any database tables we create during our analysis will be stored (often our OMOP CDM tables will be in a schema that we only have read-access to and we’ll have another schema where we can have write-access where intermediate tables can be created for a given a study).

cdm <- cdmFromCon(db, 
                  cdmSchema = "main", 
                  writeSchema = "main",
                  cdmName = "example_data")
cdm
── # OMOP CDM reference (duckdb) of example_data ───────────────────────────────
• omop tables: person, observation_period, visit_occurrence,
condition_occurrence, drug_exposure, procedure_occurrence, measurement,
observation, cdm_source, concept, vocabulary, concept_relationship,
concept_synonym, concept_ancestor, drug_strength
• cohort tables: -
• achilles tables: -
• other tables: -

We can also specify a write prefix and this will be used whenever permanent tables are created in the write schema. This can be useful when we’re sharing our write schema with others and want to avoid table name conflicts and easily drop tables created as part of a particular study.

cdm <- cdmFromCon(con = db,
                  cdmSchema = "main", 
                  writeSchema = "main", 
                  writePrefix = "my_study_",
                  cdmName = "example_data")

We can see that we now have an object that contains references to all the OMOP CDM tables. We can reference specific tables using the “$” or “[[ … ]]” operators.

cdm$person
# Source:   table<person> [?? x 18]
# Database: DuckDB v1.1.3 [eburn@Windows 10 x64:R 4.4.0/:memory:]
   person_id gender_concept_id year_of_birth month_of_birth day_of_birth
       <int>             <int>         <int>          <int>        <int>
 1         1              8532          1968              3           27
 2         2              8532          1964              7           17
 3         3              8532          1983              1           22
 4         4              8532          1996             10           24
 5         5              8507          1974              8            2
 6         6              8532          1974             12           20
 7         7              8507          1960             10           27
 8         8              8532          1958              7           15
 9         9              8532          1971              3           15
10        10              8532          1954              5           22
# ℹ more rows
# ℹ 13 more variables: race_concept_id <int>, ethnicity_concept_id <int>,
#   birth_datetime <dttm>, location_id <int>, provider_id <int>,
#   care_site_id <int>, person_source_value <chr>, gender_source_value <chr>,
#   gender_source_concept_id <int>, race_source_value <chr>,
#   race_source_concept_id <int>, ethnicity_source_value <chr>,
#   ethnicity_source_concept_id <int>
cdm[["observation_period"]]
# Source:   table<observation_period> [?? x 5]
# Database: DuckDB v1.1.3 [eburn@Windows 10 x64:R 4.4.0/:memory:]
   observation_period_id person_id observation_period_s…¹ observation_period_e…²
                   <int>     <int> <date>                 <date>                
 1                     1         1 2007-02-28             2019-11-01            
 2                     2         2 1997-10-29             2014-07-24            
 3                     3         3 1996-04-06             2011-05-20            
 4                     4         4 2017-03-18             2017-08-06            
 5                     5         5 1982-03-19             1985-08-02            
 6                     6         6 1986-02-23             2010-12-11            
 7                     7         7 2017-10-07             2017-12-26            
 8                     8         8 1976-06-28             1987-09-09            
 9                     9         9 2015-06-30             2019-12-07            
10                    10        10 1981-11-06             2004-05-24            
# ℹ more rows
# ℹ abbreviated names: ¹​observation_period_start_date,
#   ²​observation_period_end_date
# ℹ 1 more variable: period_type_concept_id <int>

5.3 CDM attributes

5.3.1 CDM name

Our cdm reference will be associated with a name. By default this name will be taken from the cdm source name field from the cdm source table.

cdm <- cdmFromCon(db,
  cdmSchema = "main", 
  writeSchema = "main")
cdm$cdm_source
# Source:   table<cdm_source> [?? x 10]
# Database: DuckDB v1.1.3 [eburn@Windows 10 x64:R 4.4.0/:memory:]
  cdm_source_name cdm_source_abbreviation cdm_holder source_description
  <chr>           <lgl>                   <lgl>      <lgl>             
1 mock            NA                      NA         NA                
# ℹ 6 more variables: source_documentation_reference <lgl>,
#   cdm_etl_reference <lgl>, source_release_date <lgl>, cdm_release_date <lgl>,
#   cdm_version <dbl>, vocabulary_version <lgl>
cdmName(cdm)
[1] "mock"

However, we can instead set this name when creating our cdm reference.

cdm <- cdmFromCon(db,
  cdmSchema = "main", 
  writeSchema = "main", 
  cdmName = "my_cdm")
cdmName(cdm)
[1] "my_cdm"

Note, we can also can get our cdm name from any of the tables in our cdm reference.

cdmName(cdm$person)
[1] "my_cdm"

The cdm reference itself has a class of a cdm_reference.

class(cdm)
[1] "cdm_reference"
class(cdm$person)
[1] "omop_table"            "cdm_table"             "tbl_duckdb_connection"
[4] "tbl_dbi"               "tbl_sql"               "tbl_lazy"             
[7] "tbl"                  

Then each of the tables have a class of a cdm_table. If the table is one of the standard OMOP CDM tables it will also have a class of an “omop_table”. This latter class is useful when we want to have different behaviour for these core tables compared to other tables that are added to the cdm reference during the course of running a study.

class(cdm$person)
[1] "omop_table"            "cdm_table"             "tbl_duckdb_connection"
[4] "tbl_dbi"               "tbl_sql"               "tbl_lazy"             
[7] "tbl"                  

We can see that cdmName() is a generic function, which works for both the cdm reference as a whole and individual tables.

library(sloop)
s3_dispatch(cdmName(cdm))
=> cdmName.cdm_reference
 * cdmName.default
s3_dispatch(cdmName(cdm$person))
   cdmName.omop_table
=> cdmName.cdm_table
   cdmName.tbl_duckdb_connection
   cdmName.tbl_dbi
   cdmName.tbl_sql
   cdmName.tbl_lazy
   cdmName.tbl
 * cdmName.default

5.3.2 CDM version

We can also easily check the OMOP CDM version that is being used

cdmVersion(cdm)
[1] "5.3"

5.4 Including cohort tables in the cdm reference

We’ll be seeing how to create cohorts in more detail in a couple of chapters time. For the moment, let’s just see that we can include a cohort in our cdm reference. For this we’ll use omock to add a cohort to our local cdm and upload that to a duckdb database again.

cdm_local <- cdm_local |> 
  omock::mockCohort(name = "my_study_cohort")
db <- DBI::dbConnect(duckdb::duckdb())
CDMConnector::copyCdmTo(con = db,
                        cdm = cdm_local,
                        schema ="main", 
                        overwrite = TRUE)
── # OMOP CDM reference (duckdb) of mock database ──────────────────────────────
• omop tables: person, observation_period, cdm_source, concept, vocabulary,
concept_relationship, concept_synonym, concept_ancestor, drug_strength,
condition_occurrence, drug_exposure, observation, measurement,
visit_occurrence, procedure_occurrence
• cohort tables: my_study_cohort
• achilles tables: -
• other tables: -

Now we can specify this existing cohort table when creating our cdm reference.

cdm <- cdmFromCon(db, 
                  cdmSchema = "main", 
                  writeSchema = "main",
                  cohortTables = "my_study_cohort",
                  cdmName = "example_data")
cdm
cdm$my_study_cohort |> 
  glimpse()
Rows: ??
Columns: 4
Database: DuckDB v1.1.3 [eburn@Windows 10 x64:R 4.4.0/:memory:]
$ cohort_definition_id <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ subject_id           <int> 1, 1, 1, 1, 1, 1, 2, 3, 3, 3, 4, 4, 4, 5, 5, 7, 9…
$ cohort_start_date    <date> 2007-06-10, 2008-05-02, 2008-08-14, 2008-09-16, …
$ cohort_end_date      <date> 2008-05-01, 2008-08-13, 2008-09-15, 2009-04-23, …

5.5 Including achilles tables in the cdm reference

If we have the results tables from the Achilles R package https://ohdsi.github.io/Achilles/ in our database, we can also include these in our cdm reference.

Just to show how this can be done let’s upload some empty results tables in the Achilles format.

DBI::dbWriteTable(db, 
                  "achilles_analysis",
                  tibble(
                    analysis_id = NA_integer_,
                    analysis_name = NA_character_,
                    stratum_1_name = NA_character_,
                    stratum_2_name = NA_character_,
                    stratum_3_name = NA_character_,
                    stratum_4_name = NA_character_,
                    stratum_5_name = NA_character_,
                    is_default = NA_character_,
                    category = NA_character_))
DBI::dbWriteTable(db, 
                  "achilles_results",
                  tibble(
                    analysis_id = NA_integer_,
                    stratum_1 = NA_character_,
                    stratum_2 = NA_character_,
                    stratum_3 = NA_character_,
                    stratum_4 = NA_character_,
                    stratum_5 = NA_character_,
                    count_value = NA_character_))
DBI::dbWriteTable(db, 
                  "achilles_results_dist",
                  tibble(
                    analysis_id = NA_integer_,
                    stratum_1 = NA_character_,
                    stratum_2 = NA_character_,
                    stratum_3 = NA_character_,
                    stratum_4 = NA_character_,
                    stratum_5 = NA_character_,
                    count_value = NA_character_,
                    min_value = NA_character_,
                    max_value = NA_character_,
                    avg_value = NA_character_,
                    stdev_value = NA_character_,
                    median_value = NA_character_,
                    p10_value = NA_character_,
                    p25_value = NA_character_,
                    p75_value = NA_character_,
                    p90_value = NA_character_))

We can now include these achilles table in our cdm reference.

cdm <- cdmFromCon(db, 
                  cdmSchema = "main", 
                  writeSchema = "main",
                  cohortTables = "my_study_cohort",
                  achillesSchema = "main",
                  cdmName = "example_data")
cdm

5.6 Adding other tables to the cdm reference

Let’s say we have some local data that we want to add to our cdm reference. We can add this both to the same source (in this case a database) and add to our reference using insertTable().

cars |> 
  glimpse()
Rows: 50
Columns: 2
$ speed <dbl> 4, 4, 7, 7, 8, 9, 10, 10, 10, 11, 11, 12, 12, 12, 12, 13, 13, 13…
$ dist  <dbl> 2, 10, 4, 22, 16, 10, 18, 26, 34, 17, 28, 14, 20, 24, 28, 26, 34…
cdm <- insertTable(cdm = cdm, 
                   name = "cars", 
                   table = cars, 
                   temporary = FALSE)

We can see that now we have this extra table has been uploaded to the database behind our cdm reference and added to our reference.

cdm
── # OMOP CDM reference (duckdb) of example_data ───────────────────────────────
• omop tables: person, observation_period, visit_occurrence,
condition_occurrence, drug_exposure, procedure_occurrence, measurement,
observation, cdm_source, concept, vocabulary, concept_relationship,
concept_synonym, concept_ancestor, drug_strength
• cohort tables: my_study_cohort
• achilles tables: achilles_analysis, achilles_results, achilles_results_dist
• other tables: cars
cdm$cars
# Source:   table<cars> [?? x 2]
# Database: DuckDB v1.1.3 [eburn@Windows 10 x64:R 4.4.0/:memory:]
   speed  dist
   <dbl> <dbl>
 1     4     2
 2     4    10
 3     7     4
 4     7    22
 5     8    16
 6     9    10
 7    10    18
 8    10    26
 9    10    34
10    11    17
# ℹ more rows

If we already had the table in the database, then we could have instead just assigned it to our existing cdm reference. To see this lets upload the penguins table to our duckdb database.

DBI::dbWriteTable(db, 
                  "penguins",
                  palmerpenguins::penguins)

Once we have this table in the database, we can see that we can just assign it to our cdm reference.

cdm$penguins <- tbl(db, "penguins")

cdm
── # OMOP CDM reference (duckdb) of example_data ───────────────────────────────
• omop tables: person, observation_period, visit_occurrence,
condition_occurrence, drug_exposure, procedure_occurrence, measurement,
observation, cdm_source, concept, vocabulary, concept_relationship,
concept_synonym, concept_ancestor, drug_strength
• cohort tables: my_study_cohort
• achilles tables: achilles_analysis, achilles_results, achilles_results_dist
• other tables: cars, penguins

5.7 Mutability of the cdm reference

An important characteristic of our cdm reference is that we can alter the tables in R, but the OMOP CDM data will not be affected.

For example, let’s say we want to perform a study with only people born in 1970. For this we could filter our person table to only people born in this year.

cdm$person <- cdm$person |> 
  filter(year_of_birth == 1970)

cdm$person
# Source:   SQL [?? x 18]
# Database: DuckDB v1.1.3 [eburn@Windows 10 x64:R 4.4.0/:memory:]
  person_id gender_concept_id year_of_birth month_of_birth day_of_birth
      <int>             <int>         <int>          <int>        <int>
1        56              8532          1970              9            3
2        92              8507          1970              1           24
# ℹ 13 more variables: race_concept_id <int>, ethnicity_concept_id <int>,
#   birth_datetime <dttm>, location_id <int>, provider_id <int>,
#   care_site_id <int>, person_source_value <chr>, gender_source_value <chr>,
#   gender_source_concept_id <int>, race_source_value <chr>,
#   race_source_concept_id <int>, ethnicity_source_value <chr>,
#   ethnicity_source_concept_id <int>

From now on, when we work with our cdm reference this restriction will continue to have been applied.

cdm$person |> 
    tally()
# Source:   SQL [?? x 1]
# Database: DuckDB v1.1.3 [eburn@Windows 10 x64:R 4.4.0/:memory:]
      n
  <dbl>
1     2

The original OMOP CDM data itself however will remain unaffected. And we can see if we create our reference again that the underlying data is unchanged.

cdm <- cdmFromCon(con = db,
                  cdmSchema = "main", 
                  writeSchema = "main", 
                  cdmName = "Synthea Covid-19 data")
cdm$person |> 
    tally()
# Source:   SQL [?? x 1]
# Database: DuckDB v1.1.3 [eburn@Windows 10 x64:R 4.4.0/:memory:]
      n
  <dbl>
1   100

The mutability of our cdm reference is a useful feature for studies as it means we can easily tweak our OMOP CDM data if needed. Meanwhile, leaving the underlying data unchanged is essential so that other study code can run against the data unaffected by any of our changes.

One thing we can’t do though is alter the structure of OMOP CDM tables. For example this code would cause an error as the person table must have the column person_id.

cdm$person <- cdm$person |> 
    rename("new_id" = "person_id")
Error in `newOmopTable()`:
! person_id is not present in table person

In such a case we would have to call the table something else.

cdm$person_new <- cdm$person |> 
    rename("new_id" = "person_id") |> 
    compute(name = "person_new", 
            temporary = TRUE)

Now we would have this new table as an additional table in our cdm reference, knowing it was not in the format of one of the core OMOP CDM tables.

cdm
── # OMOP CDM reference (duckdb) of Synthea Covid-19 data ──────────────────────
• omop tables: person, observation_period, visit_occurrence,
condition_occurrence, drug_exposure, procedure_occurrence, measurement,
observation, cdm_source, concept, vocabulary, concept_relationship,
concept_synonym, concept_ancestor, drug_strength
• cohort tables: -
• achilles tables: -
• other tables: -

5.8 Working with temporary and permanent tables

When we create new tables and our cdm reference is in a database we have a choice between using temporary and permanent tables. In most cases we can work with these interchangeably. Below we create one temporary table and one permanent table. We can see that both of these tables have been added to our cdm reference and that we can use them in the same way.

cdm$person_new_temp <- cdm$person |> 
  head(5) |> 
  compute()
cdm$person_new_permanent <- cdm$person |> 
  head(5) |> 
  compute(name = "person_new_permanent", 
          temporary = FALSE)
cdm

cdm$person_new_temp
# Source:   table<og_001_1742636789> [?? x 18]
# Database: DuckDB v1.1.3 [eburn@Windows 10 x64:R 4.4.0/:memory:]
  person_id gender_concept_id year_of_birth month_of_birth day_of_birth
      <int>             <int>         <int>          <int>        <int>
1         1              8532          1968              3           27
2         2              8532          1964              7           17
3         3              8532          1983              1           22
4         4              8532          1996             10           24
5         5              8507          1974              8            2
# ℹ 13 more variables: race_concept_id <int>, ethnicity_concept_id <int>,
#   birth_datetime <dttm>, location_id <int>, provider_id <int>,
#   care_site_id <int>, person_source_value <chr>, gender_source_value <chr>,
#   gender_source_concept_id <int>, race_source_value <chr>,
#   race_source_concept_id <int>, ethnicity_source_value <chr>,
#   ethnicity_source_concept_id <int>
cdm$person_new_permanent
# Source:   table<person_new_permanent> [?? x 18]
# Database: DuckDB v1.1.3 [eburn@Windows 10 x64:R 4.4.0/:memory:]
  person_id gender_concept_id year_of_birth month_of_birth day_of_birth
      <int>             <int>         <int>          <int>        <int>
1         1              8532          1968              3           27
2         2              8532          1964              7           17
3         3              8532          1983              1           22
4         4              8532          1996             10           24
5         5              8507          1974              8            2
# ℹ 13 more variables: race_concept_id <int>, ethnicity_concept_id <int>,
#   birth_datetime <dttm>, location_id <int>, provider_id <int>,
#   care_site_id <int>, person_source_value <chr>, gender_source_value <chr>,
#   gender_source_concept_id <int>, race_source_value <chr>,
#   race_source_concept_id <int>, ethnicity_source_value <chr>,
#   ethnicity_source_concept_id <int>

One benefit of working with temporary tables is that they will be automatically dropped at the end of the session, whereas the permanent tables will be left over in the database until explicitly dropped. However, one disadvantage of using temporary tables is that we will generally accumulate more and more of them as we go. Whereas with permanent tables we can overwrite them as we go (as in if we’re inside a loop that requires a compute we could overwite a intermediate permanent table 100 times rather than creating 100 temporary tables).

6 Disconnecting

Once we have finished our analysis we can close our connection to the database behind our cdm reference like so.

cdmDisconnect(cdm) 

7 Further reading