7 Conventions

This chapter documents the naming and structural conventions observed across the ecosystem. Many will feel familiar if you have already worked with the packages, but having them written down in one place makes it easier to apply them consistently in new code and to identify deviations worth addressing.

Where real inconsistencies exist across the current packages, they are noted — the goal is for new packages to conform to these conventions even when older ones pre-date them.

7.1 Casing

The ecosystem uses camelCase for all R identifiers: exported function names, unexported helper function names, argument names, and local variable names.

# Correct — camelCase throughout
generateIngredientCohortSet(cdm, name = "dus_cohort", gapEra = 7)
addCohortIntersectFlag(cohort, targetCohortTable = "cohort2", window = c(-Inf, -1))
summariseClinicalRecords(cdm, omopTableName = "condition_occurrence")

The camelCase rule is enforced by lintr:

lintr::lint_package(
  linters = lintr::linters_with_defaults(
    lintr::object_name_linter(styles = "camelCase")
  )
)

Exception: strings stored in data. Column names in database tables, variable_name values inside summarised_result objects, result_type strings, cohort names, and table names in the CDM are all snake_case. The camelCase rule applies to R object names and arguments, not to string values that end up in data or in the CDM.

# The R argument is camelCase...
addAge(cohort, ageGroup = list(c(0, 17), c(18, 64)))

# ...but values stored as data are snake_case:
# group_level  = "acetaminophen"
# variable_name = "number_subjects"
# result_type   = "summarised_characteristics"
# cohort name   = "dus_cohort"

7.2 Function naming

7.2.1 Verb prefixes

Every exported function begins with a standard verb prefix. The prefix is not merely a naming choice — it signals the return type, the side effects, and the role the function plays in a pipeline.

Prefix	Returns	Side effects	Representative examples
`generate*`	`cohort_table` written to CDM	Writes table to database	`generateConceptCohortSet()`, `generateIngredientCohortSet()`, `generateDenominatorCohortSet()`
`require*`	Modified `cohort_table`	May rewrite cohort table in DB; updates attrition	`requireIsFirstEntry()`, `requireAge()`, `requireSex()`, `requireCohortIntersect()`, `requireConceptIntersect()`
`add*`	Input table with extra columns appended	None	`addAge()`, `addSex()`, `addDemographics()`, `addCohortIntersectFlag()`, `addCohortIntersectCount()`, `addCohortIntersectDate()`, `addDrugUse()`
`summarise*`	`summarised_result`	None	`summariseCharacteristics()`, `summariseClinicalRecords()`, `summariseDrugUtilisation()`, `summariseCohortAttrition()`, `summariseCohortCount()`
`estimate*`	`summarised_result`	None	`estimateIncidence()`, `estimatePointPrevalence()`, `estimatePeriodPrevalence()`
`plot*`	`ggplot2` object	None	`plotRecordCount()`, `plotIncidence()`, `plotPrevalence()`, `plotCohortAttrition()`
`table*`	Formatted table (`gt`, `flextable`, etc.)	None	`tableCharacteristics()`, `tableClinicalRecords()`, `tableOmopSnapshot()`, `tableCohortCount()`
`mock*`	`cdm_reference` (in-memory)	None	`mockCohortConstructor()`, `mockPatientProfiles()`, `mockDrugUtilisation()`, `mockOmopSketch()`, `mockIncidencePrevalence()`
`get*`	Plain R object	None	`getDrugIngredientCodes()`, `getVocabVersion()`
`new*`	New instance of a class	None	`newCodelist()`, `newSummarisedResult()`, `newCdmSource()`
`compute*`	Materialised table	Writes temp table to DB	`computeTable()` (internal convention in PatientProfiles)

Note

estimate* is used in IncidencePrevalence rather than summarise* because incidence and prevalence are computed statistics with a specific analytic meaning (rates and proportions over time), not summaries in the general sense. New packages that compute similar analytic estimates should consider whether estimate* or summarise* better describes the result.

Note

require* is specific to CohortConstructor and means “apply this inclusion or exclusion criterion and update the cohort attrition”. Do not use require* for functions that merely check or validate inputs — use validate* or assert* for those.

7.2.2 The `*CohortSet` suffix

Functions that generate multiple cohorts at once and return a full cohort_table in the CDM use the *CohortSet suffix:

generateConceptCohortSet(cdm, conceptSet = ..., name = "my_cohort")
generateIngredientCohortSet(cdm, ingredient = "acetaminophen", name = "dus_cohort")
generateDenominatorCohortSet(cdm, name = "denominator", ageGroup = list(...))

Functions that return a single cohort or modify an existing table typically do not use this suffix.

7.2.3 Noun component

The noun should describe what the function operates on or returns clearly enough that users can understand it without reading the documentation:

summariseClinicalRecords — summarises records in clinical OMOP tables
addCohortIntersectFlag — adds a flag column indicating cohort intersection
requireConceptIntersect — requires intersection with a concept set

Avoid abbreviations unless they are standard in the domain (e.g. Cdm, Omop). Spell out words fully: Incidence not Inc, Characteristics not Chars.

7.2.4 `mock*` functions

Every package should provide at least one mock* function so that examples and tests can run without a real database. The naming convention is mock{PackageName}():

mockPatientProfiles()        # PatientProfiles
mockCohortConstructor()      # CohortConstructor
mockDrugUtilisation(numberIndividual = 100, source = "duckdb")
mockCohortCharacteristics()  # CohortCharacteristics
mockOmopSketch()             # OmopSketch
mockIncidencePrevalence(sampleSize = 1000)

Internally, these functions typically call omock helpers such as mockCdmFromDataset() or mockCdmFromTables(). The mock* function should return a cdm_reference containing whatever tables the package needs, and should support a source argument ("duckdb" or "local") where appropriate.

Mock functions should be called mockDisconnect(cdm) at the end of examples to release the database connection — a convention from PatientProfiles now used widely.

7.3 Argument naming

7.3.1 Standard argument names

The following argument names are reserved. Use them whenever your function has an argument that matches the described role.

Argument	Type	Role
`cdm`	`cdm_reference`	CDM reference object. First argument if present.
`cohort`	`cohort_table`	Cohort table to operate on.
`cohortId`	`integer` or `NULL`	IDs of cohorts to include; `NULL` = all.
`conceptSet`	codelist / expression	Clinical concept set.
`name`	`character(1)`	Name of the output table written into the CDM.
`nameStyle`	`character(1)`	Glue template for naming output columns.
`targetCohortTable`	`character(1)`	Name of a cohort table to intersect with.
`targetCohortId`	`integer` or `NULL`	Cohort IDs within `targetCohortTable`.
`tableName`	`character(1)`	Name of an OMOP CDM clinical table.
`omopTableName`	`character(1)`	As `tableName`; used in OmopSketch-style functions.
`indexDate`	`character(1)`	Column name to use as the index date.
`strata`	`list` of `character` vectors	Stratification variable sets.
`ageGroup`	named `list` or `NULL`	Age group definitions, e.g. `list(c(0, 17), c(18, 64))`.
`sex`	`character`	Sex filter or stratification: `"Male"`, `"Female"`, `"Both"`.
`window`	length-2 integer vector or named `list`	Time window(s) relative to index date.
`intersections`	integer or integer vector of length 2	Number of required intersections.
`gapEra`	`integer(1)`	Days between records below which they are merged into one era.
`minCellCount`	`integer(1)`	Minimum cell count for suppression. Default `5`.
`type`	`character(1)`	Output format for `table*`: `"gt"`, `"flextable"`, `"tibble"`, `"datatable"`.
`facet`	`character` or formula	Columns to facet by in `plot*` functions.
`colour`	`character`	Columns to colour by in `plot*` functions.

7.3.2 Argument order

Arguments should appear in this order:

cdm (if the function takes the CDM directly)
cohort (if the function takes a cohort table)
cohortId
Content arguments (what to compute, concept sets, windows, etc.)
Modifier arguments (flags and options that adjust behaviour)
name (near the end, since it is often left at its default)

# cdm first, then content, then name
generateIngredientCohortSet(
  cdm = cdm,
  name = "dus_cohort",
  ingredient = "acetaminophen",
  gapEra = 7
)

# cohort first, then targetCohortTable, then window, then name
requireCohortIntersect(
  cohort = cdm$fractures,
  targetCohortTable = "gibleed",
  intersections = c(1, Inf),
  window = c(-Inf, 0),
  name = "fractures_with_gi_bleed"
)

7.3.3 The `nameStyle` argument

add* functions in PatientProfiles that produce multiple columns use a nameStyle argument — a glue-style template that controls how the new column names are constructed. Available template variables depend on the function but commonly include {cohort_name}, {window_name}, {concept_name}, and {value}.

# Default nameStyle — column is named automatically
addCohortIntersectFlag(cohort, targetCohortTable = "cohort2", window = c(-Inf, -1))

# Custom nameStyle — single column named "prior_infection"
addCohortIntersectFlag(
  cohort,
  targetCohortTable = "cohort2",
  window = c(-Inf, -1),
  nameStyle = "prior_infection"
)

# Multi-window nameStyle — columns like "prior_morphine", "future_aspirin"
addCohortIntersectCount(
  cohort,
  targetCohortTable = "drugs",
  window = list("prior" = c(-Inf, -1), "future" = c(1, Inf)),
  nameStyle = "{window_name}_{cohort_name}"
)

When your add* function can produce multiple columns, provide a sensible default nameStyle and document the available template variables in the function’s roxygen entry.

7.4 Windows

Time windows are integer vectors of length 2: c(start, end), where values are days relative to an index date (typically cohort_start_date).

0 means the index date itself.
-Inf means no lower bound (any time before).
Inf means no upper bound (any time after).

When multiple windows are needed, they are passed as a named list. The names become the window labels in nameStyle templates and in additional_name/additional_level columns of summarised_result objects:

window = list(
  "short_term" = c(1, 30),
  "mid_term"   = c(31, 180),
  "long_term"  = c(181, 365)
)

7.5 Sex values

When sex is stored as a value — in strata_level, as a column value, or in cohort settings — the standard values are "Male", "Female", and "Both". These use an initial capital. This convention is established in IncidencePrevalence and used consistently across the ecosystem.

7.6 Data values in `summarised_result`

String values stored inside summarised_result columns use snake_case. This applies to variable_name, variable_level, estimate_name, result_type, and values in group_level and strata_level:

# Typical variable_name values
"number_records"
"number_subjects"
"age"
"sex"

# Typical estimate_name values
"count"
"mean"
"sd"
"q25"
"median"
"q75"

# Typical result_type values
"summarised_characteristics"
"summarised_drug_utilisation"
"cohort_count"

7.7 Column naming in cohort tables

Columns added to cohort tables by add* functions use snake_case. The names are derived from the nameStyle template and typically combine a descriptor of the value type, a cohort or concept name, and a window name when multiple windows are used:

Pattern	Example column name
Single target, default style	`cohort_2_minf_to_m1`
Custom `nameStyle = "prior_infection"`	`prior_infection`
`nameStyle = "{window_name}_{cohort_name}"`	`prior_morphine`, `future_aspirin`
`addAge()`	`age`
`addSex()`	`sex`
`addPriorObservation()`	`prior_observation`

7.8 The `summarise` → `table`/`plot` pattern

All analytics packages follow the same two-step workflow:

A summarise* (or estimate*) function queries the CDM and returns a summarised_result.
A paired table* or plot* function formats that result for display.

Do not pipe table* or plot* functions directly onto summarise* calls in production code. The summarised_result object is the exportable artefact — you may want to save it, combine it with results from other databases, or suppress it before rendering.

# Correct workflow
result <- summariseCharacteristics(cdm$my_cohort)
result |> tableCharacteristics(type = "gt")

# Also fine for quick exploration, but not recommended in scripts
# cdm$my_cohort |>
#   summariseCharacteristics() |>
#   tableCharacteristics()

7.9 Inconsistencies to be aware of

The ecosystem has grown over several years, and a small number of inconsistencies currently exist:

omopTableName vs tableName: OmopSketch uses omopTableName for the name of a CDM clinical table; other packages use tableName. New packages should prefer tableName unless interoperating closely with OmopSketch patterns.
sex argument type: Some older functions accept sex = TRUE (logical, meaning “stratify by sex”) while newer ones accept sex = c("Male", "Female", "Both") (character, meaning “filter to these groups”). Check the package you are building on and match its convention; prefer the character form in new code.
estimate* vs summarise*: IncidencePrevalence uses estimate* for its main output functions. Other packages use summarise*. Both conventions will coexist; choose the one that best describes what your function computes.
mockDisconnect() location: The mockDisconnect() helper is exported from PatientProfiles. Packages that use it in examples need PatientProfiles in Suggests.

When you find a convention that is not documented here or that appears applied inconsistently, please open an issue on the book repository.