Creating a databricks cdm reference • OmopSparkConnector

So far we’ve been using a local Spark connection for introducing the OmopSparkConnector package. However, in practice, when working with patient-level health data our data will most likely be in the cloud-based Databricks plaform which is built around Apache Spark. Once we have created our cdm reference, the same code we have seen when working with a local Spark dataset will also work with Databricks. It is just that the way we connect will differ.

To create your connection to https://spark.posit.co/deployment/databricks-connect.html. Briefly, first you would save environmental variables.

usethis::edit_r_environ()

DATABRICKS_HOST = "Enter here your Workspace URL"
DATABRICKS_TOKEN = "Enter here your personal token"

With these saved you should now be able to connect with sparklyr, specifying your cluster ID.

library(sparklyr)
con <- spark_connect(
  cluster_id = "Enter here your cluster ID",
  method = "databricks_connect"
)
con

With this, we can check that everything is working and we have an open connection

connection_is_open(con)

With this, we we should be able to create a reference to a table. Let’s say we our OMOP CDM data is in a catalog called “my_catalog” and a schema called “my_omop_schema”. We should be able to create a reference to our person table.

library(dplyr)
tbl(con, I("my_catalog.my_omop_schema.person"))

We should be able to collect the first five rows of this table into R

tbl(con, I("my_catalog.my_omop_schema.person")) |> 
  head(5) |> 
  collect()

As well as this we should be able to go in the other direction and copy data from R to a Spark dataframe.

spark_cars_df <- sdf_copy_to(con,
                             cars,
                             overwrite = TRUE)
spark_cars_df

If these basics are working we should be well set-up to start working with OmopSparkConnector. Here we would spefify our cdm schema as we’ve seen above. And now let’s say we have another schema called “my_results_schema” where we want to save any study-specific tables. We’ll use this when specifying the write schema. In addition, we can also give a write prefix and all the tables we create during the course of the working with this cdm reference will start with this prefix.

library(OmopSparkConnector)
cdm <- cdmFromSpark(con,
  cdmSchema = "my_catalog.my_omop_schema",
  writeSchema = "my_catalog.my_results_schema",
  writePrefix = "study_1_"
)
cdm