{dm} is fantastic!

SQL

A quick overview of the {dm} package for working with relational data models in R.

Published

May 9, 2021

Sometimes as an R user you may find yourself needing to work with data that is stored in a remote database, but want to continue to use your R-orientated workflow. The dm package by Kirill Müller provides some great tools for doing exactly that, and in this post I wanted to share some of the functions I’ve found particularly useful.

Now, you may already be familiar with the dbplyr package which allows you to use the majority of dplyr/tidyr functions on a remote table by auto-magically converting these commands into the necessary SQL and returning the result back to R when you call collect(). dm extends this concept and adds extra functionality for working with the whole database, making it a very valuable tool. For the examples in this post, I’ve set up a local SQL server, and copied over the dm::dm_nycflights13() example database that is included in the dm package.

# Setup for examples using SQL server

con <- DBI::dbConnect(odbc::odbc(),
                      Driver   = "ODBC Driver 17 for SQL Server",
                      Server   = "localhost",
                      UID      = Sys.getenv("SQL_user"),
                      PWD      = Sys.getenv("SQL_password"),
                      Port     = 1433)

flights <- dm_nycflights13()

DBI::dbSendQuery(con, "CREATE DATABASE nycflights")

copy_dm_to(con, dm = flights, temporary = FALSE)

Next I rerun the connection, but this time specifying the newly created nycflights database.

con <- DBI::dbConnect(odbc::odbc(),
                      Driver   = "ODBC Driver 17 for SQL Server",
                      Server   = "localhost",
                      Database = "nycflights",
                      UID      = Sys.getenv("SQL_user"),
                      PWD      = Sys.getenv("SQL_password"),
                      Port     = 1433)

Now our data is in the database we can begin working with it. To connect to the database and learn the connections, we can use dm_from_src(). Setting the learn_keys argument to TRUE means dm will attempt to discover the primary and foreign keys between the tables, however this currently only works with Postgres and SQL Server databases.

flights <- dm_from_src(con, learn_keys = TRUE)

flights

#> ── Table source ───────────────────────────────────────────────────────────
#> src:  Microsoft SQL Server 15.00.4138[dbo@h-xps/nycflights]
#> ── Metadata ───────────────────────────────────────────────────────────────
#> Tables: `airlines`, `airports`, `flights`, `planes`, `weather`, `mtcars`
#> Columns: 64
#> Primary keys: 3
#> Foreign keys: 2

A really great feature of dm is the ability to plot the database to visualise the links between tables. We can do this by using the dm_draw() function

dm_draw(flights)

Now, there is a small issue with the nycflights data in that some of the tailnum values are not present in both the flights and planes table. However, we can manually create the link by adding an additional foreign key using dm_add_fk().

flights <- flights %>% 
  dm_add_fk(table = flights, columns = tailnum, ref_table = planes)

flights

#> ── Table source ───────────────────────────────────────────────────────────
#> src:  Microsoft SQL Server 15.00.4138[dbo@h-xps/nycflights]
#> ── Metadata ───────────────────────────────────────────────────────────────
#> Tables: `airlines`, `airports`, `flights`, `planes`, `weather`, `mtcars`
#> Columns: 64
#> Primary keys: 3
#> Foreign keys: 3

And if we draw the model again we can see that planes is now connected to the flights table. I’ve also added some extra styling to show how easy it is to customise these plots with dm.

You can use dm_get_available_colors() to see all possible colours

flights %>% 
  dm_set_colors(
    salmon4 = flights,
    violetred1 = airlines,
    royalblue4 = planes,
    wheat = airports
  ) %>% 
  dm_draw(rankdir = "TB",
          view_type = "title_only",
          edge_attrs = "arrowhead = vee")

Working with the data

So, now we have a data model set up we can begin working with it. dm contains a range of dm_* functions that work in a similar way to their dpylr equivalents, but they can affect the whole data model object. For example, we can select certain tables and use filters on the whole data model. Lets drop the weather table as we aren’t too interested in weather for the time being.

flights <- flights %>% 
  dm_select_tbl(-weather)

flights

#> ── Table source ───────────────────────────────────────────────────────────
#> src:  Microsoft SQL Server 15.00.4138[dbo@h-xps/nycflights]
#> ── Metadata ───────────────────────────────────────────────────────────────
#> Tables: `airlines`, `airports`, `flights`, `planes`, `mtcars`
#> Columns: 49
#> Primary keys: 3
#> Foreign keys: 3

By printing the dm object, we can see the weather table is no longer part of the dm object.

Lets say we are particular interested in finding out about Alaska Airlines Inc. We can use dm_filter() to find all the linked information for this airline in our data model.

flights_filt <- flights %>% 
  dm_filter(flights, carrier == "AA") 

flights_filt

#> ── Table source ───────────────────────────────────────────────────────────
#> src:  Microsoft SQL Server 15.00.4138[dbo@h-xps/nycflights]
#> ── Metadata ───────────────────────────────────────────────────────────────
#> Tables: `airlines`, `airports`, `flights`, `planes`, `mtcars`
#> Columns: 49
#> Primary keys: 3
#> Foreign keys: 3
#> ── Filters ────────────────────────────────────────────────────────────────
#> flights: carrier == "AA"

Now we can see we have some additional information regarding the filter when we print the data model. In order to apply this filter, we have a couple of possibilities. If we were interested in returning all the rows in the airports table, we could apply this filter directly to that table using dm_apply_filters_to_tbl().

flights_filt %>% 
  dm_apply_filters_to_tbl(airports)

#> # Source:   lazy query [?? x 8]
#> # Database: Microsoft SQL Server 15.00.4138[dbo@h-xps/nycflights]
#>   faa   name                  lat   lon   alt    tz dst   tzone           
#>   <chr> <chr>               <dbl> <dbl> <dbl> <dbl> <chr> <chr>           
#> 1 LGA   La Guardia           40.8 -73.9    22    -5 A     America/New_York
#> 2 EWR   Newark Liberty Intl  40.7 -74.2    18    -5 A     America/New_York
#> 3 JFK   John F Kennedy Intl  40.6 -73.8    13    -5 A     America/New_York

Sometimes, we might want to return all tables with the filter applied. dm provides the dm_flattern_to_tbl() function to do exactly that. First we need to apply the filter to the dm object, and then we can “flattern” to a single table, specifying the type of join we would like to use. dm will create the joins based on the keys already defined in the dm object.

flights_filt %>% 
  dm_apply_filters() %>% 
  dm_flatten_to_tbl(start = flights, airports, planes, join = left_join)

#> Renamed columns:
#> * year -> flights.year, planes.year

#> # Source:   lazy query [?? x 34]
#> # Database: Microsoft SQL Server 15.00.4138[dbo@h-xps/nycflights]
#>    flights.year month   day dep_time sched_dep_time dep_delay arr_time
#>           <int> <int> <int>    <int>          <int>     <dbl>    <int>
#>  1         2013     1    10      531            540        -9      832
#>  2         2013     1    10      553            600        -7      837
#>  3         2013     1    10      555            600        -5      733
#>  4         2013     1    10      604            610        -6      851
#>  5         2013     1    10      604            610        -6      858
#>  6         2013     1    10      625            630        -5      753
#>  7         2013     1    10      633            630         3     1142
#>  8         2013     1    10      652            659        -7      942
#>  9         2013     1    10      659            700        -1     1013
#> 10         2013     1    10      700            700         0      837
#> # … with more rows, and 27 more variables: sched_arr_time <int>,
#> #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
#> #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#> #   minute <dbl>, time_hour <dttm>, name <chr>, lat <dbl>, lon <dbl>,
#> #   alt <dbl>, tz <dbl>, dst <chr>, tzone <chr>, planes.year <int>,
#> #   type <chr>, manufacturer <chr>, model <chr>, engines <int>,
#> #   seats <int>, speed <int>, engine <chr>

The start argument is the table that we want to join the others onto. You can optionally supply tables names that you want to join (in this case airports and planes), or specify nothing to join all tables. Finally you can also specify the type of join you would like to use (left_join is the default).

Before flattening to a single table, it can sometimes be useful to see how many rows each tables has after you have applied the filter.

flights_filt %>% 
  dm_apply_filters() %>% 
  dm_nrow()

#> airlines airports  flights   planes   mtcars 
#>        1        3     1089      118       32

This can be good to double check the filter has done what you think it was going to do!

To see what’s going on under the hood here, we can use show_query().

flights_filt %>% 
  dm_apply_filters() %>% 
  dm_flatten_to_tbl(start = flights, airports, planes, join = left_join) %>% 
  show_query()

#> Renamed columns:
#> * year -> flights.year, planes.year

#> <SQL>
#> SELECT "flights.year", "month", "day", "dep_time", "sched_dep_time", "dep_delay", "arr_time", "sched_arr_time", "arr_delay", "carrier", "flight", "LHS"."tailnum" AS "tailnum", "origin", "dest", "air_time", "distance", "hour", "minute", "time_hour", "name", "lat", "lon", "alt", "tz", "dst", "tzone", "planes.year", "type", "manufacturer", "model", "engines", "seats", "speed", "engine"
#> FROM (SELECT "flights.year", "month", "day", "dep_time", "sched_dep_time", "dep_delay", "arr_time", "sched_arr_time", "arr_delay", "carrier", "flight", "tailnum", "origin", "dest", "air_time", "distance", "hour", "minute", "time_hour", "name", "lat", "lon", "alt", "tz", "dst", "tzone"
#> FROM (SELECT "year" AS "flights.year", "month", "day", "dep_time", "sched_dep_time", "dep_delay", "arr_time", "sched_arr_time", "arr_delay", "carrier", "flight", "tailnum", "origin", "dest", "air_time", "distance", "hour", "minute", "time_hour"
#> FROM "dbo"."flights"
#> WHERE ("carrier" = 'AA')) "LHS"
#> LEFT JOIN (SELECT * FROM "dbo"."airports" AS "LHS"
#> WHERE EXISTS (
#>   SELECT 1 FROM (SELECT *
#> FROM "dbo"."flights"
#> WHERE ("carrier" = 'AA')) "RHS"
#>   WHERE ("LHS"."faa" = "RHS"."origin")
#> )) "RHS"
#> ON ("LHS"."origin" = "RHS"."faa")
#> ) "LHS"
#> LEFT JOIN (SELECT "tailnum", "year" AS "planes.year", "type", "manufacturer", "model", "engines", "seats", "speed", "engine"
#> FROM (SELECT * FROM "dbo"."planes" AS "LHS"
#> WHERE EXISTS (
#>   SELECT 1 FROM (SELECT *
#> FROM "dbo"."flights"
#> WHERE ("carrier" = 'AA')) "RHS"
#>   WHERE ("LHS"."tailnum" = "RHS"."tailnum")
#> )) "q01") "RHS"
#> ON ("LHS"."tailnum" = "RHS"."tailnum")

So with just a few line of code we were able to generate this fairly lengthy SQL statement. Pretty neat 😎.

Zooming with dm’s

Anther cool feature in dm is the ability to “zoom” into a table and use a dplyr style workflow on that table. This will be very familiar if you have ever used dbplyr. Lets zoom into the planes table and find the mean number of seats for each manufacturer.

zoomed <- flights_filt %>% 
  dm_zoom_to(planes) %>% 
  group_by(manufacturer) %>% 
  summarise(mean_seats = mean(seats, na.rm = TRUE)) %>% 
  arrange(desc(mean_seats))

zoomed

#> # Zoomed table: planes
#> # Source:       lazy query [?? x 2]
#> # Database:     Microsoft SQL Server 15.00.4138[dbo@h-xps/nycflights]
#> # Ordered by:   desc(mean_seats)
#>    manufacturer                  mean_seats
#>    <chr>                              <int>
#>  1 AIRBUS                               221
#>  2 AIRBUS INDUSTRIE                     187
#>  3 BOEING                               175
#>  4 MCDONNELL DOUGLAS                    162
#>  5 MCDONNELL DOUGLAS CORPORATION        142
#>  6 MCDONNELL DOUGLAS AIRCRAFT CO        142
#>  7 DOUGLAS                              102
#>  8 BOMBARDIER INC                        74
#>  9 CANADAIR                              55
#> 10 EMBRAER                               45
#> # … with more rows

After using dm_zoom_to() to select the planes table the rest of the code should look fairly familiar. We now have a couple of options for what we want to do with our new zoomed in table. We might want to overwrite the existing planes table with our new summary table. If that was the case we could use the dm_update_zoomed() to replace the original table with the one we have just created. An alternative (a potentially less destructive) approach is to create a new table containing the summary information.

flights_updated <- zoomed %>% 
  dm_insert_zoomed(new_tbl_name = "planes_summary")

flights_updated

#> ── Table source ───────────────────────────────────────────────────────────
#> src:  Microsoft SQL Server 15.00.4138[dbo@h-xps/nycflights]
#> ── Metadata ───────────────────────────────────────────────────────────────
#> Tables: `airlines`, `airports`, `flights`, `planes`, `mtcars`, `planes_summary`
#> Columns: 51
#> Primary keys: 3
#> Foreign keys: 3
#> ── Filters ────────────────────────────────────────────────────────────────
#> flights: carrier == "AA"

Now this particular summary table doesn’t need to link to any of the other tables, but the dm zoomed vingette provides lots of great examples for how one might link summary tables into the dm object.

dm_draw(flights_updated)

However, we may want to access this table and include it in our report. We can access a specific table easily by using the pull_tbl() function.

flights_updated %>% 
  pull_tbl(planes_summary) %>% 
  collect() %>% 
  reactable::reactable()

Nice!

Conclusion

So that was a quick overview of some of the functions I’ve found particularly useful. dm can do much more though so check out the site https://cynkra.github.io/dm/index.html which contains a great mix of tutorials and technical articles which make the package extremely accessible and fun to use.

Thanks for reading!