Dplyr Cheat Sheet

4 min read2020/04/16

Summarize Data Make New Columns Combine Data Sets df'w'.valuecounts Count number of rows with each unique value of variable len(df) # of rows in DataFrame. When calculating nested counts use dplyr::vars to specify 2 variables for targetvar. Setdistinctby(e, distinctby) - Specifies variable(s) to use to calculate distinct occurrences. DISTINCT VS EVENT COUNTS FORMATTING setformatstrings and fstr are used to specify the occurrence and proportion variables and how they will be presented.

Motivation

I use R to access data held in Microsoft SQL Server databases on a daily basis. As a result of running into problems, I’ve realized I don’t have an understanding of the roles different components, notably dplyr, dbplyr, odbc and dbi each play in the process. This contributes to the fact that my efforts to resolve or mitigate issues are often an inefficient combination of Google searches and trial and error. Additionally, as I find or develop workarounds, I am unwilling to promote them with others because I don’t fully understand the cause of the issue and, as a result, I am not confident I have addressed the problem at the appropriate level.

Specifically, this is most motivated by the, “Invalid descriptor index,” error documented here.

I am writing what I learn to solidly my thinking and to help others who may experience the same challenge.

Scope

Given that the source of my motivation is encountering problems when working with data in Microsoft SQL Server databases, and that I prefer to use packages in the tidyverse, this investigation will be focused on how these packages work together to collect data from databases managed by Microsoft SQL server.

I won’t be writing about other options like RODBC, RJDBC or database-specific packages.

I will also not include comments about database management systems (DBMS) other than Microsoft SQL Server.

While it’s possible to generalize many of the concepts I write about here to other DBMS systems, I will not explicitly call them out. There are plenty of resources that do that. I aim to be very focused on how these components interact in a tidyverse and Microsoft SQL Server environment in the hopes it will help paint a simpler, clearer picture for others working in that same configuration.

Additionally, I almost never write data to a DBMS, and I suspect this is the case for many people working as analysts in Enterprise environments. In light of that I will be focused on how these components work together to extract data from SQL, and not how they write data to it.

Role of dplyr and dbplyr packages

Dplyr cheat sheet 2021

Dplyr Cheat Sheet 2020

dbplyr is the database back-end for dplyr - it does not need to be loaded explicitly, it is loaded by dplyr when working with data in a database.

dbplyr translates dplyr syntax into Microsoft SQL Server specific SQL code so dplyr can be used to retrieve data from a database system without the need to write SQL code.

dbplyr relies on the DBI and odbc packages as an intermediaries for connections with a SQL Server database.

Extra help tiers explained. Dplyr can also pass on explicitly written (not translated from dplyr) SQL code to DBI.

dbplyr generates, or captures, the SQL code that is then passed into the front-end of the database stack provided by DBI, odbc and the ODBC driver

Role of the DBI package

DBI segments the connectivity to the SQL database into a, “front-end,” and a, “back-end.”

DBI implements a standardized front-end to dbplyr, and the odbc package acts as a driver for DBI to interface with SQL Server.

An example of front-end functionality provided by DBI…

  • connect/disconnect to the database
  • create and execute statements in the DBMS
  • extract results/output from statements
  • error/exception handling
  • information (meta-data) from database objects
  • transaction management (optional)

I think of the DBI package as the front end for interactive user at the R console, a script or package, into the other components, odbc and an ODBC driver, that make it possible to extract data from SQL server.

Role of the odbc package

The odbc package provides the DBI back-end to any odbc driver connection, including those for Microsoft SQL Server.

This enables a connection to any database with ODBC drivers available.

I think of the odbc package as the “back-end” of DBI and the “front-end” into the ODBC driver.

Role of ODBC drivers

Open Database Connectivity (ODBC) drivers are the last leg of the link between dplyr and SQL Server. They are what enable the odbc package to interface with SQL server.

I think of the SQL Server ODBC driver as the “front-end” into SQL server, and again, back to the user, script or package

Summary

Dplyr Cheat Sheet R

  • Dbplyr translates dplyr syntax into SQL code, or captures explicitly written SQL, and hands it off to DBI
  • DBI uses odbc to interface with the SQL Server ODBC driver
  • ODBC driver communicates with SQL server and retrieves results

User or package code -> DBI -> odbc -> SQL Server ODBC driver -> SQL server

Sources and notes

I haven’t written anything new here, just focused it on the configuration I use day to day in the hopes it helps someone else.

Most was gleaned from the following and I’d recommend reviewing them for a broader perspective, and deeper insights into specific areas:

  • vignette('DBI', package = 'DBI')
  • vignette('dbplyr', 'package = 'dbplyr)