Using ‘R’ Services within SQL Server 2016

 Using 'R' Services Within SQL Server 2016

‘R’ Services in SQL Server 2016 is a new technology within the SQL Server suite. This is an Enterprise feature allowing data scientists to explore and model data directly against the database, thereby minimising data movement and potentially stale data.

 

Enabling ‘R’ services requires the selection of the ‘R’ Services ‘In-Database’ option during the setup process of the SQL Server installation.

On the Feature Selection page, select these options:

  • Database Engine ServicesAt least one instance of the database engine is required to use SQL Server R Services. You can use either a default or named instance.
  • R Services (In-Database) This option configures the database services used by R jobs and installs the extensions that support external scripts and processes.

 

Once the installation is complete, you will see:

image

 

So… we’ve got the binaries installed, and now we need to enable the use of ‘R’ scripts within the instance itself:

image

 

This allows SQL Server to execute the ‘external’ ‘R’ scripts by passing the scripts and the associated data set to the external R server process. Now… first gotcha…. the execution of ‘R’ scripts can be very resource intensive. These ‘R’ scripts are running on the same server as the SQL Server Database Engine…. and we don’t really want the execution of the ‘R’ services script to consume all of the processor and memory resources on the box.

 

Resource Governor to the rescue…. We can setup a new Resource Governor Pool to control how much of our precious server resources can be taken by the ‘R’ services:

image

 

And now we are about ready to actually execute our first ‘R’ script!….

The ‘R’ scripts can be developed in Visual Studio, or in R-Studio, or any other ‘R’ IDE of your choice. Once the script is ready for use within SQL Server, we need to wrap it with an

  • EXECUTE sp_execute_external_script call, which we offer in the following parameters:

@language = N’R’ <- Execute an ‘R’ script within the sp_execute_external_script procedure call.

@script = N’<insert your ‘R’ script here>’

WITH RESULTS SETS (([column definitions for the returning data set], …. ));

 

So, as a first example:

EXECUTE sp_execute_external_script
@language = N'R'
, @script = N'data(iris); head(iris);'
, @output_data_1_name = N'iris'
WITH RESULT SETS ((
[Sepal_Length] float not null,
[Sepal_Width] float not null,
[Petal_Length] float not null,
[Petal_Width] float not null,
[Species] varchar(20) not null));

 

This will execute an ‘R’ script based on the IRIS dataset and return the 5 columns within the result set, mapped to the column definitions within the WITH RESULTS SET clause.

 

We can also expand on this to start doing some basic modelling within ‘R’, using the same concept:

 

EXECUTE sp_execute_external_script
@language = N'R'
, @script = N'data(iris);
head(iris);
x <- iris$Petal.Length;
y <- iris$Petal.Width;
model <- lm (y ~ x);
prediction <- data.frame(predict(object = model, newdata = data.frame(x = c(2,5,7))))'
, @output_data_1_name = N'prediction'
WITH RESULT SETS ((
[Predict] nvarchar(max) not null));

 

Here we use the same concepts as before, but are now using a linear modelling function ‘lm (y ~ x)’ to allow us to start predicting values based on the line of best fit within the existing data set.

Once the basic concepts of how to execute ‘R’ scripts using the “EXECUTE sp_execute_external_script” procedure are grasped, and the mechanisms behind mapping the resulting ‘R’ data sets to T-SQL columns are understood, we can run any ‘R’ based script which returns a tabular data set.

 

So... next step... how about trying to access data from within SQL Server databases? How easy is that to achieve?

Answer.... kind of easy... Here's how....

Firstly, to connect to a SQL Server database from within an 'R' script, we need an ODBC connection. This is achieved using the R library called RODBC...

  • library(RODBC);

 

Unfortunately, the RODBC package isn't installed by default, so we need to get it.

To install 'R' packages, using the Windows command prompt...

Open a Windows command prompt as administrator, and navigate to the directory where the RTerm.Exe or RGui.exe files are located.

In a default install, this is the R \bin directory. For example, in SQL Server 2016, the R tools are located here:

C:\Program files\MSSQL13.MSSQLSERVER\R_SERVICES\bin or
C:\Program files\MSSQL13.<instanceName>\R_SERVICES\bin\x64

Run R.Exe.

 

Set the library to be the SQL Server library location:
In the Command Window execute:
lib.SQL <- "C:/Program Files/Microsoft SQL Server/MSSQL12.MSSQLSERVER/R_SERVICES/library"
Note the / rather than the \ in the path otherwise a "library is not writable" will result.

To install packages into the R library associated with SQL Server run:
install.packages("RODBC", lib = lib.SQL)

To list the installed packages
EXECUTE sp_execute_external_script
@language=N'R'
,@script = N'str(OutputDataSet);
packagematrix <- installed.packages();
NameOnly <- packagematrix[,1];
OutputDataSet <- as.data.frame(NameOnly);'
,@input_data_1 = N'SELECT 1 as col'
WITH RESULT SETS ((PackageName nvarchar(250) ))

Continue to install any other packages which are needed.  Useful R libraries are listed here:  https://cran.r-project.org/web/packages/available\_packages\_by\_name.html

 

If you need additional packages from the c-ran website, simply replace the "RODBC" line above with the package required and you are good to go.

So, with the RODBC library installed into the 'R' SQL library (so that SQL Server can use it when the 'R' script is executed within the context of SQL Server itself), we are ready to make the connection into SQL Server:

 

dbhandle <- odbcDriverConnect("driver={SQL Server};server=<your_server>;database=AdventureWorks2014;trusted_connection=true")

then to execute a bit of T-SQL to return a dataset:

OutputDataSet <- sqlQuery(dbhandle, "SELECT AddressID, AddressLine1, AddressLine2, City, StateProvinceID, PostalCode, rowguid, ModifiedDate from Person.Address");'

and all we then need to do is map the OutputDataSet to the "WITH RESULTS SET" clause as such:

EXECUTE sp_execute_external_script
@language = N'R'
, @script = N'library(RODBC);
dbhandle <- odbcDriverConnect("driver={SQL Server};server=<your_server>;database=AdventureWorks2014;trusted_connection=true")
OutputDataSet <- sqlQuery(dbhandle, "SELECT AddressID, AddressLine1, AddressLine2, City, StateProvinceID, PostalCode, rowguid, ModifiedDate from Person.Address");'
, @output_data_1_name = N'OutputDataSet'
WITH RESULT SETS ((
[AddressID] int not null,
[AddressLine1] nvarchar(60) not null,
[AddressLine2] nvarchar(60) null,
[City] nvarchar(30) not null,
[StateProvinceID] int not null,
[PostalCode] nvarchar(15) not null,
--[spatiallocation] nvarchar(max) not null,
[rowguid] uniqueidentifier not null,
[ModifiedDate] datetime not null));

And you should be able to pull a data set using T-SQL directly into an 'R' script for manipulation. The construction of T-SQL for table joins, aggregates, sorts and all of the other familiar T-SQL syntax is fully available to you, it simply needs to return a table data set which we then map to the WITH RESULTS SET clause.

As a final script, a connection is made to the AdventureworksDW2014 database to run a semi-complex T-SQL statement which returns a data set. This data set then has some analysis done on it to determine the mean of the aggregated [Unitprice] column and bind the results into the overall data set.

EXECUTE sp_execute_external_script
@language = N'R'
, @script = N'library(RODBC);
dbhandle <- odbcDriverConnect("driver={SQL Server};server=<your_server>;database=AdventureWorksDW2014;trusted_connection=true")
SQLResultSet <- sqlQuery(dbhandle, "
SELECT DC.Title, DC.FirstName, DC.MiddleName, DC.LastName, SUM(UnitPrice) AS [TotalSpend]
FROM
FactInternetSales FIS
INNER JOIN DimCustomer DC ON FIS.CustomerKey = DC.CustomerKey
GROUP BY
DC.Title, DC.FirstName, DC.MiddleName, DC.LastName
HAVING SUM(UnitPrice) > 10000
ORDER BY
SUM(UnitPrice) DESC");
MeanDataSet <- data.frame(mean(SQLResultSet$TotalSpend, na.rm=TRUE))
OutputDataSet <- cbind(SQLResultSet, MeanDataSet)'
, @output_data_1_name = N'OutputDataSet'
WITH RESULT SETS ((
[Title] nvarchar(8) null,
[FirstName] nvarchar(50) null,
[MiddleName] nvarchar(50) null,
[LastName] nvarchar(50) null,
[TotalSpend] float not null,
[Mean] float not null));

Hopefully that has been useful, stay tuned for more posts on 'R' and how to implement it.