This post is authored by Yiwen Sun, Data Scientist, Esin Saka, Senior Software Engineer, and Shravan Matthur Narayanamurthy, Senior Software Engineering Manager, at Microsoft.
As petabyte-scale data becomes available, how to successfully enable and scale the traditional machine learning paradigm onto big data infrastructure is no longer a trivial matter for the likes of AI engineers and data scientists. Microsoft Azure Data Lake (ADL) offers such a service, storing stores structured and unstructured data at infinite size and running optimized analytics that instantly scales. The advantage of processing data using Azure Data Lake Analytics (ADLA) comes from the unique characteristics of U-SQL, a big data query language that combines SQL-like declarative benefits with the expressiveness and extensibility of C#. It can reuse existing libraries written in .NET language, R or Python.
In this post, we demonstrate how to deliver on a fraud detection scenario with big data stored in a data lake, by using ADLA and R.
Before starting, make sure you have an ADLA account and have enabled “U-SQL Advanced Analytics”, which includes the package for the R extension. Check this tutorial for more details on setting up U-SQL extensions.
Recent years have seen a steady growth of e-commerce. For e-commerce companies, one of the big challenges they are facing today is the financial loss that are caused by fraudulent transactions. This document uses the example of online purchase transactions to demonstrate a basic 3-step process in fraud detection: feature engineering, training, and scoring.
The input data consists of two parts:
1) Transaction data: Each line is an individual transaction, including numerical variables like transactionAmountUSD, itemCount, and categorical attributes like transactionCurrencyCode, cardType, etc.. The following table shows a subset of the data schema.
2) Account data: Each line represents a user account, with categorical variables like accountCountry, accountState, and numerical ones like accountAge.
The goal is to use labelled historical transaction data to train a classification model that can be used to identify if a new transaction is fraudulent or not.
Step 1: Feature Engineering
To enhance the predictive power of the ML algorithms, an important step is feature engineering, where additional features are created from raw data based on domain knowledge. For example, if an account has not made a big purchase in the last month, then a thousand-dollar transaction all of a sudden could be suspicious. The new features generated in this scenario include:
- Aggregated variables, such as aggregated transaction amount per account, aggregated transaction count per account in last 24 hours and last 30 days;
- Mismatch variables: such as mismatch between shippingCountry and billingCountry, which potentially indicates abnormal behavior.
- Risk tables: fraud risks are calculated using historical probability grouped by country, state, IPAddress, etc..
Sample U-SQL code that demonstrates how to conduct feature engineering for training can be found on Github here.
Step 2: Training
After adding new features and cleaning up the dataset, we can apply an ML algorithm to train the model. Training will be done using an R script, either outside of ADLA (e.g. using RStudio), or with the R extension on ADLA. The example shown below uses the latter approach on a local ADLA environment, where we’ll get a trained model file “trainedmodel-rxbtree.rds” that can be uploaded to ADLA storage online for scoring later.
The algorithm used for training is rxBTree method, which is provided in RevoScaleR package. It works by fitting stochastic gradient boosted decision trees for very large datasets using the parallel external memory algorithm. Read the specification here for more information about this method.
data_train <- inputFromUSQL[, !(colnames(inputFromUSQL) %in% c(“accountID”,“transactionID”,“transactionDateTime”,“transactionScenario”,“transactionType”, “Par”))]
# For numeric columns fill missing values with 0
data_train[ data_train == “” | data_train == ” “ | data_train == “na” | data_train == “NA”] <- NA
numcol <- sapply(data_train, is.numeric)
data_train[,numcol][is.na(data_train[,numcol])] <- 0
# Convert string columns to factor
data_train <- as.data.frame(unclass(data_train))
data_train$Label <- as.factor(data_train$Label)
names <- colnames(data_train)[which(colnames(data_train) != “Label”)]
equation <- paste(“Label ~ “, paste(names, collapse = “+”, sep=“”), sep=“”)
boosted_fit <- rxBTrees(formula = as.formula(equation),
data = data_train,
learningRate = 0.2,
minSplit = 10,
minBucket = 10,
nTree = 100,
seed = 5,
lossFunction = “bernoulli”)
filename_model <- “F:/USQL/USQLDataRoot/Samples/Output/training/trainedmodel-rxbtree.rds”
outputToUSQL <- summary(boosted_fit)
This script demonstrates a few features of the U-SQL R extension:
- The R extension supports a set of R packages by default, like RevoScaleR in this example. Read this document for the list of supported R modules. For additional packages, it can be uploaded as a zip file, declared as RESOURCE in U-SQL, then installed in R script.
- The R extension defines two variables: inputFromUSQL, outputToUSQL, to import data from U-SQL and export results to U-SQL. By default, both the input and output variables are data_frames.
- The R Reducer in U-SQL also provides an option (rReturnType) to pass “charactermatrix” instead of “dataframe” as the returned object from R. For example, in the U-SQL script corresponding to the training R Script, following statement is called to return the model summary as charactermatrix.
USING new Extension.R.Reducer(scriptFile:“FraudDetectionTrainingAdvanced.R”, rReturnType : “charactermatrix”);
In the U-SQL script, the REDUCE operation is executed to partition the input data on a key (Par in this case) to enable parallel processing. The R extension for U-SQL includes a built-in reducer (Extension.R.Reducer) that runs R code on each vertex assigned to the reducer.
To evaluate the model, we can split the dataset into training and testing sets, train a model using the training set, and then evaluate the model’s performance using metrics such as accuracy or AUC on the test set. Check this sample script for more details.
Step 3: Scoring
One we have a trained model, it can be used to predict if a new transaction is fraudulent or not. This R script, also shown below, demonstrates how to load a pre-trained model and run predictions to get probability scores.
# Load pre-trained model
filename_model <- “trainedmodel-rxbtree.rds”
boosted_fit <- readRDS(filename_model)
data_test <- inputFromUSQL[, !(colnames(inputFromUSQL) %in% c(“Par”))]
# Fill missing values
data_test[ data_test == “” | data_test == ” “ | data_test == “na” | data_test == “NA”] <- NA
numcol <- sapply(data_test, is.numeric)
data_test[,numcol][is.na(data_test[,numcol])] <- 0
# Convert string columns to factors
data_test <- as.data.frame(unclass(data_test))
scores <- rxPredict(
modelObject = boosted_fit,
data = data_test,
type = “response”,
extraVarsToWrite=c(“accountID”, “transactionID”, “transactionAmountUSD”))
outputdata <- data.frame(accountID=scores$accountID, transactionID=scores$transactionID,
outputToUSQL <- outputdata
And in this corresponding U-SQL script, we can increase the variable PartitionCount to increase parallelization when the scoring input is big.
DECLARE @PartitionCount int = 100;
SELECT Extension.R.RandomNumberGenerator.GetRandomNumber(@PartitionCount) AS Par, *
USING new Extension.R.Reducer(scriptFile : “FraudDetectionTrainingAdvanced-Scoring.R”,
rReturnType : “dataframe”);
Another option is to adjust the number of AUs when submitting the scoring job to ADLA. But keep in mind – increasing the number of AUs not only increases the performance but also adds extra cost.
In this post, we demonstrated how to predict fraudulent transactions using the U-SQL R extension on Azure Data Lake Analytics using a three-step process. This work was inspired by the fraud detection template we had earlier done using SQL Server R Services here.
With Azure Data Lake Analytics, AI engineers and data scientists can easily enable their machine learning solutions on petabyte-scale infrastructure instantly, without having to worry about cluster provision, management, etc., and the code can automatically be parallelized for the scale they need.
For additional information about U-SQL and the R extension for ADLA, please be sure to check the references below.
Yiwen, Esin & Shravan