Category Archives: Analytical CRM

Calculating Customer Lifetime Value with Recency, Frequency, and Monetary (RFM)

Share on FacebookShare on LinkedInShare on Twitter

Introducing Customer Lifetime Value (CLV)

Customer Lifetime Value is “the present value of the future cash flows attributed to the customer during his/her entire relationship with the company.”1 There are different kinds of formulas, from simplified to advanced, to calculate CLV.  But the following one might be the one being used most commonly:-


t is a period, e.g. the first year(t=1), the second year(t=2)

n is the total number of periods the customer will stay before he/she finally churns

r is the retention rate/possibility

Pt is the profit the customer will contribute in the Period t

d is the discount rate

Here we assume that r is constant in the above formula; however, it is not always the case. The factors which influence r include demographics (age, geography, and profession etc), behavior (Recency, Frequency, Monetary, etc), tenure, competition, etc2. There are some improved formulas which forecast the r by different approaches such as Logistic Regression.

In the article, we will demonstrate how to calculate a customer’s CLV by predicting the retention/repurchasing rate r of customers in each future purchasing cycle time with the Logistic Regression model based on the predictors of Recency, Frequency, and Monetary.

We will use the CDNow full data set for concrete case study to build the above model.

The CDNow data set can be downloaded here. There are 23570 unique customers who made their first ever purchase at CDNOW in the first quarter of 1997 in the sample data. There are total 69659 transaction records, which occurred during the period of the start of Jan 1997 to the end of June 1998.

For more details about the dataset, read the paper of “Creating an RFM Summary Using Excel (Peter S. Fader, Bruce G. S. Hardie)” please or another blog RFM Customer Analysis with R Language on this website.

Exploring the relationships between Repurchase Rate and Recency, Frequency, and Monetary

Firstly we calculate the number of customers grouped by Recency values, and then further group them into “Buy” and “No Buy” according to the data in the next purchasing cycle time, and finally get the percentage of customers who repurchase in a certain Recency value in the next period. Here we leverage the R language function “ddply” to complete the grouping and calculating work. Below is a list pairs of percentage and Recency value we calculated. Please note that the less the Recency value is, the more recent the purchasing takes place.

Recency Buy Number Percentage

0              1   1180       0.45

1              1    581       0.28

2              1    279       0.22

3              1    198       0.17

4              1    163       0.14

5              1    249       0.05

6              1    316       0.03

7              1     13       0.03

The first row means that there are 45% customers who purchased CDs in the most recent period (Recency=0), purchased CDs again in the next period. We selected the translations that took place Jan 1st, 1997 through Feb 28th, 1998, for the calculating. The duration of the purchasing cycle time is set as two months.

By the same way, we can get the percentage lists of Frequency and Monetary. The relationships between Repurchase Rate and Recency, Frequency, and Monetary are plot blow.


The scatter plots above suggest that there is an obvious linear or exponential fall relationship between the repurchasing percentage and the Recency, and an obvious exponential rise relationship between the repurchasing percentage and the Frequency. However, there is no obvious relationship between the repurchasing percentage and the Monetary.

Building the model

Based on the above observation, we only use Recency and Frequency as the predictors in this case and conduct the logistic regression to get the model with R language.


Given a customer’s status of Recency and Frequency, we can predict the probability of repurchasing with the above model.

> pred<-predict(model,data.frame(Recency=c(0),Frequency=c(1)),type=’response’)

> pred



As shown in the above, a customer, say Tom, who became a new customer in the most recent period (So Recency = 0, and Frequency=1), has a 26% probability to purchase again in the next period (Period 1).

Calculating CLV

Suppose Tom would remain for 3 more periods before he churns, and the average profit he would contribute are 100 dollars, the discount rate is 0.02. How to calculate Tom’s CLV?


As shown in the above figure, The rectangles in light blue color are Tom’s possible Recency and Frequency status in each period.

In Period 0, his Recency is 0 and his Frequency is 1.

In Period 1, there are 0.26 probabilities, which we have calculated by the model in the above part, for him to buy again, and 0.74 probabilities for him Not to buy again. In the first case, his status would transit to Recency=0 and Frequency=2; in the second case, his status would transit to Recency=1 and Frequency=1.  The forecast profit Tom would contribute in Period 1 is 0.26 * 100 / (1+0.02) = 25.5 dollars.

In Period 2, Tom would transit to four possible statues. Take the most left statues for illustration, we first get the probabilities of transition by the model with the input value of status of R=0 and F=2 in Period 1.

> pred<-predict(model,data.frame(Recency=c(0),Frequency=c(1)),type=’response’)

> pred


0.2873471  (about 29%)

Then the probabilities for Tom to purchase again in Period 2 after purchasing in Period 1 are 0.26 * 0.29 = 0.08.  We can also get the probabilities for Tom not to purchase in Period 1 but to purchase in Period 2 is 0.14. The forecast profit Tom would contribute in Period 2 is (0.08 + 0.14) * 100 / (1+0.02)2 = 21.1 dollars.

In Period 3, by the same way, the forecast profit is (0.03 + 0.04 + 0.04 + 0.08) * 100 / (1+0.02)3=17.9 dollars

Tom’s CLV is 64.5 dollars by summing up the forecast profit in Period 1, 2, and 3.

R Source Code

The R source code we used in this article can be downloaded here.



2. Customer Lifetime Value (CLV) – A Methodology for Quantifying and Managing Future Cash Flows, David C. Ogden

3. Chapter 5 The Migration Model , Segmentation and Lifetime Value Models Using SAS, Edward C. Malthouse

Author: Jack Han. All rights reserved. 转载须以超链接形式标明文章原始出处和作者信息

39,188 total views, 24 views today

Increasing Repeat Purchase Rate by Analyzing Customer Latency

Share on FacebookShare on LinkedInShare on Twitter

For online businesses, Repeat Purchase Rate is one of the critical metrics of the business performance. Higher repeat purchase rate means more active members, and thus leads to higher profit.

Customer Latency refers to the average time between customer activity events, for example, making a purchase, calling the help desk, or visiting a web site”1, said Jim Novo.

In this article, we will demonstrate how to find out the right trigger points of marketing campaigns by analyzing Customer Latency, thus to increase the Repeat Purchase Rate.

Exploring and Preparing the CDNOW Data Set

We will use the CDNOW Sample data during this demonstration; you can download the data here.

There are 2357 unique customers who made their first-ever purchase at CDNOW in the first quarter of 1997 in the sample data. There are total 6919 transaction records, which occurred during the period of the start of Jan 1997 to the end of June 1998.

For more details about the dataset, read the paper of “Creating an RFM Summary Using Excel (Peter S. Fader, Bruce G. S. Hardie)” please.

We will keep the columns of “ID”, “Date”, and “Amount” in the original data set and prepare several  additional columns of “Interval”, “Times”, and “TotalTimes”, thus we can manipulate the data set more conveniently, where,

“ID” is the customer ID;

“Date” is the transaction date;

“Amount” is the money amount paid by a customer per transaction;

“Interval” is the Customer Latency, the number of days between a customer’s each transactions, say 10 days between 1st purchase and the second, 15 days between the second and the third, and so on;

“Times”  is 1 to n,   1 means the customers’ first purchase, 2 means the second, and so on;  

TotalTimes is the number of transactions in total for a customer.

The prepared data set  looks like the following.

> head(df)

ID       Date           Amount Interval Times TotalTimes

1    4       1997-01-01  29.33        0           1          4

2    4      1997-01-18  29.73       17          2          4

3    4      1997-08-02  14.96      196         3          4

4    4      1997-12-12  26.48      132         4          4

158 18  1997-01-04  14.96        0           1          1

5   21    1997-01-01  63.34        0           1          2

Calculating the Repeat Purchase Rate and Percentages

First of all, let’s examine the average number of repeat purchases per customer and the average spending amount per transaction during that period.

> 6919/2357

[1] 2.935511

> sum(df$Amount)/6919

[1] 35.2785

The average repeat purchases rate is about 3. Obviously it is not a high rate for an online store of CDs.  Let’s further study the distribution of total repeat times of the customers.

# get the matrix of customer ID ~ the customer’s total number of transactions

> TimesByID <$ID))

#get the matrix of total number of transactions ~ number of customers who have the total number

> GroupByTimes <-$Freq))

> head(GroupByTimes,12)

Times   Customers

1      1      1205

2      2       406

3      3       208

4      4       150

5      5        98

6      6        56

7      7        65

8      8        35

9      9        23

10    10        21

11    11         8

12    12        10

>plot(GroupByTimes,xlab=”Total Number of Purchases”,ylab=”Number of Customers”,pch=16,col=”blue”,type=”o”)

> text(2,1220,”1205″)

> text(3,425,”406″)

> text(4,220,”208″)

> text(5,170,”150″)

> text(6,120,”98″)

> text(12,50,”10″)

> text(30,50,”1″)

CustomersByTimes                                                                              Figure – 1

As we can see from Figure – 1 above, the number of customers decreases very quickly while the total number of purchases increases from 1 to 6.  Almost of half of the customers only made one purchase during the 1.5 year period!

Let’s examine the percentage of customers making (x) purchases more closely.

> percentages<-round(GroupByTimes$Customers / 2357 , 3)

> percentages

[1] 0.511 0.172 0.088 0.064 0.042 0.024 0.028 0.015 0.010 0.009

[11] 0.003 0.004 0.006 0.003 0.003 0.003 0.004 0.001 0.001 0.000

[21] 0.001 0.002 0.002 0.000 0.000 0.000 0.000 0.000 0.000 0.000

[31] 0.000 0.000 0.001 0.000 0.000

> x<-barplot(percentages [1:10]*100,col=”blue”,main=”Percentage of Customers Making (x) Purchases”, xlab=”Number of Purchases”, ylab=”Repeat Purchase Rate (%)”,ylim=range(0:55),axisnames=TRUE,names.arg=GroupByTimes$Times[1:10],cex.names=TRUE)

> text(x, percentages [1:10]*100+2,paste(percentages [1:10]*100,”%”))

repeatRate                                                                          Figure – 2

As shown in Figure-2 which displays the percentages of customers who made 1 to 10 purchases respectively, 51.1% of customers only made one purchase, 17.2% of the customers made two purchases, and 8.8% of the customers made three purchases and so on.

Based on the above data, to increase the average repeat purchase rate, CDNOW should try to increase the percentage of customers who make more than one purchase, especially the customers who make two and three purchases because the percentage decreases very quickly from one purchase to two purchases, and from two purchases to three purchases. We will leverage the Customer Latency concept to find ways to increase the repeat purchase rate in the following parts.

Calculating the Customer Latency and Increasing the Repeat Purchase Rate

Here Customer Latency refers to the average time between customers’ purchases. For example the average days between customers’ first purchase and second purchase, the days between second purchase and the third purchase, and so on.

Let’s calculate the Customer Interval between 1st and 2nd purchase first since increasing the 2nd purchase rate is important for increasing the overall repeat purchase rate.

> # filter out the customers who only made more than one purchase and their intervals between the 1st and the 2nd purchase

> df2<-df[df$TotalTimes>=2 & Times==2,]

> # see how many 2nd transcations

> nrow(df2)

[1] 1152

> # get the mean days of customer latency

> mean(df2$Interval)

[1] 105.6276

There are total 1152 second purchases and the average customer latency is about 100 days.

Let take a further look at the distributions of the Customer Latency.

> hist(df2$Interval,main=”Distribution of Customer Latency (1st – 2nd purchase)”, xlab=”Days”, ylab=”Number of 2nd Transcations”) latency_1_2                                                                           Figure – 3

As shown in Figure-3, more than half of the second purchases happened in 50 days after the first purchase and it is a decline distribution from left to right.

A customer who has longer Latency than the average Latency of the norm means something happened. It might be due to that the customer was unhappy with the product or service, or it might be due to his own reasons. Anyway, the Latency data is speaking to us, “it is a rising of the hand by the customer, and the Data-Driven marketer or service provider not only sees the raised hand, but also reacts to it3, as Jim Novo mentioned in his book.

So, based on the above analysis, CDNOW should do something to increase the second purchase rate when the customers’ Latency in the database exceeds 50 days, and 100 days. It can be an email sent to the customers with coupon or discount or something else to absorb the customers to go back to the CDNOW again. Otherwise, the longer the Latency is, the more likely the customers will defect.

By the same way, we can also calculate the average Latency between the second and the third purchase for increasing the third repeat purchase rate.

Thus the overall average repeat purchase rate will likely be increased.

R Source Codes

You can download the complete R source codes here.


  3. Drilling Down – Turning Customer Data into Profits with a Spreadsheet, Jim Novo, 2004

Author: Jack Han. All rights reserved. 转载须以超链接形式标明文章原始出处和作者信息

87,064 total views, 273 views today

Identifying Potential Customers with Classification Techniques in R Language

Share on FacebookShare on LinkedInShare on Twitter

Data mining techniques and algorithms such as Decision Tree, Naïve Bayes, Support Vector Machine, Random Forest, and Logistic Regression are “most commonly used for predicting a specific outcome such as response / no-response, high / medium / low-value customer, likely to buy / not buy.”1

In this article, we will demonstrate how to use R to build classification models to identify the potential customers who are likely to buy an insurance product. We will build models with Decision Tree, Random Forest, Naïve Bayes, and SVM respectively and then compare the models to find out the best one. The scenario and data are based on the public tutorial of “Using Oracle Data Miner 11g Release 22

Data Observation and Preparation

The dataset is exported from the database mentioned in the above tutorial. Let’s take a look at the variables in the dataset.

> dataset <- read.csv(file.choose(),header=T)

> dim(dataset)

[1] 1015   31

> names(dataset)

 [1] “BUY_INSURANCE”      “CUSTOMER_ID”      “LAST”         “FIRST”                 

 [5] “STATE”            “REGION”         “SEX”          “PROFESSION”             

 [9]”AGE”              “HAS_CHILDREN”        “SALARY”       “N_OF_DEPENDENTS”       



[21] “T_AMOUNT_AUTOM_PAYMENTS”            “MONTHLY_CHECKS_WRITTEN”  “MORTGAGE_AMOUNT”                             “N_TRANS_ATM”           

[25] “N_MORTGAGES”             “N_TRANS_TELLER”          “CREDIT_CARD_LIMITS”      “N_TRANS_KIOSK”         

[29] “N_TRANS_WEB_BANK”          “LTV”                     “LTV_BIN”               

There are 1015 cases and 31 variables in the dataset. The variable of “BUY_INSURANCE” is the dependent variable. Other variables are customers’ basic information, geographical information, and bank account information. The data types for the variables should be “factor” and “numeric” in R. 

> table(dataset$BUY_INSURANCE)

No Yes

742 273

In the 1015 cases, 273 people bought the insurance product in the past.

Checking Missing Value

> sum(complete.cases(dataset))

[1] 1015

There is no missing value in the dataset for us to deal with.

Removing Unnecessary Variables

The variables of “CUSTOMER_ID”, “LAST”, and “FIRST” doesn’t help for the data mining. We can remove them.

> dataset <-subset(dataset,select = -c(CUSTOMER_ID, LAST, FIRST))

Some of the algorithms have a limitation on the categorical levels. If there are too many levels in a variable, we need to combine the lower levels into higher levels to reduce the quantity of the total levels or we can just remove the variables if it doesn’t influence the data mining result. Let’s check how many levels there are in the following categorical variables.

> dim(table(dataset$PROFESSION))

[1] 95

> dim(table(dataset$STATE))

[1] 22

> dim(table(dataset$REGION))

[1] 5

The quantity of PROFESSION levels exceeds the limitation and we know that there are 50 states though it is only 22 in the dataset. For simplification, we remove the two variables here.

> dataset <-subset(dataset, select = -c(PROFESSION, STATE))

Since variable of LTV has been binned into the LTV_BIN already in the dataset, we remove LTV as well.

> dataset <-subset(dataset, select = -c(LTV))

Transferring the Data Type

> dataset$REGION <-as.factor(dataset$REGION)

> dataset$SEX <-as.factor(dataset$SEX)

> dataset$CAR_OWNERSHIP <-as.factor(dataset$CAR_OWNERSHIP)

> dataset$HOUSE_OWNERSHIP <-as.factor(dataset$HOUSE_OWNERSHIP)

> dataset$MARITAL_STATUS <-as.factor(dataset$MARITAL_STATUS)

> dataset$HAS_CHILDREN <-as.factor(dataset$HAS_CHILDREN)

> dataset$LTV_BIN <-as.ordered(dataset$LTV_BIN)


> dataset$AGE <-as.numeric(dataset$AGE)

> dataset$SALARY <-as.numeric(dataset$SALARY)

> dataset$N_OF_DEPENDENTS <-as.numeric(dataset$N_OF_DEPENDENTS)

> dataset$TIME_AS_CUSTOMER <-as.numeric(dataset$TIME_AS_CUSTOMER)

> dataset$CREDIT_BALANCE <-as.numeric(dataset$CREDIT_BALANCE)

> dataset$BANK_FUNDS <-as.numeric(dataset$BANK_FUNDS)

> dataset$CHECKING_AMOUNT <-as.numeric(dataset$CHECKING_AMOUNT)




> dataset$MORTGAGE_AMOUNT <-as.numeric(dataset$MORTGAGE_AMOUNT)

> dataset$N_TRANS_ATM <-as.numeric(dataset$N_TRANS_ATM)

> dataset$N_MORTGAGES <-as.numeric(dataset$N_MORTGAGES)

> dataset$N_TRANS_TELLER <-as.numeric(dataset$N_TRANS_TELLER)

> dataset$CREDIT_CARD_LIMITS <-as.numeric(dataset$CREDIT_CARD_LIMITS)

> dataset$N_TRANS_KIOSK <-as.numeric(dataset$N_TRANS_KIOSK)

> dataset$N_TRANS_WEB_BANK <-as.numeric(dataset$N_TRANS_WEB_BANK)

Checking the Correlations between Numeric Variables

We could use the function of “pairs20x()” to check the correlations visually. Due to that there are more than 20 numeric variables and thus the output figure is too large to display, we don’t show the figure here. We only use the function of “cor()” to get the correlations.

> cor(dataset$TIME_AS_CUSTOMER, dataset$N_OF_DEPENDENTS)

[1] 0.7667451


[1] 0.8274963

> cor(dataset$N_TRANS_WEB_BANK, dataset$MORTGAGE_AMOUNT)

[1] 0.7679546

We found the above three pairs of variables have higher correlations. Ideally we should try to remove one variable in turn in each pair for model building to see if the performance of models can be improved. However, for simplification, we don’t deal with the correlated variables in this ariticle.

Breaking Data into Training and Test Sample

> # breaking the data set into training and test samples by half

> d = sort(sample(nrow(dataset), nrow(dataset)*.5))

> train<-dataset[d,]

> test<-dataset[-d,]

Building the Models

In this part, we will use the Decision Tree, Random Forest, Naive Bayes, and SVM classifiers in R to build models respectively. For simplification, we will not conduct k-folder cross validation during modeling for some classifiers in which there are no embedded cross validation.

Decision Tree

Decision Tree is one of the most commonly used classifier. It is able to handle both numerical and categorical variables and it is insensitive to data errors or even missing data. Most importantly, it provides human-readable rules.

> library(“rpart”)

> model.tree <- rpart (BUY_INSURANCE~.,data=train,method=”class”)

> plot(model.tree,uniform=TRUE,margin=0.1)

> text(model.tree,use.n=T,cex=0.8)

We can view the output tree structure in the Figure 1.

tree_before_pruningFigure 1

The tree is a bit complex so we will prune it. Firstly, we need to find out the right complexity parameter (cp) value, hence the number of splits (or size) of the tree, for pruning. The right cp is a threshold point where increased cost for further splitting outweighs reduction in lack-of-fit.

># plot cp

> plotcp(model.tree)

plotcpFigure 2

“A good choice of cp for pruning is often the leftmost value for which the mean lies below the horizontal line.” In the Figure 2 above, we can see that the optimal cp value is 0.042.

># Prune the tree with the optimal cp value

> pTree<- prune(model.tree, 0.042)

# draw the pruned tree



prunedtreeFigure 3

As shown in Figure 3, it is easy to describe the rules to decide if a customer is more likely to buy.

Take the most right leaf for example; the rules can be described as follows.

IF BANK_FUNDS >= 320.5

         IF CHECKING_AMOUNT < 162

                  IF MONEY_MONTHLY_WITHDRAWN >=53.68

                          IF CREDIT_BALANCE < 3850

                                   THEN YES for BUY

The numbers below the leaf shows that 4 customers didn’t buy and 50 customers bought under the above conditions in the training dataset.

After we build the model, we can use it to predict for the customers in the test dataset.

> pred.tree <- predict(pTree,test[,-1])

The prediction result can be No or Yes for each customer, or it can provide the probabilities of No and Yes for each customers as follows.

> head(pred.tree)

         No               Yes

1 0.8380952         0.16190476

2 0.9823009         0.01769912

3 0.8380952         0.16190476

7 0.9823009         0.01769912

8 0.9823009         0.01769912

9 0.9823009         0.01769912

Random Forest

Random Forest is based on Decision Tree. It can handle large dataset and thousands of input variables without variable deletion. In random forests, there is no need for cross-validation or a separate test set to get an unbiased estimate of the test set error. It gives estimates of what variables are important in the classification.3

> library(randomForest)

>model.randomForest<- randomForest(BUY_INSURANCE ~ ., data = train, importance=TRUE,proximity=TRUE)

> # look at variable importance

> round(importance(model.randomForest),2)

                           No   Yes    MeanDecreaseAccuracy   MeanDecreaseGini

REGION               0.18  0.74                 0.34                   4.68

SEX                      0.01  0.09                 0.04                   0.96

AGE                      0.40  0.85                 0.53                  11.10

HAS_CHILDREN   -0.12  0.21                -0.03                 1.12

SALARY                 -0.06  0.01                -0.04             8.98

N_OF_DEPENDENTS     -0.04  1.08                 0.39             4.01

CAR_OWNERSHIP           -0.06 -0.15                -0.08             0.44

HOUSE_OWNERSHIP          0.08  0.38                 0.19             1.11

TIME_AS_CUSTOMER         0.09  0.32                 0.17             2.90

MARITAL_STATUS           0.22  0.83                 0.45             3.62

CREDIT_BALANCE           0.76  0.62                 0.67             4.26

BANK_FUNDS               1.02  2.69                 1.27            27.15

CHECKING_AMOUNT          1.26  1.78                 1.18            15.12

MONEY_MONTLY_OVERDRAWN   0.93  2.31                 1.16            24.09

T_AMOUNT_AUTOM_PAYMENTS  0.85  1.58                 0.99            15.97

MONTHLY_CHECKS_WRITTEN   0.27  1.35                 0.64            10.43

MORTGAGE_AMOUNT          0.10  1.45                 0.60             8.26

N_TRANS_ATM              0.66  1.95                 0.96            13.97

N_MORTGAGES             -0.01  0.29                 0.10             1.58

N_TRANS_TELLER           0.66  1.52                 0.82             7.62

CREDIT_CARD_LIMITS       0.00  0.82                 0.25             6.35

N_TRANS_KIOSK           -0.01 -0.51                -0.17             4.03

N_TRANS_WEB_BANK         0.29  1.20                 0.60             9.27

LTV_BIN                 -0.01  0.15                 0.04             2.47

Higher values of in the above table indicate variables that are more important to the classification. We can see that BANK_FUNDS, CHECKING_AMOUNT, and MONEY_MONTLY_OVERDRAWN are more helpful to the classification.

Let’s use the model to predicate the cases in the test data set.

> pred.randomForest <- predict(model.randomForest, test[,-1],type=”prob”)

> head(pred.randomForest)

     No      Yes

1 0.764    0.236

2 0.990    0.010

3 0.600    0.400

7 0.864    0.136

8 0.996    0.004

9 0.996    0.004

Naïve Bayes

“A naive Bayes classifier is a simple probabilistic classifier based on applying Bayes’ theorem with strong (naive) independence assumptions. Despite the fact that the far-reaching independence assumptions are often inaccurate, it has several properties that make it surprisingly useful in practice”4. Naive Bayes can deal with both categorical and numeric data. Since the sample size in the training data set is not very large, we will not discretize the continuous values in some of the variables by binning for simplification.

> library(e1071)

> model.naiveBayes <- naiveBayes(BUY_INSURANCE ~ ., data = train, laplace = 3)

> pred.naiveBayes <- predict(model.naiveBayes, test[,-1],type=”raw”)

> head(pred.naiveBayes)

     No                    Yes

[1,] 1.0000000       5.244713e-18

[2,] 0.9953059       4.694106e-03

[3,] 0.5579982       4.420018e-01

[4,] 0.2221896       7.778104e-01

[5,] 0.9857277       1.427231e-02

[6,] 0.9923343       7.665676e-03


“A support vector machine constructs a hyperplane or set of hyperplanes in a high- or infinite-dimensional space, which can be used for classification, regression, or other tasks.”5 We will use the svm() function in the e1071 package to build the classification models. Kernel and cost parameters are important for svm() function to yield sensible results. We will try linear and radial kernel functions respectively in the following.

> # build the models with two different kernel functions respectively

> model.svm.linear <- svm(BUY_INSURANCE ~ ., data = train, kernel=”linear”,probability = TRUE)

> model.svm.radial <- svm(BUY_INSURANCE ~ ., data = train, kernel=”radial”,probability = TRUE)

> # prediction with the two models respectively

> pred.svm.linear <- predict(model.svm.linear, test[,-1],probability=TRUE)

> attr(pred.svm.linear, “probabilities”)[1:6,]

         No                 Yes

1 0.9816020           0.01839796

2 0.9391826           0.06081737

3 0.5237238           0.47627615

7 0.9310071           0.06899288

8 0.9531510           0.04684897

9 0.9444462           0.05555381

> pred.svm.radial <- predict(model.svm.radial, test[,-1],probability=TRUE)

> attr(pred.svm.radial, “probabilities”)[1:6,]

         No                Yes

1 0.8849981           0.11500191

2 0.9664234           0.03357663

3 0.5672350           0.43276502

7 0.9591768           0.04082316

8 0.9624121           0.03758789

9 0.9862672           0.01373277

Comparing the Models

To compare the models generated above, we will plot ROC curve and calculate the area under the ROC (AUC for short).

> #prepares the legend string for the ROC figure

> c.legend<-c(“decision tree, auc=”,”random forest, auc=”,”naive Bayes, auc=”,”svm.linear, auc=”,”svm.radial, auc=”)

> #ROC for Decision Tree

> pred <- prediction(pred.tree[,2], test[,1])

> perf <- performance(pred, “tpr”, “fpr”)

> plot(perf,col=”red”,lwd=2)

> # caculate the AUC and add it to the legend vector

> c.legend[1]<-paste(c.legend[1],round((performance(pred,’auc’)@y.values)[[1]],3))

>#ROC for Random Forest

> pred <- prediction(pred.randomForest[,2], test[,1])

> perf <- performance(pred, “tpr”, “fpr”)

> plot(perf,add=TRUE,col=”green”,lwd=2)

> # caculate the AUC and add it to the legend vector

> c.legend[2]<-paste(c.legend[2],round((performance(pred,’auc’)@y.values)[[1]],3))

> #ROC for Naive Bayes

> pred <- prediction(pred.naiveBayes[,2], test[,1])

> perf <- performance(pred, “tpr”, “fpr”)

> plot(perf,add=TRUE,col=”blue”,lwd=2)

> # caculate the AUC and add it to the legend vector

> c.legend[3]<-paste(c.legend[3],round((performance(pred,’auc’)@y.values)[[1]],3))

> #ROC for SVM with linear kernel

> pred <- prediction(attr(pred.svm.linear, “probabilities”)[,2], test[,1])

> perf <- performance(pred, “tpr”, “fpr”)

> plot(perf,add=TRUE,col=”purple”,lwd=2)

> # caculate the AUC and add it to the legend vector

> c.legend[4]<-paste(c.legend[4],round((performance(pred,’auc’)@y.values)[[1]],3))

> #ROC for SVM with radial kernel

> pred <- prediction(attr(pred.svm.radial, “probabilities”)[,2], test[,1])

> perf <- performance(pred, “tpr”, “fpr”)

> plot(perf,add=TRUE,col=”black”,lwd=2)

> # caculate the AUC and add it to the legend vector

> c.legend[5]<-paste(c.legend[5],round((performance(pred,’auc’)@y.values)[[1]],3))

> draw the legend

>legend(0.5,0.6, .legend,lty=c(1,1,1,1,1),lwd=c(2,2,2,2,2),col=c(“red”,”green”,”blue”,”purple”,”black”))


Figure 4

As shown in Figure 4, the model built by Random Forest (green line) has the best performance with the AUC of 0.921 in this case. We can use this model for our actual usage to predict Not Buy or Buy on new customers who are not in the existing data set.


In this article, we built and compared the models generated by the Decision Tree, Random Forest, Naïve Bayes, SVM algorithms implemented in R packages. The performance of Random Forest exceeded others in this insurance buying use case.




Author: Jack Han. All rights reserved. 转载须以超链接形式标明文章原始出处和作者信息

59,277 total views, 20 views today

RFM Customer Analysis with R Language

Share on FacebookShare on LinkedInShare on Twitter

For database marketing or direct marketing people, they are always concerned about two questions before they send out mails or make calls to their customers:-

  1. How can they segment the customers in the database to find out who are more likely to response to their mails or buy their products?
  2. Which type of customers they should send the mails to so that they can reach breakeven and make profit?

The RFM method is a very simple but effective customer analysis way to address the above questions. Where,

R, F, and M stand for1

         Recency – How recently did the customer purchase?

         Frequency – How often do they purchase?

         Monetary Value – How much do they spend (each time on average)?

         (The above definitions are from Wikipeida.)

You can get more details on RFM from the following paper and case study besides Wikipeida:-

Segmentation and RFM Basics (Bill Ruppert)

Case Study: La Quinta narrows target to boost guest loyalty

Seems there is no out-of-box package on RFM analysis in R programming language area while other data analysis tools such as IBM SPSS have easy-to-use RFM functionalities. So in this article, we will demonstrate how to implement RFM analysis with R language.

Explore Data

We will use the CDNOW Sample data during this demonstration; you can download the data here. For more details about the dataset, read the paper of “Creating an RFM Summary Using Excel (Peter S. Fader, Bruce G. S. Hardie)” please.

Let’s explore the data.

> # read CDNOW_SAMPLE.txt

> df <- read.table(file.choose(),header=F)

> # construct a data frame with the necessary columns of customer ID, transaction date, and money amount paid by a customer per transaction

> df <-[,1],df[,3],df[,5]))

> # add appropriate column names for the above three column and

> names <- c(“ID”,”Date”,”Amount”)

> names(df) <- names

> #tranfer the the text column type to date type

> df[,2] <- as.Date(as.character(df[,2]),”%Y%m%d”)

> head(df)

ID       Date              Amount

1   4       1997-01-01    29.33

2   4       1997-01-18    29.73

3   4       1997-08-02    14.96

4   4       1997-12-12    26.48

5  21       1997-01-01    63.34

6  21       1997-01-13    11.77

> dim(df)

[1] 6919    3

#remove the rows with the duplicated IDs to see how many customers in total

> uid <- df[!duplicated(df[,”ID”]),]

> dim(uid)

[1] 2357    3

As shown in the output above, the data set contains some columns, especially three columns of customer ID, transaction date, and money amount paid by a customer per transaction. There are total 6919 transaction records in the sample dataset and 2357 unique customers.

Segment the Customers into RFM Cells

Since we have got the data ready now, we can segment the customers into RFM cells by the three dimensions of R, F, and M. Usually we rate the customers by 1 to 5 points in each dimension, the higher score the better.

According to the experiences, Recency is the most important factor (x100) for high response rate, Frequency is the second one(x10), and the Monetary is the last one(x1).  The score of customers who are segmented into the same RFM cell is the same.  A score of “542” means that the customer gets 5 points in Recency, 4 points in Frequency, and 2 points in Monetary.

There are usually two ways to segment the customers in each dimension, one is Nested, the other is Independent.

The nested method is to divide the customers into aliquots in the Recency dimension first, then for each aliquot of Recency, divide the customers into aliquots in the Frequency dimension nestedly, so does for the Monetary. The advantage of nested method is the quantity of customers in each RFM cell is roughly the same for all cells; the disadvantage is that the meaning of the Frequency and Monetary with the same score point might be different, e.g. one customer with a “543” RFM score and the other customers with a “443” RFM score, their Recency might be quite different though they have the same Recency score of “4”.

The independent method is to rate the customers into aliquots in the R, F, M dimensions independently; the advantage is that the meaning of the Frequency and Monetary with the same score point is the same; the disadvantage is that the quantities of customers in the RFM cells might be quite different.

You can read the discussion in Quora about the pros and cons of the above two methods.

In this article, we will implement the independent method in R language first, and then we will implement another method that allows users to input the parameters of the breaks for each dimension according to their own business requirement.

R Source Codes

The complete R source codes can be downloaded from RFM_Analysis_R_Source_Codes. Read the descriptions in the source codes please before you continue.

Calculate the Recency, Frequency, and Monetary

To implement the RFM analysis, we need to further process the data set in the CDNOW Sample by the following steps:-

  1. Remove the duplicate records with the same customer ID
  2. Find the most recent date for each ID and calculate the days to the now or some other date, to get the Recency data
  3. Calculate the quantity of translations of a customer, to get the Frequency data
  4. Sum the amount of money a customer spent and divide it by Frequency, to get the amount per transaction on average, that is the Monetary data.

We have already implemented the above procedures in the functions of “getDataFrame” in the source codes.

# set the startDate and endDate, we will only analysis the records in this date range

> startDate <- as.Date(“19970101″,”%Y%m%d”)

> endDate <- as.Date(“19980701″,”%Y%m%d”)

> df <- getDataFrame(df,startDate,endDate)

> head(df)

ID          Date            Amount   Recency     Frequency  Monetary

4           1997-12-12  26.48     201             4                 25.125

18          1997-01-04  14.96     543             1                 14.960

21         1997-01-13  11.77     534              2                 37.555

50         1997-01-01   6.79     546               1                 6.790

60         1997-02-01  21.75     515              1                 21.750

71         1997-01-01  13.97     546              1                 13.970

After we calculate the Recency, Frequency, and Monetary, we can implement the RFM scoring.

Independent RFM Scoring

We have implemented the independent scoring in function of “getIndependentScore”. Below is the description of the function. For details, please read the source codes.

# Function

#     getIndependentScore(df,r=5,f=5,m=5)

# Description

#       Scoring the Recency, Frequency, and Monetary in r, f, and m in aliquots independently

# Arguments

#       df – A data frame returned by the function of getDataFrame

#       r –  The highest point of Recency, it is 5 by default

#       f –  The highest point of Frequency, it is 5 by default

#       m –  The highest point of Monetary, it is 5 by default

# Return Value

#       Returns a new data frame with four new columns of “R_Score”,”F_Score”,”M_Score”, and

#  “Total_Score”.

Let’s execute the function and take a look of the output.

> df1 <-getIndepandentScore(df)

> head(df1[-(2:3)])

ID      Recency Frequency Monetary      R_Score  F_Score  M_Score  Total_Score

11462   52       4                191.6425       5             5              5              555

8736    55        9                148.3944       5             5              5              555

7856    54        5                130.0980       5             5              5              555

15105   58       9                129.5256       5             5              5              555

22356   106    8                127.3650        5             5              5              555

8481    57     13                117.3492        5             5              5              555

Let’s explore the customer distribution in each RFM cell to see if the customers are distributed evenly.

#Draw the histograms in the R, F, and M dimensions so that we can see the distribution of customers in each RFM cell.

> drawHistograms(df1)


The distribution is not very even due to that the Frequency doesn’t have the scores of 1 and 2. Let’s further find out how many customers have a total score larger than 500 or 400.

> S500<-df1[df1$Total_Score>500,]

> dim(S500)

[1] 471  10

> S400<-df1[df1$Total_Score>400,]

> dim(S500)

[1] 945  10

We can consider those customers are more important to get a higher response rate.

RFM Scoring with Breaks

Sometimes users want to determine the breaks for each dimension by themselves according to their own business requirement. For example, a user can set 0 -30 days, 30 -90 days, 90 – 180 days, 180 – 360 days, and more than 360 days as the 5 breaks for Recency rather than just let the computer segment the customers into 5 aliquots without considering the specific business requirement.

We have implemented this scoring method in the function of “getScoreWithBreaks”. Below is the description of the function. For details, please read the source codes.

# Function

#     getScoreWithBreaks(df,r,f,m)

# Description

#       Scoring the Recency, Frequency, and Monetary in r, f, and m which are vector object containing a series of breaks

# Arguments

#       df – A data frame returned by the function of getDataFrame

#       r –  A vector of Recency breaks

#       f –  A vector of Frequency breaks

#       m –  A vector of Monetary breaks

# Return Value

#       Returns a new data frame with four new columns of “R_Score”,”F_Score”,”M_Score”, and “Total_Score”.

Before we execute the function, we can take a look at the distributions of Recency, Frequency, and Monetary.

> par(mfrow = c(1,3))

> hist(df$Recency)

> hist(df$Frequency)

> hist(df$Monetary)


We can find that the distributions are not even. Most of the customers have a Recency of more than 450 days, a Frequency of less than 5 times, and a Monetary of less than 50 dollars. That’s the reason why the customers are not distributed evenly in the RFM cells when using the Independent RFM scoring method in the previous part.

Based on the above observation and relevant business sense, we can construct the breaks for each RFM dimension as follows, so that we can get segmented RFM cells more evenly and reasonably.

# set the Recency ranges as 0-120 days, 120-240 days, 240-450 days, 450-500days, and more than 500days.

> r <-c(120,240,450,500)

# set the Frequency ranges as 0 – 2times, 2-5 times,5-8 times, 8-10 times, and more than 10 times.

> f <-c(2,5,8,10)

# set the Monetary ranges as 0-10 dollars, 10-20 dollars, and so on.

> m <-c(10,20,30,100)

Than we can execute the function of “getScoreWithBreaks” and see the customers distributions in the RFM cells.

> df2<-getScoreWithBreaks(df,r,f,m)

> drawHistograms(df2)


We can also calculate how many customers have a total score of more than 500 or 400.

> S500<-df2[df2$Total_Score>500,]

> dim(S500)

[1] 399  10

> drawHistograms(df2)

> S400<-df2[df2$Total_Score>400,]

> dim(S400)

[1] 641  10

There are 641 customers have a RFM score more than 400.

Estimate response rate for each RFM cell4

After we segment the customers into RFM cells, we can assign the response rate to each RFM cell according to historical responding data. If it is the first time to use RFM analysis and there is no historical data, we can select some customers, say 10% percent, randomly from each RFM cells. Send mails to the selected customers as a trail and count the response rate for each cell. Below is an example of the response rate table.

Response Rate Table

RFM cell (Total Score) Response Rate
555 8.5%
441 5.6%
435 4.8%

Calculate the breakeven point to select valuable customers4

To be breakeven,

P – C/R= 0


P is the price or revenue per deal or per response

C is the cost per mail sent out, including production cost, mailing cost etc.

R is the response rate.

Suppose,  P = 100 dollars, C = 5 dollars, to be breakeven:-

R = C/P = 5/100 = 5%

That means, we should chose the customers in the RFM cells, that have a response rate equal to or more than 5%, to send out the direct marketing mails, to make money.

Select the Target Customers

According to the Response Rate Table and the breakeven point, we need to select customers in the cells with a response rate more than 5%, that means we need to select the customers with a total score equal to or more than 441.

> target <- df2[df2$Total_Score>=441,]

> dim(target)

[1] 416  10

There are 416 customers we will send the direct marketing mails to.


We introduced the basic concept of RFM customer analysis and implemented the analysis procedures in R language by providing complete R source code.


  4. Recency, Frequency and Monetary (RFM) Analysis (Charlotte Mason)
  5. Case Study: La Quinta narrows target to boost guest loyalty

Author: Jack Han. All rights reserved. 转载须以超链接形式标明文章原始出处和作者信息

93,844 total views, 52 views today