Calculating Customer Lifetime Value with Recency, Frequency, and Monetary (RFM)

Share on FacebookShare on LinkedInShare on Twitter

Introducing Customer Lifetime Value (CLV)

Customer Lifetime Value is “the present value of the future cash flows attributed to the customer during his/her entire relationship with the company.”1 There are different kinds of formulas, from simplified to advanced, to calculate CLV.  But the following one might be the one being used most commonly:-

Where,

t is a period, e.g. the first year(t=1), the second year(t=2)

n is the total number of periods the customer will stay before he/she finally churns

r is the retention rate/possibility

Pt is the profit the customer will contribute in the Period t

d is the discount rate

Here we assume that r is constant in the above formula; however, it is not always the case. The factors which influence r include demographics (age, geography, and profession etc), behavior (Recency, Frequency, Monetary, etc), tenure, competition, etc2. There are some improved formulas which forecast the r by different approaches such as Logistic Regression.

In the article, we will demonstrate how to calculate a customer’s CLV by predicting the retention/repurchasing rate r of customers in each future purchasing cycle time with the Logistic Regression model based on the predictors of Recency, Frequency, and Monetary.

We will use the CDNow full data set for concrete case study to build the above model.

The CDNow data set can be downloaded here. There are 23570 unique customers who made their first ever purchase at CDNOW in the first quarter of 1997 in the sample data. There are total 69659 transaction records, which occurred during the period of the start of Jan 1997 to the end of June 1998.

For more details about the dataset, read the paper of “Creating an RFM Summary Using Excel (Peter S. Fader, Bruce G. S. Hardie)” please or another blog RFM Customer Analysis with R Language on this website.

Exploring the relationships between Repurchase Rate and Recency, Frequency, and Monetary

Firstly we calculate the number of customers grouped by Recency values, and then further group them into “Buy” and “No Buy” according to the data in the next purchasing cycle time, and finally get the percentage of customers who repurchase in a certain Recency value in the next period. Here we leverage the R language function “ddply” to complete the grouping and calculating work. Below is a list pairs of percentage and Recency value we calculated. Please note that the less the Recency value is, the more recent the purchasing takes place.

Recency Buy Number Percentage

0              1   1180       0.45

1              1    581       0.28

2              1    279       0.22

3              1    198       0.17

4              1    163       0.14

5              1    249       0.05

6              1    316       0.03

7              1     13       0.03

The first row means that there are 45% customers who purchased CDs in the most recent period (Recency=0), purchased CDs again in the next period. We selected the translations that took place Jan 1st, 1997 through Feb 28th, 1998, for the calculating. The duration of the purchasing cycle time is set as two months.

By the same way, we can get the percentage lists of Frequency and Monetary. The relationships between Repurchase Rate and Recency, Frequency, and Monetary are plot blow.

percentage_curves

The scatter plots above suggest that there is an obvious linear or exponential fall relationship between the repurchasing percentage and the Recency, and an obvious exponential rise relationship between the repurchasing percentage and the Frequency. However, there is no obvious relationship between the repurchasing percentage and the Monetary.

Building the model

Based on the above observation, we only use Recency and Frequency as the predictors in this case and conduct the logistic regression to get the model with R language.

>model=glm(Buy~Recency+Frequency,family=quasibinomial(link=’logit’),data=train)

Given a customer’s status of Recency and Frequency, we can predict the probability of repurchasing with the above model.

> pred<-predict(model,data.frame(Recency=c(0),Frequency=c(1)),type=’response’)

> pred

1

0.2579282

As shown in the above, a customer, say Tom, who became a new customer in the most recent period (So Recency = 0, and Frequency=1), has a 26% probability to purchase again in the next period (Period 1).

Calculating CLV

Suppose Tom would remain for 3 more periods before he churns, and the average profit he would contribute are 100 dollars, the discount rate is 0.02. How to calculate Tom’s CLV?

clv_tree

As shown in the above figure, The rectangles in light blue color are Tom’s possible Recency and Frequency status in each period.

In Period 0, his Recency is 0 and his Frequency is 1.

In Period 1, there are 0.26 probabilities, which we have calculated by the model in the above part, for him to buy again, and 0.74 probabilities for him Not to buy again. In the first case, his status would transit to Recency=0 and Frequency=2; in the second case, his status would transit to Recency=1 and Frequency=1.  The forecast profit Tom would contribute in Period 1 is 0.26 * 100 / (1+0.02) = 25.5 dollars.

In Period 2, Tom would transit to four possible statues. Take the most left statues for illustration, we first get the probabilities of transition by the model with the input value of status of R=0 and F=2 in Period 1.

> pred<-predict(model,data.frame(Recency=c(0),Frequency=c(1)),type=’response’)

> pred

1

0.2873471  (about 29%)

Then the probabilities for Tom to purchase again in Period 2 after purchasing in Period 1 are 0.26 * 0.29 = 0.08.  We can also get the probabilities for Tom not to purchase in Period 1 but to purchase in Period 2 is 0.14. The forecast profit Tom would contribute in Period 2 is (0.08 + 0.14) * 100 / (1+0.02)2 = 21.1 dollars.

In Period 3, by the same way, the forecast profit is (0.03 + 0.04 + 0.04 + 0.08) * 100 / (1+0.02)3=17.9 dollars

Tom’s CLV is 64.5 dollars by summing up the forecast profit in Period 1, 2, and 3.

R Source Code

The R source code we used in this article can be downloaded here.

References

1. http://en.wikipedia.org/wiki/Customer_lifetime_value

2. Customer Lifetime Value (CLV) – A Methodology for Quantifying and Managing Future Cash Flows, David C. Ogden

3. Chapter 5 The Migration Model , Segmentation and Lifetime Value Models Using SAS, Edward C. Malthouse

Author: Jack Han. All rights reserved. 转载须以超链接形式标明文章原始出处和作者信息

8,972 total views, 2 views today

Sentiment and Topicmodel Analysis on Microblog

Share on FacebookShare on LinkedInShare on Twitter

I spoke a session on the topic of social media marketing&CRM at the Capital of Statistics Salon in Guangzhou on Oct 19th.  In the speech, I demonstrated how to conduct sentiment analysis and topic-model analysis on Sina Weibo, which is the largest social media of microblog in China, with R language.

The slides in Chinese can be download here. 

Author: Jack Han. All rights reserved. 转载须以超链接形式标明文章原始出处和作者信息

4,493 total views, 4 views today

Increasing Repeat Purchase Rate by Analyzing Customer Latency

Share on FacebookShare on LinkedInShare on Twitter

For online businesses, Repeat Purchase Rate is one of the critical metrics of the business performance. Higher repeat purchase rate means more active members, and thus leads to higher profit.

Customer Latency refers to the average time between customer activity events, for example, making a purchase, calling the help desk, or visiting a web site”1, said Jim Novo.

In this article, we will demonstrate how to find out the right trigger points of marketing campaigns by analyzing Customer Latency, thus to increase the Repeat Purchase Rate.

Exploring and Preparing the CDNOW Data Set

We will use the CDNOW Sample data during this demonstration; you can download the data here.

There are 2357 unique customers who made their first-ever purchase at CDNOW in the first quarter of 1997 in the sample data. There are total 6919 transaction records, which occurred during the period of the start of Jan 1997 to the end of June 1998.

For more details about the dataset, read the paper of “Creating an RFM Summary Using Excel (Peter S. Fader, Bruce G. S. Hardie)” please.

We will keep the columns of “ID”, “Date”, and “Amount” in the original data set and prepare several  additional columns of “Interval”, “Times”, and “TotalTimes”, thus we can manipulate the data set more conveniently, where,

“ID” is the customer ID;

“Date” is the transaction date;

“Amount” is the money amount paid by a customer per transaction;

“Interval” is the Customer Latency, the number of days between a customer’s each transactions, say 10 days between 1st purchase and the second, 15 days between the second and the third, and so on;

“Times”  is 1 to n,   1 means the customers’ first purchase, 2 means the second, and so on;  

TotalTimes is the number of transactions in total for a customer.

The prepared data set  looks like the following.

> head(df)

ID       Date           Amount Interval Times TotalTimes

1    4       1997-01-01  29.33        0           1          4

2    4      1997-01-18  29.73       17          2          4

3    4      1997-08-02  14.96      196         3          4

4    4      1997-12-12  26.48      132         4          4

158 18  1997-01-04  14.96        0           1          1

5   21    1997-01-01  63.34        0           1          2

Calculating the Repeat Purchase Rate and Percentages

First of all, let’s examine the average number of repeat purchases per customer and the average spending amount per transaction during that period.

> 6919/2357

[1] 2.935511

> sum(df$Amount)/6919

[1] 35.2785

The average repeat purchases rate is about 3. Obviously it is not a high rate for an online store of CDs.  Let’s further study the distribution of total repeat times of the customers.

# get the matrix of customer ID ~ the customer’s total number of transactions

> TimesByID <-as.data.frame(table(df$ID))

#get the matrix of total number of transactions ~ number of customers who have the total number

> GroupByTimes <- as.data.frame(table(TimesByID$Freq))

> head(GroupByTimes,12)

Times   Customers

1      1      1205

2      2       406

3      3       208

4      4       150

5      5        98

6      6        56

7      7        65

8      8        35

9      9        23

10    10        21

11    11         8

12    12        10

>plot(GroupByTimes,xlab=”Total Number of Purchases”,ylab=”Number of Customers”,pch=16,col=”blue”,type=”o”)

> text(2,1220,”1205″)

> text(3,425,”406″)

> text(4,220,”208″)

> text(5,170,”150″)

> text(6,120,”98″)

> text(12,50,”10″)

> text(30,50,”1″)

CustomersByTimes                                                                              Figure – 1

As we can see from Figure – 1 above, the number of customers decreases very quickly while the total number of purchases increases from 1 to 6.  Almost of half of the customers only made one purchase during the 1.5 year period!

Let’s examine the percentage of customers making (x) purchases more closely.

> percentages<-round(GroupByTimes$Customers / 2357 , 3)

> percentages

[1] 0.511 0.172 0.088 0.064 0.042 0.024 0.028 0.015 0.010 0.009

[11] 0.003 0.004 0.006 0.003 0.003 0.003 0.004 0.001 0.001 0.000

[21] 0.001 0.002 0.002 0.000 0.000 0.000 0.000 0.000 0.000 0.000

[31] 0.000 0.000 0.001 0.000 0.000

> x<-barplot(percentages [1:10]*100,col=”blue”,main=”Percentage of Customers Making (x) Purchases”, xlab=”Number of Purchases”, ylab=”Repeat Purchase Rate (%)”,ylim=range(0:55),axisnames=TRUE,names.arg=GroupByTimes$Times[1:10],cex.names=TRUE)

> text(x, percentages [1:10]*100+2,paste(percentages [1:10]*100,”%”))

repeatRate                                                                          Figure – 2

As shown in Figure-2 which displays the percentages of customers who made 1 to 10 purchases respectively, 51.1% of customers only made one purchase, 17.2% of the customers made two purchases, and 8.8% of the customers made three purchases and so on.

Based on the above data, to increase the average repeat purchase rate, CDNOW should try to increase the percentage of customers who make more than one purchase, especially the customers who make two and three purchases because the percentage decreases very quickly from one purchase to two purchases, and from two purchases to three purchases. We will leverage the Customer Latency concept to find ways to increase the repeat purchase rate in the following parts.

Calculating the Customer Latency and Increasing the Repeat Purchase Rate

Here Customer Latency refers to the average time between customers’ purchases. For example the average days between customers’ first purchase and second purchase, the days between second purchase and the third purchase, and so on.

Let’s calculate the Customer Interval between 1st and 2nd purchase first since increasing the 2nd purchase rate is important for increasing the overall repeat purchase rate.

> # filter out the customers who only made more than one purchase and their intervals between the 1st and the 2nd purchase

> df2<-df[df$TotalTimes>=2 & Times==2,]

> # see how many 2nd transcations

> nrow(df2)

[1] 1152

> # get the mean days of customer latency

> mean(df2$Interval)

[1] 105.6276

There are total 1152 second purchases and the average customer latency is about 100 days.

Let take a further look at the distributions of the Customer Latency.

> hist(df2$Interval,main=”Distribution of Customer Latency (1st – 2nd purchase)”, xlab=”Days”, ylab=”Number of 2nd Transcations”) latency_1_2                                                                           Figure – 3

As shown in Figure-3, more than half of the second purchases happened in 50 days after the first purchase and it is a decline distribution from left to right.

A customer who has longer Latency than the average Latency of the norm means something happened. It might be due to that the customer was unhappy with the product or service, or it might be due to his own reasons. Anyway, the Latency data is speaking to us, “it is a rising of the hand by the customer, and the Data-Driven marketer or service provider not only sees the raised hand, but also reacts to it3, as Jim Novo mentioned in his book.

So, based on the above analysis, CDNOW should do something to increase the second purchase rate when the customers’ Latency in the database exceeds 50 days, and 100 days. It can be an email sent to the customers with coupon or discount or something else to absorb the customers to go back to the CDNOW again. Otherwise, the longer the Latency is, the more likely the customers will defect.

By the same way, we can also calculate the average Latency between the second and the third purchase for increasing the third repeat purchase rate.

Thus the overall average repeat purchase rate will likely be increased.

R Source Codes

You can download the complete R source codes here.

Reference

  1. http://www.jimnovo.com/Customer-Latency.htm
  2. http://rjmetrics.com/metrics/repeat-purchase-rate
  3. Drilling Down – Turning Customer Data into Profits with a Spreadsheet, Jim Novo, 2004

Author: Jack Han. All rights reserved. 转载须以超链接形式标明文章原始出处和作者信息

20,184 total views, 3 views today

Identifying Potential Customers with Classification Techniques in R Language

Share on FacebookShare on LinkedInShare on Twitter

Data mining techniques and algorithms such as Decision Tree, Naïve Bayes, Support Vector Machine, Random Forest, and Logistic Regression are “most commonly used for predicting a specific outcome such as response / no-response, high / medium / low-value customer, likely to buy / not buy.”1

In this article, we will demonstrate how to use R to build classification models to identify the potential customers who are likely to buy an insurance product. We will build models with Decision Tree, Random Forest, Naïve Bayes, and SVM respectively and then compare the models to find out the best one. The scenario and data are based on the public tutorial of “Using Oracle Data Miner 11g Release 22

Data Observation and Preparation

The dataset is exported from the database mentioned in the above tutorial. Let’s take a look at the variables in the dataset.

> dataset <- read.csv(file.choose(),header=T)

> dim(dataset)

[1] 1015   31

> names(dataset)

 [1] “BUY_INSURANCE”      “CUSTOMER_ID”      “LAST”         “FIRST”                 

 [5] “STATE”            “REGION”         “SEX”          “PROFESSION”             

 [9]“AGE”              “HAS_CHILDREN”        “SALARY”       “N_OF_DEPENDENTS”       

[13] “CAR_OWNERSHIP”       “HOUSE_OWNERSHIP”        “TIME_AS_CUSTOMER”        “MARITAL_STATUS”        

[17] “CREDIT_BALANCE”         “BANK_FUNDS”                “CHECKING_AMOUNT”         “MONEY_MONTLY_OVERDRAWN”

[21] “T_AMOUNT_AUTOM_PAYMENTS”            “MONTHLY_CHECKS_WRITTEN”  “MORTGAGE_AMOUNT”                             “N_TRANS_ATM”           

[25] “N_MORTGAGES”             “N_TRANS_TELLER”          “CREDIT_CARD_LIMITS”      “N_TRANS_KIOSK”         

[29] “N_TRANS_WEB_BANK”          “LTV”                     “LTV_BIN”               

There are 1015 cases and 31 variables in the dataset. The variable of “BUY_INSURANCE” is the dependent variable. Other variables are customers’ basic information, geographical information, and bank account information. The data types for the variables should be “factor” and “numeric” in R. 

> table(dataset$BUY_INSURANCE)

No Yes

742 273

In the 1015 cases, 273 people bought the insurance product in the past.

Checking Missing Value

> sum(complete.cases(dataset))

[1] 1015

There is no missing value in the dataset for us to deal with.

Removing Unnecessary Variables

The variables of “CUSTOMER_ID”, “LAST”, and “FIRST” doesn’t help for the data mining. We can remove them.

> dataset <-subset(dataset,select = -c(CUSTOMER_ID, LAST, FIRST))

Some of the algorithms have a limitation on the categorical levels. If there are too many levels in a variable, we need to combine the lower levels into higher levels to reduce the quantity of the total levels or we can just remove the variables if it doesn’t influence the data mining result. Let’s check how many levels there are in the following categorical variables.

> dim(table(dataset$PROFESSION))

[1] 95

> dim(table(dataset$STATE))

[1] 22

> dim(table(dataset$REGION))

[1] 5

The quantity of PROFESSION levels exceeds the limitation and we know that there are 50 states though it is only 22 in the dataset. For simplification, we remove the two variables here.

> dataset <-subset(dataset, select = -c(PROFESSION, STATE))

Since variable of LTV has been binned into the LTV_BIN already in the dataset, we remove LTV as well.

> dataset <-subset(dataset, select = -c(LTV))

Transferring the Data Type

> dataset$REGION <-as.factor(dataset$REGION)

> dataset$SEX <-as.factor(dataset$SEX)

> dataset$CAR_OWNERSHIP <-as.factor(dataset$CAR_OWNERSHIP)

> dataset$HOUSE_OWNERSHIP <-as.factor(dataset$HOUSE_OWNERSHIP)

> dataset$MARITAL_STATUS <-as.factor(dataset$MARITAL_STATUS)

> dataset$HAS_CHILDREN <-as.factor(dataset$HAS_CHILDREN)

> dataset$LTV_BIN <-as.ordered(dataset$LTV_BIN)

>

> dataset$AGE <-as.numeric(dataset$AGE)

> dataset$SALARY <-as.numeric(dataset$SALARY)

> dataset$N_OF_DEPENDENTS <-as.numeric(dataset$N_OF_DEPENDENTS)

> dataset$TIME_AS_CUSTOMER <-as.numeric(dataset$TIME_AS_CUSTOMER)

> dataset$CREDIT_BALANCE <-as.numeric(dataset$CREDIT_BALANCE)

> dataset$BANK_FUNDS <-as.numeric(dataset$BANK_FUNDS)

> dataset$CHECKING_AMOUNT <-as.numeric(dataset$CHECKING_AMOUNT)

>dataset$MONEY_MONTLY_OVERDRAWN <-as.numeric(dataset$MONEY_MONTLY_OVERDRAWN)

>dataset$T_AMOUNT_AUTOM_PAYMENTS <-as.numeric(dataset$T_AMOUNT_AUTOM_PAYMENTS)

> dataset$MONTHLY_CHECKS_WRITTEN <-as.numeric(dataset$MONTHLY_CHECKS_WRITTEN)

> dataset$MORTGAGE_AMOUNT <-as.numeric(dataset$MORTGAGE_AMOUNT)

> dataset$N_TRANS_ATM <-as.numeric(dataset$N_TRANS_ATM)

> dataset$N_MORTGAGES <-as.numeric(dataset$N_MORTGAGES)

> dataset$N_TRANS_TELLER <-as.numeric(dataset$N_TRANS_TELLER)

> dataset$CREDIT_CARD_LIMITS <-as.numeric(dataset$CREDIT_CARD_LIMITS)

> dataset$N_TRANS_KIOSK <-as.numeric(dataset$N_TRANS_KIOSK)

> dataset$N_TRANS_WEB_BANK <-as.numeric(dataset$N_TRANS_WEB_BANK)

Checking the Correlations between Numeric Variables

We could use the function of “pairs20x()” to check the correlations visually. Due to that there are more than 20 numeric variables and thus the output figure is too large to display, we don’t show the figure here. We only use the function of “cor()” to get the correlations.

> cor(dataset$TIME_AS_CUSTOMER, dataset$N_OF_DEPENDENTS)

[1] 0.7667451

> cor(dataset$T_AMOUNT_AUTOM_PAYMENTS, dataset$CREDIT_BALANCE)

[1] 0.8274963

> cor(dataset$N_TRANS_WEB_BANK, dataset$MORTGAGE_AMOUNT)

[1] 0.7679546

We found the above three pairs of variables have higher correlations. Ideally we should try to remove one variable in turn in each pair for model building to see if the performance of models can be improved. However, for simplification, we don’t deal with the correlated variables in this ariticle.

Breaking Data into Training and Test Sample

> # breaking the data set into training and test samples by half

> d = sort(sample(nrow(dataset), nrow(dataset)*.5))

> train<-dataset[d,]

> test<-dataset[-d,]

Building the Models

In this part, we will use the Decision Tree, Random Forest, Naive Bayes, and SVM classifiers in R to build models respectively. For simplification, we will not conduct k-folder cross validation during modeling for some classifiers in which there are no embedded cross validation.

Decision Tree

Decision Tree is one of the most commonly used classifier. It is able to handle both numerical and categorical variables and it is insensitive to data errors or even missing data. Most importantly, it provides human-readable rules.

> library(“rpart”)

> model.tree <- rpart (BUY_INSURANCE~.,data=train,method=”class”)

> plot(model.tree,uniform=TRUE,margin=0.1)

> text(model.tree,use.n=T,cex=0.8)

We can view the output tree structure in the Figure 1.

tree_before_pruningFigure 1

The tree is a bit complex so we will prune it. Firstly, we need to find out the right complexity parameter (cp) value, hence the number of splits (or size) of the tree, for pruning. The right cp is a threshold point where increased cost for further splitting outweighs reduction in lack-of-fit.

># plot cp

> plotcp(model.tree)

plotcpFigure 2

“A good choice of cp for pruning is often the leftmost value for which the mean lies below the horizontal line.” In the Figure 2 above, we can see that the optimal cp value is 0.042.

># Prune the tree with the optimal cp value

> pTree<- prune(model.tree, 0.042)

# draw the pruned tree

>plot(pTree,uniform=TRUE,margin=0.1)

>text(pTree,use.n=T,cex=0.8)

prunedtreeFigure 3

As shown in Figure 3, it is easy to describe the rules to decide if a customer is more likely to buy.

Take the most right leaf for example; the rules can be described as follows.

IF BANK_FUNDS >= 320.5

         IF CHECKING_AMOUNT < 162

                  IF MONEY_MONTHLY_WITHDRAWN >=53.68

                          IF CREDIT_BALANCE < 3850

                                   THEN YES for BUY

The numbers below the leaf shows that 4 customers didn’t buy and 50 customers bought under the above conditions in the training dataset.

After we build the model, we can use it to predict for the customers in the test dataset.

> pred.tree <- predict(pTree,test[,-1])

The prediction result can be No or Yes for each customer, or it can provide the probabilities of No and Yes for each customers as follows.

> head(pred.tree)

         No               Yes

1 0.8380952         0.16190476

2 0.9823009         0.01769912

3 0.8380952         0.16190476

7 0.9823009         0.01769912

8 0.9823009         0.01769912

9 0.9823009         0.01769912

Random Forest

Random Forest is based on Decision Tree. It can handle large dataset and thousands of input variables without variable deletion. In random forests, there is no need for cross-validation or a separate test set to get an unbiased estimate of the test set error. It gives estimates of what variables are important in the classification.3

> library(randomForest)

>model.randomForest<- randomForest(BUY_INSURANCE ~ ., data = train, importance=TRUE,proximity=TRUE)

> # look at variable importance

> round(importance(model.randomForest),2)

                           No   Yes    MeanDecreaseAccuracy   MeanDecreaseGini

REGION               0.18  0.74                 0.34                   4.68

SEX                      0.01  0.09                 0.04                   0.96

AGE                      0.40  0.85                 0.53                  11.10

HAS_CHILDREN   -0.12  0.21                -0.03                 1.12

SALARY                 -0.06  0.01                -0.04             8.98

N_OF_DEPENDENTS     -0.04  1.08                 0.39             4.01

CAR_OWNERSHIP           -0.06 -0.15                -0.08             0.44

HOUSE_OWNERSHIP          0.08  0.38                 0.19             1.11

TIME_AS_CUSTOMER         0.09  0.32                 0.17             2.90

MARITAL_STATUS           0.22  0.83                 0.45             3.62

CREDIT_BALANCE           0.76  0.62                 0.67             4.26

BANK_FUNDS               1.02  2.69                 1.27            27.15

CHECKING_AMOUNT          1.26  1.78                 1.18            15.12

MONEY_MONTLY_OVERDRAWN   0.93  2.31                 1.16            24.09

T_AMOUNT_AUTOM_PAYMENTS  0.85  1.58                 0.99            15.97

MONTHLY_CHECKS_WRITTEN   0.27  1.35                 0.64            10.43

MORTGAGE_AMOUNT          0.10  1.45                 0.60             8.26

N_TRANS_ATM              0.66  1.95                 0.96            13.97

N_MORTGAGES             -0.01  0.29                 0.10             1.58

N_TRANS_TELLER           0.66  1.52                 0.82             7.62

CREDIT_CARD_LIMITS       0.00  0.82                 0.25             6.35

N_TRANS_KIOSK           -0.01 -0.51                -0.17             4.03

N_TRANS_WEB_BANK         0.29  1.20                 0.60             9.27

LTV_BIN                 -0.01  0.15                 0.04             2.47

Higher values of in the above table indicate variables that are more important to the classification. We can see that BANK_FUNDS, CHECKING_AMOUNT, and MONEY_MONTLY_OVERDRAWN are more helpful to the classification.

Let’s use the model to predicate the cases in the test data set.

> pred.randomForest <- predict(model.randomForest, test[,-1],type=”prob”)

> head(pred.randomForest)

     No      Yes

1 0.764    0.236

2 0.990    0.010

3 0.600    0.400

7 0.864    0.136

8 0.996    0.004

9 0.996    0.004

Naïve Bayes

“A naive Bayes classifier is a simple probabilistic classifier based on applying Bayes’ theorem with strong (naive) independence assumptions. Despite the fact that the far-reaching independence assumptions are often inaccurate, it has several properties that make it surprisingly useful in practice”4. Naive Bayes can deal with both categorical and numeric data. Since the sample size in the training data set is not very large, we will not discretize the continuous values in some of the variables by binning for simplification.

> library(e1071)

> model.naiveBayes <- naiveBayes(BUY_INSURANCE ~ ., data = train, laplace = 3)

> pred.naiveBayes <- predict(model.naiveBayes, test[,-1],type=”raw”)

> head(pred.naiveBayes)

     No                    Yes

[1,] 1.0000000       5.244713e-18

[2,] 0.9953059       4.694106e-03

[3,] 0.5579982       4.420018e-01

[4,] 0.2221896       7.778104e-01

[5,] 0.9857277       1.427231e-02

[6,] 0.9923343       7.665676e-03

SVM

“A support vector machine constructs a hyperplane or set of hyperplanes in a high- or infinite-dimensional space, which can be used for classification, regression, or other tasks.”5 We will use the svm() function in the e1071 package to build the classification models. Kernel and cost parameters are important for svm() function to yield sensible results. We will try linear and radial kernel functions respectively in the following.

> # build the models with two different kernel functions respectively

> model.svm.linear <- svm(BUY_INSURANCE ~ ., data = train, kernel=”linear”,probability = TRUE)

> model.svm.radial <- svm(BUY_INSURANCE ~ ., data = train, kernel=”radial”,probability = TRUE)

> # prediction with the two models respectively

> pred.svm.linear <- predict(model.svm.linear, test[,-1],probability=TRUE)

> attr(pred.svm.linear, “probabilities”)[1:6,]

         No                 Yes

1 0.9816020           0.01839796

2 0.9391826           0.06081737

3 0.5237238           0.47627615

7 0.9310071           0.06899288

8 0.9531510           0.04684897

9 0.9444462           0.05555381

> pred.svm.radial <- predict(model.svm.radial, test[,-1],probability=TRUE)

> attr(pred.svm.radial, “probabilities”)[1:6,]

         No                Yes

1 0.8849981           0.11500191

2 0.9664234           0.03357663

3 0.5672350           0.43276502

7 0.9591768           0.04082316

8 0.9624121           0.03758789

9 0.9862672           0.01373277

Comparing the Models

To compare the models generated above, we will plot ROC curve and calculate the area under the ROC (AUC for short).

> #prepares the legend string for the ROC figure

> c.legend<-c(“decision tree, auc=”,”random forest, auc=”,”naive Bayes, auc=”,”svm.linear, auc=”,”svm.radial, auc=”)

> #ROC for Decision Tree

> pred <- prediction(pred.tree[,2], test[,1])

> perf <- performance(pred, “tpr”, “fpr”)

> plot(perf,col=”red”,lwd=2)

> # caculate the AUC and add it to the legend vector

> c.legend[1]<-paste(c.legend[1],round((performance(pred,’auc’)@y.values)[[1]],3))

>#ROC for Random Forest

> pred <- prediction(pred.randomForest[,2], test[,1])

> perf <- performance(pred, “tpr”, “fpr”)

> plot(perf,add=TRUE,col=”green”,lwd=2)

> # caculate the AUC and add it to the legend vector

> c.legend[2]<-paste(c.legend[2],round((performance(pred,’auc’)@y.values)[[1]],3))

> #ROC for Naive Bayes

> pred <- prediction(pred.naiveBayes[,2], test[,1])

> perf <- performance(pred, “tpr”, “fpr”)

> plot(perf,add=TRUE,col=”blue”,lwd=2)

> # caculate the AUC and add it to the legend vector

> c.legend[3]<-paste(c.legend[3],round((performance(pred,’auc’)@y.values)[[1]],3))

> #ROC for SVM with linear kernel

> pred <- prediction(attr(pred.svm.linear, “probabilities”)[,2], test[,1])

> perf <- performance(pred, “tpr”, “fpr”)

> plot(perf,add=TRUE,col=”purple”,lwd=2)

> # caculate the AUC and add it to the legend vector

> c.legend[4]<-paste(c.legend[4],round((performance(pred,’auc’)@y.values)[[1]],3))

> #ROC for SVM with radial kernel

> pred <- prediction(attr(pred.svm.radial, “probabilities”)[,2], test[,1])

> perf <- performance(pred, “tpr”, “fpr”)

> plot(perf,add=TRUE,col=”black”,lwd=2)

> # caculate the AUC and add it to the legend vector

> c.legend[5]<-paste(c.legend[5],round((performance(pred,’auc’)@y.values)[[1]],3))

> draw the legend

>legend(0.5,0.6, .legend,lty=c(1,1,1,1,1),lwd=c(2,2,2,2,2),col=c(“red”,”green”,”blue”,”purple”,”black”))

ROC

Figure 4

As shown in Figure 4, the model built by Random Forest (green line) has the best performance with the AUC of 0.921 in this case. We can use this model for our actual usage to predict Not Buy or Buy on new customers who are not in the existing data set.

Summary

In this article, we built and compared the models generated by the Decision Tree, Random Forest, Naïve Bayes, SVM algorithms implemented in R packages. The performance of Random Forest exceeded others in this insurance buying use case.

References

  1. http://www.oracle.com/technetwork/database/options/advanced-analytics/odm/odm-techniques-algorithms-097163.html
  2. http://www.oracle.com/webfolder/technetwork/tutorials/obe/db/11g/r2/prod/bidw/datamining/ODM11gR2.htm
  3. http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#workings
  4. http://en.wikipedia.org/wiki/Naive_Bayes_classifier
  5. http://en.wikipedia.org/wiki/Support_vector_machine
  6. http://heuristically.wordpress.com/2009/12/23/compare-performance-machine-learning-classifiers-r/
  7. http://www.r-bloggers.com/modelling-with-r-part-4/
  8. http://heuristically.wordpress.com/2009/12/18/plot-roc-curve-lift-chart-random-forest/
  9. http://cran.r-project.org/doc/contrib/Sharma-CreditScoring.pdf
  10. http://xccds1977.blogspot.jp/2012/07/blog-post_06.html
  11. http://depts.washington.edu/landecol/PDFS/RF.pdf

 

Author: Jack Han. All rights reserved. 转载须以超链接形式标明文章原始出处和作者信息

27,519 total views, 92 views today

RFM Customer Analysis with R Language

Share on FacebookShare on LinkedInShare on Twitter

For database marketing or direct marketing people, they are always concerned about two questions before they send out mails or make calls to their customers:-

  1. How can they segment the customers in the database to find out who are more likely to response to their mails or buy their products?
  2. Which type of customers they should send the mails to so that they can reach breakeven and make profit?

The RFM method is a very simple but effective customer analysis way to address the above questions. Where,

R, F, and M stand for1

         Recency – How recently did the customer purchase?

         Frequency – How often do they purchase?

         Monetary Value – How much do they spend (each time on average)?

         (The above definitions are from Wikipeida.)

You can get more details on RFM from the following paper and case study besides Wikipeida:-

Segmentation and RFM Basics (Bill Ruppert)

Case Study: La Quinta narrows target to boost guest loyalty

Seems there is no out-of-box package on RFM analysis in R programming language area while other data analysis tools such as IBM SPSS have easy-to-use RFM functionalities. So in this article, we will demonstrate how to implement RFM analysis with R language.

Explore Data

We will use the CDNOW Sample data during this demonstration; you can download the data here. For more details about the dataset, read the paper of “Creating an RFM Summary Using Excel (Peter S. Fader, Bruce G. S. Hardie)” please.

Let’s explore the data.

> # read CDNOW_SAMPLE.txt

> df <- read.table(file.choose(),header=F)

> # construct a data frame with the necessary columns of customer ID, transaction date, and money amount paid by a customer per transaction

> df <- as.data.frame(cbind(df[,1],df[,3],df[,5]))

> # add appropriate column names for the above three column and

> names <- c(“ID”,”Date”,”Amount”)

> names(df) <- names

> #tranfer the the text column type to date type

> df[,2] <- as.Date(as.character(df[,2]),”%Y%m%d”)

> head(df)

ID       Date              Amount

1   4       1997-01-01    29.33

2   4       1997-01-18    29.73

3   4       1997-08-02    14.96

4   4       1997-12-12    26.48

5  21       1997-01-01    63.34

6  21       1997-01-13    11.77

> dim(df)

[1] 6919    3

#remove the rows with the duplicated IDs to see how many customers in total

> uid <- df[!duplicated(df[,"ID"]),]

> dim(uid)

[1] 2357    3

As shown in the output above, the data set contains some columns, especially three columns of customer ID, transaction date, and money amount paid by a customer per transaction. There are total 6919 transaction records in the sample dataset and 2357 unique customers.

Segment the Customers into RFM Cells

Since we have got the data ready now, we can segment the customers into RFM cells by the three dimensions of R, F, and M. Usually we rate the customers by 1 to 5 points in each dimension, the higher score the better.

According to the experiences, Recency is the most important factor (x100) for high response rate, Frequency is the second one(x10), and the Monetary is the last one(x1).  The score of customers who are segmented into the same RFM cell is the same.  A score of “542” means that the customer gets 5 points in Recency, 4 points in Frequency, and 2 points in Monetary.

There are usually two ways to segment the customers in each dimension, one is Nested, the other is Independent.

The nested method is to divide the customers into aliquots in the Recency dimension first, then for each aliquot of Recency, divide the customers into aliquots in the Frequency dimension nestedly, so does for the Monetary. The advantage of nested method is the quantity of customers in each RFM cell is roughly the same for all cells; the disadvantage is that the meaning of the Frequency and Monetary with the same score point might be different, e.g. one customer with a “543” RFM score and the other customers with a “443” RFM score, their Recency might be quite different though they have the same Recency score of “4”.

The independent method is to rate the customers into aliquots in the R, F, M dimensions independently; the advantage is that the meaning of the Frequency and Monetary with the same score point is the same; the disadvantage is that the quantities of customers in the RFM cells might be quite different.

You can read the discussion in Quora about the pros and cons of the above two methods.

In this article, we will implement the independent method in R language first, and then we will implement another method that allows users to input the parameters of the breaks for each dimension according to their own business requirement.

R Source Codes

The complete R source codes can be downloaded from RFM_Analysis_R_Source_Codes. Read the descriptions in the source codes please before you continue.

Calculate the Recency, Frequency, and Monetary

To implement the RFM analysis, we need to further process the data set in the CDNOW Sample by the following steps:-

  1. Remove the duplicate records with the same customer ID
  2. Find the most recent date for each ID and calculate the days to the now or some other date, to get the Recency data
  3. Calculate the quantity of translations of a customer, to get the Frequency data
  4. Sum the amount of money a customer spent and divide it by Frequency, to get the amount per transaction on average, that is the Monetary data.

We have already implemented the above procedures in the functions of “getDataFrame” in the source codes.

# set the startDate and endDate, we will only analysis the records in this date range

> startDate <- as.Date(“19970101″,”%Y%m%d”)

> endDate <- as.Date(“19980701″,”%Y%m%d”)

> df <- getDataFrame(df,startDate,endDate)

> head(df)

ID          Date            Amount   Recency     Frequency  Monetary

4           1997-12-12  26.48     201             4                 25.125

18          1997-01-04  14.96     543             1                 14.960

21         1997-01-13  11.77     534              2                 37.555

50         1997-01-01   6.79     546               1                 6.790

60         1997-02-01  21.75     515              1                 21.750

71         1997-01-01  13.97     546              1                 13.970

After we calculate the Recency, Frequency, and Monetary, we can implement the RFM scoring.

Independent RFM Scoring

We have implemented the independent scoring in function of “getIndependentScore”. Below is the description of the function. For details, please read the source codes.

# Function

#     getIndependentScore(df,r=5,f=5,m=5)

# Description

#       Scoring the Recency, Frequency, and Monetary in r, f, and m in aliquots independently

# Arguments

#       df – A data frame returned by the function of getDataFrame

#       r –  The highest point of Recency, it is 5 by default

#       f –  The highest point of Frequency, it is 5 by default

#       m –  The highest point of Monetary, it is 5 by default

# Return Value

#       Returns a new data frame with four new columns of “R_Score”,”F_Score”,”M_Score”, and

#  “Total_Score”.

Let’s execute the function and take a look of the output.

> df1 <-getIndepandentScore(df)

> head(df1[-(2:3)])

ID      Recency Frequency Monetary      R_Score  F_Score  M_Score  Total_Score

11462   52       4                191.6425       5             5              5              555

8736    55        9                148.3944       5             5              5              555

7856    54        5                130.0980       5             5              5              555

15105   58       9                129.5256       5             5              5              555

22356   106    8                127.3650        5             5              5              555

8481    57     13                117.3492        5             5              5              555

Let’s explore the customer distribution in each RFM cell to see if the customers are distributed evenly.

#Draw the histograms in the R, F, and M dimensions so that we can see the distribution of customers in each RFM cell.

> drawHistograms(df1)

RFMcells1

The distribution is not very even due to that the Frequency doesn’t have the scores of 1 and 2. Let’s further find out how many customers have a total score larger than 500 or 400.

> S500<-df1[df1$Total_Score>500,]

> dim(S500)

[1] 471  10

> S400<-df1[df1$Total_Score>400,]

> dim(S500)

[1] 945  10

We can consider those customers are more important to get a higher response rate.

RFM Scoring with Breaks

Sometimes users want to determine the breaks for each dimension by themselves according to their own business requirement. For example, a user can set 0 -30 days, 30 -90 days, 90 – 180 days, 180 – 360 days, and more than 360 days as the 5 breaks for Recency rather than just let the computer segment the customers into 5 aliquots without considering the specific business requirement.

We have implemented this scoring method in the function of “getScoreWithBreaks”. Below is the description of the function. For details, please read the source codes.

# Function

#     getScoreWithBreaks(df,r,f,m)

# Description

#       Scoring the Recency, Frequency, and Monetary in r, f, and m which are vector object containing a series of breaks

# Arguments

#       df – A data frame returned by the function of getDataFrame

#       r –  A vector of Recency breaks

#       f –  A vector of Frequency breaks

#       m –  A vector of Monetary breaks

# Return Value

#       Returns a new data frame with four new columns of “R_Score”,”F_Score”,”M_Score”, and “Total_Score”.

Before we execute the function, we can take a look at the distributions of Recency, Frequency, and Monetary.

> par(mfrow = c(1,3))

> hist(df$Recency)

> hist(df$Frequency)

> hist(df$Monetary)

RFMfrequency

We can find that the distributions are not even. Most of the customers have a Recency of more than 450 days, a Frequency of less than 5 times, and a Monetary of less than 50 dollars. That’s the reason why the customers are not distributed evenly in the RFM cells when using the Independent RFM scoring method in the previous part.

Based on the above observation and relevant business sense, we can construct the breaks for each RFM dimension as follows, so that we can get segmented RFM cells more evenly and reasonably.

# set the Recency ranges as 0-120 days, 120-240 days, 240-450 days, 450-500days, and more than 500days.

> r <-c(120,240,450,500)

# set the Frequency ranges as 0 – 2times, 2-5 times,5-8 times, 8-10 times, and more than 10 times.

> f <-c(2,5,8,10)

# set the Monetary ranges as 0-10 dollars, 10-20 dollars, and so on.

> m <-c(10,20,30,100)

Than we can execute the function of “getScoreWithBreaks” and see the customers distributions in the RFM cells.

> df2<-getScoreWithBreaks(df,r,f,m)

> drawHistograms(df2)

RFMcells2

We can also calculate how many customers have a total score of more than 500 or 400.

> S500<-df2[df2$Total_Score>500,]

> dim(S500)

[1] 399  10

> drawHistograms(df2)

> S400<-df2[df2$Total_Score>400,]

> dim(S400)

[1] 641  10

There are 641 customers have a RFM score more than 400.

Estimate response rate for each RFM cell4

After we segment the customers into RFM cells, we can assign the response rate to each RFM cell according to historical responding data. If it is the first time to use RFM analysis and there is no historical data, we can select some customers, say 10% percent, randomly from each RFM cells. Send mails to the selected customers as a trail and count the response rate for each cell. Below is an example of the response rate table.

Response Rate Table

RFM cell (Total Score) Response Rate
555 8.5%
441 5.6%
435 4.8%

Calculate the breakeven point to select valuable customers4

To be breakeven,

P – C/R= 0

Where,

P is the price or revenue per deal or per response

C is the cost per mail sent out, including production cost, mailing cost etc.

R is the response rate.

Suppose,  P = 100 dollars, C = 5 dollars, to be breakeven:-

R = C/P = 5/100 = 5%

That means, we should chose the customers in the RFM cells, that have a response rate equal to or more than 5%, to send out the direct marketing mails, to make money.

Select the Target Customers

According to the Response Rate Table and the breakeven point, we need to select customers in the cells with a response rate more than 5%, that means we need to select the customers with a total score equal to or more than 441.

> target <- df2[df2$Total_Score>=441,]

> dim(target)

[1] 416  10

There are 416 customers we will send the direct marketing mails to.

Summary

We introduced the basic concept of RFM customer analysis and implemented the analysis procedures in R language by providing complete R source code.

References

  1. http://en.wikipedia.org/wiki/RFM
  2. http://brucehardie.com/datasets/CDNOW sample.zip
  3. http://www.brucehardie.com/notes/022/RFM_summary_in_Excel.pdf
  4. Recency, Frequency and Monetary (RFM) Analysis (Charlotte Mason)
  5. Case Study: La Quinta narrows target to boost guest loyalty
  6. http://www.responseb2b.com/files/rfm.ppt

Author: Jack Han. All rights reserved. 转载须以超链接形式标明文章原始出处和作者信息

9,859 total views, 4 views today

Uncertain Demand Forecasting and Inventory Optimizing for Short-life-cycle Products

Share on FacebookShare on LinkedInShare on Twitter

For short-life-cycle products such as newspapers and fashion, it is important to match the supply with the demand. However, sometimes we order too little from supplier and sometimes we order too much due to the uncertain demand. We would lose sales and customers would be unsatisfied if ordering too little or we would let the inventory left over at the end of the season if ordering too much. To reduce the mismatch between demand and supply, on the one hand, we need to improve the demand forecasting; on the other hand, we need to optimize the inventory management with an appropriate service level.

In this article, a fashion company ABC, is preparing for the inventory of a new design product, Cool-7, for the upcoming season. We will use the approach of expert judgment in conjunction with the A/F ratio for demand forecasting, and use the Newsvendor model for optimizing the inventory management. R language is used for the statistical computing where needed.

Demand Forecasting

Since Cool-7 is a new product, there is no direct historical data for reference. ABC Company formed a committee, which consists of experts from Marketing, Sales, and Channels etc, to forecast the demand for Cool-7 in the coming summer season.

The initial demand forecasted by the committee is 3500.  Is the number reliable?

We need to further consider the committee’s bias (optimistic, pessimistic, etc) and estimation error by the forecast and actual demands of other types of products in the past. Click here to download the fabricated forecast and actual demand data, and then explore the data.

> #read the dataset from .csv file

> df <- read.csv(file.choose(),header=T)

>

> #list each data column name and the first six lines of the dataset

> head(df)

Product   Forecast   Actual.Demand   A.F.Ratio (A/F Ratio)

1      P1     1600          1261                  0.79

2      P2     2200          2890                  1.31

3      P3     3500          3172                   0.91

4      P4     1200           895                   0.75

5      P5     1800          2500                  1.39

6      P6     3000          2330                  0.78

Where, the A/F ratio = Actual Demand / Forecast.

># get the mean and standard deviation of A/F Ratio

> mean(df$A.F.Ratio)

[1] 1.0425

> sd(df$A.F.Ratio)

[1] 0.2646721

We will adjust the initial forecast according to the A/F ratio mean as follows.

Adjust forecast = initial forecast * A/F ratio mean

Adjust forecast = 3500 * 1.0425 = 3649, we can also call it the expected actual demand statistically.

The standard deviation of actual demand is, 3500 * 0.2647 = 926.

Inventory Optimizing

Fashion is a kind of short-life-cycle product and usually can only order once before the sales season. To avoid order too much or order too less, we can use Newsvendor model to calculate how many units ABC Company needs to order from the fashion producer, so that the inventory can be optimized considering both the overage cost and underage cost.

Assume the demand has a normal distribution, the retail price is $150 per unit, purchasing price is $100, and the discounted price of leftover products is $70.  We can calculate the Overage cost (Co) and the Underage cost (Cu) respectively:-

Co = 100 – 70 = 30

Cu = 150 – 100 = 50

According to Newsvendor model, the optimal service level for Cool – 7 is,

Cu / (Cu + Co ) = 50 / 80 = 0.625

Given the demand has a mean of 3649 and a standard deviation of 926, what is the order quantity Q* that achieves the optimal service level of 0.625? 1

Q* = mean demand+ Z * demand standard deviation,

Where,

Mean demand = 3649;

Demand standard deviation = 926;

Z is the inverse of the standard normal cumulative distribution of the service level.

# get ‘z’

> qnorm(0.625)

[1] 0.3186394

So, Q* = 3649 + 926 * 0.31864 = 3944.  ABC Company should order 3944 units from the producer.

References:

  1. HKUST Prof. Albert Ha’s lecture slides of the Operation Management course;
  2. http://en.wikipedia.org/wiki/Newsvendor_model
  3. www.utdallas.edu/~metin/Or6302/Folios/omnewsvendor.ppt

Author: Jack Han. All rights reserved. 转载须以超链接形式标明文章原始出处和作者信息

22,419 total views, 10 views today

Investment Portfolio Analysis with R Language

Share on FacebookShare on LinkedInShare on Twitter

[Please note that this article is for studying R programming purpose only rather than for advices of investment]

R has a wide application in finance analysis areas such as time series analysis, portfolio management, and risk management, with its basic functions and many professional packages in Finance. In this article, we will demonstrate how to help investors to optimize the investment portfolio with R language by a simple example.

Scenario Introduction

Peter is interested in investing the US stock markets and he has selected four stocks, 3M (MMM), Johnson & Johnson (JNJ), P&G (PG), and Wal-Mart (WMT) for consideration. To avoid “putting all eggs in one basket”, his investing strategy is to invest in all of the four stocks rather than to invest in only one stock.

In this case, what are the optimal asset allocation percentages among the four stocks, assuming no short? How can we use R language to conduct the portfolio management analysis for Peter?

Basic Concepts

Before we dive into the R programming, let’s review some of the basic concepts about the financial portfolio management.

Expected Return

Expected Return is a weighted-average return of a portfolio. The return of each asset in the portfolio can be the mean of return values of the asset in a past period of time; it can also be calculated by the capital asset pricing model (CAPM), which is used to determine a theoretically appropriate required rate of return of an asset1. It can also calculated by other factor models. We will use CAPM to calculate the expected return of each stock in this article.

Risk

Risk is the variation of the return of an asset. It can be calculated by the standard deviation of the historical return data; it can also be calculated by factor models. We will use one-factor model to calculate the variance of an asset and the covariance of two assets in a portfolio.

Sharpe Ratio

“The Sharpe ratio characterizes how well the return of an asset compensates the investor for the risk taken”2.  The higher Sharpe Ratio means the better investment option. In a set of risky assets, we can find the optimal portfolio asset allocations so that the Sharpe Ration is the largest.

Capital Asset Pricing Model (CAPM) 1

The CAPM is a model for pricing an individual security or portfolio. The formula of CAPM is in the following:-

E(Ri)  =  Rf + βi * (E(Rm) – Rf)

Where:

E(Ri) is the expect return of an asset

Rf is the risk-free return, we usually use the interest rate of T-Bills as the Rf

E(Rm) is the expected return of the market, we usually use SP500 index return as the US stock market

βi is the sensitivity of the expected excess asset return to the expected excess market return. We can get the value of βi by do a regression on the historical data of the excess returns of an asset and the market.

Explore the data

The historical monthly return data of SP500, MMM, JNJ, PG, and WMT from April 2010 through April 2013, can be downloaded from finance.yahoo.com. The one month T-Bills interest rate, which we use as the risky free return,  can be download from here. The SP500 returns are used as the market return. Here is the csv file of the returns.

#read the historical return data from .csv file

> return <- read.csv(file.choose(),header=T)

#list each data column names and the first six lines of the dataset

> head(return)

Month T.Bills    SP500       MMM       JNJ         PG         WMT

1 2010/4/1  0.0015 -0.08197592 -0.100012182 -0.084992210 -0.017177262 -0.05195068

2 2010/5/1  0.0015 -0.05388238 -0.004060639  0.013053348 -0.018198198 -0.04925373

3 2010/6/1  0.0008  0.06877783  0.083038869 -0.016433240  0.027711507  0.06503700

4 2010/7/1  0.0016 -0.04744917 -0.076044673 -0.009303209 -0.024464286 -0.01495052

5 2010/8/1  0.0015  0.08755110  0.103897868  0.086814872  0.005125389  0.06755024

6 2010/9/1  0.0012  0.03685594 -0.028666339  0.028566390  0.068111455  0.01201442

We also calculated the excess return data set for our further usage in this excess return csv file.The excess return is equal to monthly return minus monthly T-Bills interest rate.

#read the excess return from .csv file

>exReturn <- read.csv(file.choose(),header=T)

#list each data column name and the first five lines of the dataset

>head(exReturn)

Month   SP500_ExcessR  MMM_ExcessR JNJ_ExcessR  PG_ExcessR WMT_ExcessR

1 2010/4/1   -0.08347592 -0.101512182 -0.08649221 -0.018677262 -0.05345068

2 2010/5/1   -0.05538238 -0.005560639  0.01155335 -0.019698198 -0.05075373

3 2010/6/1    0.06797783  0.082238869 -0.01723324  0.026911507  0.06423700

4 2010/7/1   -0.04904916 -0.077644673 -0.01090321 -0.026064286 -0.01655052

5 2010/8/1    0.08605110  0.102397868  0.08531487  0.003625389  0.06605023

6 2010/9/1    0.03565594 -0.029866339  0.02736639  0.066911455  0.01081442

Let’s further explore the means and standard deviations of the excess returns.

# remove the date column

exReturn <- exReturn[,-1]

# mean and standard deviation

Mean=sapply(exReturn,mean)

SD=sapply(exReturn,sd)

cbind(Mean,SD)

Mean         SD

SP500_ExcessR   0.008837961    0.04298597

MMM_ExcessR   0.008758935    0.05381220

JNJ_ExcessR      0.010740620    0.03862065

PG_ExcessR      0.008919860    0.03794766

WMT_ExcessR    0.012765898    0.04456828

We will use the mean and standard deviation of SP500 excess returns as the market excess return and risk (standard deviation) in the following parts.

CAPM Analysis

According to the CAPM formula, we will first get the beta of each stock by regressions; then further calculate the expected return of each stock and the covariance matrix of the four stocks; finally we can calculate the optimal asset allocations (weights) of the portfolio consisting of the four stocks.

Beta

We can conduct the regression on the excess returns of each stock and the excess return of market (SP500). Take the MMM stock for example as follows.

# conduct the regression to get the beta of MMM

>lm.MMM<- lm(MMM_ExcessR ~ SP500_ExcessR,df)

>summary(lm.MMM)

Call:

lm(formula = MMM_ExcessR ~ SP500_ExcessR, data = df)

Residuals:

Min        1Q    Median        3Q       Max

-0.066906 -0.015134  0.006473  0.019116  0.053404

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept)   -0.0005611  0.0049388  -0.114     0.91

SP500_ExcessR  1.0545439  0.1140273   9.248 6.29e-11 ***

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.02941 on 35 degrees of freedom

Multiple R-squared: 0.7096,  Adjusted R-squared: 0.7013

F-statistic: 85.53 on 1 and 35 DF,  p-value: 6.291e-11

 

The beta of MMM is 1.0545439 as shown in the above. The CAPM formula of MMM is:-

E(RMMM) – Rf  =βMMM * (E(Rm) – Rf), that is,

Expected excess return of MMM =βMMM * Expected excess return of SP500,

Where:

The expected excess return of SP500 is the mean value, which is 0.008837961, of the excess SP500 return in the past three years.

So, the expected excess return of MMM = 1.0545439 * 0.008837961 = 0.009320017.

According to Single-Factor Model, the variance of MMM is:-

σMMM2= βMMM2σM2+ σ2(e)

Where,

σM2 is the variance of SP500, which is 0.0429859752

σ2(e) is the variance of residual standard error, which is 0.029412, as shown in the above output.

So, σMMM2 = 1.05454392 * 0.0429859752 + 0.029412 = 0.0540352772

We can follow the same procedures to get the betas, variances, and expected excess returns of the stocks of JNJ, PG, and WMT as follows.

MMM          JNJ           PG           WMT

Variances        0.0540352772 0.039036492   0.0384049982 0.0451064012

beta             1.0545439      0.438671     0.347552             0.39777

Expected Excess Rt         0.009320017   0.003876957   0.003071651   0.003515476

Based on the above results, we can further calculate the covariance in pairs of the four stocks by the formula of σij= βi βjσM2 .  So, for the covariance between MMM and JNJ,

σMMM,JNJ = βMMM βJNJσM2 = 1.0545439 * 0.438671 * 0.0429859752 = 0.000854786.

We can calculate the other covariances by the same way and build the covariance matrix as follows.

MMM          JNJ           PG           WMT

MMM      0.002919811   0.000854786   0.000677233   0.000775087

JNJ       0.000854786  0.001523848   0.000281716   0.000322422

PG        0.000677233  0.000281716   0.001474944   0.00025545

WMT        0.000775087   0.000322422   0.00025545     0.002034587

Mean-Variance Portfolio Optimization

Now with the expected excess returns and the covariance matrix ready, we can conduct the Mean-Variance Portfolio Optimization to get the optimal allocation (weight) of each stock so that the Sharpe Ratio of the portfolio is maximized.

Here we will use Guy Yollin’s “effFrontier” and “maxSharpe” functions3, which use the core function of “portfolio.optim” in the “tseries” R package, for the calculations of efficient frontier and maximum Sharpe Ratio.

#load tseries package

library(tseries)

#build the expected excess return matrix and covariance matrix for the parameters of “maxSharpe” #function. The numbers are from the calculations in the previous parts.

> averet <- matrix(c(0.009320017,0.003876957,0.003071651,0.003515476), nrow=1)

>rcov<-matrix(c(0.002919811,0.000854786,0.000677233,0.000775087,0.000854786,0.001523848,0.000281716,0.000322422,0.000677233,0.000281716,0.001474944,0.00025545,0.000775087,0.000322422,0.00025545,0.002034587),nrow=4)

# get the optimal weights

> weights <- maxSharpe(averet,rcov,shorts=F)

> weights

[1] 0.5779719 0.1794985 0.1332271 0.1093025

According to the above output, Peter can allocate about 58% of his capital in MMM, 18% in JNJ, 13% in PG, and 11% in WMT, to get a maximum Sharpe Ratio roughly, which means a maximum value of expected excess return per risk unit.

Summary

In this article, we went through some basic related finance concepts and demonstrated how to use R to conduct the portfolio management.

References

1. http://en.wikipedia.org/wiki/Capital_asset_pricing_model

2. http://en.wikipedia.org/wiki/Sharpe_ratio

3. http://www.rinfinance.com/RinFinance2009/presentations/yollin_slides.pdf

4. HKUST Prof. Fei Ding’s lecture slides.

Data Sources

Stock:  http://finance.yahoo.com/

T-Bills:  http://research.stlouisfed.org/fred2/series/TB4WK/downloaddata?cid=116

Author: Jack Han. All rights reserved. 转载须以超链接形式标明文章原始出处和作者信息

6,031 total views, 5 views today

用R语言分析2012年伦敦奥运会女子10米汽手枪决赛成绩

Share on FacebookShare on LinkedInShare on Twitter

2012年伦敦奥运会女子10米汽手枪比赛,中国选手易思玲、波兰选手波加卡、中国选手喻丹分别获得金、银、铜牌。另外进入决赛的6名选手分别是:EMMONS Katerina、 GRAY J Lynn、AHMADI Elaheh、SCHERER Sarah、VDOVINA Daria。

决赛中每位选手每一轮射击的成绩在伦敦奥运会的官方网站上可以查到1,也可在此下载(airrifledata)。利用R语言对8位选手决赛成绩的简单统计分析,我们可以大致看出其决赛时的平均水平、稳定性、及心理变化过程2

一、平均水平

从下面的个人在决赛中的平均成绩(图-1)来看,易思玲的平均成绩最好、杨丹其次而最终获得铜牌的波加卡平均成绩为第四,可以推断出她在预赛中的成绩比杨丹和EMMONS Katerina要好,所以她最终获得了银牌。

meanstd

图-1 均值及标准差

二、稳定性

由于各人平均成绩差异不大,可以通过计算各人成绩的标准差(图-1)来分析决赛时发挥的稳定性。从下面R的输出结果可以看出,易思玲和AHMADI Elaheh发挥相对稳定,而波加卡的发挥相对不稳定。

从下面的箱线图(图-2)也可以直观地观察8名运动员决赛时的稳定性。同时还观察到易思玲、AHMADI Elaheh和SCHERER Sarah的比赛成绩都有离群点(>1.5 IQR)的存在,这说明射击运动员超常或失常发挥的情况时有发生,也说明一些影响射击的微小的因素的变化,比如:心理状况,都可能导致一次射击成绩有较大变化,有点“失之毫厘、谬以千里”的意思。

boxplot

图-2 箱线图

三、心理变化过程

在决赛的过程中,射击运动员的心理变化是由“紧张-》放松”,还是由“放松-》紧张”?通过对各个运动员比赛成绩及比赛轮次做线性回归分析(图-3),我们可以观察到8位运动员中有6位的比赛成绩的趋势是随着比较轮次的增加而降低的。可见,大部分射击运动员的心理是越到后面越紧张。

reg

图-3 比赛成绩及比赛轮次的线性回归

我们还可以观察到易思玲的心理变化较为稳定并略有上升趋势。根据记者赛后采访,她说心理减压的原因是她不懂英语,“听不懂裁判的报靶,不知道对手的情况,也就没有了压力。”

稳定的心理素质加上较高的平均水平,是她最终获得冠军的主要原因。

参考资料:

1. 数据来源http://www.london2012.com/shooting/event/women-10m-air-rifle/phase=shw101100/index.html

2. 本文部分分析思路参考了贾俊平、郝静等编著的《统计学案例与分析》一书中的“案例1.3 分析运动员发挥的稳定性”。但该书第19页最下面及第20页最上面关于运动员紧张情绪的分析逻辑我个人认为是错误的。不能通过运动员成绩的分布的左偏或右偏程度来说明运动员在决赛过程中前后阶段的情绪变化趋势,因为成绩的分布形态与时间先后(轮次)无关。

附 R 源代码

#load the data into data frame from .csv file

dataDir=”D: \\R\\ Women’s 10m Air Rifle.csv”

df <- read.csv(dataDir,header=TRUE)

#draw boxplots

bp <- boxplot(df,boxwex = 0.5,col = “yellow”,

main = “Women’s 10m Air Rifle – Final”,

xlab = “Athlete”,ylim = c(9, 11),

ylab = “Score”)

#Labelling the outliers in the boxplots

x<-bp$group + 0.15

y<-bp$out + 0.07

text(x,y,out)

#mean

sapply(df,mean)

 

#Std Dev

sapply(df,sd)

#plot & simple linear regression

x <- c(1:10)

par(mfrow=c(3,3))

for ( i in 1:8 ){

plot(x,df[,i],xlab=”Round”,ylab=”Score”,main=names(df)[i],col=”red”,pch=20)

lm.reg <- lm(df[,i]~x)

abline(lm.reg)

}

Author: Jack Han. All rights reserved. 转载须以超链接形式标明文章原始出处和作者信息

10,007 total views, 31 views today

Data Analysis for Marketing Research with R Language (1)

Share on FacebookShare on LinkedInShare on Twitter

Data Analysis technologies such as t-test, ANOVA, regression, conjoint analysis, and factor analysis are widely used in the marketing research areas of A/B Testing, consumer preference analysis, market segmentation, product pricing, sales driver analysis, and sales forecast etc. Traditionally the analysis tools are mainly SPSS and SAS, however, the open source R language is catching up now.

In this article, we will introduce how to use R to conduct some basic marketing researches by a sample, and then to further implement the analysis in Oracle R Enterprise which integrating R with Oracle Database.

Scenario Introduction and Marketing Research Objectives

A fabricate company, ABC store chain, is selling a new type of grape juice in some of its stores for pilot selling. The marketing team of ABC wants to analyze:-

  • Which type of in-store advertisement is more effective? They have placed two types of ads in stores for testing, one theme is natural production of the juice, the other theme is family health caring;
  • The Price Elasticity – the reactions of sales quantity of the grape juice to its price change;
  • The Cross-price Elasticity – the reactions of sales quantity of the grape juice to the price changes of other products such as apple juice and cookies in the same store;
  • How to find the best unit price of the grape juice which can maximize the profit and the forecast of sales with that price.

The marketing team has randomly sampled 30 observations and constructed the following dataset for the analysis. There are 5 variables (data columns) in the dataset.

Variable Description
Sales Total unit sales of the grape juice in one week in a store
Price Average unit price of the grape juice in the week
ad_type The in-store advertisement type to promote the grape juice.

ad_type = 0,  the theme of the ad is natural production of the juice

ad_type = 1,  the theme of the ad is family health caring

price_apple Average unit price of the apple juice in the same store in the week
price_cookies Average unit price of the cookies in the same store in the week

The dataset can be downloaded here. Please note the dataset is made up by the author for illustration purpose only, so it maybe looks different from the data in the real world.

Data Exploration

Let’s have some basic exploration to know more about the dataset.

#load the libraries needed in the following codes
> library(s20x)
> library(car)

> #read the dataset from an existing .csv file
> df <- read.csv(file.choose(),header=T)

> #list the name of each variable (data column) and the first six rows of the dataset
> head(df)
sales price ad_type price_apple price_cookies
1   222  9.83       0        7.36          8.80
2   201  9.72       1        7.43          9.62
3   247 10.15       1        7.66          8.90
4   169 10.04       0        7.57         10.26
5   317  8.38       1        7.33          9.54
6   227  9.74       0        7.51          9.49

> # basic statistics of the variables
> summary(df)
sales           price           ad_type     price_apple    price_cookies
Min.   :131.0   Min.   : 8.200    Min.   :0.0   Min.   :7.300   Min.   : 8.790
1st Qu.:182.5 1st Qu.: 9.585  1st Qu.:0.0  1st Qu.:7.438    1st Qu.: 9.190
Median :204.5 Median : 9.855 Median :0.5  Median :7.580   Median : 9.515
Mean:   216.7  Mean:   9.738   Mean   :0.5   Mean:   7.659  Mean   : 9.622
3rd Qu.:244.2  3rd Qu.:10.268 3rd Qu.:1.0   3rd Qu.:7.805    3rd Qu.:10.140
Max. :   335.0   Max. :   10.490  Max. :   1.0   Max. :   8.290   Max. :   10.580

From the above summary table, we can roughly know the basic statistics of each numeric variable. For example, the mean value of sales is 216.7 units, the min value is 131, and the max value is 335. Please ignore the statistics of the “ad_type” there since it is a categorical variable.

We can further explore the distribution of the data of sales by visualizing the data in graphical form as follows.

> #set the 1 by 2 layout plot window
> par(mfrow = c(1,2))
>
> # boxplot to check if there are outliers
> boxplot(df$sales,horizontal = TRUE, xlab=”sales”)
>
> # histogram to explore the data distribution shape
> hist(df$sales,main=””,xlab=”sales”,prob=T)
> lines(density(df$sales),lty=”dashed”,lwd=2.5,col=”red”)

dataexplore

We don’t find outliers in the above box plot graph and the sales data distribution is roughly normal. It is not necessary to apply further data cleaning and treatment to the data set.

Analysis of Ad Effectiveness

The marketing team wants to find out the ad with better effectiveness for sales between the two types of ads, one is with natural production theme; the other is with family health caring theme. So they can place the better one into all of ABC’s stores after the pilot period.

To find out the better ad, we can calculate and compare the mean of sales with the two different ad types at the first step.

> #divide the dataset into two sub dataset by ad_type
> sales_ad_nature = subset(df,ad_type==0)
> sales_ad_family = subset(df,ad_type==1)
>
> #calculate the mean of sales with different ad_type
> mean(sales_ad_nature$sales)
[1] 186.6667
> mean(sales_ad_family$sales)
[1] 246.6667

The mean of sales with nature product theme is about 187; the mean of sales with family health caring theme is about 247. It looks like that the latter one is better. However, this is only the conclusion based on the sample with only 30 observations randomly selected. To find out how likely the conclusion is correct for the whole population, it is necessary to do statistical testing – two-sample t-test.

It is important to check the assumptions of t-tests, which assume the observations are normally distributed and independent, before conducting the t-tests. Otherwise the results of t-tests are not valid. The observations are independent since they were randomly sampled. Let’s check the normality by plotting the distribution shapes of the two groups of sales data.

> #set the 1 by 2 layout plot window
> par(mfrow = c(1,2))
>
> # histogram to explore the data distribution shapes
> hist(sales_ad_nature$sales,main=””,xlab=”sales with nature production theme ad”,prob=T)
> lines(density(sales_ad_nature$sales),lty=”dashed”,lwd=2.5,col=”red”)
>
> hist(sales_ad_family$sales,main=””,xlab=”sales with family health caring theme ad”,prob=T)
> lines(density(sales_ad_family$sales),lty=”dashed”,lwd=2.5,col=”red”)

hist

We can see that the shapes are roughly normally distributed. We can also check the normality by Shapiro-Wilk test as follows.

> shapiro.test(sales_ad_nature$sales)

Shapiro-Wilk normality test
data:  sales_ad_nature$sales
W = 0.9426, p-value = 0.4155

> shapiro.test(sales_ad_family$sales)

Shapiro-Wilk normality test
data:  sales_ad_family$sales
W = 0.8974, p-value = 0.08695

The p-values of the Shapiro-Wilk tests are larger than 0.05, so there is no strong evidence to reject the null hypothesis that the two groups of sales data are normally distributed.

Now we can conduct the t-test since the t-test assumptions are met.

> t.test(sales_ad_nature$sales,sales_ad_family$sales)

Welch Two Sample t-test
data:  sales_ad_nature$sales and sales_ad_family$sales
t = -3.7515, df = 25.257, p-value = 0.0009233
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -92.92234 -27.07766
sample estimates:
mean of x mean of y
186.6667  246.6667

From the output of t-test above, we can say that:-

We have strong evidence to say that the population means of the sales with the two different ad types are different because the p-value of the t-test is very small;

With 95% confidence, we can estimate that the mean of the sales with natural production theme ad is somewhere in 27 to 93 units less than that of the sales with family health caring theme ad.

So the conclusion is that the ad with the theme of family health caring is BETTER.

Sales Driver Analysis and Price Elasticity Analysis

With the information given in the data set, we can explore how grape juice price, ad type, apple juice price, cookies price influence the sales of grape juice in a store by multiple linear regression analysis. Here, “sales” is the dependent variable and the others are independent variables.

Let’s investigate the correlation between the sales and other variables by displaying the correlation coefficients in pairs.

> pairs(df,col=”blue”,pch=20)
> pairs20x(df)

paircorreletion

The correlation coefficients between sales and price, ad_type, price_apple, and price_cookies are 0.85, 0.58, 0.37, and 0.37 respectively, that means they all might have some influences to the sales, so we can try to add all of the independent variables into the regression model as follows.

> sales.reg<-lm(sales~price+ad_type+price_apple+price_cookies,df)
> summary(sales.reg)

Call:
lm(formula = sales ~ price + ad_type + price_apple + price_cookies,
data = df)

Residuals:
Min      1Q  Median      3Q     Max
-36.290 -10.488   0.884  10.483  29.471

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)    774.813    145.349   5.331 1.59e-05 ***
price          -51.239      5.321  -9.630 6.83e-10 ***
ad_type         29.742      7.249   4.103 0.000380 ***
price_apple     22.089     12.512   1.765 0.089710 .
price_cookies  -25.277      6.296  -4.015 0.000477 ***

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 18.2 on 25 degrees of freedom
Multiple R-squared: 0.8974,    Adjusted R-squared: 0.881
F-statistic: 54.67 on 4 and 25 DF,  p-value: 5.318e-12

The p-value for price, ad_type, and price_cookies in the last column of the above output is much less than 0.05. They are significant in explaining the sales. We are confident to include these three variables into the model.

The p-value of price_apple is a bit larger than 0.05, seems there are no strong evidence for apple juice price to explain the sales. However, according to our real-life experience, we know when apple juice price is lower, consumers likely to buy more apple juice, and then the sales of other fruit juice will decrease. So we can also add it into the model to explain the grape juice sales.

The Adjusted R-squared is 0.881, which indicates a reasonable goodness of fit and 88% of the variation in sales can be explained by the four variables. The remaining 12% can be attributed to other factors or inherent variability. Please note the R-squared is very high here because the dataset were made up rather than from real world data sources.

The assumptions for the regression to be true are that data are random and independent; residuals are normally distributed and have constant variance. Let’s check the residuals assumptions visually.

# plotting the residuals vs. other key model metrics
> par(mfrow=c(2,2))
> plot(sales.reg)

residuals

The Residuals vs Fitted graph above shows that the residuals scatter around the fitted line with no obvious pattern, and the Normal Q-Q graph shows that basically the residuals are normally distributed. The assumptions are met.

For multiple regression, it is also important to check the multicollinearity among the variables because high multicollinearity will make the coefficients for independent variables less precise and introduce large errors in the predictions for dependant variable. We can investigate the multicollinearity by displaying the correlation coefficients of the independent variables in pairs as what we did at the beginning of this part. We can also check the multicollinearity by the following command in R.

#check multicollinearity
> vif(sales.reg)
price       ad_type   price_apple price_cookies
1.246084      1.189685      1.149248      1.099255

The VIF test value for each variable is close to 1, which means the multicollinearity is very low among these variables.

Based on the above analysis, we can accept the regression result and construct the multi-linear model of sales as follows.

Sales = 774.81 – 51.24 * price + 29.74 * ad_type + 22.1 * price_apple – 25.28 * price_cookies

With model established, we can analysis the Price Elasticity(PE) and Cross-price Elasticity(CPE) to predict the reactions of sales quantity to price. “Price elasticity is defined as %ΔQ/%ΔP, which indicates the percent change in quantity divided by the percent change in price; Cross-price Elasticity is the percent change in quantity divided by the change in the price of some other product.”1

PE = (ΔQ/Q) / (ΔP/P) = (ΔQ/ΔP) * (P/Q) = -51.24 * 0.045 = -2.3

P is price, Q is sales quantity

ΔQ/ΔP = -51.24 , the parameter before the variable “price” in the above model

P/Q = 9.738 / 216.7 = 0.045,  P is the mean of prices in the dataset, so does Q

The PE indicates that 10% decrease in price will increase the sales by 23%, and vice verse.

Let’s further calculate the CPE on apple juice and cookies to analyze the how the change of apple juice price and cookies price influence the sales of grape juice.

CPEapple = (ΔQ/ΔPapple) * (Papple/Q) = 22.1 * ( 7.659 / 216.7) = 0.78

CPEcookies = (ΔQ/ΔPcookies) * (Pcookies/Q) = -25.28 * ( 9.622 / 216.7) = – 1.12

The CPEapple indicates that 10% decrease in apple juice price will DECREASE the sales by 7.8%, and vice verse. So the grape juice and apple juice are substitutes.

The CPEcookies indicates that 10% decrease in cookies price will INCREASE the sales by 11.2%, and vice verse. So the grape juice and cookies are compliments. Place the two products together will likely increase the sales for both.

We can also know that the sales increase 29.74 units when using the ad with the family health caring theme (ad_type = 1).

Optimal Pricing and Sales Prediction

Usually companies want to get higher profit rather than just higher sales quantity. So, how to set the optimal price for the new grape juice to get the maximum profit based on the dataset collected in the pilot period and the regression model above?

To simplify the question, we can let the ad_type = 1, the price_apple = 7.659 (mean value), and the price_cookies = 9.738 (mean value).

The model is simplified as follows:-

Sales = 774.81 – 51.24 * price + 29.74 * 1 + 22.1 * 7.659 – 25.28 * 9.738

Sales = 772.64 – 51.24*price

Assume the marginal cost(C) per unit of grape juice is 5. We can calculate the profit (Y) by the following formula.

Y = (price – C) * Sales Quantity = (price – 5) * (772.64 – 51.24*price)

Y = – 51.24 * price2 + 1028.84 * price – 3863.2

To get the optimal price to maximize Y, we can use the following R function.

> f = function(x) -51.24*x^2 + 1028.84 * x – 3863.2
> optimize(f,lower=0,upper=20,maximum=TRUE)
$maximum
[1] 10.03942
$objective
[1] 1301.28

The optimal price is 10.04; the maximum profit will be 1301 according to the above output. In reality, we can reasonably set the price to be 10 or 9.99.

We can further use the model to predict the sales while the price is 10.

> # predict the sales
> inputData <- data.frame(price=10,ad_type=1,price_apple=7.659,price_cookies=9.738)
> predict(sales.reg,inputData,interval=”p”)
fit           lwr           upr
1 215.1978 176.0138 254.3817

The sales forecast will be 215 units with a variable range of 176 ~ 254 with 95% confidence in a store in one work on average. Based on the forecast and other factors, ABC Company can prepare the inventory for all of its stores after the pilot period.

Implementation in Oracle R Enterprise

“Oracle R Enterprise (ORE) implements a transparency layer on top of the R engine that allows R computations to be executed in Oracle Database from the R environment.”3 It is also not necessary to load the whole bunch of data into R environment, which usually runs on a desktop or laptop with limitations of RAM and CPU, from database. This is helpful when we have millions of data records to be analyzed. Let’s try to implement the regression model by ORE.

# load the Oracle R Enterprise library and connect to Oracle Database
> library(ORE)
> ore.connect(user = “<DBUser>”,sid = “orcl”,host = “<host name>”,password = “<password>”,port = 1521,all = TRUE)

# regression by ore.lm
> sales.reg <- ore.lm(SALES ~ PRICE+AD_TYPE+PRICE_APPLE+PRICE_COOKIES, data = GRAPEJUICE)
> summary(sales.reg)

The output is the same as we use the function of “lm” for regression.

Summary

In this article, using the open source R language, we introduced how to test the differences of effectiveness among different ad types; how to analyze the price elasticity and cross-price elasticity of a product; and how to set the optimal price to maximize the profit and then to forecast the sales with the price.

 

References:

  1. Prof. Ronald T. Wilcox’s lecture slides of “Estimating Demand from Scanner Data”
  2. http://xccds1977.blogspot.jp
  3. https://blogs.oracle.com/R/entry/analyzing_big_data_using_the

 

Author: Jack Han. All rights reserved. 转载须以超链接形式标明文章原始出处和作者信息

5,535 total views, 5 views today

Conjoint Analysis(联合分析法)简介

Share on FacebookShare on LinkedInShare on Twitter

Conjoint Analysis(联合分析法)是一种融合实验设计和多元回归分析、主要用于新产品开发的市场调研方法。一种产品及型号可以通过其主要的产品属性(attribute)及相应的属性水平(level)来定义,比如某种型号的豆浆机可以由“加热时间– 60秒”、“容量 – 5杯”、“价格-500元”等属性及相应的属性水平描述。通过收集和处理用户对同类不同型号产品的购买倾向的评估数据,Conjoint Analysis可以分析出各种产品属性对用户购买意愿的影响程度、同一种属性中不同的属性水平对用户购买倾向的影响程度、最优的产品属性组合,并可进一步估算各种型号产品的市场占有率。

假设需要研发一种豆浆机。豆浆机的主要属性及每种属性的常见属性值如下:

压榨和加热豆浆时长(秒) 容量(杯) 价格(元)
60 3 300
100 5 450
130 8 550

将各种属性及属性值进行排列组合,可以得出33(27)种潜在产品属性组合,每种组合可以理解为市场上的每种产品或同类产品的某种型号。用户对每种产品属性组合进行评估(1至7分),评估值越大,表明其相对购买意愿越大。如下表所示。

时长(60秒) 时长(100秒) 时长(130秒)
价格 3杯 5杯 8杯 3杯 5杯 8杯 3杯 5杯 8杯
300元 6 7 5 5 6 4 5 6 3
450元 5 6 4 4 5 3 4 5 3
550元 3 4 2 2 3 1 1 2 1

 

假设用户相对购买意愿值(U)与各种产品属性值的多元线性回归模型如下:

U = C + β11 * X1112 * X12 + β21 * X21 + β22 * X22 + β31 * X31 + β32 * X32

其中:

变量X11 是表示价格是否为450元

变量X12 是表示价格是否为550元

变量X21 是表示时长是否为100秒

变量X22 是表示时长是否为130秒

变量X31 是表示容量是否为5杯

变量X31 是表示容量是否为8杯

C是常数,X11X12X21X22X31X31为哑变量(Dummy Variable)。

整理后的数据集见附件。利用R语言做多元线性回归。R语言代码如下:

> dataset <- read.csv(file.choose(),header=T)
> lm.reg<-lm(dataset[,1]~dataset[,"X11"]+dataset[,"X12"]+dataset[,"X21"]+dataset[,"X22"]+dataset[,"X31"]+dataset[,"X32"])
> summary(lm.reg)

输出结果如下:

 Coefficients:

                 Estimate Std. Error t value Pr(>|t|)

(Intercept)        6.0000     0.1518  39.524  < 2e-16 ***

dataset[, "X11"]  -0.8889     0.1405  -6.325 3.58e-06 ***

dataset[, "X12"]  -3.1111     0.1405 -22.136 1.54e-15 ***

dataset[, "X21"]  -1.0000     0.1405  -7.115 6.78e-07 ***

dataset[, "X22"]  -1.3333     0.1405  -9.487 7.61e-09 ***

dataset[, "X31"]   1.0000     0.1405   7.115 6.78e-07 ***

dataset[, "X32"]  -1.0000     0.1405  -7.115 6.78e-07 ***

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

 Residual standard error: 0.2981 on 20 degrees of freedom

Multiple R-squared: 0.9762,     Adjusted R-squared: 0.969

F-statistic: 136.7 on 6 and 20 DF,  p-value: 3.699e-15

 根据上述结果,我们可以计算每个产品属性(attribute)及每个属性水平(Level)对用户购买意愿的影响力的相对大小。各个属性水平的影响力称之为效用(Utility),各个产品属性的影响力称之为权重(Importance)。

各个属性水平的效用计算方法如下:

首先构造方程组:

U价格450- U价格300 = -0.89  (注: -0.89为变量X11的回归系数。见上面的回归结果)

U价格550- U价格300 = -3.11   (注: -3.11为变量X12的回归系数。见上面的回归结果)

U价格300+ U价格450+ U价格550= 0

求解方程组可得:U价格300= 1.33; U价格450= 0.44 ; U价格550= -1.78;

同样的方法,可以分别计算出时长和容量各个属性水平的效用。如下表所示。

属性 水平 效用 最高最低效用差 权重

价格

300 (元) 1.33  3.11(1.33 – (-1.78)) 48.29%
450 0.44
550 -1.78
时长 60 (秒) 0.78 1.33(0.78 – (-0.55)) 20.65%
100 -0.22
130 -0.55
容量 3 (杯) 0 2(1 – (-1)) 31.06%
5 1
8 -1

注:上表中各个属性的权重的计算方法为 某个属性的最高最低效用差 / 所有属性的最高最低效用差之和。以价格属性的权重计算为例: 3.11 / (3.11 + 1.33 + 2) = 48.29%

通过各个属性水平的效用值,我们可以知道当某款豆浆机的价格为300元、时长为60秒、容量为3杯时,其总效用(Total Utility)最高,最受用户喜欢。一般来讲(奢侈品除外),价格越低,用户会越喜欢;但企业也需要考虑相应的销售利润。

通过比较各个属性的权重,我们可以知道产品价格对用户购买的影响力最大,其次是加热时间,再次是容量。如果要完善产品设计,关注于加热时间而不是容量可能会是更好的选择。

假如市场上已经有两款竞争对手的产品:产品A(450元、100秒、8杯),产品B(300元、130秒,5杯),如果我们推出一款产品C (450元、60秒、8杯), 如果估算产品C的市场占有率?

根据利用R语言回归计算的结果,构造多元线性回归方程如下:

U = 6 – 0.89 * X11 – 3.11 * X12 – 1 * X21 – 1.33 * X22 + 1 * X31 – 1 * X32

首先通过上述方程计算各款产品的总效用值如下:

产品A = 6 – 0.89 * 1 – 3.11 * 0 – 1 * 1 – 1.33 * 0 + 1 * 0 – 1 * 1 = 3.11

产品B = 6 – 0.89 * 0 – 3.11 * 0 – 1 * 0 – 1.33 * 1 + 1 * 1 – 1 * 0 = 5.67

产品 C = 6 – 0.89 * 1 – 3.11 * 0 – 1 *0 – 1.33 *0 +1 *0 – 1 * 1 = 4.11

 根据各款产品的总效用值,各个产品的市场占有率估算如下:

产品A = 3.11 / (3.11 + 5.67 + 4.11) = 24 %

产品B = 5.67 / (3.11 + 5.67 + 4.11) = 44 %

产品C = 4.11 / (3.11 + 5.67 + 4.11) = 32 %

注:此估算的前提是大的市场环境及竞争对手的产品不会有太大的相应变化,因此,市场占有率的估算较适用于产品研发周期长的产品。

 

参考资料:

Prof. Ralf van der Lans’ lecture slides for Marketing Research course

http://doc.mbalib.com/view/aa0b69f6d7fc636cf9401cf69208b8d1.html

http://doc.mbalib.com/view/b0ab43ac2671aaf94843d9ca0dbd3175.html

http://doc.mbalib.com/view/075b2d60a4ed411c3f799330b74dc9ba.html

http://doc.mbalib.com/view/4dae2f760ce5f7c8fa16ef5075e1139c.html

http://faculty.washington.edu/sundar/NPM/CONJOINT-ProductDesign/Conjoint%20Tutorial.pdf

http://www.pragmaticmarketing.com//resources/conjoint-analysis-101?p=2

 

Author: Jack Han. All rights reserved. 转载须以超链接形式标明文章原始出处和作者信息

53,457 total views, 125 views today