Category Archives: R language

RFM Customer Analysis with R Language

Share on FacebookShare on LinkedInShare on Twitter

For database marketing or direct marketing people, they are always concerned about two questions before they send out mails or make calls to their customers:-

  1. How can they segment the customers in the database to find out who are more likely to response to their mails or buy their products?
  2. Which type of customers they should send the mails to so that they can reach breakeven and make profit?

The RFM method is a very simple but effective customer analysis way to address the above questions. Where,

R, F, and M stand for1

         Recency – How recently did the customer purchase?

         Frequency – How often do they purchase?

         Monetary Value – How much do they spend (each time on average)?

         (The above definitions are from Wikipeida.)

You can get more details on RFM from the following paper and case study besides Wikipeida:-

Segmentation and RFM Basics (Bill Ruppert)

Case Study: La Quinta narrows target to boost guest loyalty

Seems there is no out-of-box package on RFM analysis in R programming language area while other data analysis tools such as IBM SPSS have easy-to-use RFM functionalities. So in this article, we will demonstrate how to implement RFM analysis with R language.

Explore Data

We will use the CDNOW Sample data during this demonstration; you can download the data here. For more details about the dataset, read the paper of “Creating an RFM Summary Using Excel (Peter S. Fader, Bruce G. S. Hardie)” please.

Let’s explore the data.

> # read CDNOW_SAMPLE.txt

> df <- read.table(file.choose(),header=F)

> # construct a data frame with the necessary columns of customer ID, transaction date, and money amount paid by a customer per transaction

> df <-[,1],df[,3],df[,5]))

> # add appropriate column names for the above three column and

> names <- c(“ID”,”Date”,”Amount”)

> names(df) <- names

> #tranfer the the text column type to date type

> df[,2] <- as.Date(as.character(df[,2]),”%Y%m%d”)

> head(df)

ID       Date              Amount

1   4       1997-01-01    29.33

2   4       1997-01-18    29.73

3   4       1997-08-02    14.96

4   4       1997-12-12    26.48

5  21       1997-01-01    63.34

6  21       1997-01-13    11.77

> dim(df)

[1] 6919    3

#remove the rows with the duplicated IDs to see how many customers in total

> uid <- df[!duplicated(df[,”ID”]),]

> dim(uid)

[1] 2357    3

As shown in the output above, the data set contains some columns, especially three columns of customer ID, transaction date, and money amount paid by a customer per transaction. There are total 6919 transaction records in the sample dataset and 2357 unique customers.

Segment the Customers into RFM Cells

Since we have got the data ready now, we can segment the customers into RFM cells by the three dimensions of R, F, and M. Usually we rate the customers by 1 to 5 points in each dimension, the higher score the better.

According to the experiences, Recency is the most important factor (x100) for high response rate, Frequency is the second one(x10), and the Monetary is the last one(x1).  The score of customers who are segmented into the same RFM cell is the same.  A score of “542” means that the customer gets 5 points in Recency, 4 points in Frequency, and 2 points in Monetary.

There are usually two ways to segment the customers in each dimension, one is Nested, the other is Independent.

The nested method is to divide the customers into aliquots in the Recency dimension first, then for each aliquot of Recency, divide the customers into aliquots in the Frequency dimension nestedly, so does for the Monetary. The advantage of nested method is the quantity of customers in each RFM cell is roughly the same for all cells; the disadvantage is that the meaning of the Frequency and Monetary with the same score point might be different, e.g. one customer with a “543” RFM score and the other customers with a “443” RFM score, their Recency might be quite different though they have the same Recency score of “4”.

The independent method is to rate the customers into aliquots in the R, F, M dimensions independently; the advantage is that the meaning of the Frequency and Monetary with the same score point is the same; the disadvantage is that the quantities of customers in the RFM cells might be quite different.

You can read the discussion in Quora about the pros and cons of the above two methods.

In this article, we will implement the independent method in R language first, and then we will implement another method that allows users to input the parameters of the breaks for each dimension according to their own business requirement.

R Source Codes

The complete R source codes can be downloaded from RFM_Analysis_R_Source_Codes. Read the descriptions in the source codes please before you continue.

Calculate the Recency, Frequency, and Monetary

To implement the RFM analysis, we need to further process the data set in the CDNOW Sample by the following steps:-

  1. Remove the duplicate records with the same customer ID
  2. Find the most recent date for each ID and calculate the days to the now or some other date, to get the Recency data
  3. Calculate the quantity of translations of a customer, to get the Frequency data
  4. Sum the amount of money a customer spent and divide it by Frequency, to get the amount per transaction on average, that is the Monetary data.

We have already implemented the above procedures in the functions of “getDataFrame” in the source codes.

# set the startDate and endDate, we will only analysis the records in this date range

> startDate <- as.Date(“19970101″,”%Y%m%d”)

> endDate <- as.Date(“19980701″,”%Y%m%d”)

> df <- getDataFrame(df,startDate,endDate)

> head(df)

ID          Date            Amount   Recency     Frequency  Monetary

4           1997-12-12  26.48     201             4                 25.125

18          1997-01-04  14.96     543             1                 14.960

21         1997-01-13  11.77     534              2                 37.555

50         1997-01-01   6.79     546               1                 6.790

60         1997-02-01  21.75     515              1                 21.750

71         1997-01-01  13.97     546              1                 13.970

After we calculate the Recency, Frequency, and Monetary, we can implement the RFM scoring.

Independent RFM Scoring

We have implemented the independent scoring in function of “getIndependentScore”. Below is the description of the function. For details, please read the source codes.

# Function

#     getIndependentScore(df,r=5,f=5,m=5)

# Description

#       Scoring the Recency, Frequency, and Monetary in r, f, and m in aliquots independently

# Arguments

#       df – A data frame returned by the function of getDataFrame

#       r –  The highest point of Recency, it is 5 by default

#       f –  The highest point of Frequency, it is 5 by default

#       m –  The highest point of Monetary, it is 5 by default

# Return Value

#       Returns a new data frame with four new columns of “R_Score”,”F_Score”,”M_Score”, and

#  “Total_Score”.

Let’s execute the function and take a look of the output.

> df1 <-getIndepandentScore(df)

> head(df1[-(2:3)])

ID      Recency Frequency Monetary      R_Score  F_Score  M_Score  Total_Score

11462   52       4                191.6425       5             5              5              555

8736    55        9                148.3944       5             5              5              555

7856    54        5                130.0980       5             5              5              555

15105   58       9                129.5256       5             5              5              555

22356   106    8                127.3650        5             5              5              555

8481    57     13                117.3492        5             5              5              555

Let’s explore the customer distribution in each RFM cell to see if the customers are distributed evenly.

#Draw the histograms in the R, F, and M dimensions so that we can see the distribution of customers in each RFM cell.

> drawHistograms(df1)


The distribution is not very even due to that the Frequency doesn’t have the scores of 1 and 2. Let’s further find out how many customers have a total score larger than 500 or 400.

> S500<-df1[df1$Total_Score>500,]

> dim(S500)

[1] 471  10

> S400<-df1[df1$Total_Score>400,]

> dim(S500)

[1] 945  10

We can consider those customers are more important to get a higher response rate.

RFM Scoring with Breaks

Sometimes users want to determine the breaks for each dimension by themselves according to their own business requirement. For example, a user can set 0 -30 days, 30 -90 days, 90 – 180 days, 180 – 360 days, and more than 360 days as the 5 breaks for Recency rather than just let the computer segment the customers into 5 aliquots without considering the specific business requirement.

We have implemented this scoring method in the function of “getScoreWithBreaks”. Below is the description of the function. For details, please read the source codes.

# Function

#     getScoreWithBreaks(df,r,f,m)

# Description

#       Scoring the Recency, Frequency, and Monetary in r, f, and m which are vector object containing a series of breaks

# Arguments

#       df – A data frame returned by the function of getDataFrame

#       r –  A vector of Recency breaks

#       f –  A vector of Frequency breaks

#       m –  A vector of Monetary breaks

# Return Value

#       Returns a new data frame with four new columns of “R_Score”,”F_Score”,”M_Score”, and “Total_Score”.

Before we execute the function, we can take a look at the distributions of Recency, Frequency, and Monetary.

> par(mfrow = c(1,3))

> hist(df$Recency)

> hist(df$Frequency)

> hist(df$Monetary)


We can find that the distributions are not even. Most of the customers have a Recency of more than 450 days, a Frequency of less than 5 times, and a Monetary of less than 50 dollars. That’s the reason why the customers are not distributed evenly in the RFM cells when using the Independent RFM scoring method in the previous part.

Based on the above observation and relevant business sense, we can construct the breaks for each RFM dimension as follows, so that we can get segmented RFM cells more evenly and reasonably.

# set the Recency ranges as 0-120 days, 120-240 days, 240-450 days, 450-500days, and more than 500days.

> r <-c(120,240,450,500)

# set the Frequency ranges as 0 – 2times, 2-5 times,5-8 times, 8-10 times, and more than 10 times.

> f <-c(2,5,8,10)

# set the Monetary ranges as 0-10 dollars, 10-20 dollars, and so on.

> m <-c(10,20,30,100)

Than we can execute the function of “getScoreWithBreaks” and see the customers distributions in the RFM cells.

> df2<-getScoreWithBreaks(df,r,f,m)

> drawHistograms(df2)


We can also calculate how many customers have a total score of more than 500 or 400.

> S500<-df2[df2$Total_Score>500,]

> dim(S500)

[1] 399  10

> drawHistograms(df2)

> S400<-df2[df2$Total_Score>400,]

> dim(S400)

[1] 641  10

There are 641 customers have a RFM score more than 400.

Estimate response rate for each RFM cell4

After we segment the customers into RFM cells, we can assign the response rate to each RFM cell according to historical responding data. If it is the first time to use RFM analysis and there is no historical data, we can select some customers, say 10% percent, randomly from each RFM cells. Send mails to the selected customers as a trail and count the response rate for each cell. Below is an example of the response rate table.

Response Rate Table

RFM cell (Total Score) Response Rate
555 8.5%
441 5.6%
435 4.8%

Calculate the breakeven point to select valuable customers4

To be breakeven,

P – C/R= 0


P is the price or revenue per deal or per response

C is the cost per mail sent out, including production cost, mailing cost etc.

R is the response rate.

Suppose,  P = 100 dollars, C = 5 dollars, to be breakeven:-

R = C/P = 5/100 = 5%

That means, we should chose the customers in the RFM cells, that have a response rate equal to or more than 5%, to send out the direct marketing mails, to make money.

Select the Target Customers

According to the Response Rate Table and the breakeven point, we need to select customers in the cells with a response rate more than 5%, that means we need to select the customers with a total score equal to or more than 441.

> target <- df2[df2$Total_Score>=441,]

> dim(target)

[1] 416  10

There are 416 customers we will send the direct marketing mails to.


We introduced the basic concept of RFM customer analysis and implemented the analysis procedures in R language by providing complete R source code.


  4. Recency, Frequency and Monetary (RFM) Analysis (Charlotte Mason)
  5. Case Study: La Quinta narrows target to boost guest loyalty

Author: Jack Han. All rights reserved. 转载须以超链接形式标明文章原始出处和作者信息

93,844 total views, 52 views today

Uncertain Demand Forecasting and Inventory Optimizing for Short-life-cycle Products

Share on FacebookShare on LinkedInShare on Twitter

For short-life-cycle products such as newspapers and fashion, it is important to match the supply with the demand. However, sometimes we order too little from supplier and sometimes we order too much due to the uncertain demand. We would lose sales and customers would be unsatisfied if ordering too little or we would let the inventory left over at the end of the season if ordering too much. To reduce the mismatch between demand and supply, on the one hand, we need to improve the demand forecasting; on the other hand, we need to optimize the inventory management with an appropriate service level.

In this article, a fashion company ABC, is preparing for the inventory of a new design product, Cool-7, for the upcoming season. We will use the approach of expert judgment in conjunction with the A/F ratio for demand forecasting, and use the Newsvendor model for optimizing the inventory management. R language is used for the statistical computing where needed.

Demand Forecasting

Since Cool-7 is a new product, there is no direct historical data for reference. ABC Company formed a committee, which consists of experts from Marketing, Sales, and Channels etc, to forecast the demand for Cool-7 in the coming summer season.

The initial demand forecasted by the committee is 3500.  Is the number reliable?

We need to further consider the committee’s bias (optimistic, pessimistic, etc) and estimation error by the forecast and actual demands of other types of products in the past. Click here to download the fabricated forecast and actual demand data, and then explore the data.

> #read the dataset from .csv file

> df <- read.csv(file.choose(),header=T)


> #list each data column name and the first six lines of the dataset

> head(df)

Product   Forecast   Actual.Demand   A.F.Ratio (A/F Ratio)

1      P1     1600          1261                  0.79

2      P2     2200          2890                  1.31

3      P3     3500          3172                   0.91

4      P4     1200           895                   0.75

5      P5     1800          2500                  1.39

6      P6     3000          2330                  0.78

Where, the A/F ratio = Actual Demand / Forecast.

># get the mean and standard deviation of A/F Ratio

> mean(df$A.F.Ratio)

[1] 1.0425

> sd(df$A.F.Ratio)

[1] 0.2646721

We will adjust the initial forecast according to the A/F ratio mean as follows.

Adjust forecast = initial forecast * A/F ratio mean

Adjust forecast = 3500 * 1.0425 = 3649, we can also call it the expected actual demand statistically.

The standard deviation of actual demand is, 3500 * 0.2647 = 926.

Inventory Optimizing

Fashion is a kind of short-life-cycle product and usually can only order once before the sales season. To avoid order too much or order too less, we can use Newsvendor model to calculate how many units ABC Company needs to order from the fashion producer, so that the inventory can be optimized considering both the overage cost and underage cost.

Assume the demand has a normal distribution, the retail price is $150 per unit, purchasing price is $100, and the discounted price of leftover products is $70.  We can calculate the Overage cost (Co) and the Underage cost (Cu) respectively:-

Co = 100 – 70 = 30

Cu = 150 – 100 = 50

According to Newsvendor model, the optimal service level for Cool – 7 is,

Cu / (Cu + Co ) = 50 / 80 = 0.625

Given the demand has a mean of 3649 and a standard deviation of 926, what is the order quantity Q* that achieves the optimal service level of 0.625? 1

Q* = mean demand+ Z * demand standard deviation,


Mean demand = 3649;

Demand standard deviation = 926;

Z is the inverse of the standard normal cumulative distribution of the service level.

# get ‘z’

> qnorm(0.625)

[1] 0.3186394

So, Q* = 3649 + 926 * 0.31864 = 3944.  ABC Company should order 3944 units from the producer.


  1. HKUST Prof. Albert Ha’s lecture slides of the Operation Management course;

Author: Jack Han. All rights reserved. 转载须以超链接形式标明文章原始出处和作者信息

52,552 total views, 14 views today

Investment Portfolio Analysis with R Language

Share on FacebookShare on LinkedInShare on Twitter

[Please note that this article is for studying R programming purpose only rather than for advices of investment]

R has a wide application in finance analysis areas such as time series analysis, portfolio management, and risk management, with its basic functions and many professional packages in Finance. In this article, we will demonstrate how to help investors to optimize the investment portfolio with R language by a simple example.

Scenario Introduction

Peter is interested in investing the US stock markets and he has selected four stocks, 3M (MMM), Johnson & Johnson (JNJ), P&G (PG), and Wal-Mart (WMT) for consideration. To avoid “putting all eggs in one basket”, his investing strategy is to invest in all of the four stocks rather than to invest in only one stock.

In this case, what are the optimal asset allocation percentages among the four stocks, assuming no short? How can we use R language to conduct the portfolio management analysis for Peter?

Basic Concepts

Before we dive into the R programming, let’s review some of the basic concepts about the financial portfolio management.

Expected Return

Expected Return is a weighted-average return of a portfolio. The return of each asset in the portfolio can be the mean of return values of the asset in a past period of time; it can also be calculated by the capital asset pricing model (CAPM), which is used to determine a theoretically appropriate required rate of return of an asset1. It can also calculated by other factor models. We will use CAPM to calculate the expected return of each stock in this article.


Risk is the variation of the return of an asset. It can be calculated by the standard deviation of the historical return data; it can also be calculated by factor models. We will use one-factor model to calculate the variance of an asset and the covariance of two assets in a portfolio.

Sharpe Ratio

“The Sharpe ratio characterizes how well the return of an asset compensates the investor for the risk taken”2.  The higher Sharpe Ratio means the better investment option. In a set of risky assets, we can find the optimal portfolio asset allocations so that the Sharpe Ration is the largest.

Capital Asset Pricing Model (CAPM) 1

The CAPM is a model for pricing an individual security or portfolio. The formula of CAPM is in the following:-

E(Ri)  =  Rf + βi * (E(Rm) – Rf)


E(Ri) is the expect return of an asset

Rf is the risk-free return, we usually use the interest rate of T-Bills as the Rf

E(Rm) is the expected return of the market, we usually use SP500 index return as the US stock market

βi is the sensitivity of the expected excess asset return to the expected excess market return. We can get the value of βi by do a regression on the historical data of the excess returns of an asset and the market.

Explore the data

The historical monthly return data of SP500, MMM, JNJ, PG, and WMT from April 2010 through April 2013, can be downloaded from The one month T-Bills interest rate, which we use as the risky free return,  can be download from here. The SP500 returns are used as the market return. Here is the csv file of the returns.

#read the historical return data from .csv file

> return <- read.csv(file.choose(),header=T)

#list each data column names and the first six lines of the dataset

> head(return)

Month T.Bills    SP500       MMM       JNJ         PG         WMT

1 2010/4/1  0.0015 -0.08197592 -0.100012182 -0.084992210 -0.017177262 -0.05195068

2 2010/5/1  0.0015 -0.05388238 -0.004060639  0.013053348 -0.018198198 -0.04925373

3 2010/6/1  0.0008  0.06877783  0.083038869 -0.016433240  0.027711507  0.06503700

4 2010/7/1  0.0016 -0.04744917 -0.076044673 -0.009303209 -0.024464286 -0.01495052

5 2010/8/1  0.0015  0.08755110  0.103897868  0.086814872  0.005125389  0.06755024

6 2010/9/1  0.0012  0.03685594 -0.028666339  0.028566390  0.068111455  0.01201442

We also calculated the excess return data set for our further usage in this excess return csv file.The excess return is equal to monthly return minus monthly T-Bills interest rate.

#read the excess return from .csv file

>exReturn <- read.csv(file.choose(),header=T)

#list each data column name and the first five lines of the dataset


Month   SP500_ExcessR  MMM_ExcessR JNJ_ExcessR  PG_ExcessR WMT_ExcessR

1 2010/4/1   -0.08347592 -0.101512182 -0.08649221 -0.018677262 -0.05345068

2 2010/5/1   -0.05538238 -0.005560639  0.01155335 -0.019698198 -0.05075373

3 2010/6/1    0.06797783  0.082238869 -0.01723324  0.026911507  0.06423700

4 2010/7/1   -0.04904916 -0.077644673 -0.01090321 -0.026064286 -0.01655052

5 2010/8/1    0.08605110  0.102397868  0.08531487  0.003625389  0.06605023

6 2010/9/1    0.03565594 -0.029866339  0.02736639  0.066911455  0.01081442

Let’s further explore the means and standard deviations of the excess returns.

# remove the date column

exReturn <- exReturn[,-1]

# mean and standard deviation




Mean         SD

SP500_ExcessR   0.008837961    0.04298597

MMM_ExcessR   0.008758935    0.05381220

JNJ_ExcessR      0.010740620    0.03862065

PG_ExcessR      0.008919860    0.03794766

WMT_ExcessR    0.012765898    0.04456828

We will use the mean and standard deviation of SP500 excess returns as the market excess return and risk (standard deviation) in the following parts.

CAPM Analysis

According to the CAPM formula, we will first get the beta of each stock by regressions; then further calculate the expected return of each stock and the covariance matrix of the four stocks; finally we can calculate the optimal asset allocations (weights) of the portfolio consisting of the four stocks.


We can conduct the regression on the excess returns of each stock and the excess return of market (SP500). Take the MMM stock for example as follows.

# conduct the regression to get the beta of MMM

>lm.MMM<- lm(MMM_ExcessR ~ SP500_ExcessR,df)



lm(formula = MMM_ExcessR ~ SP500_ExcessR, data = df)


Min        1Q    Median        3Q       Max

-0.066906 -0.015134  0.006473  0.019116  0.053404


Estimate Std. Error t value Pr(>|t|)

(Intercept)   -0.0005611  0.0049388  -0.114     0.91

SP500_ExcessR  1.0545439  0.1140273   9.248 6.29e-11 ***

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.02941 on 35 degrees of freedom

Multiple R-squared: 0.7096,  Adjusted R-squared: 0.7013

F-statistic: 85.53 on 1 and 35 DF,  p-value: 6.291e-11


The beta of MMM is 1.0545439 as shown in the above. The CAPM formula of MMM is:-

E(RMMM) – Rf  =βMMM * (E(Rm) – Rf), that is,

Expected excess return of MMM =βMMM * Expected excess return of SP500,


The expected excess return of SP500 is the mean value, which is 0.008837961, of the excess SP500 return in the past three years.

So, the expected excess return of MMM = 1.0545439 * 0.008837961 = 0.009320017.

According to Single-Factor Model, the variance of MMM is:-

σMMM2= βMMM2σM2+ σ2(e)


σM2 is the variance of SP500, which is 0.0429859752

σ2(e) is the variance of residual standard error, which is 0.029412, as shown in the above output.

So, σMMM2 = 1.05454392 * 0.0429859752 + 0.029412 = 0.0540352772

We can follow the same procedures to get the betas, variances, and expected excess returns of the stocks of JNJ, PG, and WMT as follows.

MMM          JNJ           PG           WMT

Variances        0.0540352772 0.039036492   0.0384049982 0.0451064012

beta             1.0545439      0.438671     0.347552             0.39777

Expected Excess Rt         0.009320017   0.003876957   0.003071651   0.003515476

Based on the above results, we can further calculate the covariance in pairs of the four stocks by the formula of σij= βi βjσM2 .  So, for the covariance between MMM and JNJ,

σMMM,JNJ = βMMM βJNJσM2 = 1.0545439 * 0.438671 * 0.0429859752 = 0.000854786.

We can calculate the other covariances by the same way and build the covariance matrix as follows.

MMM          JNJ           PG           WMT

MMM      0.002919811   0.000854786   0.000677233   0.000775087

JNJ       0.000854786  0.001523848   0.000281716   0.000322422

PG        0.000677233  0.000281716   0.001474944   0.00025545

WMT        0.000775087   0.000322422   0.00025545     0.002034587

Mean-Variance Portfolio Optimization

Now with the expected excess returns and the covariance matrix ready, we can conduct the Mean-Variance Portfolio Optimization to get the optimal allocation (weight) of each stock so that the Sharpe Ratio of the portfolio is maximized.

Here we will use Guy Yollin’s “effFrontier” and “maxSharpe” functions3, which use the core function of “portfolio.optim” in the “tseries” R package, for the calculations of efficient frontier and maximum Sharpe Ratio.

#load tseries package


#build the expected excess return matrix and covariance matrix for the parameters of “maxSharpe” #function. The numbers are from the calculations in the previous parts.

> averet <- matrix(c(0.009320017,0.003876957,0.003071651,0.003515476), nrow=1)


# get the optimal weights

> weights <- maxSharpe(averet,rcov,shorts=F)

> weights

[1] 0.5779719 0.1794985 0.1332271 0.1093025

According to the above output, Peter can allocate about 58% of his capital in MMM, 18% in JNJ, 13% in PG, and 11% in WMT, to get a maximum Sharpe Ratio roughly, which means a maximum value of expected excess return per risk unit.


In this article, we went through some basic related finance concepts and demonstrated how to use R to conduct the portfolio management.





4. HKUST Prof. Fei Ding’s lecture slides.

Data Sources



Author: Jack Han. All rights reserved. 转载须以超链接形式标明文章原始出处和作者信息

32,127 total views, 31 views today


Share on FacebookShare on LinkedInShare on Twitter

2012年伦敦奥运会女子10米汽手枪比赛,中国选手易思玲、波兰选手波加卡、中国选手喻丹分别获得金、银、铜牌。另外进入决赛的6名选手分别是:EMMONS Katerina、 GRAY J Lynn、AHMADI Elaheh、SCHERER Sarah、VDOVINA Daria。



从下面的个人在决赛中的平均成绩(图-1)来看,易思玲的平均成绩最好、杨丹其次而最终获得铜牌的波加卡平均成绩为第四,可以推断出她在预赛中的成绩比杨丹和EMMONS Katerina要好,所以她最终获得了银牌。


图-1 均值及标准差


由于各人平均成绩差异不大,可以通过计算各人成绩的标准差(图-1)来分析决赛时发挥的稳定性。从下面R的输出结果可以看出,易思玲和AHMADI Elaheh发挥相对稳定,而波加卡的发挥相对不稳定。

从下面的箱线图(图-2)也可以直观地观察8名运动员决赛时的稳定性。同时还观察到易思玲、AHMADI Elaheh和SCHERER Sarah的比赛成绩都有离群点(>1.5 IQR)的存在,这说明射击运动员超常或失常发挥的情况时有发生,也说明一些影响射击的微小的因素的变化,比如:心理状况,都可能导致一次射击成绩有较大变化,有点“失之毫厘、谬以千里”的意思。


图-2 箱线图




图-3 比赛成绩及比赛轮次的线性回归




1. 数据来源

2. 本文部分分析思路参考了贾俊平、郝静等编著的《统计学案例与分析》一书中的“案例1.3 分析运动员发挥的稳定性”。但该书第19页最下面及第20页最上面关于运动员紧张情绪的分析逻辑我个人认为是错误的。不能通过运动员成绩的分布的左偏或右偏程度来说明运动员在决赛过程中前后阶段的情绪变化趋势,因为成绩的分布形态与时间先后(轮次)无关。

附 R 源代码

#load the data into data frame from .csv file

dataDir=”D: \\R\\ Women’s 10m Air Rifle.csv”

df <- read.csv(dataDir,header=TRUE)

#draw boxplots

bp <- boxplot(df,boxwex = 0.5,col = “yellow”,

main = “Women’s 10m Air Rifle – Final”,

xlab = “Athlete”,ylim = c(9, 11),

ylab = “Score”)

#Labelling the outliers in the boxplots

x<-bp$group + 0.15

y<-bp$out + 0.07





#Std Dev


#plot & simple linear regression

x <- c(1:10)


for ( i in 1:8 ){


lm.reg <- lm(df[,i]~x)



Author: Jack Han. All rights reserved. 转载须以超链接形式标明文章原始出处和作者信息

28,345 total views, 4 views today

Data Analysis for Marketing Research with R Language (1)

Share on FacebookShare on LinkedInShare on Twitter

Data Analysis technologies such as t-test, ANOVA, regression, conjoint analysis, and factor analysis are widely used in the marketing research areas of A/B Testing, consumer preference analysis, market segmentation, product pricing, sales driver analysis, and sales forecast etc. Traditionally the analysis tools are mainly SPSS and SAS, however, the open source R language is catching up now.

In this article, we will introduce how to use R to conduct some basic marketing researches by a sample, and then to further implement the analysis in Oracle R Enterprise which integrating R with Oracle Database.

Scenario Introduction and Marketing Research Objectives

A fabricate company, ABC store chain, is selling a new type of grape juice in some of its stores for pilot selling. The marketing team of ABC wants to analyze:-

  • Which type of in-store advertisement is more effective? They have placed two types of ads in stores for testing, one theme is natural production of the juice, the other theme is family health caring;
  • The Price Elasticity – the reactions of sales quantity of the grape juice to its price change;
  • The Cross-price Elasticity – the reactions of sales quantity of the grape juice to the price changes of other products such as apple juice and cookies in the same store;
  • How to find the best unit price of the grape juice which can maximize the profit and the forecast of sales with that price.

The marketing team has randomly sampled 30 observations and constructed the following dataset for the analysis. There are 5 variables (data columns) in the dataset.

Variable Description
Sales Total unit sales of the grape juice in one week in a store
Price Average unit price of the grape juice in the week
ad_type The in-store advertisement type to promote the grape juice.

ad_type = 0,  the theme of the ad is natural production of the juice

ad_type = 1,  the theme of the ad is family health caring

price_apple Average unit price of the apple juice in the same store in the week
price_cookies Average unit price of the cookies in the same store in the week

The dataset can be downloaded here. Please note the dataset is made up by the author for illustration purpose only, so it maybe looks different from the data in the real world.

Data Exploration

Let’s have some basic exploration to know more about the dataset.

#load the libraries needed in the following codes
> library(s20x)
> library(car)

> #read the dataset from an existing .csv file
> df <- read.csv(file.choose(),header=T)

> #list the name of each variable (data column) and the first six rows of the dataset
> head(df)
sales price ad_type price_apple price_cookies
1   222  9.83       0        7.36          8.80
2   201  9.72       1        7.43          9.62
3   247 10.15       1        7.66          8.90
4   169 10.04       0        7.57         10.26
5   317  8.38       1        7.33          9.54
6   227  9.74       0        7.51          9.49

> # basic statistics of the variables
> summary(df)
sales           price           ad_type     price_apple    price_cookies
Min.   :131.0   Min.   : 8.200    Min.   :0.0   Min.   :7.300   Min.   : 8.790
1st Qu.:182.5 1st Qu.: 9.585  1st Qu.:0.0  1st Qu.:7.438    1st Qu.: 9.190
Median :204.5 Median : 9.855 Median :0.5  Median :7.580   Median : 9.515
Mean:   216.7  Mean:   9.738   Mean   :0.5   Mean:   7.659  Mean   : 9.622
3rd Qu.:244.2  3rd Qu.:10.268 3rd Qu.:1.0   3rd Qu.:7.805    3rd Qu.:10.140
Max. :   335.0   Max. :   10.490  Max. :   1.0   Max. :   8.290   Max. :   10.580

From the above summary table, we can roughly know the basic statistics of each numeric variable. For example, the mean value of sales is 216.7 units, the min value is 131, and the max value is 335. Please ignore the statistics of the “ad_type” there since it is a categorical variable.

We can further explore the distribution of the data of sales by visualizing the data in graphical form as follows.

> #set the 1 by 2 layout plot window
> par(mfrow = c(1,2))
> # boxplot to check if there are outliers
> boxplot(df$sales,horizontal = TRUE, xlab=”sales”)
> # histogram to explore the data distribution shape
> hist(df$sales,main=””,xlab=”sales”,prob=T)
> lines(density(df$sales),lty=”dashed”,lwd=2.5,col=”red”)


We don’t find outliers in the above box plot graph and the sales data distribution is roughly normal. It is not necessary to apply further data cleaning and treatment to the data set.

Analysis of Ad Effectiveness

The marketing team wants to find out the ad with better effectiveness for sales between the two types of ads, one is with natural production theme; the other is with family health caring theme. So they can place the better one into all of ABC’s stores after the pilot period.

To find out the better ad, we can calculate and compare the mean of sales with the two different ad types at the first step.

> #divide the dataset into two sub dataset by ad_type
> sales_ad_nature = subset(df,ad_type==0)
> sales_ad_family = subset(df,ad_type==1)
> #calculate the mean of sales with different ad_type
> mean(sales_ad_nature$sales)
[1] 186.6667
> mean(sales_ad_family$sales)
[1] 246.6667

The mean of sales with nature product theme is about 187; the mean of sales with family health caring theme is about 247. It looks like that the latter one is better. However, this is only the conclusion based on the sample with only 30 observations randomly selected. To find out how likely the conclusion is correct for the whole population, it is necessary to do statistical testing – two-sample t-test.

It is important to check the assumptions of t-tests, which assume the observations are normally distributed and independent, before conducting the t-tests. Otherwise the results of t-tests are not valid. The observations are independent since they were randomly sampled. Let’s check the normality by plotting the distribution shapes of the two groups of sales data.

> #set the 1 by 2 layout plot window
> par(mfrow = c(1,2))
> # histogram to explore the data distribution shapes
> hist(sales_ad_nature$sales,main=””,xlab=”sales with nature production theme ad”,prob=T)
> lines(density(sales_ad_nature$sales),lty=”dashed”,lwd=2.5,col=”red”)
> hist(sales_ad_family$sales,main=””,xlab=”sales with family health caring theme ad”,prob=T)
> lines(density(sales_ad_family$sales),lty=”dashed”,lwd=2.5,col=”red”)


We can see that the shapes are roughly normally distributed. We can also check the normality by Shapiro-Wilk test as follows.

> shapiro.test(sales_ad_nature$sales)

Shapiro-Wilk normality test
data:  sales_ad_nature$sales
W = 0.9426, p-value = 0.4155

> shapiro.test(sales_ad_family$sales)

Shapiro-Wilk normality test
data:  sales_ad_family$sales
W = 0.8974, p-value = 0.08695

The p-values of the Shapiro-Wilk tests are larger than 0.05, so there is no strong evidence to reject the null hypothesis that the two groups of sales data are normally distributed.

Now we can conduct the t-test since the t-test assumptions are met.

> t.test(sales_ad_nature$sales,sales_ad_family$sales)

Welch Two Sample t-test
data:  sales_ad_nature$sales and sales_ad_family$sales
t = -3.7515, df = 25.257, p-value = 0.0009233
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -92.92234 -27.07766
sample estimates:
mean of x mean of y
186.6667  246.6667

From the output of t-test above, we can say that:-

We have strong evidence to say that the population means of the sales with the two different ad types are different because the p-value of the t-test is very small;

With 95% confidence, we can estimate that the mean of the sales with natural production theme ad is somewhere in 27 to 93 units less than that of the sales with family health caring theme ad.

So the conclusion is that the ad with the theme of family health caring is BETTER.

Sales Driver Analysis and Price Elasticity Analysis

With the information given in the data set, we can explore how grape juice price, ad type, apple juice price, cookies price influence the sales of grape juice in a store by multiple linear regression analysis. Here, “sales” is the dependent variable and the others are independent variables.

Let’s investigate the correlation between the sales and other variables by displaying the correlation coefficients in pairs.

> pairs(df,col=”blue”,pch=20)
> pairs20x(df)


The correlation coefficients between sales and price, ad_type, price_apple, and price_cookies are 0.85, 0.58, 0.37, and 0.37 respectively, that means they all might have some influences to the sales, so we can try to add all of the independent variables into the regression model as follows.

> sales.reg<-lm(sales~price+ad_type+price_apple+price_cookies,df)
> summary(sales.reg)

lm(formula = sales ~ price + ad_type + price_apple + price_cookies,
data = df)

Min      1Q  Median      3Q     Max
-36.290 -10.488   0.884  10.483  29.471

Estimate Std. Error t value Pr(>|t|)
(Intercept)    774.813    145.349   5.331 1.59e-05 ***
price          -51.239      5.321  -9.630 6.83e-10 ***
ad_type         29.742      7.249   4.103 0.000380 ***
price_apple     22.089     12.512   1.765 0.089710 .
price_cookies  -25.277      6.296  -4.015 0.000477 ***

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 18.2 on 25 degrees of freedom
Multiple R-squared: 0.8974,    Adjusted R-squared: 0.881
F-statistic: 54.67 on 4 and 25 DF,  p-value: 5.318e-12

The p-value for price, ad_type, and price_cookies in the last column of the above output is much less than 0.05. They are significant in explaining the sales. We are confident to include these three variables into the model.

The p-value of price_apple is a bit larger than 0.05, seems there are no strong evidence for apple juice price to explain the sales. However, according to our real-life experience, we know when apple juice price is lower, consumers likely to buy more apple juice, and then the sales of other fruit juice will decrease. So we can also add it into the model to explain the grape juice sales.

The Adjusted R-squared is 0.881, which indicates a reasonable goodness of fit and 88% of the variation in sales can be explained by the four variables. The remaining 12% can be attributed to other factors or inherent variability. Please note the R-squared is very high here because the dataset were made up rather than from real world data sources.

The assumptions for the regression to be true are that data are random and independent; residuals are normally distributed and have constant variance. Let’s check the residuals assumptions visually.

# plotting the residuals vs. other key model metrics
> par(mfrow=c(2,2))
> plot(sales.reg)


The Residuals vs Fitted graph above shows that the residuals scatter around the fitted line with no obvious pattern, and the Normal Q-Q graph shows that basically the residuals are normally distributed. The assumptions are met.

For multiple regression, it is also important to check the multicollinearity among the variables because high multicollinearity will make the coefficients for independent variables less precise and introduce large errors in the predictions for dependant variable. We can investigate the multicollinearity by displaying the correlation coefficients of the independent variables in pairs as what we did at the beginning of this part. We can also check the multicollinearity by the following command in R.

#check multicollinearity
> vif(sales.reg)
price       ad_type   price_apple price_cookies
1.246084      1.189685      1.149248      1.099255

The VIF test value for each variable is close to 1, which means the multicollinearity is very low among these variables.

Based on the above analysis, we can accept the regression result and construct the multi-linear model of sales as follows.

Sales = 774.81 – 51.24 * price + 29.74 * ad_type + 22.1 * price_apple – 25.28 * price_cookies

With model established, we can analysis the Price Elasticity(PE) and Cross-price Elasticity(CPE) to predict the reactions of sales quantity to price. “Price elasticity is defined as %ΔQ/%ΔP, which indicates the percent change in quantity divided by the percent change in price; Cross-price Elasticity is the percent change in quantity divided by the change in the price of some other product.”1

PE = (ΔQ/Q) / (ΔP/P) = (ΔQ/ΔP) * (P/Q) = -51.24 * 0.045 = -2.3

P is price, Q is sales quantity

ΔQ/ΔP = -51.24 , the parameter before the variable “price” in the above model

P/Q = 9.738 / 216.7 = 0.045,  P is the mean of prices in the dataset, so does Q

The PE indicates that 10% decrease in price will increase the sales by 23%, and vice verse.

Let’s further calculate the CPE on apple juice and cookies to analyze the how the change of apple juice price and cookies price influence the sales of grape juice.

CPEapple = (ΔQ/ΔPapple) * (Papple/Q) = 22.1 * ( 7.659 / 216.7) = 0.78

CPEcookies = (ΔQ/ΔPcookies) * (Pcookies/Q) = -25.28 * ( 9.622 / 216.7) = – 1.12

The CPEapple indicates that 10% decrease in apple juice price will DECREASE the sales by 7.8%, and vice verse. So the grape juice and apple juice are substitutes.

The CPEcookies indicates that 10% decrease in cookies price will INCREASE the sales by 11.2%, and vice verse. So the grape juice and cookies are compliments. Place the two products together will likely increase the sales for both.

We can also know that the sales increase 29.74 units when using the ad with the family health caring theme (ad_type = 1).

Optimal Pricing and Sales Prediction

Usually companies want to get higher profit rather than just higher sales quantity. So, how to set the optimal price for the new grape juice to get the maximum profit based on the dataset collected in the pilot period and the regression model above?

To simplify the question, we can let the ad_type = 1, the price_apple = 7.659 (mean value), and the price_cookies = 9.738 (mean value).

The model is simplified as follows:-

Sales = 774.81 – 51.24 * price + 29.74 * 1 + 22.1 * 7.659 – 25.28 * 9.738

Sales = 772.64 – 51.24*price

Assume the marginal cost(C) per unit of grape juice is 5. We can calculate the profit (Y) by the following formula.

Y = (price – C) * Sales Quantity = (price – 5) * (772.64 – 51.24*price)

Y = – 51.24 * price2 + 1028.84 * price – 3863.2

To get the optimal price to maximize Y, we can use the following R function.

> f = function(x) -51.24*x^2 + 1028.84 * x – 3863.2
> optimize(f,lower=0,upper=20,maximum=TRUE)
[1] 10.03942
[1] 1301.28

The optimal price is 10.04; the maximum profit will be 1301 according to the above output. In reality, we can reasonably set the price to be 10 or 9.99.

We can further use the model to predict the sales while the price is 10.

> # predict the sales
> inputData <- data.frame(price=10,ad_type=1,price_apple=7.659,price_cookies=9.738)
> predict(sales.reg,inputData,interval=”p”)
fit           lwr           upr
1 215.1978 176.0138 254.3817

The sales forecast will be 215 units with a variable range of 176 ~ 254 with 95% confidence in a store in one work on average. Based on the forecast and other factors, ABC Company can prepare the inventory for all of its stores after the pilot period.

Implementation in Oracle R Enterprise

“Oracle R Enterprise (ORE) implements a transparency layer on top of the R engine that allows R computations to be executed in Oracle Database from the R environment.”3 It is also not necessary to load the whole bunch of data into R environment, which usually runs on a desktop or laptop with limitations of RAM and CPU, from database. This is helpful when we have millions of data records to be analyzed. Let’s try to implement the regression model by ORE.

# load the Oracle R Enterprise library and connect to Oracle Database
> library(ORE)
> ore.connect(user = “<DBUser>”,sid = “orcl”,host = “<host name>”,password = “<password>”,port = 1521,all = TRUE)

# regression by ore.lm
> summary(sales.reg)

The output is the same as we use the function of “lm” for regression.


In this article, using the open source R language, we introduced how to test the differences of effectiveness among different ad types; how to analyze the price elasticity and cross-price elasticity of a product; and how to set the optimal price to maximize the profit and then to forecast the sales with the price.



  1. Prof. Ronald T. Wilcox’s lecture slides of “Estimating Demand from Scanner Data”


Author: Jack Han. All rights reserved. 转载须以超链接形式标明文章原始出处和作者信息

45,520 total views, 38 views today

Conjoint Analysis(联合分析法)简介

Share on FacebookShare on LinkedInShare on Twitter

Conjoint Analysis(联合分析法)是一种融合实验设计和多元回归分析、主要用于新产品开发的市场调研方法。一种产品及型号可以通过其主要的产品属性(attribute)及相应的属性水平(level)来定义,比如某种型号的豆浆机可以由“加热时间– 60秒”、“容量 – 5杯”、“价格-500元”等属性及相应的属性水平描述。通过收集和处理用户对同类不同型号产品的购买倾向的评估数据,Conjoint Analysis可以分析出各种产品属性对用户购买意愿的影响程度、同一种属性中不同的属性水平对用户购买倾向的影响程度、最优的产品属性组合,并可进一步估算各种型号产品的市场占有率。


压榨和加热豆浆时长(秒) 容量(杯) 价格(元)
60 3 300
100 5 450
130 8 550


时长(60秒) 时长(100秒) 时长(130秒)
价格 3杯 5杯 8杯 3杯 5杯 8杯 3杯 5杯 8杯
300元 6 7 5 5 6 4 5 6 3
450元 5 6 4 4 5 3 4 5 3
550元 3 4 2 2 3 1 1 2 1



U = C + β11 * X1112 * X12 + β21 * X21 + β22 * X22 + β31 * X31 + β32 * X32


变量X11 是表示价格是否为450元

变量X12 是表示价格是否为550元

变量X21 是表示时长是否为100秒

变量X22 是表示时长是否为130秒

变量X31 是表示容量是否为5杯

变量X31 是表示容量是否为8杯

C是常数,X11X12X21X22X31X31为哑变量(Dummy Variable)。


> dataset <- read.csv(file.choose(),header=T)
> lm.reg<-lm(dataset[,1]~dataset[,"X11"]+dataset[,"X12"]+dataset[,"X21"]+dataset[,"X22"]+dataset[,"X31"]+dataset[,"X32"])
> summary(lm.reg)



                 Estimate Std. Error t value Pr(>|t|)

(Intercept)        6.0000     0.1518  39.524  < 2e-16 ***

dataset[, “X11”]  -0.8889     0.1405  -6.325 3.58e-06 ***

dataset[, “X12”]  -3.1111     0.1405 -22.136 1.54e-15 ***

dataset[, “X21”]  -1.0000     0.1405  -7.115 6.78e-07 ***

dataset[, “X22”]  -1.3333     0.1405  -9.487 7.61e-09 ***

dataset[, “X31”]   1.0000     0.1405   7.115 6.78e-07 ***

dataset[, “X32”]  -1.0000     0.1405  -7.115 6.78e-07 ***

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

 Residual standard error: 0.2981 on 20 degrees of freedom

Multiple R-squared: 0.9762,     Adjusted R-squared: 0.969

F-statistic: 136.7 on 6 and 20 DF,  p-value: 3.699e-15




U价格450– U价格300 = -0.89  (注: -0.89为变量X11的回归系数。见上面的回归结果)

U价格550– U价格300 = -3.11   (注: -3.11为变量X12的回归系数。见上面的回归结果)

U价格300+ U价格450+ U价格550= 0

求解方程组可得:U价格300= 1.33; U价格450= 0.44 ; U价格550= -1.78;


属性 水平 效用 最高最低效用差 权重


300 (元) 1.33  3.11(1.33 – (-1.78)) 48.29%
450 0.44
550 -1.78
时长 60 (秒) 0.78 1.33(0.78 – (-0.55)) 20.65%
100 -0.22
130 -0.55
容量 3 (杯) 0 2(1 – (-1)) 31.06%
5 1
8 -1

注:上表中各个属性的权重的计算方法为 某个属性的最高最低效用差 / 所有属性的最高最低效用差之和。以价格属性的权重计算为例: 3.11 / (3.11 + 1.33 + 2) = 48.29%

通过各个属性水平的效用值,我们可以知道当某款豆浆机的价格为300元、时长为60秒、容量为3杯时,其总效用(Total Utility)最高,最受用户喜欢。一般来讲(奢侈品除外),价格越低,用户会越喜欢;但企业也需要考虑相应的销售利润。


假如市场上已经有两款竞争对手的产品:产品A(450元、100秒、8杯),产品B(300元、130秒,5杯),如果我们推出一款产品C (450元、60秒、8杯), 如果估算产品C的市场占有率?


U = 6 – 0.89 * X11 – 3.11 * X12 – 1 * X21 – 1.33 * X22 + 1 * X31 – 1 * X32


产品A = 6 – 0.89 * 1 – 3.11 * 0 – 1 * 1 – 1.33 * 0 + 1 * 0 – 1 * 1 = 3.11

产品B = 6 – 0.89 * 0 – 3.11 * 0 – 1 * 0 – 1.33 * 1 + 1 * 1 – 1 * 0 = 5.67

产品 C = 6 – 0.89 * 1 – 3.11 * 0 – 1 *0 – 1.33 *0 +1 *0 – 1 * 1 = 4.11


产品A = 3.11 / (3.11 + 5.67 + 4.11) = 24 %

产品B = 5.67 / (3.11 + 5.67 + 4.11) = 44 %

产品C = 4.11 / (3.11 + 5.67 + 4.11) = 32 %




Prof. Ralf van der Lans’ lecture slides for Marketing Research course


Author: Jack Han. All rights reserved. 转载须以超链接形式标明文章原始出处和作者信息

166,529 total views, 1 views today