Ravshan S.K.

Simulating Likert scale data in R

In my last project I had to find theoretical limits for a psychometric index involving Likert scale data (aka categorical data). After successfully finding it, I decided to test the results in a simple Monte-Carlo simulation.

I was surprised to find out that there is no built-in categorical data generator in R. What I was looking for, was something like runif(100) which would generate a vector of length 100 where every element is drawn from a multinomial distribution in general or a categorical distribution in particular.

The first idea was to use sample function with given probabilities: sample(c(0,1,2),1,prob=c(0.33, 0.33, 0.34)) but you couldn’t repeat this procedure for N participants without using loops, which is very inefficient, or you would end up with rep repeating the same random pick N times.

I didn’t want to use any third-party libraries just for this small application either, so I came up with this simple trick.

Algorithm

Suppose, you want to generate a 5-category data (x1, x2, x3, x4, x5) for N participants with probabilities (1/10, 2/10, 4/10, 2/10, 1/10). The following formula will work:

distribution <- c(rep(x1,1),rep(x2,2),rep(x3,4),rep(x4,2),rep(x5,1))
potential <- rep(distribution, M)
likert_data <- sample(potential, N)

or, as one-liner:

likert_data <- sample(rep(c(rep(x1,1),rep(x2,2),rep(x3,4),rep(x4,2),rep(x5,1)), M), N)

Notice that distribution sets the probabilities, potential repeats this M times (where M is any number greater than or equal to N — I personally used M = N), and likert_data (uniformly) randomly picks N elements and returns the required vector.

A frequency table of likert_data

Notice how in the screenshot above, we obtain almost exact probabilities we wanted: (1, 2, 4, 2, 1)/10. Since every time is a random draw, there are some deviations, but repeating this formula and averaging, gives the desired values.

UPDATE: A StackExchange user suggested a better hack — to randomly sample with replacement. This would make my solution obsolete, but it’s brilliant:

likert_data <- sample(c(x1,x2,x3,x4,x5), N, replace = TRUE, prob=c(1/10, 2/10, 4/10, 2/10, 1/10))
Tweet it Share on Linkedin Share on Facebook Share on Google+