How to one hot encode several categorical variables in R

May 2024 ยท 4 minute read

I'm working on a prediction problem and I'm building a decision tree in R, I have several categorical variables and I'd like to one-hot encode them consistently in my training and testing set. I managed to do it on my training data with :

temps <- X_train tt <- subset(temps, select = -output) oh <- data.frame(model.matrix(~ . -1, tt), CLASS = temps$output) 

But I can't find a way to apply the same encoding on my testing set, how can I do that?


5 Answers

I recommend using the dummyVars function in the caret package:

library(caret) customers <- data.frame( id=c(10, 20, 30, 40, 50), gender=c('male', 'female', 'female', 'male', 'female'), mood=c('happy', 'sad', 'happy', 'sad','happy'), outcome=c(1, 1, 0, 0, 0)) customers id gender mood outcome 1 10 male happy 1 2 20 female sad 1 3 30 female happy 0 4 40 male sad 0 5 50 female happy 0 # dummify the data dmy <- dummyVars(" ~ .", data = customers) trsf <- data.frame(predict(dmy, newdata = customers)) trsf id gender.female gender.male mood.happy mood.sad outcome 1 10 0 1 1 0 1 2 20 1 0 0 1 1 3 30 1 0 1 0 0 4 40 0 1 0 1 0 5 50 1 0 1 0 0 

example source

You apply the same procedure to both the training and validation sets.


Here's a simple solution to one-hot-encode your category using no packages.



It needs your categorical variable to be a factor. The factor levels must be the same in your training and test data, check with levels(train$category) and levels(test$category). It doesn't matter if some levels don't occur in your test set.


Here's an example using the iris dataset.

data(iris) #Split into train and test sets. train <- sample(1:nrow(iris),100) test <- -1*train iris[test,] Sepal.Length Sepal.Width Petal.Length Petal.Width Species 34 5.5 4.2 1.4 0.2 setosa 106 7.6 3.0 6.6 2.1 virginica 112 6.4 2.7 5.3 1.9 virginica 127 6.2 2.8 4.8 1.8 virginica 132 7.9 3.8 6.4 2.0 virginica 

model.matrix() creates a column for each level of the factor, even if it is not present in the data. Zero indicates it is not that level, one indicates it is. Adding the zero specifies that you do not want an intercept or reference level and is equivalent to -1.

oh_train <- model.matrix(~0+iris[train,'Species']) oh_test <- model.matrix(~0+iris[test,'Species']) #Renaming the columns to be more concise. attr(oh_test, "dimnames")[[2]] <- levels(iris$Species) setosa versicolor virginica 1 1 0 0 2 0 0 1 3 0 0 1 4 0 0 1 5 0 0 1 

P.S. It's generally preferable to include all categories in training and test data. But that's none of my business.



library(data.table) library(mltools) customers_1h <- one_hot( 


> customers_1h id gender_female gender_male mood_happy mood_sad outcome 1: 10 0 1 1 0 1 2: 20 1 0 0 1 1 3: 30 1 0 1 0 0 4: 40 0 1 0 1 0 5: 50 1 0 1 0 0 


customers <- data.frame( id=c(10, 20, 30, 40, 50), gender=c('male', 'female', 'female', 'male', 'female'), mood=c('happy', 'sad', 'happy', 'sad','happy'), outcome=c(1, 1, 0, 0, 0)) 

In case you don't want to use any external package I have my own function:

one_hot_encoding = function(df, columns="season"){ # create a copy of the original data.frame for not modifying the original df = cbind(df) # convert the columns to vector in case it is a string columns = c(columns) # for each variable perform the One hot encoding for (column in columns){ unique_values = sort(unique(df[column])[,column]) non_reference_values = unique_values[c(-1)] # the first element is going # to be the reference by default for (value in non_reference_values){ # the new dummy column name new_col_name = paste0(column,'.',value) # create new dummy column for each value of the non_reference_values df[new_col_name] <- with(df, ifelse(df[,column] == value, 1, 0)) } # delete the one hot encoded column df[column] = NULL } return(df) } 

And you use it like this:

df = one_hot_encoding(df, c("season")) 

Hi here is my version of the same, this function encodes all categorical variables which are 'factors' , and removes one of the dummy variables to avoid dummy variable trap and returns a new Data frame with the encoding :-

onehotencoder <- function(df_orig) { df<-cbind(df_orig) df_clmtyp<-data.frame(clmtyp=sapply(df,class)) df_col_typ<-data.frame(clmnm=colnames(df),clmtyp=df_clmtyp$clmtyp) for (rownm in 1:nrow(df_col_typ)) { if (df_col_typ[rownm,"clmtyp"]=="factor") { clmn_obj<-df[toString(df_col_typ[rownm,"clmnm"])] dummy_matx<-data.frame(model.matrix( ~.-1, data = clmn_obj)) dummy_matx<-dummy_matx[,c(1,3:ncol(dummy_matx))] df[toString(df_col_typ[rownm,"clmnm"])]<-NULL df<-cbind(df,dummy_matx) df[toString(df_col_typ[rownm,"clmnm"])]<-NULL } } return(df) } 
