Title: | Category Variable Encodings |
---|---|
Description: | Simple, fast, and automatic encodings for category data using a data.table backend. Most of the methods are an implementation of Johannemann, Hadad, Athey, Wager (2019) <arXiv:1908.09874>, particularly their 'means', "sPCA", "low-rank" and "multinomial logit". |
Authors: | Juraj Szitas [aut, cre] |
Maintainer: | Juraj Szitas <[email protected]> |
License: | GPL-3 |
Version: | 1.5.0 |
Built: | 2025-02-21 03:29:10 UTC |
Source: | https://github.com/jszitas/categoryencodings |
**[deprecated: use encoder()]** Transforms the original design matrix automatically, using the appropriate encoding.
encode_categories(X, Y = NULL, fact = NULL, method = NULL, keep = FALSE)
encode_categories(X, Y = NULL, fact = NULL, method = NULL, keep = FALSE)
X |
The data.frame/data.table to transform. |
Y |
Optional: The dependent variable to ignore in the transformation. |
fact |
Optional: The factor variable(s) to encode by - either positive integer(s) specifying the column number, or the name(s) of the column. If left empty a heuristic is used to determine the factor variable(s), and a warning is written with the names of the variables converted. |
method |
Optional: A character string indicating which encoding method to use, either of the following: * "mean" * "median" * "deviation" * "lowrank" * "spca" * "mnl" * "dummy" * "difference" * "helmert" * "simple_effect" * "repeated_effect" If only a single method is specified, it is taken to encode either all of the variables supplied through *fact*, or variables which have been flagged as factors automatically. If multiple methods are specified, the number of methods must match the number of factor variables in *fact* - and these are applied to correspond in the order in which they were supplied. In case a missmatch occurs, an error is raised. If left empty, the appriopriate method is selected on a case by case basis (and the selected methods are written out to console). |
keep |
Whether to keep the original factor column(s), defaults to **FALSE**. |
Automatically selects the appropriate method given the number of anticipated newly created variables, based on the results in Johannemann et al.(2019) 'Sufficient Representations for Categorical Variables', and a simple heuristic - where
A new data.table X which contains the new columns and optionally the old factor(s).
design_mat <- cbind( data.frame( matrix(rnorm(5*100),ncol = 5) ), sample( sample(letters, 10), 100, replace = TRUE) ) colnames(design_mat)[6] <- "factor_var" encode_categories( design_mat, method = "mean" )
design_mat <- cbind( data.frame( matrix(rnorm(5*100),ncol = 5) ), sample( sample(letters, 10), 100, replace = TRUE) ) colnames(design_mat)[6] <- "factor_var" encode_categories( design_mat, method = "mean" )
Transforms the original design matrix using a deviation dummy encoding.
encode_deviation(X, fact, keep_factor = FALSE, encoding_only = FALSE)
encode_deviation(X, fact, keep_factor = FALSE, encoding_only = FALSE)
X |
The data.frame/data.table to transform. |
fact |
The factor variable to encode by - either a positive integer specifying the column number, or the name of the column. |
keep_factor |
Whether to keep the original factor column(defaults to **FALSE**). |
encoding_only |
Whether to return the full transformed dataset or only the new columns. Defaults to FALSE and returns the full dataset. |
The deviation dummy variable encoding, with reference class level set to -1. The reference class is always the last class observed.
A new data.table X which contains the new columns and optionally the old factor.
design_mat <- cbind( data.frame( matrix(rnorm(5*100),ncol = 5) ), sample( sample(letters, 10), 100, replace = TRUE) ) colnames(design_mat)[6] <- "factor_var" #encode_deviation(X = design_mat, fact = "factor_var", keep_factor = FALSE)
design_mat <- cbind( data.frame( matrix(rnorm(5*100),ncol = 5) ), sample( sample(letters, 10), 100, replace = TRUE) ) colnames(design_mat)[6] <- "factor_var" #encode_deviation(X = design_mat, fact = "factor_var", keep_factor = FALSE)
Transforms the original design matrix using a difference encoding.
encode_difference(X, fact, keep_factor = FALSE, encoding_only = FALSE)
encode_difference(X, fact, keep_factor = FALSE, encoding_only = FALSE)
X |
The data.frame/data.table to transform. |
fact |
The factor variable to encode by - either a positive integer specifying the column number, or the name of the column. |
keep_factor |
Whether to keep the original factor column(defaults to **FALSE**). |
encoding_only |
Whether to return the full transformed dataset or only the new columns. Defaults to FALSE and returns the full dataset. |
A new data.table X which contains the new columns and optionally the old factor.
design_mat <- cbind( data.frame( matrix(rnorm(5*100),ncol = 5) ), sample( sample(letters, 10), 100, replace = TRUE) ) colnames(design_mat)[6] <- "factor_var" encode_difference(X = design_mat, fact = "factor_var", keep_factor = FALSE)
design_mat <- cbind( data.frame( matrix(rnorm(5*100),ncol = 5) ), sample( sample(letters, 10), 100, replace = TRUE) ) colnames(design_mat)[6] <- "factor_var" encode_difference(X = design_mat, fact = "factor_var", keep_factor = FALSE)
Transforms the original design matrix using a dummy variable encoding.
encode_dummy( X, fact, keep_factor = FALSE, encoding_only = FALSE, use_reference = TRUE, reference_value = 0 )
encode_dummy( X, fact, keep_factor = FALSE, encoding_only = FALSE, use_reference = TRUE, reference_value = 0 )
X |
The data.frame/data.table to transform. |
fact |
The factor variable to encode by - either a positive integer specifying the column number, or the name of the column. |
keep_factor |
Whether to keep the original factor column(defaults to **FALSE**). |
encoding_only |
Whether to return the full transformed dataset or only the new columns. Defaults to FALSE and returns the full dataset. |
use_reference |
Whether to include a reference level (i.e. whether the new encoding contains an **intercept-like** constant term). Defaults to **TRUE**. |
reference_value |
What the reference value should be if **use_reference** is set to **TRUE**. Defaults to 0. |
The basic dummy variable encoding, with reference class level set to 0. The reference class is always the first class observed.
A new data.table X which contains the new columns and optionally the old factor.
design_mat <- cbind( data.frame( matrix(rnorm(5*100),ncol = 5) ), sample( sample(letters, 10), 100, replace = TRUE) ) colnames(design_mat)[6] <- "factor_var" encode_dummy(X = design_mat, fact = "factor_var", keep_factor = FALSE)
design_mat <- cbind( data.frame( matrix(rnorm(5*100),ncol = 5) ), sample( sample(letters, 10), 100, replace = TRUE) ) colnames(design_mat)[6] <- "factor_var" encode_dummy(X = design_mat, fact = "factor_var", keep_factor = FALSE)
Transforms the original design matrix using a helmert (reverse difference) encoding.
encode_helmert(X, fact, keep_factor = FALSE, encoding_only = FALSE)
encode_helmert(X, fact, keep_factor = FALSE, encoding_only = FALSE)
X |
The data.frame/data.table to transform. |
fact |
The factor variable to encode by - either a positive integer specifying the column number, or the name of the column. |
keep_factor |
Whether to keep the original factor column(defaults to **FALSE**). |
encoding_only |
Whether to return the full transformed dataset or only the new columns. Defaults to FALSE and returns the full dataset. |
A new data.table X which contains the new columns and optionally the old factor.
design_mat <- cbind( data.frame( matrix(rnorm(5*100),ncol = 5) ), sample( sample(letters, 10), 100, replace = TRUE) ) colnames(design_mat)[6] <- "factor_var" encode_helmert(X = design_mat, fact = "factor_var", keep_factor = FALSE)
design_mat <- cbind( data.frame( matrix(rnorm(5*100),ncol = 5) ), sample( sample(letters, 10), 100, replace = TRUE) ) colnames(design_mat)[6] <- "factor_var" encode_helmert(X = design_mat, fact = "factor_var", keep_factor = FALSE)
Transforms the original design matrix using a low rank encoding.
encode_lowrank(X, fact, keep_factor = FALSE, encoding_only = FALSE)
encode_lowrank(X, fact, keep_factor = FALSE, encoding_only = FALSE)
X |
The data.frame/data.table to transform. |
fact |
The factor variable to encode by - either a positive integer specifying the column number, or the name of the column. |
keep_factor |
Whether to keep the original factor column(defaults to **FALSE**). |
encoding_only |
Whether to return the full transformed dataset or only the new columns. Defaults to FALSE and returns the full dataset. |
Uses the method from Johannemann et al.(2019) 'Sufficient Representations for Categorical Variables' - Low rank.
A new data.table X which contains the new columns and optionally the old factor.
design_mat <- cbind( data.frame( matrix(rnorm(5*100),ncol = 5) ), sample( sample(letters, 10), 100, replace = TRUE) ) colnames(design_mat)[6] <- "factor_var" encode_lowrank(X = design_mat, fact = "factor_var", keep_factor = FALSE)
design_mat <- cbind( data.frame( matrix(rnorm(5*100),ncol = 5) ), sample( sample(letters, 10), 100, replace = TRUE) ) colnames(design_mat)[6] <- "factor_var" encode_lowrank(X = design_mat, fact = "factor_var", keep_factor = FALSE)
Transforms the original design matrix using a means encoding.
encode_mean(X, fact, keep_factor = FALSE, encoding_only = FALSE)
encode_mean(X, fact, keep_factor = FALSE, encoding_only = FALSE)
X |
The data.frame/data.table to transform. |
fact |
The factor variable to encode by - either a positive integer specifying the column number, or the name of the column. |
keep_factor |
Whether to keep the original factor column(defaults to **FALSE**). |
encoding_only |
Whether to return the full transformed dataset or only the new columns. Defaults to FALSE and returns the full dataset. |
Uses the method from Johannemann et al.(2019) 'Sufficient Representations for Categorical Variables' - Means Encoding.
A new data.table X which contains the new columns and optionally the old factor.
design_mat <- cbind( data.frame( matrix(rnorm(5*100),ncol = 5) ), sample( sample(letters, 10), 100, replace = TRUE) ) colnames(design_mat)[6] <- "factor_var" encode_mean(X = design_mat, fact = "factor_var", keep_factor = FALSE)
design_mat <- cbind( data.frame( matrix(rnorm(5*100),ncol = 5) ), sample( sample(letters, 10), 100, replace = TRUE) ) colnames(design_mat)[6] <- "factor_var" encode_mean(X = design_mat, fact = "factor_var", keep_factor = FALSE)
Transforms the original design matrix using a median encoding.
encode_median(X, fact, keep_factor = FALSE, encoding_only = FALSE)
encode_median(X, fact, keep_factor = FALSE, encoding_only = FALSE)
X |
The data.frame/data.table to transform. |
fact |
The factor variable to encode by - either a positive integer specifying the column number, or the name of the column. |
keep_factor |
Whether to keep the original factor column(defaults to **FALSE**). |
encoding_only |
Whether to return the full transformed dataset or only the new columns. Defaults to FALSE and returns the full dataset. |
This might be somewhat lacking in theory (to the author's best knowledge), but feel free to try it and publish the results if they turn out interesting on some particular problem.
A new data.table X which contains the new columns and optionally the old factor.
design_mat <- cbind( data.frame( matrix(rnorm(5*100),ncol = 5) ), sample( sample(letters, 10), 100, replace = TRUE) ) colnames(design_mat)[6] <- "factor_var" encode_median(X = design_mat, fact = "factor_var", keep_factor = FALSE)
design_mat <- cbind( data.frame( matrix(rnorm(5*100),ncol = 5) ), sample( sample(letters, 10), 100, replace = TRUE) ) colnames(design_mat)[6] <- "factor_var" encode_median(X = design_mat, fact = "factor_var", keep_factor = FALSE)
Transforms the original design matrix using a mnl encoding.
encode_mnl(X, fact, keep_factor = FALSE, encoding_only = FALSE)
encode_mnl(X, fact, keep_factor = FALSE, encoding_only = FALSE)
X |
The data.frame/data.table to transform. |
fact |
The factor variable to encode by - either a positive integer specifying the column number, or the name of the column. |
keep_factor |
Whether to keep the original factor column(defaults to **FALSE**). |
encoding_only |
Whether to return the full transformed dataset or only the new columns. Defaults to FALSE and returns the full dataset. |
Uses the method from Johannemann et al.(2019) 'Sufficient Representations for Categorical Variables' - mnl.
A new data.table X which contains the new columns and optionally the old factor.
design_mat <- cbind( data.frame( matrix(rnorm(5*100),ncol = 5) ), sample( sample(letters, 10), 100, replace = TRUE) ) colnames(design_mat)[6] <- "factor_var" encode_mnl(X = design_mat, fact = "factor_var", keep_factor = FALSE)
design_mat <- cbind( data.frame( matrix(rnorm(5*100),ncol = 5) ), sample( sample(letters, 10), 100, replace = TRUE) ) colnames(design_mat)[6] <- "factor_var" encode_mnl(X = design_mat, fact = "factor_var", keep_factor = FALSE)
Transforms the original design matrix using a repeated effect encoding.
encode_repeated_effect(X, fact, keep_factor = FALSE, encoding_only = FALSE)
encode_repeated_effect(X, fact, keep_factor = FALSE, encoding_only = FALSE)
X |
The data.frame/data.table to transform. |
fact |
The factor variable to encode by - either a positive integer specifying the column number, or the name of the column. |
keep_factor |
Whether to keep the original factor column(defaults to **FALSE**). |
encoding_only |
Whether to return the full transformed dataset or only the new columns. Defaults to FALSE and returns the full dataset. |
A new data.table X which contains the new columns and optionally the old factor.
design_mat <- cbind( data.frame( matrix(rnorm(5*100),ncol = 5) ), sample( sample(letters, 10), 100, replace = TRUE) ) colnames(design_mat)[6] <- "factor_var" encode_repeated_effect(X = design_mat, fact = "factor_var", keep_factor = FALSE)
design_mat <- cbind( data.frame( matrix(rnorm(5*100),ncol = 5) ), sample( sample(letters, 10), 100, replace = TRUE) ) colnames(design_mat)[6] <- "factor_var" encode_repeated_effect(X = design_mat, fact = "factor_var", keep_factor = FALSE)
Transforms the original design matrix using a simple effect encoding.
encode_simple_effect(X, fact, keep_factor = FALSE, encoding_only = FALSE)
encode_simple_effect(X, fact, keep_factor = FALSE, encoding_only = FALSE)
X |
The data.frame/data.table to transform. |
fact |
The factor variable to encode by - either a positive integer specifying the column number, or the name of the column. |
keep_factor |
Whether to keep the original factor column(defaults to **FALSE**). |
encoding_only |
Whether to return the full transformed dataset or only the new columns. Defaults to FALSE and returns the full dataset. |
A new data.table X which contains the new columns and optionally the old factor.
design_mat <- cbind( data.frame( matrix(rnorm(5*100),ncol = 5) ), sample( sample(letters, 10), 100, replace = TRUE) ) colnames(design_mat)[6] <- "factor_var" encode_simple_effect(X = design_mat, fact = "factor_var", keep_factor = FALSE)
design_mat <- cbind( data.frame( matrix(rnorm(5*100),ncol = 5) ), sample( sample(letters, 10), 100, replace = TRUE) ) colnames(design_mat)[6] <- "factor_var" encode_simple_effect(X = design_mat, fact = "factor_var", keep_factor = FALSE)
Transforms the original design matrix using a sPCA encoding.
encode_spca(X, fact, keep_factor = FALSE, encoding_only = FALSE, ...)
encode_spca(X, fact, keep_factor = FALSE, encoding_only = FALSE, ...)
X |
The data.frame/data.table to transform. |
fact |
The factor variable to encode by - either a positive integer specifying the column number, or the name of the column. |
keep_factor |
Whether to keep the original factor column(defaults to **FALSE**). |
encoding_only |
Whether to return the full transformed dataset or only the new columns. Defaults to FALSE and returns the full dataset. |
... |
Additional parameters to pass to |
Uses the method from Johannemann et al.(2019) 'Sufficient Representations for Categorical Variables' - sPCA.
A new data.table X which contains the new columns and optionally the old factor.
design_mat <- cbind( data.frame( matrix(rnorm(5*100),ncol = 5) ), sample( sample(letters, 10), 100, replace = TRUE) ) colnames(design_mat)[6] <- "factor_var" encode_spca(X = design_mat, fact = "factor_var", keep_factor = FALSE)
design_mat <- cbind( data.frame( matrix(rnorm(5*100),ncol = 5) ), sample( sample(letters, 10), 100, replace = TRUE) ) colnames(design_mat)[6] <- "factor_var" encode_spca(X = design_mat, fact = "factor_var", keep_factor = FALSE)
Make your own encoder to be used in a pipeline
encoder( X, Y = NULL, fact = NULL, method = NULL, custom_encoding_assignment = NULL, ... )
encoder( X, Y = NULL, fact = NULL, method = NULL, custom_encoding_assignment = NULL, ... )
X |
The data.frame/data.table to transform. |
Y |
Optional: The dependent variable to ignore in the transformation. |
fact |
Optional: The factor variable(s) to encode by - either positive integer(s) specifying the column number, or the name(s) of the column. If left empty a heuristic is used to determine the factor variable(s), and a warning is written with the names of the variables converted. |
method |
Optional: A character string indicating which encoding method to use, either of the following: * "mean" * "median" * "deviation" * "lowrank" * "spca" * "mnl" * "dummy" * "difference" * "helmert" * "simple_effect" * "repeated_effect" If only a single method is specified, it is taken to encode either all of the variables supplied through *fact*, or variables which have been flagged as factors automatically. If multiple methods are specified, the number of methods must match the number of factor variables in *fact* - and these are applied to correspond in the order in which they were supplied. In case a missmatch occurs, an error is raised. If left empty, the appriopriate method is selected on a case by case basis (and the selected methods are written out to console). |
custom_encoding_assignment |
**experimental** A function which takes two arguments (**X** and **fact**) denoting the data and the factors, respectivelly, and assigns a valid encoding **method** to each factor in **fact**. |
... |
Not implemented. |
Automatically selects the appropriate method given the number of anticipated newly created variables, based on the results in Johannemann et al.(2019) 'Sufficient Representations for Categorical Variables', and a simple heuristic - where
A new data.table X which contains the new columns and optionally the old factor(s).