This function filter variables base on specified conditions, such as information value, missing rate, identical value rate.

var_filter(dt, y, x = NULL, iv_limit = 0.02, missing_limit = 0.95,
  identical_limit = 0.95, var_rm = NULL, var_kp = NULL,
  return_rm_reason = FALSE, positive = "bad|1")

Arguments

dt

A data frame with both x (predictor/feature) and y (response/label) variables.

y

Name of y variable.

x

Name of x variables. Defaults to NULL. If x is NULL, then all columns except y are counted as x variables.

iv_limit

The information value of kept variables should >= iv_limit. The Defaults to 0.02.

missing_limit

The missing rate of kept variables should <= missing_limit. The Defaults to 0.95.

identical_limit

The identical value rate (excluding NAs) of kept variables should <= identical_limit. The Defaults to 0.95.

var_rm

Name of force removed variables, Defaults to NULL.

var_kp

Name of force kept variables, Defaults to NULL.

return_rm_reason

Logical, Defaults to FALSE.

positive

Value of positive class, Defaults to "bad|1".

Value

A data frame with columns for y and selected x variables, and a data frame with columns for remove reason if return_rm_reason == TRUE.

Examples

# Load German credit data
data(germancredit)

# variable filter
dt_sel = var_filter(germancredit, y = "creditability")
#> [INFO] filtering variables ... 
dim(dt_sel)
#> [1] 1000   14

# return the reason of varaible removed
dt_sel2 = var_filter(germancredit, y = "creditability", return_rm_reason = TRUE)
#> [INFO] filtering variables ... 
lapply(dt_sel2, dim)
#> $dt
#> [1] 1000   14
#> 
#> $rm
#> [1] 20  5
#> 

str(dt_sel2$dt)
#> Classes ‘data.table’ and 'data.frame':	1000 obs. of  14 variables:
#>  $ status.of.existing.checking.account                : Factor w/ 4 levels "... < 0 DM","0 <= ... < 200 DM",..: 1 2 4 1 1 4 4 2 4 2 ...
#>  $ duration.in.month                                  : num  6 48 12 42 24 36 24 36 12 30 ...
#>  $ credit.history                                     : Factor w/ 5 levels "no credits taken/ all credits paid back duly",..: 5 3 5 3 4 3 3 3 3 5 ...
#>  $ purpose                                            : chr  "radio/television" "radio/television" "education" "furniture/equipment" ...
#>  $ credit.amount                                      : num  1169 5951 2096 7882 4870 ...
#>  $ savings.account.and.bonds                          : Factor w/ 5 levels "... < 100 DM",..: 5 1 1 1 1 5 3 1 4 1 ...
#>  $ present.employment.since                           : Factor w/ 5 levels "unemployed","... < 1 year",..: 5 3 4 4 3 3 5 3 4 1 ...
#>  $ installment.rate.in.percentage.of.disposable.income: num  4 2 2 2 3 2 3 2 2 4 ...
#>  $ other.debtors.or.guarantors                        : Factor w/ 3 levels "none","co-applicant",..: 1 1 1 3 1 1 1 1 1 1 ...
#>  $ property                                           : Factor w/ 4 levels "real estate",..: 1 1 1 2 4 4 2 3 1 3 ...
#>  $ age.in.years                                       : num  67 22 49 45 53 35 53 35 61 28 ...
#>  $ other.installment.plans                            : Factor w/ 3 levels "bank","stores",..: 3 3 3 3 3 3 3 3 3 3 ...
#>  $ housing                                            : Factor w/ 3 levels "rent","own","for free": 2 2 2 3 3 3 2 1 2 2 ...
#>  $ creditability                                      : int  0 1 0 0 1 0 0 0 0 1 ...
#>  - attr(*, ".internal.selfref")=<externalptr> 
str(dt_sel2$rm)
#> Classes ‘data.table’ and 'data.frame':	20 obs. of  5 variables:
#>  $ variable      : chr  "foreign.worker" "job" "number.of.existing.credits.at.this.bank" "number.of.people.being.liable.to.provide.maintenance.for" ...
#>  $ rm_reason     : chr  "identical rate > 0.95" "iv < 0.02" "iv < 0.02" "iv < 0.02" ...
#>  $ info_value    : num  4.39e-02 8.76e-03 1.33e-02 4.34e-05 8.84e-03 ...
#>  $ missing_rate  : num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ identical_rate: num  0.963 0.63 0.633 0.845 0.548 0.413 0.596 0.051 0.003 0.53 ...
#>  - attr(*, ".internal.selfref")=<externalptr> 

# keep columns manually, such as rowid
germancredit$rowid = row.names(germancredit)
dt_sel3 = var_filter(germancredit, y = "creditability", var_kp = 'rowid')
#> [INFO] filtering variables ... 

# remove columns manually
dt_sel4 = var_filter(germancredit, y = "creditability", var_rm = 'rowid')
#> [INFO] filtering variables ...