This function filter variables base on specified conditions, such as information value, missing rate, identical value rate.

var_filter(dt, y, x = NULL, iv_limit = 0.02, missing_limit = 0.95,
  identical_limit = 0.95, var_rm = NULL, var_kp = NULL,
  return_rm_reason = FALSE, positive = "bad|1")

Arguments

dt

A data frame with both x (predictor/feature) and y (response/label) variables.

y

Name of y variable.

x

Name of x variables. Default is NULL. If x is NULL, then all columns except y are counted as x variables.

iv_limit

The information value of kept variables should >= iv_limit. The default is 0.02.

missing_limit

The missing rate of kept variables should <= missing_limit. The default is 0.95.

identical_limit

The identical value rate (excluding NAs) of kept variables should <= identical_limit. The default is 0.95.

var_rm

Name of force removed variables, default is NULL.

var_kp

Name of force kept variables, default is NULL.

return_rm_reason

Logical, default is FALSE.

positive

Value of positive class, default is "bad|1".

Value

A data frame with columns for y and selected x variables, and a data frame with columns for remove reason if return_rm_reason == TRUE.

Examples

# Load German credit data data(germancredit) # variable filter dt_sel = var_filter(germancredit, y = "creditability")
#> [INFO] filtering variables ...
dim(dt_sel)
#> [1] 1000 15
# return the reason of varaible removed dt_sel2 = var_filter(germancredit, y = "creditability", return_rm_reason = TRUE)
#> [INFO] filtering variables ...
lapply(dt_sel2, dim)
#> $dt #> [1] 1000 15 #> #> $rm #> [1] 20 5 #>
str(dt_sel2$dt)
#> Classes ‘data.table’ and 'data.frame': 1000 obs. of 15 variables: #> $ status.of.existing.checking.account : Factor w/ 4 levels "... < 0 DM","0 <= ... < 200 DM",..: 1 2 4 1 1 4 4 2 4 2 ... #> $ duration.in.month : num 6 48 12 42 24 36 24 36 12 30 ... #> $ credit.history : Factor w/ 5 levels "no credits taken/ all credits paid back duly",..: 5 3 5 3 4 3 3 3 3 5 ... #> $ purpose : chr "radio/television" "radio/television" "education" "furniture/equipment" ... #> $ credit.amount : num 1169 5951 2096 7882 4870 ... #> $ savings.account.and.bonds : Factor w/ 5 levels "... < 100 DM",..: 5 1 1 1 1 5 3 1 4 1 ... #> $ present.employment.since : Factor w/ 5 levels "unemployed","... < 1 year",..: 5 3 4 4 3 3 5 3 4 1 ... #> $ installment.rate.in.percentage.of.disposable.income: num 4 2 2 2 3 2 3 2 2 4 ... #> $ personal.status.and.sex : Factor w/ 5 levels "male : divorced/separated",..: 3 2 3 3 3 3 3 3 2 4 ... #> $ other.debtors.or.guarantors : Factor w/ 3 levels "none","co-applicant",..: 1 1 1 3 1 1 1 1 1 1 ... #> $ property : Factor w/ 4 levels "real estate",..: 1 1 1 2 4 4 2 3 1 3 ... #> $ age.in.years : num 67 22 49 45 53 35 53 35 61 28 ... #> $ other.installment.plans : Factor w/ 3 levels "bank","stores",..: 3 3 3 3 3 3 3 3 3 3 ... #> $ housing : Factor w/ 3 levels "rent","own","for free": 2 2 2 3 3 3 2 1 2 2 ... #> $ creditability : int 0 1 0 0 1 0 0 0 0 1 ... #> - attr(*, ".internal.selfref")=<externalptr>
str(dt_sel2$rm)
#> Classes ‘data.table’ and 'data.frame': 20 obs. of 5 variables: #> $ variable : chr "foreign.worker" "job" "number.of.existing.credits.at.this.bank" "number.of.people.being.liable.to.provide.maintenance.for" ... #> $ rm_reason : chr "identical rate > 0.95" "iv < 0.02" "iv < 0.02" "iv < 0.02" ... #> $ info_value : num 4.39e-02 8.76e-03 1.33e-02 4.34e-05 3.59e-03 ... #> $ missing_rate : num 0 0 0 0 0 0 0 0 0 0 ... #> $ identical_rate: num 0.963 0.63 0.633 0.845 0.413 0.596 0.051 0.003 0.53 0.184 ... #> - attr(*, ".internal.selfref")=<externalptr>
# keep columns manually, such as rowid germancredit$rowid = row.names(germancredit) dt_sel3 = var_filter(germancredit, y = "creditability", var_kp = 'rowid')
#> [INFO] filtering variables ...
# remove columns manually dt_sel4 = var_filter(germancredit, y = "creditability", var_rm = 'rowid')
#> [INFO] filtering variables ...