This function calculates information value (IV) for multiple x variables. It treats each unique value in x variables as a group. If there is a zero number of y class, it will be replaced by 0.99 to make sure woe/iv is calculable.
iv(dt, y, x = NULL, positive = "bad|1", order = TRUE)
A data frame with both x (predictor/feature) and y (response/label) variables.
Name of y variable.
Name of x variables. Defaults to NULL. If x is NULL, then all columns except y are counted as x variables.
Value of positive class, Defaults to "bad|1".
Logical, Defaults to TRUE. If it is TRUE, the output will descending order via iv.
A data frame with columns for variable and info_value
IV is a very useful concept for variable selection while developing credit scorecards. The formula for information value is shown below: $$IV = \sum(DistributionPositive_{i} - DistributionNegative_{i})*\ln(\frac{DistributionPositive_{i}}{DistributionNegative_{i}}).$$ The log component in information value is defined as weight of evidence (WOE), which is shown as $$WeightofEvidence = \ln(\frac{DistributionPositive_{i}}{DistributionNegative_{i}}).$$ The relationship between information value and predictive power is as follows:
Information Value | Predictive Power |
----------------- | ---------------- |
< 0.02 | useless for prediction |
0.02 to 0.1 | Weak predictor |
0.1 to 0.3 | Medium predictor |
> 0.3 | Strong predictor |
# Load German credit data
data(germancredit)
# information values
info_value = iv(germancredit, y = "creditability")
str(info_value)
#> Classes ‘data.table’ and 'data.frame': 20 obs. of 2 variables:
#> $ variable : chr "status.of.existing.checking.account" "duration.in.month" "credit.history" "age.in.years" ...
#> $ info_value: num 0.666 0.335 0.293 0.26 0.196 ...
#> - attr(*, ".internal.selfref")=<externalptr>