ml_logistic_regression does not work with categorical features #96

szilard · 2016-07-10T04:23:29Z

library(sparklyr)
library(dplyr)
sc <- spark_connect("local")

d <- data.frame(y = letters[1:2], x1 = letters[1:2], x2 = 1:4)
dx <- copy_to(sc, d, overwrite = TRUE)

dx %>%  ml_logistic_regression(response = "y", features = c("x1","x2"))

gets:
Error: org.apache.spark.SparkException: VectorAssembler does not support the StringType type

while

d <- data.frame(y = letters[1:2], x1 = 1:2, x2 = 1:4)
dx <- copy_to(sc, d, overwrite = TRUE)

dx %>%  ml_logistic_regression(response = "y", features = c("x1","x2"))

is OK (difference is that x1 is categorical in (1) and numeric in (2).

In SparkR both work:

d <- data.frame(y = letters[1:2], x1 = letters[1:2], x2 = 1:4)
dx <- as.DataFrame(sqlContext, d)

md <- glm(y ~ ., family = "binomial", data = dx)
summary(md)

The text was updated successfully, but these errors were encountered:

slopp · 2016-07-10T15:39:06Z

Did you try using the ft_string_indexer ? You should be able to do:

d <- data.frame(y = letters[1:2], x1 = letters[1:2], x2 = 1:4)

dx <- copy_to(sc, d)

dx %>%
  sdf_mutate(x1_cat = ft_string_indexer(x1)) %>%
  ml_logistic_regression(response = "y", features = c("x1_cat", "x2"))

jjallaire · 2016-07-10T16:43:21Z

@kevinushey What accounts for this difference in behavior from SparkR? It looks to me like y and x1 are both character vectors (not factors) so our handling seems correct (i.e. user is required to use ft_string_indexer). What would happen if y and x1 were in fact factors, would this then work without the sdf_mutate?

kevinushey · 2016-07-10T18:59:51Z

Behind the scenes, sparklyr uses the ft_string_indexer for categorical response variables; it seems like we might need to do the same for features as well. I'll take a look on Monday and confirm that SparkR is doing the same thing here (I'm guessing they're transparently converting character vectors / factors into 0:n numeric vectors as well)

szilard · 2016-07-10T23:08:26Z

@slopp I'm just a spoiled R user expecting ml_logistic_regression to do all that for me :) Btw that would be pretty ugly with 100 categorical features.

@kevinushey that would be great.

kevinushey · 2016-07-11T17:42:25Z

It looks like this is going to take some more work to implement. IIUC, Spark is using the OneHotEncoder transformer to transform categorical vectors into separate binary vectors, which are then used in the model fit. (basically, as dummy variables, if I understand correctly?)

JosiahParry · 2016-07-11T17:47:35Z

@szilard I had the same issue / response. Glad I'm not the only one.

szilard · 2016-07-11T17:52:36Z

@kevinushey Re dummy vars: I guess so too, I'm not really a Spark user ;) In most stats/ML libs it's either the user (extra code) or the system (behind the scenes) that's doing the encoding, I prefer the later (less user code to write/maintain, nicer API). Looks like SparkR does it now (it was broken a few months ago), so it would be nice to match it at some point of time.

kevinushey · 2016-07-11T18:00:42Z

Definitely agree that we should be doing this ourselves behind the scenes!

I think this code (roughly speaking) is what Spark (and hence SparkR) is using to handle categorical variables:

https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala#L128-L174

Basically, given a column of strings (intended to be treated as a categorical variable), we need to:

Use the StringIndexer to transform the column into a numeric column, indexed from 0:n, and then
Use the OneHotEncoder to transform the column produced before into a vector of 'binary vectors', where each binary vector is effectively a dummy-variable encoding of the categorical variable level.

The tricky part in doing this all transparently is keeping track of variable labels and so on.

kevinushey · 2016-07-11T22:00:25Z

We should now handle categorical variables using a similar approach to R / SparkR. For a dummy dataset, I see:

library(sparklyr)
library(dplyr)
sc <- spark_connect(master = "local")

set.seed(123)
data <- data.frame(
  y = rnorm(100),
  x = rep(letters[1:10], each = 10),
  stringsAsFactors = FALSE
)

data_tbl <- copy_to(sc, data)

ml_linear_regression(data_tbl, y ~ x)

producing

> ml_linear_regression(data_tbl, y ~ x)
Call:
y ~ xb + xc + xd + xe + xf + xg + xh + xi + xj 

Coefficients:
(Intercept)          xb          xc          xd          xe          xf          xg          xh          xi 
 0.07462564  0.13399632 -0.49918452  0.24741891 -0.08334118  0.14706035  0.04845804 -0.43754346  0.23846966 
         xj 
 0.36246853

Of course, we need to preserve the original call appropriately, but this is definitely a step in the right direction.

kevinushey · 2016-07-12T16:46:37Z

Closing this as implemented now; please let us know if you encounter any bugs / have other feature requests!

szilard · 2016-07-13T07:05:50Z

Re-opening this because of performance issues:

Take this data: https://s3.amazonaws.com/benchm-ml--main/train-10m.csv
(airline dataset, I used it in this benchmark: https://github.com/szilard/benchm-ml/ )

d_train <- spark_read_csv(sc, name = "d_train", path = "train-10m.csv")

system.time({
md <- d_train %>% 
       sdf_mutate(Month_cat = ft_string_indexer(Month)) %>%
       sdf_mutate(DayofMonth_cat = ft_string_indexer(DayofMonth)) %>%
       sdf_mutate(DayOfWeek_cat = ft_string_indexer(DayOfWeek)) %>%
       sdf_mutate(UniqueCarrier_cat = ft_string_indexer(UniqueCarrier)) %>%
       sdf_mutate(Origin_cat = ft_string_indexer(Origin)) %>%
       sdf_mutate(Dest_cat = ft_string_indexer(Dest)) %>%
       ml_logistic_regression(response = "dep_delayed_15min",
          features = c("Month_cat","DayofMonth_cat","DayOfWeek_cat","DepTime",
             "UniqueCarrier_cat","Origin_cat","Dest_cat","Distance"))
})

runs in 58 sec on 16 core box
(SparkR is 54 sec)

system.time({
md <- d_train %>% ml_logistic_regression(response = "dep_delayed_15min",
            features = setdiff(colnames(d_train),"dep_delayed_15min"))
})

runs forever (I stopped it after ~10mins)
also uses only 1 core

logs show things like:

SELECT *
FROM `sparklyr_tmp_202f65621d37

jjallaire · 2016-07-13T12:17:11Z

An unrelated note, I think the sdf_mutate sequence can be re-written as:

md <- d_train %>% 
   sdf_mutate(
      Month_cat = ft_string_indexer(Month),
      DayofMonth_cat = ft_string_indexer(DayofMonth),
      DayOfWeek_cat = ft_string_indexer(DayOfWeek),
      UniqueCarrier_cat = ft_string_indexer(UniqueCarrier),
      Origin_cat = ft_string_indexer(Origin),
      Dest_cat = ft_string_indexer(Dest)) %>%
   ml_logistic_regression(
      response = "dep_delayed_15min",
      features = c("Month_cat","DayofMonth_cat","DayOfWeek_cat","DepTime",
                   "UniqueCarrier_cat","Origin_cat","Dest_cat","Distance"))

i.e. a single call to sdf_mutate with multiple parameters. @kevinushey is that correct?

szilard · 2016-07-13T16:18:19Z

Yeah, just checked, that works and it has same runtime (expected, as the calls should be just lazily evaluated). Nicer code though.

kevinushey · 2016-07-13T22:13:48Z

Note that the models you are fitting are not identical -- the first converts the character vectors into integer vectors, and fits those as a numeric response, rather than a categorical response.

The second version actually does fit those as categorical variables -- with the current implementation, this implies introducing a new dummy variable for each level in each categorical variable (minus one reference level).

The problem appears to be that generating the dummy variables takes much longer than expected.

FWIW, I think you would get identical behavior by one-hot encoding the columns, then regressing against those -- that may indeed be more performant.

kevinushey · 2016-07-13T22:27:22Z

It looks like the main culprit is our use of withColumn, which is a very slow mechanism for building DataFrames:

https://issues.apache.org/jira/browse/SPARK-7276

I'll look into speeding this up. Thanks for reporting!

szilard · 2016-07-13T22:34:41Z

Oh, I see. I think in most cases people would want to fit as categorical vars (and that's the default of most R functions e.g.glm and that's also what R users would generally expect).

It looks like SparkR is doing that (dummies), yet it's fast (so you can speed up sparklyr :))

Back to the original example, slightly modified:

d <- data.frame(y = letters[1:2], x1 = letters[1:3], x2 = 1:6)
dx <- copy_to(sc, d)

As categorical:

dx %>%  ml_logistic_regression(response = "y", features = c("x1","x2"))

Call: y ~ x1b + x1c + x2
Coefficients:
(Intercept)         x1b         x1c          x2 
  1.1552454   0.4620983   0.9241965  -0.4620982

As integer encoding:

dx %>% sdf_mutate(x1_cat = ft_string_indexer(x1)) %>%
  ml_logistic_regression(response = "y", features = c("x1_cat", "x2"))

Call: y ~ x1_cat + x2

Coefficients:
(Intercept)      x1_cat          x2 
  1.1497085   0.1914684  -0.3829368

SparkR:

d <- data.frame(y = letters[1:2], x1 = letters[1:3], x2 = 1:6)
dx <- as.DataFrame(sqlContext, d)

md <- glm(y ~ ., family = "binomial", data = dx)
summary(md)

$coefficients
              Estimate
(Intercept) -2.0794417
x1_a         0.9241967
x1_b         0.4620987
x2           0.4620979

kevinushey · 2016-07-13T22:54:52Z

I just pushed a commit to sparklyr that should speed up the generation of dummy variables -- I think this should put us close to / on par with SparkR, as we are now mostly bottlenecked by the speed of Spark's own logistic regression routines. (It's possible that SparkRs use of one-hot encoding is much more performant than the naive dummy variable version we're using, however)

I'm also curious what the performance is like between ml_logistic_regression and ml_generalized_linear_regression, since these farm out to different Spark (Scala) APIs.

szilard · 2016-07-15T05:50:57Z

I can't see the speedup (with newly installed sparklyr).

You can use same data as above for testing (10M rows) or this smaller one:
https://s3.amazonaws.com/benchm-ml--main/train-0.1m.csv (100K rows).

library(sparklyr)
library(dplyr)
sc <- spark_connect("local")   

system.time({
d_train <- spark_read_csv(sc, name = "d_train", path = "train-0.1m.csv")
})

system.time({
md <- d_train %>% 
   sdf_mutate(
      Month_cat = ft_string_indexer(Month),
      DayofMonth_cat = ft_string_indexer(DayofMonth),
      DayOfWeek_cat = ft_string_indexer(DayOfWeek),
      UniqueCarrier_cat = ft_string_indexer(UniqueCarrier),
      Origin_cat = ft_string_indexer(Origin),
      Dest_cat = ft_string_indexer(Dest)) %>%
   ml_logistic_regression(
      response = "dep_delayed_15min",
      features = c("Month_cat","DayofMonth_cat","DayOfWeek_cat","DepTime",
                   "UniqueCarrier_cat","Origin_cat","Dest_cat","Distance"))
})
#   user  system elapsed 
#  0.508   0.045   7.904 

system.time({
md <- d_train %>% ml_logistic_regression(response = "dep_delayed_15min",
            features = setdiff(colnames(d_train),"dep_delayed_15min"))
})
#   user  system elapsed 
#  9.156   0.828  56.034

Btw I have a config.yml with:

  sparklyr.shell.executor-memory: 20G
  sparklyr.shell.driver-memory: 30G

szilard · 2016-07-15T06:19:10Z

To use ml_generalized_linear_regression it seems one needs 2.0.0.

With the 2.0.0-preview:


library(sparklyr)
library(dplyr)
sc <- spark_connect("local")    
d_train <- spark_read_csv(sc, name = "d_train", path = "train-0.1m.csv")


system.time({
md <- d_train %>% 
   sdf_mutate(
      Month_cat = ft_string_indexer(Month),
      DayofMonth_cat = ft_string_indexer(DayofMonth),
      DayOfWeek_cat = ft_string_indexer(DayOfWeek),
      UniqueCarrier_cat = ft_string_indexer(UniqueCarrier),
      Origin_cat = ft_string_indexer(Origin),
      Dest_cat = ft_string_indexer(Dest)) %>%
   ml_logistic_regression(
      response = "dep_delayed_15min",
      features = c("Month_cat","DayofMonth_cat","DayOfWeek_cat","DepTime",
                   "UniqueCarrier_cat","Origin_cat","Dest_cat","Distance"))
})
#   user  system elapsed 
#  0.492   0.027   9.752


system.time({
md <- d_train %>% ml_logistic_regression(response = "dep_delayed_15min",
            features = setdiff(colnames(d_train),"dep_delayed_15min"))
})
#Error: java.util.concurrent.ExecutionException: java.lang.Exception: failed to compile: 
#org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 53, Column 26: 
#Expression "inputadapter_isNull5" is not an rvalue


system.time({
md <- d_train %>% 
   sdf_mutate(
      Month_cat = ft_string_indexer(Month),
      DayofMonth_cat = ft_string_indexer(DayofMonth),
      DayOfWeek_cat = ft_string_indexer(DayOfWeek),
      UniqueCarrier_cat = ft_string_indexer(UniqueCarrier),
      Origin_cat = ft_string_indexer(Origin),
      Dest_cat = ft_string_indexer(Dest)) %>%
   ml_generalized_linear_regression(
      family = binomial(),
      response = "dep_delayed_15min",
      features = c("Month_cat","DayofMonth_cat","DayOfWeek_cat","DepTime",
                   "UniqueCarrier_cat","Origin_cat","Dest_cat","Distance"))
})
#   user  system elapsed 
#  0.523   0.034   6.593 


system.time({
md <- d_train %>% ml_generalized_linear_regression(
            family = binomial(),
            response = "dep_delayed_15min",
            features = setdiff(colnames(d_train),"dep_delayed_15min"))
})
#Error: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree:
#Exchange SinglePartition, None
#+- WholeStageCodegen

ElianoMarques · 2016-07-26T12:11:24Z

Conscious this is close but still the valid forum.

If we have a numerical field which we want to bucketize and then use in a ML model:

for example:

tbl4 = sdf_mutate(tbl3,
                comp_factor = ft_string_indexer("comp"),
                avg_temp_factor = ft_bucketizer( "avg_temp", splits = c(-Inf,0,5,10,20,30,Inf)),
                avg_temp_factor2 = ft_one_hot_encoder("avg_temp_factor"),
                stdev_temp_factor = ft_bucketizer("sdev_temp", splits = c(-Inf,0,4.4,6.3,8.3,16,Inf)),
                stdev_temp_factor2 = ft_string_indexer("stdev_temp_factor"),
                stdev_temp_factor3 = ft_one_hot_encoder("stdev_temp_factor2") 
                 )

model_parameters <- c('avg_temp_factor2',
                'stdev_temp_factor3',
                'comp_factor',
model = ml_survival_regression(tbl4, response = "time",features = model_parameters, censor = "status", intercept = TRUE)

model$coefficients print the value as if it was a numerical parameter, not a category.

model$coefficients
            (Intercept)        avg_temp_factor2      stdev_temp_factor3             comp_factor 
             0.67959997              0.00000000              0.39217879             -0.42696859

Is this the right way via the package to transform numerical values into categories?

Eliano

javierluraschi added the ml label Jul 10, 2016

kevinushey added a commit that referenced this issue Jul 11, 2016

handle categorical features (#96)

4ac2074

kevinushey closed this as completed Jul 12, 2016

kevinlzheng mentioned this issue Aug 16, 2017

Sdf_predict for categorical columns not working in decision tree #939

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ml_logistic_regression does not work with categorical features #96

ml_logistic_regression does not work with categorical features #96

szilard commented Jul 10, 2016 •

edited

Loading

slopp commented Jul 10, 2016

jjallaire commented Jul 10, 2016

kevinushey commented Jul 10, 2016 •

edited

Loading

szilard commented Jul 10, 2016

kevinushey commented Jul 11, 2016

JosiahParry commented Jul 11, 2016

szilard commented Jul 11, 2016 •

edited

Loading

kevinushey commented Jul 11, 2016 •

edited

Loading

kevinushey commented Jul 11, 2016

kevinushey commented Jul 12, 2016

szilard commented Jul 13, 2016

jjallaire commented Jul 13, 2016

szilard commented Jul 13, 2016 •

edited

Loading

kevinushey commented Jul 13, 2016 •

edited

Loading

kevinushey commented Jul 13, 2016

szilard commented Jul 13, 2016

kevinushey commented Jul 13, 2016 •

edited

Loading

szilard commented Jul 15, 2016 •

edited

Loading

szilard commented Jul 15, 2016 •

edited

Loading

ElianoMarques commented Jul 26, 2016

ml_logistic_regression does not work with categorical features #96

ml_logistic_regression does not work with categorical features #96

Comments

szilard commented Jul 10, 2016 • edited Loading

slopp commented Jul 10, 2016

jjallaire commented Jul 10, 2016

kevinushey commented Jul 10, 2016 • edited Loading

szilard commented Jul 10, 2016

kevinushey commented Jul 11, 2016

JosiahParry commented Jul 11, 2016

szilard commented Jul 11, 2016 • edited Loading

kevinushey commented Jul 11, 2016 • edited Loading

kevinushey commented Jul 11, 2016

kevinushey commented Jul 12, 2016

szilard commented Jul 13, 2016

jjallaire commented Jul 13, 2016

szilard commented Jul 13, 2016 • edited Loading

kevinushey commented Jul 13, 2016 • edited Loading

kevinushey commented Jul 13, 2016

szilard commented Jul 13, 2016

kevinushey commented Jul 13, 2016 • edited Loading

szilard commented Jul 15, 2016 • edited Loading

szilard commented Jul 15, 2016 • edited Loading

ElianoMarques commented Jul 26, 2016

szilard commented Jul 10, 2016 •

edited

Loading

kevinushey commented Jul 10, 2016 •

edited

Loading

szilard commented Jul 11, 2016 •

edited

Loading

kevinushey commented Jul 11, 2016 •

edited

Loading

szilard commented Jul 13, 2016 •

edited

Loading

kevinushey commented Jul 13, 2016 •

edited

Loading

kevinushey commented Jul 13, 2016 •

edited

Loading

szilard commented Jul 15, 2016 •

edited

Loading

szilard commented Jul 15, 2016 •

edited

Loading