Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ml_logistic_regression does not work with categorical features #96

Closed
szilard opened this issue Jul 10, 2016 · 20 comments
Closed

ml_logistic_regression does not work with categorical features #96

szilard opened this issue Jul 10, 2016 · 20 comments
Labels

Comments

@szilard
Copy link

szilard commented Jul 10, 2016

library(sparklyr)
library(dplyr)
sc <- spark_connect("local")

d <- data.frame(y = letters[1:2], x1 = letters[1:2], x2 = 1:4)
dx <- copy_to(sc, d, overwrite = TRUE)

dx %>%  ml_logistic_regression(response = "y", features = c("x1","x2"))

gets:
Error: org.apache.spark.SparkException: VectorAssembler does not support the StringType type

while

d <- data.frame(y = letters[1:2], x1 = 1:2, x2 = 1:4)
dx <- copy_to(sc, d, overwrite = TRUE)

dx %>%  ml_logistic_regression(response = "y", features = c("x1","x2"))

is OK (difference is that x1 is categorical in (1) and numeric in (2).

In SparkR both work:

d <- data.frame(y = letters[1:2], x1 = letters[1:2], x2 = 1:4)
dx <- as.DataFrame(sqlContext, d)

md <- glm(y ~ ., family = "binomial", data = dx)
summary(md)
@slopp
Copy link

slopp commented Jul 10, 2016

Did you try using the ft_string_indexer ? You should be able to do:

d <- data.frame(y = letters[1:2], x1 = letters[1:2], x2 = 1:4)

dx <- copy_to(sc, d)

dx %>%
  sdf_mutate(x1_cat = ft_string_indexer(x1)) %>%
  ml_logistic_regression(response = "y", features = c("x1_cat", "x2"))

@jjallaire
Copy link
Contributor

@kevinushey What accounts for this difference in behavior from SparkR? It looks to me like y and x1 are both character vectors (not factors) so our handling seems correct (i.e. user is required to use ft_string_indexer). What would happen if y and x1 were in fact factors, would this then work without the sdf_mutate?

@kevinushey
Copy link
Contributor

kevinushey commented Jul 10, 2016

Behind the scenes, sparklyr uses the ft_string_indexer for categorical response variables; it seems like we might need to do the same for features as well. I'll take a look on Monday and confirm that SparkR is doing the same thing here (I'm guessing they're transparently converting character vectors / factors into 0:n numeric vectors as well)

@szilard
Copy link
Author

szilard commented Jul 10, 2016

@slopp I'm just a spoiled R user expecting ml_logistic_regression to do all that for me :) Btw that would be pretty ugly with 100 categorical features.

@kevinushey that would be great.

@kevinushey
Copy link
Contributor

It looks like this is going to take some more work to implement. IIUC, Spark is using the OneHotEncoder transformer to transform categorical vectors into separate binary vectors, which are then used in the model fit. (basically, as dummy variables, if I understand correctly?)

@JosiahParry
Copy link

@szilard I had the same issue / response. Glad I'm not the only one.

@szilard
Copy link
Author

szilard commented Jul 11, 2016

@kevinushey Re dummy vars: I guess so too, I'm not really a Spark user ;) In most stats/ML libs it's either the user (extra code) or the system (behind the scenes) that's doing the encoding, I prefer the later (less user code to write/maintain, nicer API). Looks like SparkR does it now (it was broken a few months ago), so it would be nice to match it at some point of time.

@kevinushey
Copy link
Contributor

kevinushey commented Jul 11, 2016

Definitely agree that we should be doing this ourselves behind the scenes!

I think this code (roughly speaking) is what Spark (and hence SparkR) is using to handle categorical variables:

https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala#L128-L174

Basically, given a column of strings (intended to be treated as a categorical variable), we need to:

  1. Use the StringIndexer to transform the column into a numeric column, indexed from 0:n, and then
  2. Use the OneHotEncoder to transform the column produced before into a vector of 'binary vectors', where each binary vector is effectively a dummy-variable encoding of the categorical variable level.

The tricky part in doing this all transparently is keeping track of variable labels and so on.

kevinushey added a commit that referenced this issue Jul 11, 2016
@kevinushey
Copy link
Contributor

We should now handle categorical variables using a similar approach to R / SparkR. For a dummy dataset, I see:

library(sparklyr)
library(dplyr)
sc <- spark_connect(master = "local")

set.seed(123)
data <- data.frame(
  y = rnorm(100),
  x = rep(letters[1:10], each = 10),
  stringsAsFactors = FALSE
)

data_tbl <- copy_to(sc, data)

ml_linear_regression(data_tbl, y ~ x)

producing

> ml_linear_regression(data_tbl, y ~ x)
Call:
y ~ xb + xc + xd + xe + xf + xg + xh + xi + xj 

Coefficients:
(Intercept)          xb          xc          xd          xe          xf          xg          xh          xi 
 0.07462564  0.13399632 -0.49918452  0.24741891 -0.08334118  0.14706035  0.04845804 -0.43754346  0.23846966 
         xj 
 0.36246853 

Of course, we need to preserve the original call appropriately, but this is definitely a step in the right direction.

@kevinushey
Copy link
Contributor

Closing this as implemented now; please let us know if you encounter any bugs / have other feature requests!

@szilard
Copy link
Author

szilard commented Jul 13, 2016

Re-opening this because of performance issues:

Take this data: https://s3.amazonaws.com/benchm-ml--main/train-10m.csv
(airline dataset, I used it in this benchmark: https://github.com/szilard/benchm-ml/ )

d_train <- spark_read_csv(sc, name = "d_train", path = "train-10m.csv")

system.time({
md <- d_train %>% 
       sdf_mutate(Month_cat = ft_string_indexer(Month)) %>%
       sdf_mutate(DayofMonth_cat = ft_string_indexer(DayofMonth)) %>%
       sdf_mutate(DayOfWeek_cat = ft_string_indexer(DayOfWeek)) %>%
       sdf_mutate(UniqueCarrier_cat = ft_string_indexer(UniqueCarrier)) %>%
       sdf_mutate(Origin_cat = ft_string_indexer(Origin)) %>%
       sdf_mutate(Dest_cat = ft_string_indexer(Dest)) %>%
       ml_logistic_regression(response = "dep_delayed_15min",
          features = c("Month_cat","DayofMonth_cat","DayOfWeek_cat","DepTime",
             "UniqueCarrier_cat","Origin_cat","Dest_cat","Distance"))
})

runs in 58 sec on 16 core box
(SparkR is 54 sec)

system.time({
md <- d_train %>% ml_logistic_regression(response = "dep_delayed_15min",
            features = setdiff(colnames(d_train),"dep_delayed_15min"))
})

runs forever (I stopped it after ~10mins)
also uses only 1 core

logs show things like:

SELECT *
FROM `sparklyr_tmp_202f65621d37

@jjallaire
Copy link
Contributor

An unrelated note, I think the sdf_mutate sequence can be re-written as:

md <- d_train %>% 
   sdf_mutate(
      Month_cat = ft_string_indexer(Month),
      DayofMonth_cat = ft_string_indexer(DayofMonth),
      DayOfWeek_cat = ft_string_indexer(DayOfWeek),
      UniqueCarrier_cat = ft_string_indexer(UniqueCarrier),
      Origin_cat = ft_string_indexer(Origin),
      Dest_cat = ft_string_indexer(Dest)) %>%
   ml_logistic_regression(
      response = "dep_delayed_15min",
      features = c("Month_cat","DayofMonth_cat","DayOfWeek_cat","DepTime",
                   "UniqueCarrier_cat","Origin_cat","Dest_cat","Distance"))

i.e. a single call to sdf_mutate with multiple parameters. @kevinushey is that correct?

@szilard
Copy link
Author

szilard commented Jul 13, 2016

Yeah, just checked, that works and it has same runtime (expected, as the calls should be just lazily evaluated). Nicer code though.

@kevinushey
Copy link
Contributor

kevinushey commented Jul 13, 2016

Note that the models you are fitting are not identical -- the first converts the character vectors into integer vectors, and fits those as a numeric response, rather than a categorical response.

The second version actually does fit those as categorical variables -- with the current implementation, this implies introducing a new dummy variable for each level in each categorical variable (minus one reference level).

The problem appears to be that generating the dummy variables takes much longer than expected.

FWIW, I think you would get identical behavior by one-hot encoding the columns, then regressing against those -- that may indeed be more performant.

@kevinushey
Copy link
Contributor

It looks like the main culprit is our use of withColumn, which is a very slow mechanism for building DataFrames:

https://issues.apache.org/jira/browse/SPARK-7276

I'll look into speeding this up. Thanks for reporting!

@szilard
Copy link
Author

szilard commented Jul 13, 2016

Oh, I see. I think in most cases people would want to fit as categorical vars (and that's the default of most R functions e.g.glm and that's also what R users would generally expect).

It looks like SparkR is doing that (dummies), yet it's fast (so you can speed up sparklyr :))

Back to the original example, slightly modified:

d <- data.frame(y = letters[1:2], x1 = letters[1:3], x2 = 1:6)
dx <- copy_to(sc, d)

As categorical:

dx %>%  ml_logistic_regression(response = "y", features = c("x1","x2"))

Call: y ~ x1b + x1c + x2
Coefficients:
(Intercept)         x1b         x1c          x2 
  1.1552454   0.4620983   0.9241965  -0.4620982 

As integer encoding:

dx %>% sdf_mutate(x1_cat = ft_string_indexer(x1)) %>%
  ml_logistic_regression(response = "y", features = c("x1_cat", "x2"))

Call: y ~ x1_cat + x2

Coefficients:
(Intercept)      x1_cat          x2 
  1.1497085   0.1914684  -0.3829368 

SparkR:

d <- data.frame(y = letters[1:2], x1 = letters[1:3], x2 = 1:6)
dx <- as.DataFrame(sqlContext, d)

md <- glm(y ~ ., family = "binomial", data = dx)
summary(md)

$coefficients
              Estimate
(Intercept) -2.0794417
x1_a         0.9241967
x1_b         0.4620987
x2           0.4620979

@kevinushey
Copy link
Contributor

kevinushey commented Jul 13, 2016

I just pushed a commit to sparklyr that should speed up the generation of dummy variables -- I think this should put us close to / on par with SparkR, as we are now mostly bottlenecked by the speed of Spark's own logistic regression routines. (It's possible that SparkRs use of one-hot encoding is much more performant than the naive dummy variable version we're using, however)

I'm also curious what the performance is like between ml_logistic_regression and ml_generalized_linear_regression, since these farm out to different Spark (Scala) APIs.

@szilard
Copy link
Author

szilard commented Jul 15, 2016

I can't see the speedup (with newly installed sparklyr).

You can use same data as above for testing (10M rows) or this smaller one:
https://s3.amazonaws.com/benchm-ml--main/train-0.1m.csv (100K rows).

library(sparklyr)
library(dplyr)
sc <- spark_connect("local")   

system.time({
d_train <- spark_read_csv(sc, name = "d_train", path = "train-0.1m.csv")
})

system.time({
md <- d_train %>% 
   sdf_mutate(
      Month_cat = ft_string_indexer(Month),
      DayofMonth_cat = ft_string_indexer(DayofMonth),
      DayOfWeek_cat = ft_string_indexer(DayOfWeek),
      UniqueCarrier_cat = ft_string_indexer(UniqueCarrier),
      Origin_cat = ft_string_indexer(Origin),
      Dest_cat = ft_string_indexer(Dest)) %>%
   ml_logistic_regression(
      response = "dep_delayed_15min",
      features = c("Month_cat","DayofMonth_cat","DayOfWeek_cat","DepTime",
                   "UniqueCarrier_cat","Origin_cat","Dest_cat","Distance"))
})
#   user  system elapsed 
#  0.508   0.045   7.904 

system.time({
md <- d_train %>% ml_logistic_regression(response = "dep_delayed_15min",
            features = setdiff(colnames(d_train),"dep_delayed_15min"))
})
#   user  system elapsed 
#  9.156   0.828  56.034 

Btw I have a config.yml with:

  sparklyr.shell.executor-memory: 20G
  sparklyr.shell.driver-memory: 30G

@szilard
Copy link
Author

szilard commented Jul 15, 2016

To use ml_generalized_linear_regression it seems one needs 2.0.0.

With the 2.0.0-preview:


library(sparklyr)
library(dplyr)
sc <- spark_connect("local")    
d_train <- spark_read_csv(sc, name = "d_train", path = "train-0.1m.csv")


system.time({
md <- d_train %>% 
   sdf_mutate(
      Month_cat = ft_string_indexer(Month),
      DayofMonth_cat = ft_string_indexer(DayofMonth),
      DayOfWeek_cat = ft_string_indexer(DayOfWeek),
      UniqueCarrier_cat = ft_string_indexer(UniqueCarrier),
      Origin_cat = ft_string_indexer(Origin),
      Dest_cat = ft_string_indexer(Dest)) %>%
   ml_logistic_regression(
      response = "dep_delayed_15min",
      features = c("Month_cat","DayofMonth_cat","DayOfWeek_cat","DepTime",
                   "UniqueCarrier_cat","Origin_cat","Dest_cat","Distance"))
})
#   user  system elapsed 
#  0.492   0.027   9.752


system.time({
md <- d_train %>% ml_logistic_regression(response = "dep_delayed_15min",
            features = setdiff(colnames(d_train),"dep_delayed_15min"))
})
#Error: java.util.concurrent.ExecutionException: java.lang.Exception: failed to compile: 
#org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 53, Column 26: 
#Expression "inputadapter_isNull5" is not an rvalue


system.time({
md <- d_train %>% 
   sdf_mutate(
      Month_cat = ft_string_indexer(Month),
      DayofMonth_cat = ft_string_indexer(DayofMonth),
      DayOfWeek_cat = ft_string_indexer(DayOfWeek),
      UniqueCarrier_cat = ft_string_indexer(UniqueCarrier),
      Origin_cat = ft_string_indexer(Origin),
      Dest_cat = ft_string_indexer(Dest)) %>%
   ml_generalized_linear_regression(
      family = binomial(),
      response = "dep_delayed_15min",
      features = c("Month_cat","DayofMonth_cat","DayOfWeek_cat","DepTime",
                   "UniqueCarrier_cat","Origin_cat","Dest_cat","Distance"))
})
#   user  system elapsed 
#  0.523   0.034   6.593 


system.time({
md <- d_train %>% ml_generalized_linear_regression(
            family = binomial(),
            response = "dep_delayed_15min",
            features = setdiff(colnames(d_train),"dep_delayed_15min"))
})
#Error: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree:
#Exchange SinglePartition, None
#+- WholeStageCodegen

@ElianoMarques
Copy link

Conscious this is close but still the valid forum.

If we have a numerical field which we want to bucketize and then use in a ML model:

for example:

tbl4 = sdf_mutate(tbl3,
                comp_factor = ft_string_indexer("comp"),
                avg_temp_factor = ft_bucketizer( "avg_temp", splits = c(-Inf,0,5,10,20,30,Inf)),
                avg_temp_factor2 = ft_one_hot_encoder("avg_temp_factor"),
                stdev_temp_factor = ft_bucketizer("sdev_temp", splits = c(-Inf,0,4.4,6.3,8.3,16,Inf)),
                stdev_temp_factor2 = ft_string_indexer("stdev_temp_factor"),
                stdev_temp_factor3 = ft_one_hot_encoder("stdev_temp_factor2") 
                 )

model_parameters <- c('avg_temp_factor2',
                'stdev_temp_factor3',
                'comp_factor',
model = ml_survival_regression(tbl4, response = "time",features = model_parameters, censor = "status", intercept = TRUE)

model$coefficients print the value as if it was a numerical parameter, not a category.

model$coefficients
            (Intercept)        avg_temp_factor2      stdev_temp_factor3             comp_factor 
             0.67959997              0.00000000              0.39217879             -0.42696859 

Is this the right way via the package to transform numerical values into categories?

Eliano

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants