-
Notifications
You must be signed in to change notification settings - Fork 307
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ml_logistic_regression does not work with categorical features #96
Comments
Did you try using the
|
@kevinushey What accounts for this difference in behavior from SparkR? It looks to me like y and x1 are both character vectors (not factors) so our handling seems correct (i.e. user is required to use |
Behind the scenes, |
@slopp I'm just a spoiled R user expecting @kevinushey that would be great. |
It looks like this is going to take some more work to implement. IIUC, Spark is using the |
@szilard I had the same issue / response. Glad I'm not the only one. |
@kevinushey Re dummy vars: I guess so too, I'm not really a Spark user ;) In most stats/ML libs it's either the user (extra code) or the system (behind the scenes) that's doing the encoding, I prefer the later (less user code to write/maintain, nicer API). Looks like SparkR does it now (it was broken a few months ago), so it would be nice to match it at some point of time. |
Definitely agree that we should be doing this ourselves behind the scenes! I think this code (roughly speaking) is what Spark (and hence SparkR) is using to handle categorical variables: Basically, given a column of strings (intended to be treated as a categorical variable), we need to:
The tricky part in doing this all transparently is keeping track of variable labels and so on. |
We should now handle categorical variables using a similar approach to R / SparkR. For a dummy dataset, I see: library(sparklyr)
library(dplyr)
sc <- spark_connect(master = "local")
set.seed(123)
data <- data.frame(
y = rnorm(100),
x = rep(letters[1:10], each = 10),
stringsAsFactors = FALSE
)
data_tbl <- copy_to(sc, data)
ml_linear_regression(data_tbl, y ~ x) producing
Of course, we need to preserve the original call appropriately, but this is definitely a step in the right direction. |
Closing this as implemented now; please let us know if you encounter any bugs / have other feature requests! |
Re-opening this because of performance issues: Take this data: https://s3.amazonaws.com/benchm-ml--main/train-10m.csv
runs in 58 sec on 16 core box
runs forever (I stopped it after ~10mins) logs show things like:
|
An unrelated note, I think the md <- d_train %>%
sdf_mutate(
Month_cat = ft_string_indexer(Month),
DayofMonth_cat = ft_string_indexer(DayofMonth),
DayOfWeek_cat = ft_string_indexer(DayOfWeek),
UniqueCarrier_cat = ft_string_indexer(UniqueCarrier),
Origin_cat = ft_string_indexer(Origin),
Dest_cat = ft_string_indexer(Dest)) %>%
ml_logistic_regression(
response = "dep_delayed_15min",
features = c("Month_cat","DayofMonth_cat","DayOfWeek_cat","DepTime",
"UniqueCarrier_cat","Origin_cat","Dest_cat","Distance")) i.e. a single call to |
Yeah, just checked, that works and it has same runtime (expected, as the calls should be just lazily evaluated). Nicer code though. |
Note that the models you are fitting are not identical -- the first converts the character vectors into integer vectors, and fits those as a numeric response, rather than a categorical response. The second version actually does fit those as categorical variables -- with the current implementation, this implies introducing a new dummy variable for each level in each categorical variable (minus one reference level). The problem appears to be that generating the dummy variables takes much longer than expected. FWIW, I think you would get identical behavior by one-hot encoding the columns, then regressing against those -- that may indeed be more performant. |
It looks like the main culprit is our use of https://issues.apache.org/jira/browse/SPARK-7276 I'll look into speeding this up. Thanks for reporting! |
Oh, I see. I think in most cases people would want to fit as categorical vars (and that's the default of most R functions e.g. It looks like SparkR is doing that (dummies), yet it's fast (so you can speed up sparklyr :)) Back to the original example, slightly modified:
As categorical:
As integer encoding:
SparkR:
|
I just pushed a commit to I'm also curious what the performance is like between |
I can't see the speedup (with newly installed sparklyr). You can use same data as above for testing (10M rows) or this smaller one:
Btw I have a config.yml with:
|
To use With the 2.0.0-preview:
|
Conscious this is close but still the valid forum. If we have a numerical field which we want to bucketize and then use in a ML model: for example:
model$coefficients print the value as if it was a numerical parameter, not a category.
Is this the right way via the package to transform numerical values into categories? Eliano |
gets:
Error: org.apache.spark.SparkException: VectorAssembler does not support the StringType type
while
is OK (difference is that
x1
is categorical in (1) and numeric in (2).In SparkR both work:
The text was updated successfully, but these errors were encountered: