The fit performance of LinearRegression is sub-optimal

It seems that the performance of Linear Regression is sub-optimal when the number of samples is very large. 

[sklearn_benchmarks](https://github.com/mbatoul/sklearn_benchmarks) measures a [speedup of 48](https://mbatoul.github.io/sklearn_benchmarks/results/github_ci/master/20220315T132336/scikit_learn_intelex_vs_scikit_learn.html#linearregression) compared to an optimized implementation from scikit-learn-intelex on a `1000000x100` dataset. For a given set of parameters and a given dataset, we compute the speed-up `time scikit-learn` / `time sklearnex`. A speed-up of 48 means that sklearnex is 48 times faster than scikit-learn on the given dataset.

[Profiling](https://mbatoul.github.io/sklearn_benchmarks/results/github_ci/improve_cli/20220314T105014/profiling/sklearn_fit_0c037d7431f8fb8e89db7b3f3f29b083_ddd328af83cd4ef15164ddaa0fa01ba4.json.gz) allows a more detailed analysis of the execution of the algorithm. We observe that most of the execution time is spent in the [`lstsq` solver of scipy](https://docs.scipy.org/doc/scipy/reference/generated/scipy.linalg.lstsq.html).

<img width="1538" alt="image" src="https://user-images.githubusercontent.com/36322423/158272733-92f107f4-92cf-4e26-887d-2a0818113d2d.png">

The profiling reports of sklearn_benchmarks can be viewed with [Perfetto UI](https://ui.perfetto.dev/).

<details>
  <summary>See benchmark environment information</summary>
  
<img width="946" alt="image" src="https://user-images.githubusercontent.com/36322423/158276716-2ab08b7d-079d-4cb8-82fc-054e6b9e8f40.png">
</details>

It seems that the solver could be better chosen when the number of samples is very large. Perhaps Ridge's solver with a zero penalty could be chosen in this case. On the same dimensions, [it shows better performance](https://mbatoul.github.io/sklearn_benchmarks/results/github_ci/improve_cli/20220314T105014/scikit_learn_intelex_vs_scikit_learn.html#Ridge).

Speedups can be reproduced with the following code:

```bash
conda create -n lr_perf -c conda-forge scikit-learn scikit-learn-intelex numpy jupyter
conda activate lr_perf
```

```python
from sklearn.linear_model import LinearRegression as LinearRegressionSklearn
from sklearnex.linear_model import LinearRegression as LinearRegressionSklearnex
from sklearn.datasets import make_regression
import time
import numpy as np

X, y = make_regression(n_samples=1_000_000, n_features=100, n_informative=10)

def measure(estimator, X, y, n_executions=10):
    times = []
    while len(times) < n_executions:
        t0 = time.perf_counter()
        estimator.fit(X, y)
        t1 = time.perf_counter()
        times.append(t1 - t0)

    return np.mean(times)

mean_time_sklearn = measure(
    estimator=LinearRegressionSklearn(),
    X=X,
    y=y
)

mean_time_sklearnex = measure(
    estimator=LinearRegressionSklearnex(),
    X=X,
    y=y
)

speedup = mean_time_sklearn / mean_time_sklearnex
speedup
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The fit performance of LinearRegression is sub-optimal #22855

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

The fit performance of LinearRegression is sub-optimal #22855

Description

Activity

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions