Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MAINT: einsum: Optimize the sub function two-operands by using SIMD. #18194

Merged
merged 1 commit into from
Jan 21, 2021

Conversation

Qiyu8
Copy link
Member

@Qiyu8 Qiyu8 commented Jan 20, 2021

Introduction

Here is the final part of #17049 , With code reduced by 69%, There has no impact on X86 platform and about 14%~49% increased performance in ARM.

Benchmark

Here is the ASV benchmark result.

SSE2 enabled

· Creating environments
· Discovering benchmarks
·· Uninstalling from virtualenv-py3.7-Cython
·· Building e4402bd8  for virtualenv-py3.7-Cython
·· Installing e4402bd8  into virtualenv-py3.7-Cython
· Running 28 total benchmarks (2 commits * 1 environments * 14 benchmarks)
[  0.00%] · For numpy commit 2908338b  (round 1/2):
[  0.00%] ·· Building for virtualenv-py3.7-Cython
[  0.00%] ·· Benchmarking virtualenv-py3.7-Cython
[  1.79%] ··· Running (bench_linalg.Einsum.time_einsum_contig_contig--)..............
[ 25.00%] · For numpy commit e4402bd8  (round 1/2):
[ 25.00%] ·· Building for virtualenv-py3.7-Cython
[ 25.00%] ·· Benchmarking virtualenv-py3.7-Cython
[ 26.79%] ··· Running (bench_linalg.Einsum.time_einsum_contig_contig--)..............
[ 50.00%] · For numpy commit e4402bd8  (round 2/2):
[ 50.00%] ·· Benchmarking virtualenv-py3.7-Cython
[ 51.79%] ··· bench_linalg.Einsum.time_einsum_contig_contig                                                                                                                                                   ok
[ 51.79%] ··· =============== =========
                   dtype
              --------------- ---------
               numpy.float32   139±2μs
               numpy.float64   240±5μs
              =============== =========

[ 53.57%] ··· bench_linalg.Einsum.time_einsum_contig_outstride0 ok
[ 53.57%] ··· =============== =========
dtype
--------------- ---------
numpy.float32 130±1μs
numpy.float64 229±9μs
=============== =========

[ 55.36%] ··· bench_linalg.Einsum.time_einsum_mul ok
[ 55.36%] ··· =============== =============
dtype
--------------- -------------
numpy.float32 1.40±0.05ms
numpy.float64 2.74±0.03ms
=============== =============

[ 57.14%] ··· bench_linalg.Einsum.time_einsum_multiply ok
[ 57.14%] ··· =============== ============
dtype
--------------- ------------
numpy.float32 65.5±0.7μs
numpy.float64 78.9±3μs
=============== ============

[ 58.93%] ··· bench_linalg.Einsum.time_einsum_noncon_contig_contig ok
[ 58.93%] ··· =============== ============
dtype
--------------- ------------
numpy.float32 42.1±1μs
numpy.float64 43.4±0.5μs
=============== ============

[ 60.71%] ··· bench_linalg.Einsum.time_einsum_noncon_contig_outstride0 ok
[ 60.71%] ··· =============== ============
dtype
--------------- ------------
numpy.float32 32.0±0.7μs
numpy.float64 33.0±2μs
=============== ============

[ 62.50%] ··· bench_linalg.Einsum.time_einsum_noncon_mul ok
[ 62.50%] ··· =============== ==========
dtype
--------------- ----------
numpy.float32 39.6±2μs
numpy.float64 46.3±3μs
=============== ==========

[ 64.29%] ··· bench_linalg.Einsum.time_einsum_noncon_multiply ok
[ 64.29%] ··· =============== ==========
dtype
--------------- ----------
numpy.float32 67.9±3μs
numpy.float64 78.3±4μs
=============== ==========

[ 66.07%] ··· bench_linalg.Einsum.time_einsum_noncon_outer ok
[ 66.07%] ··· =============== ============
dtype
--------------- ------------
numpy.float32 10.7±0.1ms
numpy.float64 21.8±0.3ms
=============== ============

[ 67.86%] ··· bench_linalg.Einsum.time_einsum_noncon_sum_mul ok
[ 67.86%] ··· =============== ==========
dtype
--------------- ----------
numpy.float32 65.3±1μs
numpy.float64 65.9±6μs
=============== ==========

[ 69.64%] ··· bench_linalg.Einsum.time_einsum_noncon_sum_mul2 ok
[ 69.64%] ··· =============== ==========
dtype
--------------- ----------
numpy.float32 57.7±5μs
numpy.float64 63.5±2μs
=============== ==========

[ 71.43%] ··· bench_linalg.Einsum.time_einsum_outer ok
[ 71.43%] ··· =============== ============
dtype
--------------- ------------
numpy.float32 24.7±0.3ms
numpy.float64 49.9±0.7ms
=============== ============

[ 73.21%] ··· bench_linalg.Einsum.time_einsum_sum_mul ok
[ 73.21%] ··· =============== ==========
dtype
--------------- ----------
numpy.float32 55.7±2μs
numpy.float64 59.5±3μs
=============== ==========

[ 75.00%] ··· bench_linalg.Einsum.time_einsum_sum_mul2 ok
[ 75.00%] ··· =============== ============
dtype
--------------- ------------
numpy.float32 55.2±0.9μs
numpy.float64 57.7±2μs
=============== ============

[ 75.00%] · For numpy commit 2908338b (round 2/2):
[ 75.00%] ·· Building for virtualenv-py3.7-Cython
[ 75.00%] ·· Benchmarking virtualenv-py3.7-Cython
[ 76.79%] ··· bench_linalg.Einsum.time_einsum_contig_contig ok
[ 76.79%] ··· =============== =========
dtype
--------------- ---------
numpy.float32 149±7μs
numpy.float64 237±4μs
=============== =========

[ 78.57%] ··· bench_linalg.Einsum.time_einsum_contig_outstride0 ok
[ 78.57%] ··· =============== ==========
dtype
--------------- ----------
numpy.float32 133±8μs
numpy.float64 226±20μs
=============== ==========

[ 80.36%] ··· bench_linalg.Einsum.time_einsum_mul ok
[ 80.36%] ··· =============== =============
dtype
--------------- -------------
numpy.float32 1.31±0.01ms
numpy.float64 2.66±0.06ms
=============== =============

[ 82.14%] ··· bench_linalg.Einsum.time_einsum_multiply ok
[ 82.14%] ··· =============== ==========
dtype
--------------- ----------
numpy.float32 65.6±4μs
numpy.float64 79.2±4μs
=============== ==========

[ 83.93%] ··· bench_linalg.Einsum.time_einsum_noncon_contig_contig ok
[ 83.93%] ··· =============== ==========
dtype
--------------- ----------
numpy.float32 42.2±1μs
numpy.float64 44.8±3μs
=============== ==========

[ 85.71%] ··· bench_linalg.Einsum.time_einsum_noncon_contig_outstride0 ok
[ 85.71%] ··· =============== ============
dtype
--------------- ------------
numpy.float32 34.3±5μs
numpy.float64 33.0±0.5μs
=============== ============

[ 87.50%] ··· bench_linalg.Einsum.time_einsum_noncon_mul ok
[ 87.50%] ··· =============== ============
dtype
--------------- ------------
numpy.float32 38.8±0.7μs
numpy.float64 41.3±2μs
=============== ============

[ 89.29%] ··· bench_linalg.Einsum.time_einsum_noncon_multiply ok
[ 89.29%] ··· =============== ==========
dtype
--------------- ----------
numpy.float32 67.8±3μs
numpy.float64 78.2±4μs
=============== ==========

[ 91.07%] ··· bench_linalg.Einsum.time_einsum_noncon_outer ok
[ 91.07%] ··· =============== ============
dtype
--------------- ------------
numpy.float32 10.8±0.2ms
numpy.float64 22.1±0.2ms
=============== ============

[ 92.86%] ··· bench_linalg.Einsum.time_einsum_noncon_sum_mul ok
[ 92.86%] ··· =============== ==========
dtype
--------------- ----------
numpy.float32 59.9±2μs
numpy.float64 66.2±3μs
=============== ==========

[ 94.64%] ··· bench_linalg.Einsum.time_einsum_noncon_sum_mul2 ok
[ 94.64%] ··· =============== ============
dtype
--------------- ------------
numpy.float32 56.6±0.9μs
numpy.float64 70.6±7μs
=============== ============

[ 96.43%] ··· bench_linalg.Einsum.time_einsum_outer ok
[ 96.43%] ··· =============== ============
dtype
--------------- ------------
numpy.float32 24.5±0.6ms
numpy.float64 49.2±0.9ms
=============== ============

[ 98.21%] ··· bench_linalg.Einsum.time_einsum_sum_mul ok
[ 98.21%] ··· =============== ==========
dtype
--------------- ----------
numpy.float32 56.5±1μs
numpy.float64 62.9±3μs
=============== ==========

[100.00%] ··· bench_linalg.Einsum.time_einsum_sum_mul2 ok
[100.00%] ··· =============== ============
dtype
--------------- ------------
numpy.float32 55.1±0.8μs
numpy.float64 62.3±5μs
=============== ============

BENCHMARKS NOT SIGNIFICANTLY CHANGED.

AV2 enabled

· Creating environments
· Discovering benchmarks
·· Uninstalling from virtualenv-py3.7-Cython
·· Building e4402bd8  for virtualenv-py3.7-Cython
·· Installing e4402bd8  into virtualenv-py3.7-Cython
· Running 28 total benchmarks (2 commits * 1 environments * 14 benchmarks)
[  0.00%] · For numpy commit 2908338b  (round 1/2):
[  0.00%] ·· Building for virtualenv-py3.7-Cython
[  0.00%] ·· Benchmarking virtualenv-py3.7-Cython
[  1.79%] ··· Running (bench_linalg.Einsum.time_einsum_contig_contig--)..............
[ 25.00%] · For numpy commit e4402bd8  (round 1/2):
[ 25.00%] ·· Building for virtualenv-py3.7-Cython
[ 25.00%] ·· Benchmarking virtualenv-py3.7-Cython
[ 26.79%] ··· Running (bench_linalg.Einsum.time_einsum_contig_contig--)..............
[ 50.00%] · For numpy commit e4402bd8  (round 2/2):
[ 50.00%] ·· Benchmarking virtualenv-py3.7-Cython
[ 51.79%] ··· bench_linalg.Einsum.time_einsum_contig_contig                                                                                                                                                   ok
[ 51.79%] ··· =============== =========
                   dtype
              --------------- ---------
               numpy.float32   100±2μs
               numpy.float64   152±6μs
              =============== =========

[ 53.57%] ··· bench_linalg.Einsum.time_einsum_contig_outstride0 ok
[ 53.57%] ··· =============== ==========
dtype
--------------- ----------
numpy.float32 146±10μs
numpy.float64 244±2μs
=============== ==========

[ 55.36%] ··· bench_linalg.Einsum.time_einsum_mul ok
[ 55.36%] ··· =============== =============
dtype
--------------- -------------
numpy.float32 1.41±0.02ms
numpy.float64 2.66±0.03ms
=============== =============

[ 57.14%] ··· bench_linalg.Einsum.time_einsum_multiply ok
[ 57.14%] ··· =============== ==========
dtype
--------------- ----------
numpy.float32 79.9±2μs
numpy.float64 87.4±2μs
=============== ==========

[ 58.93%] ··· bench_linalg.Einsum.time_einsum_noncon_contig_contig ok
[ 58.93%] ··· =============== ============
dtype
--------------- ------------
numpy.float32 42.6±2μs
numpy.float64 41.6±0.6μs
=============== ============

[ 60.71%] ··· bench_linalg.Einsum.time_einsum_noncon_contig_outstride0 ok
[ 60.71%] ··· =============== ============
dtype
--------------- ------------
numpy.float32 41.0±0.9μs
numpy.float64 40.2±0.5μs
=============== ============

[ 62.50%] ··· bench_linalg.Einsum.time_einsum_noncon_mul ok
[ 62.50%] ··· =============== ============
dtype
--------------- ------------
numpy.float32 49.2±0.4μs
numpy.float64 48.2±1μs
=============== ============

[ 64.29%] ··· bench_linalg.Einsum.time_einsum_noncon_multiply ok
[ 64.29%] ··· =============== ==========
dtype
--------------- ----------
numpy.float32 76.7±2μs
numpy.float64 83.3±2μs
=============== ==========

[ 66.07%] ··· bench_linalg.Einsum.time_einsum_noncon_outer ok
[ 66.07%] ··· =============== ============
dtype
--------------- ------------
numpy.float32 10.4±0.1ms
numpy.float64 21.9±0.4ms
=============== ============

[ 67.86%] ··· bench_linalg.Einsum.time_einsum_noncon_sum_mul ok
[ 67.86%] ··· =============== ==========
dtype
--------------- ----------
numpy.float32 68.6±2μs
numpy.float64 70.7±1μs
=============== ==========

[ 69.64%] ··· bench_linalg.Einsum.time_einsum_noncon_sum_mul2 ok
[ 69.64%] ··· =============== ==========
dtype
--------------- ----------
numpy.float32 75.4±6μs
numpy.float64 71.7±1μs
=============== ==========

[ 71.43%] ··· bench_linalg.Einsum.time_einsum_outer ok
[ 71.43%] ··· =============== ============
dtype
--------------- ------------
numpy.float32 24.1±0.3ms
numpy.float64 50.2±0.6ms
=============== ============

[ 73.21%] ··· bench_linalg.Einsum.time_einsum_sum_mul ok
[ 73.21%] ··· =============== ===========
dtype
--------------- -----------
numpy.float32 67.5±2μs
numpy.float64 71.0±10μs
=============== ===========

[ 75.00%] ··· bench_linalg.Einsum.time_einsum_sum_mul2 ok
[ 75.00%] ··· =============== ==========
dtype
--------------- ----------
numpy.float32 69.8±4μs
numpy.float64 69.6±2μs
=============== ==========

[ 75.00%] · For numpy commit 2908338b (round 2/2):
[ 75.00%] ·· Building for virtualenv-py3.7-Cython
[ 75.00%] ·· Benchmarking virtualenv-py3.7-Cython
[ 76.79%] ··· bench_linalg.Einsum.time_einsum_contig_contig ok
[ 76.79%] ··· =============== ==========
dtype
--------------- ----------
numpy.float32 101±5μs
numpy.float64 155±10μs
=============== ==========

[ 78.57%] ··· bench_linalg.Einsum.time_einsum_contig_outstride0 ok
[ 78.57%] ··· =============== =========
dtype
--------------- ---------
numpy.float32 146±7μs
numpy.float64 252±7μs
=============== =========

[ 80.36%] ··· bench_linalg.Einsum.time_einsum_mul ok
[ 80.36%] ··· =============== =============
dtype
--------------- -------------
numpy.float32 1.39±0.09ms
numpy.float64 2.63±0.04ms
=============== =============

[ 82.14%] ··· bench_linalg.Einsum.time_einsum_multiply ok
[ 82.14%] ··· =============== ==========
dtype
--------------- ----------
numpy.float32 78.6±2μs
numpy.float64 87.2±3μs
=============== ==========

[ 83.93%] ··· bench_linalg.Einsum.time_einsum_noncon_contig_contig ok
[ 83.93%] ··· =============== ==========
dtype
--------------- ----------
numpy.float32 47.4±5μs
numpy.float64 53.9±9μs
=============== ==========

[ 85.71%] ··· bench_linalg.Einsum.time_einsum_noncon_contig_outstride0 ok
[ 85.71%] ··· =============== ==========
dtype
--------------- ----------
numpy.float32 40.2±2μs
numpy.float64 41.6±4μs
=============== ==========

[ 87.50%] ··· bench_linalg.Einsum.time_einsum_noncon_mul ok
[ 87.50%] ··· =============== ==========
dtype
--------------- ----------
numpy.float32 41.3±3μs
numpy.float64 39.6±2μs
=============== ==========

[ 89.29%] ··· bench_linalg.Einsum.time_einsum_noncon_multiply ok
[ 89.29%] ··· =============== ==========
dtype
--------------- ----------
numpy.float32 77.5±4μs
numpy.float64 86.9±1μs
=============== ==========

[ 91.07%] ··· bench_linalg.Einsum.time_einsum_noncon_outer ok
[ 91.07%] ··· =============== ============
dtype
--------------- ------------
numpy.float32 10.7±0.2ms
numpy.float64 21.9±0.1ms
=============== ============

[ 92.86%] ··· bench_linalg.Einsum.time_einsum_noncon_sum_mul ok
[ 92.86%] ··· =============== ==========
dtype
--------------- ----------
numpy.float32 67.9±3μs
numpy.float64 70.6±2μs
=============== ==========

[ 94.64%] ··· bench_linalg.Einsum.time_einsum_noncon_sum_mul2 ok
[ 94.64%] ··· =============== ==========
dtype
--------------- ----------
numpy.float32 68.3±1μs
numpy.float64 69.7±1μs
=============== ==========

[ 96.43%] ··· bench_linalg.Einsum.time_einsum_outer ok
[ 96.43%] ··· =============== ============
dtype
--------------- ------------
numpy.float32 25.0±0.4ms
numpy.float64 49.1±0.5ms
=============== ============

[ 98.21%] ··· bench_linalg.Einsum.time_einsum_sum_mul ok
[ 98.21%] ··· =============== ==========
dtype
--------------- ----------
numpy.float32 70.0±3μs
numpy.float64 69.2±7μs
=============== ==========

[100.00%] ··· bench_linalg.Einsum.time_einsum_sum_mul2 ok
[100.00%] ··· =============== ==========
dtype
--------------- ----------
numpy.float32 69.9±1μs
numpy.float64 69.1±1μs
=============== ==========

BENCHMARKS NOT SIGNIFICANTLY CHANGED.

NEON enabled

· Creating environments
· Discovering benchmarks
·· Uninstalling from virtualenv-py3.7-Cython.
·· Building ebfed05a  for virtualenv-py3.7-Cython................................................
·· Installing ebfed05a  into virtualenv-py3.7-Cython.
· Running 28 total benchmarks (2 commits * 1 environments * 14 benchmarks)
[  0.00%] · For numpy commit efaf210f  (round 1/2):
[  0.00%] ·· Building for virtualenv-py3.7-Cython..................................................
[  0.00%] ·· Benchmarking virtualenv-py3.7-Cython
[  1.79%] ··· Running (bench_linalg.Einsum.time_einsum_contig_contig--)..............
[ 25.00%] · For numpy commit ebfed05a  (round 1/2):
[ 25.00%] ·· Building for virtualenv-py3.7-Cython..
[ 25.00%] ·· Benchmarking virtualenv-py3.7-Cython
[ 26.79%] ··· Running (bench_linalg.Einsum.time_einsum_contig_contig--)..............
[ 50.00%] · For numpy commit ebfed05a  (round 2/2):
[ 50.00%] ·· Benchmarking virtualenv-py3.7-Cython
[ 51.79%] ··· bench_linalg.Einsum.time_einsum_contig_contig                                                       ok
[ 51.79%] ··· =============== =========
                   dtype               
              --------------- ---------
               numpy.float32   198±2μs 
               numpy.float64   353±5μs 
              =============== =========

[ 53.57%] ··· bench_linalg.Einsum.time_einsum_contig_outstride0 ok
[ 53.57%] ··· =============== =========
dtype
--------------- ---------
numpy.float32 224±2μs
numpy.float64 356±6μs
=============== =========

[ 55.36%] ··· bench_linalg.Einsum.time_einsum_mul ok
[ 55.36%] ··· =============== ==========
dtype
--------------- ----------
numpy.float32 487±10μs
numpy.float64 863±20μs
=============== ==========

[ 57.14%] ··· bench_linalg.Einsum.time_einsum_multiply ok
[ 57.14%] ··· =============== ===========
dtype
--------------- -----------
numpy.float32 115±0.5μs
numpy.float64 141±2μs
=============== ===========

[ 58.93%] ··· bench_linalg.Einsum.time_einsum_noncon_contig_contig ok
[ 58.93%] ··· =============== ============
dtype
--------------- ------------
numpy.float32 73.5±0.5μs
numpy.float64 74.0±0.7μs
=============== ============

[ 60.71%] ··· bench_linalg.Einsum.time_einsum_noncon_contig_outstride0 ok
[ 60.71%] ··· =============== ============
dtype
--------------- ------------
numpy.float32 59.6±0.2μs
numpy.float64 60.5±0.6μs
=============== ============

[ 62.50%] ··· bench_linalg.Einsum.time_einsum_noncon_mul ok
[ 62.50%] ··· =============== ============
dtype
--------------- ------------
numpy.float32 70.1±0.4μs
numpy.float64 72.0±0.5μs
=============== ============

[ 64.29%] ··· bench_linalg.Einsum.time_einsum_noncon_multiply ok
[ 64.29%] ··· =============== =========
dtype
--------------- ---------
numpy.float32 115±2μs
numpy.float64 140±2μs
=============== =========

[ 66.07%] ··· bench_linalg.Einsum.time_einsum_noncon_outer ok
[ 66.07%] ··· =============== ============
dtype
--------------- ------------
numpy.float32 2.73±0.1ms
numpy.float64 6.22±0.2ms
=============== ============

[ 67.86%] ··· bench_linalg.Einsum.time_einsum_noncon_sum_mul ok
[ 67.86%] ··· =============== ============
dtype
--------------- ------------
numpy.float32 99.3±0.4μs
numpy.float64 107±0.8μs
=============== ============

[ 69.64%] ··· bench_linalg.Einsum.time_einsum_noncon_sum_mul2 ok
[ 69.64%] ··· =============== ============
dtype
--------------- ------------
numpy.float32 99.0±0.3μs
numpy.float64 107±1μs
=============== ============

[ 71.43%] ··· bench_linalg.Einsum.time_einsum_outer ok
[ 71.43%] ··· =============== ============
dtype
--------------- ------------
numpy.float32 10.8±0.4ms
numpy.float64 22.1±0.3ms
=============== ============

[ 73.21%] ··· bench_linalg.Einsum.time_einsum_sum_mul ok
[ 73.21%] ··· =============== ============
dtype
--------------- ------------
numpy.float32 94.5±0.4μs
numpy.float64 98.1±0.3μs
=============== ============

[ 75.00%] ··· bench_linalg.Einsum.time_einsum_sum_mul2 ok
[ 75.00%] ··· =============== ============
dtype
--------------- ------------
numpy.float32 94.3±0.7μs
numpy.float64 98.7±0.4μs
=============== ============

[ 75.00%] · For numpy commit efaf210f (round 2/2):
[ 75.00%] ·· Building for virtualenv-py3.7-Cython..
[ 75.00%] ·· Benchmarking virtualenv-py3.7-Cython
[ 76.79%] ··· bench_linalg.Einsum.time_einsum_contig_contig ok
[ 76.79%] ··· =============== =========
dtype
--------------- ---------
numpy.float32 204±6μs
numpy.float64 355±5μs
=============== =========

[ 78.57%] ··· bench_linalg.Einsum.time_einsum_contig_outstride0 ok
[ 78.57%] ··· =============== =========
dtype
--------------- ---------
numpy.float32 226±6μs
numpy.float64 356±4μs
=============== =========

[ 80.36%] ··· bench_linalg.Einsum.time_einsum_mul ok
[ 80.36%] ··· =============== =============
dtype
--------------- -------------
numpy.float32 789±30μs
numpy.float64 1.00±0.01ms
=============== =============

[ 82.14%] ··· bench_linalg.Einsum.time_einsum_multiply ok
[ 82.14%] ··· =============== ===========
dtype
--------------- -----------
numpy.float32 117±0.2μs
numpy.float64 145±3μs
=============== ===========

[ 83.93%] ··· bench_linalg.Einsum.time_einsum_noncon_contig_contig ok
[ 83.93%] ··· =============== ============
dtype
--------------- ------------
numpy.float32 74.2±0.4μs
numpy.float64 76.0±0.6μs
=============== ============

[ 85.71%] ··· bench_linalg.Einsum.time_einsum_noncon_contig_outstride0 ok
[ 85.71%] ··· =============== ============
dtype
--------------- ------------
numpy.float32 59.4±0.3μs
numpy.float64 60.5±0.3μs
=============== ============

[ 87.50%] ··· bench_linalg.Einsum.time_einsum_noncon_mul ok
[ 87.50%] ··· =============== ============
dtype
--------------- ------------
numpy.float32 72.0±0.2μs
numpy.float64 72.9±0.2μs
=============== ============

[ 89.29%] ··· bench_linalg.Einsum.time_einsum_noncon_multiply ok
[ 89.29%] ··· =============== ===========
dtype
--------------- -----------
numpy.float32 115±0.5μs
numpy.float64 140±0.8μs
=============== ===========

[ 91.07%] ··· bench_linalg.Einsum.time_einsum_noncon_outer ok
[ 91.07%] ··· =============== =============
dtype
--------------- -------------
numpy.float32 5.35±0.07ms
numpy.float64 7.07±0.2ms
=============== =============

[ 92.86%] ··· bench_linalg.Einsum.time_einsum_noncon_sum_mul ok
[ 92.86%] ··· =============== ===========
dtype
--------------- -----------
numpy.float32 101±0.6μs
numpy.float64 109±0.5μs
=============== ===========

[ 94.64%] ··· bench_linalg.Einsum.time_einsum_noncon_sum_mul2 ok
[ 94.64%] ··· =============== ===========
dtype
--------------- -----------
numpy.float32 101±1μs
numpy.float64 109±0.7μs
=============== ===========

[ 96.43%] ··· bench_linalg.Einsum.time_einsum_outer ok
[ 96.43%] ··· =============== ============
dtype
--------------- ------------
numpy.float32 17.7±0.3ms
numpy.float64 25.8±0.5ms
=============== ============

[ 98.21%] ··· bench_linalg.Einsum.time_einsum_sum_mul ok
[ 98.21%] ··· =============== ============
dtype
--------------- ------------
numpy.float32 95.5±0.5μs
numpy.float64 101±0.8μs
=============== ============

[100.00%] ··· bench_linalg.Einsum.time_einsum_sum_mul2 ok
[100.00%] ··· =============== ============
dtype
--------------- ------------
numpy.float32 95.5±0.8μs
numpy.float64 98.8±0.5μs
=============== ============

   before           after         ratio
 [efaf210f]       [ebfed05a]
 <master>         <einsum-twooperands>
  • 1.00±0.01ms         863±20μs     0.86  bench_linalg.Einsum.time_einsum_mul(<class 'numpy.float64'>)
    
  •  25.8±0.5ms       22.1±0.3ms     0.86  bench_linalg.Einsum.time_einsum_outer(<class 'numpy.float64'>)
    
  •    789±30μs         487±10μs     0.62  bench_linalg.Einsum.time_einsum_mul(<class 'numpy.float32'>)
    
  •  17.7±0.3ms       10.8±0.4ms     0.61  bench_linalg.Einsum.time_einsum_outer(<class 'numpy.float32'>)
    
  • 5.35±0.07ms       2.73±0.1ms     0.51  bench_linalg.Einsum.time_einsum_noncon_outer(<class 'numpy.float32'>)
    

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.

System Info

  Arm x86
Hardware KunPeng  
Processor ARMv8 2.6GMHZ 8 processors Intel(R) Xeon(R) Gold 6161 CPU @ 2.20GHz
OS Linux ecs-9d50 4.19.36-vhulk1905.1.0.h276.eulerosv2r8.aarch64 Windows Server 2008 R2 Enterprise
Compiler gcc (GCC) 7.3.0 MSVC14.06

@Qiyu8 Qiyu8 added 01 - Enhancement component: SIMD Issues in SIMD (fast instruction sets) code or machinery labels Jan 20, 2021
@eric-wieser eric-wieser changed the title Optimize the sub function two-operands by using SIMD. MAINT: einsum: Optimize the sub function two-operands by using SIMD. Jan 20, 2021
@seiko2plus seiko2plus self-assigned this Jan 20, 2021
@seiko2plus seiko2plus self-requested a review January 20, 2021 22:47
Copy link
Member

@seiko2plus seiko2plus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well done, Thank you Chunlin!

@seiko2plus
Copy link
Member

The new code improves the accuracy due to the use of FMA too. we will have to dispatch AVX2&FMA3 and AVX512F in runtime.

@charris charris merged commit b91f3c0 into numpy:master Jan 21, 2021
@charris
Copy link
Member

charris commented Jan 21, 2021

Thanks Chunlin.

@Qiyu8 Qiyu8 deleted the einsum-twooperands branch January 21, 2021 01:11
@Qiyu8
Copy link
Member Author

Qiyu8 commented Jan 21, 2021

The dispatching solution for non-UFunc is not recommend in the past, If it is acceptable now, then we can have a discussion on the mailing list.

@ZiqiChai
Copy link

ZiqiChai commented Mar 2, 2021

This is exactly what I'm looking for right now! And I just found it here! Thanks, Chunlin.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
01 - Enhancement 03 - Maintenance component: SIMD Issues in SIMD (fast instruction sets) code or machinery
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants