Skip to content

Releases: pytorch/pytorch

PyTorch 2.6.0 Release

29 Jan 17:18
1eba9b3
Compare
Choose a tag to compare
  • Highlights
  • Tracked Regressions
  • Backwards Incompatible Change
  • Deprecations
  • New Features
  • Improvements
  • Bug fixes
  • Performance
  • Documentation
  • Developers

Highlights

We are excited to announce the release of PyTorch® 2.6 (release notes)! This release features multiple improvements for PT2: torch.compile can now be used with Python 3.13; new performance-related knob torch.compiler.set_stance; several AOTInductor enhancements. Besides the PT2 improvements, another highlight is FP16 support on X86 CPUs.

NOTE: Starting with this release we are not going to publish on Conda, please see [Announcement] Deprecating PyTorch’s official Anaconda channel for the details.

For this release the experimental Linux binaries shipped with CUDA 12.6.3 (as well as Linux Aarch64, Linux ROCm 6.2.4, and Linux XPU binaries) are built with CXX11_ABI=1 and are using the Manylinux 2.28 build platform. If you build PyTorch extensions with custom C++ or CUDA extensions, please update these builds to use CXX_ABI=1 as well and report any issues you are seeing. For the next PyTorch 2.7 release we plan to switch all Linux builds to Manylinux 2.28 and CXX11_ABI=1, please see [RFC] PyTorch next wheel build platform: manylinux-2.28 for the details and discussion.

Also in this release as an important security improvement measure we have changed the default value for weights_only parameter of torch.load. This is a backward compatibility-breaking change, please see this forum post for more details.

This release is composed of 3892 commits from 520 contributors since PyTorch 2.5. We want to sincerely thank our dedicated community for your contributions. As always, we encourage you to try these out and report any issues as we improve PyTorch. More information about how to get started with the PyTorch 2-series can be found at our Getting Started page.

Beta Prototype
torch.compiler.set_stance Improved PyTorch user experience on Intel GPUs
torch.library.triton_op FlexAttention support on X86 CPU for LLMs
torch.compile support for Python 3.13 Dim.AUTO
New packaging APIs for AOTInductor CUTLASS and CK GEMM/CONV Backends for AOTInductor
AOTInductor: minifier
AOTInductor: ABI-compatible mode code generation
FP16 support for X86 CPUs

*To see a full list of public feature submissions click here.

BETA FEATURES

[Beta] torch.compiler.set_stance

This feature enables the user to specify different behaviors (“stances”) that torch.compile can take between different invocations of compiled functions. One of the stances, for example, is

“eager_on_recompile”, that instructs PyTorch to code eagerly when a recompile is necessary, reusing cached compiled code when possible.

For more information please refer to the set_stance documentation and the Dynamic Compilation Control with torch.compiler.set_stance tutorial.

[Beta] torch.library.triton_op

torch.library.triton_op offers a standard way of creating custom operators that are backed by user-defined triton kernels.

When users turn user-defined triton kernels into custom operators, torch.library.triton_op allows torch.compile to peek into the implementation, enabling torch.compile to optimize the triton kernel inside it.

For more information please refer to the triton_op documentation and the Using User-Defined Triton Kernels with torch.compile tutorial.

[Beta] torch.compile support for Python 3.13

torch.compile previously only supported Python up to version 3.12. Users can now optimize models with torch.compile in Python 3.13.

[Beta] New packaging APIs for AOTInductor

A new package format, “PT2 archive”, has been introduced. This essentially contains a zipfile of all the files that need to be used by AOTInductor, and allows users to send everything needed to other environments. There is also functionality to package multiple models into one artifact, and to store additional metadata inside of the package.

For more details please see the updated torch.export AOTInductor Tutorial for Python runtime.

[Beta] AOTInductor: minifier

If a user encounters an error while using AOTInductor APIs, AOTInductor Minifier allows creation of a minimal nn.Module that reproduces the error.

For more information please see the AOTInductor Minifier documentation.

[Beta] AOTInductor: ABI-compatible mode code generation

AOTInductor-generated model code has dependency on Pytorch cpp libraries. As Pytorch evolves quickly, it’s important to make sure previously AOTInductor compiled models can continue to run on newer Pytorch versions, i.e. AOTInductor is backward compatible.

In order to guarantee application binary interface (ABI) backward compatibility, we have carefully defined a set of stable C interfaces in libtorch and make sure AOTInductor generates code that only refers to the specific set of APIs and nothing else in libtorch. We will keep the set of C APIs stable across Pytorch versions and thus provide backward compatibility guarantees for AOTInductor-compiled models.

[Beta] FP16 support for X86 CPUs (both eager and Inductor modes)

Float16 datatype is commonly used for reduced memory usage and faster computation in AI inference and training. CPUs like the recently launched Intel® Xeon® 6 with P-Cores support Float16 datatype with native accelerator AMX. Float16 support on X86 CPUs was introduced in PyTorch 2.5 as a prototype feature, and now it has been further improved for both eager mode and Torch.compile + Inductor mode, making it Beta level feature with both functionality and performance verified with a broad scope of workloads.

PROTOTYPE FEATURES

[Prototype] Improved PyTorch user experience on Intel GPUs

PyTorch user experience on Intel GPUs is further improved with simplified installation steps, Windows release binary distribution and expanded coverage of supported GPU models including the latest Intel® Arc™ B-Series discrete graphics. Application developers and researchers seeking to fine-tune, inference and develop with PyTorch models on Intel® Core™ Ultra AI PCs and Intel® Arc™ discrete graphics will now be able to directly install PyTorch with binary releases for Windows, Linux and Windows Subsystem for Linux 2.

  • Simplified Intel GPU software stack setup to enable one-click installation of the torch-xpu PIP wheels to run deep learning workloads in an out of the box fashion, eliminating the complexity of installing and activating Intel GPU development software bundles.
  • Windows binary releases for torch core, torchvision and torchaudio have been made available for Intel GPUs, and the supported GPU models have been expanded from Intel® Core™ Ultra Processors with Intel® Arc™ Graphics, Intel® Core™ Ultra Series 2 with Intel® Arc™ Graphics and Intel® Arc™ A-Series Graphics to the latest GPU hardware Intel® Arc™ B-Series graphics.
  • Further enhanced coverage of Aten operators on Intel GPUs with SYCL* kernels for smooth eager mode execution, as well as bug fixes and performance optimizations for torch.compile on Intel GPUs.

For more information regarding Intel GPU support, please refer to Getting Started Guide.

[Prototype] FlexAttention support on X86 CPU for LLMs

FlexAttention was initially introduced in PyTorch 2.5 to provide optimized implementations for Attention variants with a flexible API. In PyTorch 2.6, X86 CPU support for FlexAttention was added through TorchInductor CPP backend. This new feature leverages and extends current CPP template abilities to support...

Read more

PyTorch 2.5.1: bug fix release

29 Oct 17:58
a8d6afb
Compare
Choose a tag to compare

This release is meant to fix the following regressions:

  • Wheels from PyPI are unusable out of the box on PRM-based Linux distributions: #138324
  • PyPI arm64 distribution logs cpuinfo error on import: #138333
  • Crash When Using torch.compile with Math scaled_dot_product_attention in AMP Mode: #133974
  • [MPS] Internal crash due to the invalid buffer size computation if sliced API is used: #137800
  • Several issues related to CuDNN Attention: #138522

Besides the regression fixes, the release includes several documentation updates.

See release tracker #132400 for additional information.

PyTorch 2.5.0 Release, SDPA CuDNN backend, Flex Attention

17 Oct 16:26
32f585d
Compare
Choose a tag to compare

PyTorch 2.5 Release Notes

  • Highlights
  • Backwards Incompatible Change
  • Deprecations
  • New Features
  • Improvements
  • Bug fixes
  • Performance
  • Documentation
  • Developers
  • Security

Highlights

We are excited to announce the release of PyTorch® 2.5! This release features a new CuDNN backend for SDPA, enabling speedups by default for users of SDPA on H100s or newer GPUs. As well, regional compilation of torch.compile offers a way to reduce the cold start up time for torch.compile by allowing users to compile a repeated nn.Module (e.g. a transformer layer in LLM) without recompilations. Finally, TorchInductor CPP backend offers solid performance speedup with numerous enhancements like FP16 support, CPP wrapper, AOT-Inductor mode, and max-autotune mode.
This release is composed of 4095 commits from 504 contributors since PyTorch 2.4. We want to sincerely thank our dedicated community for your contributions. As always, we encourage you to try these out and report any issues as we improve 2.5. More information about how to get started with the PyTorch 2-series can be found at our Getting Started page.
As well, please check out our new ecosystem projects releases with TorchRec and TorchFix.

Beta Prototype
CuDNN backend for SDPA FlexAttention
torch.compile regional compilation without recompilations Compiled Autograd
TorchDynamo added support for exception handling & MutableMapping types Flight Recorder
TorchInductor CPU backend optimization Max-autotune Support on CPU with GEMM Template
TorchInductor on Windows
FP16 support on CPU path for both eager mode and TorchInductor CPP backend
Autoload Device Extension
Enhanced Intel GPU support

*To see a full list of public feature submissions click here.

BETA FEATURES

[Beta] CuDNN backend for SDPA

The cuDNN "Fused Flash Attention" backend was landed for torch.nn.functional.scaled_dot_product_attention. On NVIDIA H100 GPUs this can provide up to 75% speed-up over FlashAttentionV2. This speedup is enabled by default for all users of SDPA on H100 or newer GPUs.

[Beta] torch.compile regional compilation without recompilations

Regional compilation without recompilations, via torch._dynamo.config.inline_inbuilt_nn_modules which default to True in 2.5+. This option allows users to compile a repeated nn.Module (e.g. a transformer layer in LLM) without recompilations. Compared to compiling the full model, this option can result in smaller compilation latencies with 1%-5% performance degradation compared to full model compilation.

See the tutorial for more information.

[Beta] TorchInductor CPU backend optimization

This feature advances Inductor’s CPU backend optimization, including CPP backend code generation and FX fusions with customized CPU kernels. The Inductor CPU backend supports vectorization of common data types and all Inductor IR operations, along with the static and symbolic shapes. It is compatible with both Linux and Windows OS and supports the default Python wrapper, the CPP wrapper, and AOT-Inductor mode.

Additionally, it extends the max-autotune mode of the GEMM template (prototyped in 2.5), offering further performance gains. The backend supports various FX fusions, lowering to customized kernels such as oneDNN for Linear/Conv operations and SDPA. The Inductor CPU backend consistently achieves performance speedups across three benchmark suites—TorchBench, Hugging Face, and timms—outperforming eager mode in 97.5% of the 193 models tested.

PROTOTYPE FEATURES

[Prototype] FlexAttention

We've introduced a flexible API that enables implementing various attention mechanisms such as Sliding Window, Causal Mask, and PrefixLM with just a few lines of idiomatic PyTorch code. This API leverages torch.compile to generate a fused FlashAttention kernel, which eliminates extra memory allocation and achieves performance comparable to handwritten implementations. Additionally, we automatically generate the backwards pass using PyTorch's autograd machinery. Furthermore, our API can take advantage of sparsity in the attention mask, resulting in significant improvements over standard attention implementations.

For more information and examples, please refer to the official blog post and Attention Gym.

[Prototype] Compiled Autograd

Compiled Autograd is an extension to the PT2 stack allowing the capture of the entire backward pass. Unlike the backward graph traced by AOT dispatcher, Compiled Autograd tracing is deferred until backward execution time, which makes it impervious to forward pass graph breaks, and allows it to record backward hooks into the graph.

Please refer to the tutorial for more information.

[Prototype] Flight Recorder

Flight recorder is a new debugging tool that helps debug stuck jobs. The tool works by continuously capturing information about collectives as they run. Upon detecting a stuck job, the information can be used to quickly identify misbehaving ranks/machines along with code stack traces.

For more information please refer to the following tutorial.

[Prototype] Max-autotune Support on CPU with GEMM Template

Max-autotune mode for the Inductor CPU backend in torch.compile profiles multiple implementations of operations at compile time and selects the best-performing one. This is particularly beneficial for GEMM-related operations, using a C++ template-based GEMM implementation as an alternative to the ATen-based approach with oneDNN and MKL libraries. We support FP32, BF16, FP16, and INT8 with epilogue fusions for x86 CPUs. We’ve seen up to 7% geomean speedup on the dynamo benchmark suites and up to 20% boost in next-token latency for LLM inference.

For more information please refer to the tutorial.

[Prototype] TorchInductor CPU on Windows

Inductor CPU backend in torch.compile now works on Windows. We support MSVC (cl), clang (clang-cl) and Intel compiler (icx-cl) for Windows inductor currently.

See the tutorial for more details.

[Prototype] FP16 support on CPU path for both eager mode and TorchInductor CPP backend

Float16 is a commonly used reduced floating point type for performance improvement in neural network inference/training. Since this release, float16 for both eager and TorchInductor is supported on the CPU path.

[Prototype] Autoload Device Extension

PyTorch now supports autoloading for out-of-tree device extensions, streamlining integration by eliminating the need for manual imports. This feature, enabled through the torch.backends entrypoint, simplifies usage by ensuring seamless extension loading, while allowing users to disable it via an environment variable if needed.

See the tutorial for more information.

[Prototype] Enhanced Intel GPU support

Intel GPUs support enhancement is now available for both Intel® Data Center GPU Max Series and Intel® Client GPUs (Intel® Core™ Ultra processors with built-in Intel® Arc™ graphics and Intel® Arc™ Graphics for dGPU parts), which is to make it easier to accelerate your Machine Learning workflows on Intel GPUs in PyTorch 2.5 release. We also enabled the initial support of PyTorch on Windows for Intel® Client GPUs in this release.

  • Expanded PyTorch hardware backend support matrix to include both Intel Data Center and Client GPUs.  
  • The implementation of SYCL* kernels to enhance coverage and execution of Aten operators on Intel GPUs to boost performance in PyTorch eager mode.
  • Enhanced Intel GPU backend of torch.compile to improve inference and training performance for a wide range of deep learning workloads.

These features are available through PyTorch preview and nightly binary PIP wheels. For more information regarding Intel GPU support, please refer to documentation.

Backwards Incompatible changes

Distributed

  • [c10d] Remove Option for ProcessGroup and Expose backend Options to reflect the correct code structure (#132931)

    • We released Dispatchable collectives in 2.0 and we will use Backend Option for Backend initialization and the PG options are not needed any more.
    • In 2.4 and before, users can do:
    # Users can pass in a basic option when creating an instance of ProcessGroup
    base_pg_options = ProcessGroup.Options(backend=str(backend))
    base_pg_options._timeout = timeout
    
    pg: ProcessGroup = ProcessGroup(
      store, rank, group_size, base_pg_options
    )
    
    # Users then need to create a backend option to create the comm backend (e.g., ProcessGroupNCCL)
    pg_options = ProcessGroupNCCL.Options()
    backend = ProcessGroupNCCL(
      store, rank, group_size, pg_options
    )
    • But from 2.5 onwards, users don’t need to pass in an option to create an instance of ProcessGroup and user can still set default backend for the pg since users still try to get default backend in the code:
    # No basic option is passed in when creating a instance of ProcessGroup
    pg: ProcessGroup = ProcessGroup(store, rank, group_size)
    pg._set_default_backend(...
Read more

PyTorch 2.4.1 Release, bug fix release

04 Sep 19:59
ee1b680
Compare
Choose a tag to compare

This release is meant to fix the following issues (regressions / silent correctness):

Breaking Changes:

  • The pytorch/pytorch docker image now installs the PyTorch package through pip and has switch its conda installation from miniconda to miniforge (#134274)

Windows:

  • Fix performance regression on Windows related to MKL static linking (#130619) (#130697)
  • Fix error during loading on Windows: [WinError 126] The specified module could not be found. (#131662) (#130697)

MPS:

  • Fix tensor.clamp produces wrong values (#130226)
  • Fix Incorrect result from batch norm with sliced inputs (#133610)

ROCM:

  • Fix for launching kernel invalid config error when calling embedding with large index (#130994)
  • Added a check and a warning when attempting to use hipBLASLt on an unsupported architecture (#128753)
  • Fix image corruption with Memory Efficient Attention when running HuggingFace Diffusers Stable Diffusion 3 pipeline (#133331)

Distributed:

  • Fix FutureWarning when using torch.load internally (#130663)
  • Fix FutureWarning when using torch.cuda.amp.autocast internally (#130660)

Torch.compile:

  • Fix exception with torch compile when onnxruntime-training and deepspeed packages are installed. (#131194)
  • Fix silent incorrectness with torch.library.custom_op with mutable inputs and torch.compile (#133452)
  • Fix SIMD detection on Linux ARM (#129075)
  • Do not use C++20 features in cpu_inducotr code (#130816)

Packaging:

  • Fix for exposing statically linked libstdc++ CXX11 ABI symbols (#134494)
  • Fix error while building pytorch from source due to not missing QNNPACK module (#131864)
  • Make PyTorch buildable from source on PowerPC (#129736)
  • Fix XPU extension building (#132847)

Other:

  • Fix warning when using pickle on a nn.Module that contains tensor attributes (#130246)
  • Fix NaNs return in MultiheadAttention when need_weights=False (#130014)
  • Fix nested tensor MHA produces incorrect results (#130196)
  • Fix error when using torch.utils.flop_counter.FlopCounterMode (#134467)

Tracked Regressions:

  • The experimental remote caching feature for Inductor's autotuner (enabled via TORCHINDUCTOR_AUTOTUNE_REMOTE_CACHE) is known to still be broken in this release and actively worked on in main. Following Error is generated: redis.exceptions.DataError: Invalid input of type: 'dict'. Please use nightlies if you need this feature (reported and Fixed by PR: #134032)

Release tracker #132400 contains all relevant pull requests related to this release as well as links to related issues.

PyTorch 2.4: Python 3.12, AOTInductor freezing, libuv backend for TCPStore

24 Jul 18:39
d990dad
Compare
Choose a tag to compare

PyTorch 2.4 Release Notes

  • Highlights
  • Tracked Regressions
  • Backward incompatible changes
  • Deprecations
  • New features
  • Improvements
  • Bug Fixes
  • Performance
  • Documentation
  • Developers
  • Security

Highlights

We are excited to announce the release of PyTorch® 2.4!
PyTorch 2.4 adds support for the latest version of Python (3.12) for torch.compile.
AOTInductor freezing gives developers running AOTInductor more performance based optimizations by allowing the
serialization of MKLDNN weights. As well, a new default TCPStore server backend utilizing libuv has been introduced
which should significantly reduce initialization times for users running large-scale jobs.
Finally, a new Python Custom Operator API makes it easier than before to integrate custom kernels
into PyTorch, especially for torch.compile.

This release is composed of 3661 commits and 475 contributors since PyTorch 2.3. We want to sincerely thank our
dedicated community for your contributions. As always, we encourage you to try these out and report any issues as we
improve 2.4. More information about how to get started with the PyTorch 2-series can be found at our
Getting Started page.

Beta Prototype Performance Improvements
Python 3.12 support for torch.compile FSDP2: DTensor-based per-parameter-sharding FSDP torch.compile optimizations for AWS Graviton (aarch64-linux) processors
AOTInductor Freezing for CPU torch.distributed.pipelining, simplified pipeline parallelism BF16 symbolic shape optimization in TorchInductor
New Higher-level Python Custom Operator API Intel GPU is available through source build Performance optimizations for GenAI projects utilizing CPU devices
Switching TCPStore’s default server backend to libuv

*To see a full list of public feature submissions click here.

Tracked Regressions

Subproc exception with torch.compile and onnxruntime-training

There is a reported issue (#131070) when using torch.compile if onnxruntime-training lib is
installed. The issue will be fixed (#131194) in v2.4.1. It can be solved locally by setting the environment variable
TORCHINDUCTOR_WORKER_START=fork before executing the script.

cu118 wheels will not work with pre-cuda12 drivers

It was also reported (#130684) that the new version of triton uses cuda features that are not compatible with pre-cuda12 drivers.
In this case, the workaround is to set
TRITON_PTXAS_PATH manually as follows (adapt the code according to the local installation path):

TRITON_PTXAS_PATH=/usr/local/lib/python3.10/site-packages/torch/bin/ptxas  python script.py

Backwards Incompatible Change

Python frontend

Default TreadPool size to number of physical cores (#125963)

Changed the default number of threads used for intra-op parallelism from the number of logical cores to the number of
physical cores. This should reduce core oversubscribing when running CPU workload and improve performance.
Previous behavior can be recovered by using torch.set_num_threads to set the number of threads to the desired value.

Fix torch.quasirandom.SobolEngine.draw default dtype handling (#126781)

The default dtype value has been changed from torch.float32 to the current default dtype as given by
torch.get_default_dtype() to be consistent with other APIs.

Forbid subclassing torch._C._TensorBase directly (#125558)

This is an internal subclass that a user used to be able to create an object that is almost a Tensor in Python and was
advertised as such in some tutorials. This is not allowed anymore to improve consistency and all users should
subclass torch.Tensor directly.

Composability

Non-compositional usages of as_strided + mutation under torch.compile will raise an error (#122502)

The torch.compile flow involves functionalizing any mutations inside the region being compiled. Torch.as_strided is
an existing view op that can be used non-compositionally: meaning when you call x.as_strided(...), as_strided will only
consider the underlying storage size of x, and ignore its current size/stride/storage_offset when creating a new view.
This makes it difficult to safely functionalize mutations on views of as_strided that are created non-compositionally,
so we ban them rather than risking silent correctness issues under torch.compile.

An example of a non-compositional usage of as_strided followed by mutation that we will error on is below. You can avoid
this issue by re-writing your usage of as_strided so that it is compositional (for example: either use a different set
of view ops instead of as_strided, or call as_strided directly on the base tensor instead of an existing view of it).

@torch.compile
def foo(a):
    e = a.diagonal()
    # as_strided is being called on an existing view (e),
    # making it non-compositional. mutations to f under torch.compile
    # are not allowed, as we cannot easily functionalize them safely
    f = e.as_strided((2,), (1,), 0)
    f.add_(1.0)
    return a

We now verify schemas of custom ops at registration time (#124520)

Previously, you could register a custom op through the operator registration APIs, but give it a schema that contained
types unknown to the PyTorch Dispatcher. This behavior came from TorchScript, where “unknown” types were implicitly
treated by the TorchScript interpreter as type variables. However, calling such a custom op through regular pytorch
would result in an error later. As of 2.4, we will raise an error at registration time, when you first register the
custom operator. You can get the old behavior by constructing the schema with allow_typevars=true.

TORCH_LIBRARY(my_ns, m) {
  // this now raises an error at registration time: bar/baz are unknown types
  m.def("my_ns::foo(bar t) -> baz");
  // you can get back the old behavior with the below flag
  m.def(torch::schema("my_ns::foo(bar t) -> baz", /*allow_typevars*/ true));
}

Autograd frontend

Delete torch.autograd.function.traceable APIs (#122817)

The torch.autograd.function.traceable(...) API, which sets the is_traceable class attribute
on a torch.autograd.Function class was deprecated in 2.3 and is now being deleted.
This API does not do anything and was only meant for internal purposes.
The following raised an warning in 2.3, and now errors because the API has been deleted:

@torch.autograd.function.traceable
class Func(torch.autograd.Function):
    ...

Release engineering

  • Remove caffe2 db and distributed from build system (#125092)

Optim

  • Remove SparseAdam weird allowance of raw Tensor input (#127081).

Distributed

DeviceMesh

Update get_group and add get_all_groups (#128097)
In 2.3 and before, users can do:

mesh_2d = init_device_mesh(
    "cuda", (2, 2), mesh_dim_names=("dp", "tp")
)
mesh_2d.get_group()  # This will return all sub-pgs within the mesh
assert mesh_2d.get_group()[0] == mesh_2d.get_group(0)
assert mesh_2d.get_group()[1] == mesh_2d.get_group(1)

But from 2.4 forward, if users call get_group without passing in the dim, users will get a RuntimeError.
Instead, they should use get_all_groups:

mesh_2d = init_device_mesh(
    "cuda", (2, 2), mesh_dim_names=("dp", "tp")
)
mesh_2d.get_group()  # This will throw a RuntimeError
assert mesh_2d.get_all_groups()[0] == mesh_2d.get_group(0)
assert mesh_2d.get_all_groups()[1] == mesh_2d.get_group(1)

Pipelining

Retire torch.distributed.pipeline (#127354)
In 2.3 and before, users can do:

import torch.distributed.pipeline # warning saying that this will be removed and users need to migrate to torch.distributed.pipelining

But from 2.4 forward, if users write the code above, users will get a ModuleNotFound error.
Instead, they should use torch.distributed.pipelining:

import torch.distributed.pipeline # -> ModuleNotFoundError
import torch.distributed.pipelining

jit

  • Fix serialization/deepcopy behavior for tensors that are aliasing but not equal (#126126)

Fx

Complete revamp of float/promotion sympy handling (#126905)

ONNX

  • Remove caffe2 contrib and experiments (#125038)

Deprecations

Python frontend

  • User warning when using torch.load with default weights_only=False value (#129239, #129396, #129509).
    A warning is now raised if the weights_only value is not specified during a call to torch.load, encouraging users to
    adopt the safest practice when loading weights.
  • Deprecate device-specific autocast API (#126062)
    All the autocast APIs are unified under torch.amp and it can be used as a drop-in replacement for torch.{device}.amp APIs
    (passing a device argument where applicable)..
  • Export torch.newaxis=None for Python Array API/Numpy consistency (#125026)

Composability

  • Deprecate calling FakeTensor.data_ptr in eager-mode. FakeTensors are tensors without a valid data pointer, so in
    general their data pointer is not safe to access. This makes it easier for torch.compile to provide a nice error
    message when tracing custom ops into a graph that are not written in a PT2-friendly way (bec...
Read more

PyTorch 2.3.1 Release, bug fix release

05 Jun 19:16
63d5e92
Compare
Choose a tag to compare

This release is meant to fix the following issues (regressions / silent correctness):

Torch.compile:

  • Remove runtime dependency on JAX/XLA, when importing torch.__dynamo (#124634)
  • Hide Plan failed with a cudnnException warning (#125790)
  • Fix CUDA memory leak (#124238) (#120756)

Distributed:

  • Fix format_utils executable, which was causing it to run as a no-op (#123407)
  • Fix regression with device_mesh in 2.3.0 during initialization causing memory spikes (#124780)
  • Fix crash of FSDP + DTensor with ShardingStrategy.SHARD_GRAD_OP (#123617)
  • Fix failure with distributed checkpointing + FSDP if at least 1 forward/backward pass has not been run. (#121544) (#127069)
  • Fix error with distributed checkpointing + FSDP, and with use_orig_params = False and activation checkpointing (#124698) (#126935)
  • Fix set_model_state_dict errors on compiled module with non-persistent buffer with distributed checkpointing (#125336) (#125337)

MPS:

  • Fix data corruption when coping large (>4GiB) tensors (#124635)
  • Fix Tensor.abs() for complex (#125662)

Packaging:

Other:

  • Fix DeepSpeed transformer extension build on ROCm (#121030)
  • Fix kernel crash on tensor.dtype.to_complex() after ~100 calls in ipython kernel (#125154)

Release tracker #125425 contains all relevant pull requests related to this release as well as links to related issues.

PyTorch 2.3: User-Defined Triton Kernels in torch.compile, Tensor Parallelism in Distributed

24 Apr 16:12
97ff6cf
Compare
Choose a tag to compare

PyTorch 2.3 Release notes

  • Highlights
  • Backwards Incompatible Changes
  • Deprecations
  • New Features
  • Improvements
  • Bug fixes
  • Performance
  • Documentation

Highlights

We are excited to announce the release of PyTorch® 2.3! PyTorch 2.3 offers support for user-defined Triton kernels in torch.compile, allowing for users to migrate their own Triton kernels from eager without experiencing performance complications or graph breaks. As well, Tensor Parallelism improves the experience for training Large Language Models using native PyTorch functions, which has been validated on training runs for 100B parameter models.

This release is composed of 3393 commits and 426 contributors since PyTorch 2.2. We want to sincerely thank our dedicated community for your contributions. As always, we encourage you to try these out and report any issues as we improve 2.3. More information about how to get started with the PyTorch 2-series can be found at our Getting Started page.

Stable Beta Prototype Performance Improvements
User-defined Triton kernels in torch.compile torch.export adds new API to specify dynamic_shapes Weight-Only-Quantization introduced into Inductor CPU backend
Tensor parallelism within PyTorch Distributed Asynchronous checkpoint generation
Support for semi-structured sparsity

*To see a full list of public feature submissions click here.

Tracked Regressions

torch.compile on MacOS is considered unstable for 2.3 as there are known cases where it will hang (#124497)

torch.compile imports many unrelated packages when it is invoked (#123954)

This can cause significant first-time slowdown and instability when these packages are not fully compatible with PyTorch within a single process.

torch.compile is not supported on Python 3.12 (#120233)

PyTorch support for Python 3.12 in general is considered experimental. Please use Python version between 3.8 and 3.11 instead. This is an existing issue since PyTorch 2.2.

Backwards Incompatible Changes

Change default torch_function behavior to be disabled when torch_dispatch is defined (#120632)

Defining a subclass with a torch_dispatch entry will now automatically set torch_function to be disabled. This aligns better with all the use cases we’ve observed for subclasses. The main change of behavior is that the result of the torch_dispatch handler will not go through the default torch_function handler anymore, wrapping it into the current subclass. This allows in particular for your subclass to return a plain Tensor or another subclass from any op.

The original behavior can be recovered by adding the following to your Tensor subclass:

@classmethod
def __torch_function__(cls, func, types, args=(), kwargs=None):
      return super().__torch_function__(func, types, args, kwargs)

ProcessGroupNCCL removes multi-device-per-thread support from C++ level (#119099, #118674)

  • Python level support was removed in 2.2.
  • To simplify ProcessGroupNCCL’s code, we remove support for multiple cuda devices per thread. To our knowledge, this is not an active use case, but it adds a large burden to our codebase. If you are relying on this, there is no workaround other than rewriting your pytorch program to use one device per process or one device per thread (multi-threads per process is still supported).

Removes no_dist and coordinator_rank from public DCP API's (#121317)

As part of an overall effort to simplify our public facing API's for Distributed Checkpointing, we've decided to deprecate usage of the coordinator_rank and no_dist parameters under torch.distributed.checkpoint. In our opinion, these parameters can lead to confusion around the intended effect during API usage, and have limited value to begin with. One concrete example is here, #118337, where there is ambiguity in which Process Group is referenced by the coordinator rank (additional context: #118337). In the case of the no_dist parameter, we consider this an implementation detail which should be hidden from the user. Starting in this release, no_dist is inferred from the initialized state of the process group, assuming the intention is to use collectives if a process group is initialized, and assuming the opposite in the case it is not.

2.2 2.3
# Version 2.2.2
import torch.distributed.checkpoint as dcp

dcp.save(
	state_dict={"model": model.state_dict()},
       checkpoint_id="path_to_model_checkpoint"
       no_dist=True,
       coordinator_rank=0
)
# ...
dcp.load(
	state_dict={"model": model.state_dict()},
       checkpoint_id="path_to_model_checkpoint"
       no_dist=True,
       coordinator_rank=0
)
# Version 2.2.3
# no dist is assumed from pg state, and rank 0 is always coordinator.
import torch.distributed.checkpoint as dcp

dcp.save(
	state_dict={"model": model.state_dict()},
       checkpoint_id="path_to_model_checkpoint"
) 
# ...
dcp.load(
	state_dict={"model": model.state_dict()},
       checkpoint_id="path_to_model_checkpoint"
)

Remove deprecated tp_mesh_dim arg (#121432)

Starting from PyTorch 2.3, parallelize_module API only accepts a DeviceMesh (the tp_mesh_dim argument has been removed). If having a N-D DeviceMesh for multi-dimensional parallelism, you can use mesh_nd["tp"] to obtain a 1-D DeviceMesh for tensor parallelism.

torch.export

  • Users must pass in an nn.Module to torch.export.export. The reason is that we have several invariants the ExportedProgram that are ambiguous if the top-level object being traced is a function, such as how we guarantee that every call_function node has an nn_module_stack populated, and we offer ways to access the state_dict/parameters/buffers of the exported program. We'd like torch.export to offer strong invariants—the value proposition of export is that you can trade flexibility for stronger guarantees about your model. (#117528)
  • Removed constraints in favor of dynamic_shapes (#117573, #117917, #117916, #120981, #120979)
  • ExportedProgram is no longer a callable. Instead users will need to use .module() to call the ExportedProgram. This is to prevent users from treating ExportedPrograms as torch.nn.Modules as we do not plan to support all features that torch.nn.Modules have, like hooks. Instead users can create a proper torch.nn.Module through exported_program.module() and use that as a callable. (#120019, #118425, #119105)
  • Remove equality_constraints from ExportedProgram as it is not used or useful anymore. Dimensions with equal constraints will now have the same symbol. (#116979)
  • Remove torch._export.export in favor of torch.export.export (#119095)
  • Remove CallSpec (#117671)

Enable fold_quantize by default in PT2 Export Quantization (#118701, #118605, #119425, #117797)

Previously, the PT2 Export Quantization flow did not generate quantized weight by default, but instead used fp32 weight in the quantized model in this pattern: fp32 weight -> q -> dq -> linear. Setting fold_quantize=True produces a graph with quantized weights in the quantized model in this pattern by default after convert_pt2e, and users will see a reduction in the model size: int8 weight -> dq -> linear.

2.2 2.3
folded_model = convert_pt2e(model, fold_quantize=True)
non_folded_model = convert_pt2e(model)
folded_model = convert_pt2e(model)
non_folded_model = convert_pt2e(model, fold_quantize=False)

Remove deprecated torch.jit.quantized APIs (#118406)

All functions and classes under torch.jit.quantized will now raise an error if called/instantiated. This API has long been deprecated in favor of torch.ao.nn.quantized.

2.2 2.3
# torch.jit.quantized APIs

torch.jit.quantized.quantize_rnn_cell_modules

torch.jit.quantized.quantize_rnn_modules
torch.jit.quantized.quantize_linear_modules

torch.jit.quantized.QuantizedLinear
torch.jit.QuantizedLinearFP16

torch.jit.quantized.QuantizedGRU
torch.jit.quantized.QuantizedGRUCell
torch.jit.quantized.QuantizedLSTM
torch.jit.quantized.QuantizedLSTMCell
# Corresponding torch.ao.quantization APIs

torch.ao.nn.quantized.dynamic.RNNCell

torch.ao.quantization.quantize_dynamic APIs

torch.ao.nn.quantized.dynamic.Linear

torch.ao.nn.quantized.dynamic.GRU
torch.ao.nn.quantized.dynamic.GRUCell
torch.ao.nn.quantized.dynamic.LSTM

...

Read more

PyTorch 2.2.2 Release, bug fix release

27 Mar 22:27
39901f2
Compare
Choose a tag to compare

This release is meant to fix the following issues (regressions / silent correctness):

  • Properly raise an error when trying to use inductor backend on non-supported platforms such as Windows (#115969)
  • Fix mkldnn performance issue on Windows platform (#121618)
  • Fix RuntimeError: cannot create std::vector larger than max_size() in torch.nn.functional.conv1d on non-contiguous cpu inputs by patching OneDNN (pytorch/builder#1742) (pytorch/builder#1744)
  • Add support for torch.distributed.fsdp.StateDictType.FULL_STATE_DICT for when using torch.distributed.fsdp.FullyShardedDataParallel with the device_mesh argument (#120837)
  • Fix make triton command on release branch for users building the release branch from source (#121169)
  • Ensure gcc>=9.0 for build from source and cpp_extensions (#120126)
  • Fix cxx11-abi build in release branch (pytorch/builder#1709)
  • Fix building from source on Windows source MSVC 14.38 - VS 2022 (#122120)

Release tracker #120999 contains all relevant pull requests related to this release as well as links to related issues.

PyTorch 2.2.1 Release, bug fix release

22 Feb 21:15
6c8c5ad
Compare
Choose a tag to compare

This release is meant to fix the following issues (regressions / silent correctness):

  • Fix missing OpenMP support on Apple Silicon binaries (pytorch/builder#1697)
  • Fix crash when mixing lazy and non-lazy tensors in one operation (#117653)
  • Fix PyTorch performance regression on Linux aarch64 (pytorch/builder#1696)
  • Fix silent correctness in DTensor _to_copy operation (#116426)
  • Fix properly assigning param.grad_fn for next forward (#116792)
  • Ensure gradient clear out pending AsyncCollectiveTensor in FSDP Extension (#116122)
  • Fix processing unflatten tensor on compute stream in FSDP Extension (#116559)
  • Fix FSDP AssertionError on tensor subclass when setting sync_module_states=True (#117336)
  • Fix DCP state_dict cannot correctly find FQN when the leaf module is wrapped by FSDP (#115592)
  • Fix OOM when when returning a AsyncCollectiveTensor by forcing _gather_state_dict() to be synchronous with respect to the mian stream. (#118197) (#119716)
  • Fix Windows runtime torch.distributed.DistNetworkError: [WinError 32] The process cannot access the file because it is being used by another process (#118860)
  • Update supported python versions in package description (#119743)
  • Fix SIGILL crash during import torch on CPUs that do not support SSE4.1 (#116623)
  • Fix DCP RuntimeError in get_state_dict and set_state_dict (#119573)
  • Fixes for HSDP + TP integration with device_mesh (#112435) (#118620) (#119064) (#118638) (#119481)
  • Fix numerical error with mixedmm on NVIDIA V100 (#118591)
  • Fix RuntimeError when using SymInt input invariant when splitting graphs (#117406)
  • Fix compile DTensor.from_local in trace_rule_look up (#119659)
  • Improve torch.compile integration with CUDA-11.8 binaries (#119750)

Release tracker #119295 contains all relevant pull requests related to this release as well as links to related issues.

PyTorch 2.2: FlashAttention-v2, AOTInductor

30 Jan 17:58
8ac9b20
Compare
Choose a tag to compare

PyTorch 2.2 Release Notes

  • Highlights
  • Backwards Incompatible Changes
  • Deprecations
  • New Features
  • Improvements
  • Bug fixes
  • Performance
  • Documentation

Highlights

We are excited to announce the release of PyTorch® 2.2! PyTorch 2.2 offers ~2x performance improvements to scaled_dot_product_attention via FlashAttention-v2 integration, as well as AOTInductor, a new ahead-of-time compilation and deployment tool built for non-python server-side deployments.

This release also includes improved torch.compile support for Optimizers, a number of new inductor optimizations, and a new logging mechanism called TORCH_LOGS.

Please note that we are deprecating macOS x86 support, and PyTorch 2.2.x will be the last version that supports macOS x64.

Along with 2.2, we are also releasing a series of updates to the PyTorch domain libraries. More details can be found in the library updates blog.

This release is composed of 3,628 commits and 521 contributors since PyTorch 2.1. We want to sincerely thank our dedicated community for your contributions. As always, we encourage you to try these out and report any issues as we improve 2.2. More information about how to get started with the PyTorch 2-series can be found at our Getting Started page.

Summary:

  • scaled_dot_product_attention (SDPA) now supports FlashAttention-2, yielding around 2x speedups compared to previous versions.
  • PyTorch 2.2 introduces a new ahead-of-time extension of TorchInductor called AOTInductor, designed to compile and deploy PyTorch programs for non-python server-side.
  • torch.distributed supports a new abstraction for initializing and representing ProcessGroups called device_mesh.
  • PyTorch 2.2 ships a standardized, configurable logging mechanism called TORCH_LOGS.
  • A number of torch.compile improvements are included in PyTorch 2.2, including improved support for compiling Optimizers and improved TorchInductor fusion and layout optimizations.
  • Please note that we are deprecating macOS x86 support, and PyTorch 2.2.x will be the last version that supports macOS x64.
  • torch.ao.quantization now offers a prototype torch.export based flow
Stable Beta Prototype Performance Improvements
FlashAttentionV2 backend for scaled dot product attention PT 2 Quantization Inductor optimizations
AOTInductor Scaled dot product attention support for jagged layout NestedTensors aarch64-linux optimizations (AWS Graviton)
TORCH_LOGS
torch.distributed.device_mesh
torch.compile + Optimizers

*To see a full list of public 2.2 - 1.12 feature submissions click here.

Tracked Regressions

Performance reduction when using NVLSTree algorithm in NCCL 2.19.3 (#117748)

We have noticed a performance regression introduced to all-reduce in NCCL 2.19.3. Please use version 2.19.1 instead.

Poor numeric stability of loss when training with FSDP + DTensor (#117471)

We observe the loss will flatline randomly while training with FSDP + DTensor in some instances.

Backwards Incompatible Changes

Building PyTorch from source now requires GCC 9.4 or newer (#112858)

GCC 9.4 is the oldest version fully compatible with C++17, which the PyTorch codebase has migrated to from C++14.

Updated flash attention kernel in scaled_dot_product_attention to use Flash Attention v2 (#105602)

Previously, the v1 Flash Attention kernel had a Windows implementation. So if a user on Windows had explicitly forced the flash attention kernel to be run by using sdp_kernel context manager with only flash attention enabled, it would work. In 2.2, if the sdp_kernel context manager must be used, use the memory efficient or math kernel if on Windows.

with torch.backends.cuda.sdp_kernel(enable_flash=True, enable_math=False, enable_mem_efficient=False):
  torch.nn.functional.scaled_dot_product_attention(q,k,v)
# Don't force flash attention to be used if using sdp_kernel on Windows
with torch.backends.cuda.sdp_kernel(enable_flash=False, enable_math=True, enable_mem_efficient=True):
  torch.nn.functional.scaled_dot_product_attention(q,k,v)

Rewrote DTensor (Tensor Parallel) APIs to improve UX (#114732)

In PyTorch 2.1 or before, users can use ParallelStyles like PairwiseParallel and specify input/output layout with functions like make_input_replicate_1d or make_output_replicate_1d. And we have default values for _prepare_input and _prepare_output. The UX of Tensor Parallel was like:

from torch.distributed.tensor.parallel.style import (
    ColwiseParallel,
    make_input_replicate_1d,
    make_input_reshard_replicate,
    make_input_shard_1d,
    make_input_shard_1d_last_dim,
    make_sharded_output_tensor,
    make_output_replicate_1d,
    make_output_reshard_tensor,
    make_output_shard_1d,
    make_output_tensor,
    PairwiseParallel,
    parallelize_module,
)
from torch.distributed.tensor import DeviceMesh

module = DummyModule()
device_mesh = DeviceMesh("cuda", list(range(self.world_size)))
parallelize_module(module, device_mesh, PairwiseParallel(_prepare_input=make_input_replicate_1d))
...

Starting from PyTorch 2.2, we simplified parallel styles to only contain ColwiseParallel and RowwiseParallel because other ParallelStyle can consist of these two. We also deleted the input/output functions, and started using input_layouts and output_layouts as kwargs instead to specify the sharding layout of both input/output tensors. Finally, added PrepareModuleInput/PrepareModuleOutput style, and no default arguments for layouts in these two styles and users need to specify them to think about the sharding layouts.

from torch.distributed.tensor.parallel.style import (
    ColwiseParallel,
    PrepareModuleInput,
    RowwiseParallel,
    parallelize_module,
)
from torch.distributed._tensor import init_device_mesh

module = SimpleMLPModule()
device_mesh = init_device_mesh("cuda", (self.world_size,)))
parallelize_module(
   module,
   device_mesh,
   {
      "fqn": PrepareModuleInput(
                input_layouts=Shard(0),
                desired_input_layouts=Replicate()
             ),
      "fqn.net1": ColwiseParallel(),
      "fqn.net2": RowwiseParallel(output_layouts=Shard(0)),
   }
)
...

UntypedStorage.resize_ now uses the original device instead of the current device context (#113386)

Before this PR, UntypedStorage.resize_ would move data to the current CUDA device index (given by torch.cuda.current_device()).
Now, UntypedStorage.resize_() keeps the data on the same device index that it was on before, regardless of the current device index.

2.1 2.2
>>> import torch
>>> with torch.cuda.device('cuda:0'):
...:     a = torch.zeros(0, device='cuda:1')
...:     print(a.device)
...:     a = a.untyped_storage().resize_(0)
...:     print(a.device)
cuda:1
cuda:0
>>> import torch
>>> with torch.cuda.device('cuda:0'):
...:     a = torch.zeros(0, device='cuda:1')
...:     print(a.device)
...:     a = a.untyped_storage().resize_(0)
...:     print(a.device)
cuda:1
cuda:1

Wrapping a function with set_grad_enabled will consume its global mutation (#113359)

This bc-breaking change fixes some unexpected behavior when set_grad_enabled is used as a decorator.

2.1 2.2
>>> import torch
>>> @torch.set_grad_enabled(False)  # unexpectedly, this mutates the grad mode!
    def inner_func(x):
        return x.sin()

>>> torch.is_grad_enabled()
True
>>> import torch
>>> @torch.set_grad_enabled(False)  # unexpectedly, this mutates the grad mode!
    def inner_func(x):
        return x.sin()

>>> torch.is_grad_enabled()
False

Deprecated verbose parameter in LRscheduler constructors (#111302)

As part of our decision to move towards a consolidated logging system, we are deprecating the verbose flag in LRScheduler.

If you would like to print the learning rate during execution, please use get_last_lr()

2.1 2.2
optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9)
scheduler = ReduceLROnPlateau(optimizer, 'min', verbose=True)
for epoch in range(10):
    train(...)
    val_loss = validate(...)
    # Note that step should be called after validate()
    scheduler.step(val_loss)
optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9)
scheduler = ReduceLROnPlateau(optimizer, 'min')
for epoch in range(10):
    train(...)
    val_loss = validate(...)
    # Note that step should be called after validate()
    scheduler.step(val_loss)
	print(f"Epoch {epoch} has concluded with lr of {scheduler.get_last_lr()}")

</td...

Read more