-
-
Notifications
You must be signed in to change notification settings - Fork 103
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scheduler not breaking optimization loop as expected when linear solver failed #308
Comments
@MarkChenYutian Hi Yutian, can you provide the complete test code, including the model and input? |
Hi Xukuan, the code is in a large codebase for an ongoing project and the reported bug does not deterministically occurs (the matrix is positive-definite for most of the time). Therefore, it is hard to put the code, model and input here directly. I will try to build a minimum reproducible demo for this issue asap after the final exam : ) |
Hi, I have tried to dump the model state when the However, when I tried to reproduce the problem by loading the model weight and re-run the optimization loop, the "Cholesky decomposition failed" error does not occur deterministically. When optimizing the model, the scheduler generates this log
The model's weight after quitting optimization loop is stored in the However, when I load the model's state dict and rerun the optimization loop under exactly the same configuration, the error disappear mysteriously.
Below is the code to load the weight file and run exact the same optimization: import torch
import pypose as pp
from pypose.optim import LM
from pypose.optim.kernel import Huber
from pypose.optim.solver import Cholesky
from pypose.optim.strategy import TrustRegion
from pypose.optim.corrector import FastTriggs
from pypose.optim.scheduler import StopOnPlateau
EDN2NED = pp.from_matrix(torch.tensor(
[[0., 0., 1., 0.],
[1., 0., 0., 0.],
[0., 1., 0., 0.],
[0., 0., 0., 1.]], dtype=torch.float32),
pp.SE3_type)
NED2EDN = EDN2NED.Inv()
class PoseGraph(torch.nn.Module):
def __init__(self, **kwargs) -> None:
super().__init__()
self.register_buffer("K", torch.tensor([
[320., 0. , 320.],
[0. , 320., 240.],
[0. , 0. , 1. ]
]))
self.register_buffer("pts1", torch.empty(1000,2)) # N x 2, uv coordinate
self.register_buffer("pts2", torch.empty(1000,2)) # N x 2, uv coordinate
self.register_buffer("NED2EDN", NED2EDN)
self.register_buffer("EDN2NED", EDN2NED)
self.fx, self.fy = self.K[0, 0], self.K[1, 1]
self.cx, self.cy = self.K[0, 2], self.K[1, 2]
self.T = pp.Parameter(pp.identity_SE3())
self.depth = torch.nn.Parameter(torch.empty(1000,))
@property
def point3d(self) -> torch.Tensor:
return pp.geometry.pixel2point(self.pts1, self.depth, self.K)
def forward(self) -> torch.Tensor:
return pp.reprojerr(
self.point3d, self.pts2, self.K, (self.NED2EDN @ self.T @ self.EDN2NED).Inv() , reduction='none'
)
graph = PoseGraph()
graph.load_state_dict(torch.load("./Results/step49.pkl"))
graph = graph.cuda()
kernel = Huber(delta=0.1)
corrector = FastTriggs(kernel)
optimizer = LM(graph, solver=Cholesky(),
strategy=TrustRegion(radius=1e3),
kernel=kernel,
corrector=corrector,
min=1e-8,
vectorize=True)
scheduler = StopOnPlateau(optimizer, steps=10,
patience=4,
decreasing=1e-6,
verbose=True)
while scheduler.continual():
loss = optimizer.step(input=())
scheduler.step(loss) @xukuanHIT Hope this helps! I suspect this is due to some kind of numerical instability in optimizer / model, which is hard to reproduce and debug : ( |
Also, I noticed that the loss reported by scheduler is different when the graph is on different devices. Switching the model on to CPU seems to decrease the probability of this error significantly (I have not experienced the "Solver Failed" error when running purely on CPU) |
@MarkChenYutian Thanks for the detailed information. However, I have run the code 200 times, but still cannot reproduce the error. Could it be due to the different development environments? |
It might be caused by the CUDA / cuDNN version or some other subtle differences in development environment since this problem only occasionally occurs when I'm using cuda acceleration. |
🐛 Describe the bug
The optimization scheduler
StopOnPlateau
does not break the optimization loop correctly when the linear solver failed.Code
Output:
Versions
PyTorch version: 2.0.0
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A
OS: Ubuntu 22.04.2 LTS (x86_64)
GCC version: (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0
Clang version: Could not collect
CMake version: version 3.24.1
Libc version: glibc-2.35
Python version: 3.10.6 (main, Mar 10 2023, 10:55:28) [GCC 11.3.0] (64-bit runtime)
Python platform: Linux-4.18.0-372.32.1.el8_6.x86_64-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: 12.1.105
GPU models and configuration: GPU 0: NVIDIA TITAN X (Pascal)
Nvidia driver version: 535.86.10
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.1
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.1
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.1
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.1
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.1
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.1
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.1
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
Versions of relevant libraries:
[pip3] numpy==1.26.0
[pip3] pytorch-quantization==2.1.2
[pip3] torch==2.0.0
[pip3] torch-tensorrt==1.4.0.dev0
[pip3] torchdata==0.6.0
[pip3] torchtext==0.15.1
[pip3] torchvision==0.15.1
[conda] Could not collect
The text was updated successfully, but these errors were encountered: