Scheduler not breaking optimization loop as expected when linear solver failed #308

MarkChenYutian · 2023-11-21T04:47:17Z

🐛 Describe the bug

The optimization scheduler StopOnPlateau does not break the optimization loop correctly when the linear solver failed.

Code

    kernel = Huber(delta=.1)
    corrector = FastTriggs(kernel)
    optimizer = LM(graph, solver=Cholesky(),
                            strategy=TrustRegion(radius=1e5),
                            kernel=kernel,
                            corrector=corrector,
                            min=1e-5,
                            vectorize=vectorize)
    scheduler = StopOnPlateau(optimizer, steps=10,
                                            patience=4,
                                            decreasing=1e-6,
                                            verbose=verbose)
    while scheduler.continual():
        loss = optimizer.step(input=())
        scheduler.step(loss)

Output:

Processing frame 32
StopOnPlateau on step 0 Loss 5.662884e+03 --> Loss 2.004884e+03 (reduction/loss: 6.4596e-01).
StopOnPlateau on step 1 Loss 2.004884e+03 --> Loss 4.706959e+02 (reduction/loss: 7.6523e-01).
StopOnPlateau on step 2 Loss 4.706959e+02 --> Loss 1.361766e+02 (reduction/loss: 7.1069e-01).
StopOnPlateau on step 3 Loss 1.361766e+02 --> Loss 2.811403e+01 (reduction/loss: 7.9355e-01).
StopOnPlateau on step 4 Loss 2.811403e+01 --> Loss 6.645568e+00 (reduction/loss: 7.6362e-01).
StopOnPlateau on step 5 Loss 6.645568e+00 --> Loss 4.929847e-01 (reduction/loss: 9.2582e-01).
StopOnPlateau on step 6 Loss 4.929847e-01 --> Loss 4.090670e-03 (reduction/loss: 9.9170e-01).
StopOnPlateau on step 7 Loss 4.090670e-03 --> Loss 2.397438e-04 (reduction/loss: 9.4139e-01).
Cholesky decomposition failed. Check your matrix (may not be positive-definite) 
Linear solver failed. Breaking optimization step...
StopOnPlateau on step 8 Loss 2.397438e-04 --> Loss 2.397438e-04 (reduction/loss: 0.0000e+00).
Cholesky decomposition failed. Check your matrix (may not be positive-definite) 
Linear solver failed. Breaking optimization step...
StopOnPlateau on step 9 Loss 2.397438e-04 --> Loss 2.397438e-04 (reduction/loss: 0.0000e+00).
StopOnPlateau: Maximum steps reached, Quiting..

Versions

PyTorch version: 2.0.0
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.2 LTS (x86_64)
GCC version: (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0
Clang version: Could not collect
CMake version: version 3.24.1
Libc version: glibc-2.35

Python version: 3.10.6 (main, Mar 10 2023, 10:55:28) [GCC 11.3.0] (64-bit runtime)
Python platform: Linux-4.18.0-372.32.1.el8_6.x86_64-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: 12.1.105
GPU models and configuration: GPU 0: NVIDIA TITAN X (Pascal)
Nvidia driver version: 535.86.10
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.1
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.1
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.1
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.1
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.1
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.1
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.1
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

Versions of relevant libraries:
[pip3] numpy==1.26.0
[pip3] pytorch-quantization==2.1.2
[pip3] torch==2.0.0
[pip3] torch-tensorrt==1.4.0.dev0
[pip3] torchdata==0.6.0
[pip3] torchtext==0.15.1
[pip3] torchvision==0.15.1
[conda] Could not collect

The text was updated successfully, but these errors were encountered:

xukuanHIT · 2023-12-17T09:40:05Z

@MarkChenYutian Hi Yutian, can you provide the complete test code, including the model and input?

MarkChenYutian · 2023-12-18T20:26:15Z

Hi Xukuan, the code is in a large codebase for an ongoing project and the reported bug does not deterministically occurs (the matrix is positive-definite for most of the time). Therefore, it is hard to put the code, model and input here directly.

I will try to build a minimum reproducible demo for this issue asap after the final exam : )

MarkChenYutian · 2024-01-15T01:55:38Z

Hi, I have tried to dump the model state when the Cholesky decomposition failed. Check your matrix (may not be positive-definite) error occurs. One of the model weight file when such problem occurs is attached below (step49.pkl).
step49.zip, need to decompress to get .pkl file

However, when I tried to reproduce the problem by loading the model weight and re-run the optimization loop, the "Cholesky decomposition failed" error does not occur deterministically.

When optimizing the model, the scheduler generates this log

StopOnPlateau on step 0 Loss 4.499314e+06 --> Loss 2.250402e+06 (reduction/loss: 4.9983e-01).
StopOnPlateau on step 1 Loss 2.250402e+06 --> Loss 1.127921e+06 (reduction/loss: 4.9879e-01).
StopOnPlateau on step 2 Loss 1.127921e+06 --> Loss 5.645721e+05 (reduction/loss: 4.9946e-01).
StopOnPlateau on step 3 Loss 5.645721e+05 --> Loss 2.826061e+05 (reduction/loss: 4.9943e-01).
StopOnPlateau on step 4 Loss 2.826061e+05 --> Loss 1.522469e+05 (reduction/loss: 4.6128e-01).
StopOnPlateau on step 5 Loss 1.522469e+05 --> Loss 6.831369e+04 (reduction/loss: 5.5130e-01).
StopOnPlateau on step 6 Loss 6.831369e+04 --> Loss 2.861829e+04 (reduction/loss: 5.8108e-01).
StopOnPlateau on step 7 Loss 2.861829e+04 --> Loss 2.066058e+04 (reduction/loss: 2.7806e-01).
Cholesky decomposition failed. Check your matrix (may not be positive-definite) 
Linear solver failed. Breaking optimization step...
StopOnPlateau on step 8 Loss 2.066058e+04 --> Loss 2.066058e+04 (reduction/loss: 0.0000e+00).
Cholesky decomposition failed. Check your matrix (may not be positive-definite) 
Linear solver failed. Breaking optimization step...
StopOnPlateau on step 9 Loss 2.066058e+04 --> Loss 2.066058e+04 (reduction/loss: 0.0000e+00).
StopOnPlateau: Maximum steps reached, Quiting..

The model's weight after quitting optimization loop is stored in the step49.pkl.

However, when I load the model's state dict and rerun the optimization loop under exactly the same configuration, the error disappear mysteriously.

StopOnPlateau on step 0 Loss 2.066058e+04 --> Loss 8.814132e+03 (reduction/loss: 5.7338e-01).
StopOnPlateau on step 1 Loss 8.814132e+03 --> Loss 3.961462e+03 (reduction/loss: 5.5056e-01).
StopOnPlateau on step 2 Loss 3.961462e+03 --> Loss 1.379486e+03 (reduction/loss: 6.5177e-01).
StopOnPlateau on step 3 Loss 1.379486e+03 --> Loss 2.655012e+02 (reduction/loss: 8.0754e-01).
StopOnPlateau on step 4 Loss 2.655012e+02 --> Loss 2.335889e+02 (reduction/loss: 1.2020e-01).
StopOnPlateau on step 5 Loss 2.335889e+02 --> Loss 1.668978e+02 (reduction/loss: 2.8551e-01).
StopOnPlateau on step 6 Loss 1.668978e+02 --> Loss 1.625448e+02 (reduction/loss: 2.6082e-02).
StopOnPlateau on step 7 Loss 1.625448e+02 --> Loss 1.305630e+02 (reduction/loss: 1.9676e-01).
StopOnPlateau on step 8 Loss 1.305630e+02 --> Loss 1.282390e+02 (reduction/loss: 1.7800e-02).
StopOnPlateau: Maximum rejected steps reached, Quiting..

Below is the code to load the weight file and run exact the same optimization:

import torch
import pypose as pp
from pypose.optim import LM
from pypose.optim.kernel import Huber
from pypose.optim.solver import Cholesky
from pypose.optim.strategy import TrustRegion
from pypose.optim.corrector import FastTriggs
from pypose.optim.scheduler import StopOnPlateau

EDN2NED = pp.from_matrix(torch.tensor(
       [[0., 0., 1., 0.],
        [1., 0., 0., 0.],
        [0., 1., 0., 0.],
        [0., 0., 0., 1.]], dtype=torch.float32),
    pp.SE3_type)

NED2EDN = EDN2NED.Inv()

class PoseGraph(torch.nn.Module):
    def __init__(self, **kwargs) -> None:
        super().__init__()
        self.register_buffer("K", torch.tensor([
                [320., 0.  , 320.],
                [0.  , 320., 240.],
                [0.  , 0.  , 1.  ]
            ]))
        self.register_buffer("pts1", torch.empty(1000,2))        # N x 2, uv coordinate
        self.register_buffer("pts2", torch.empty(1000,2))        # N x 2, uv coordinate
        self.register_buffer("NED2EDN", NED2EDN)
        self.register_buffer("EDN2NED", EDN2NED)
        self.fx, self.fy = self.K[0, 0], self.K[1, 1]
        self.cx, self.cy = self.K[0, 2], self.K[1, 2]

        self.T = pp.Parameter(pp.identity_SE3())
        self.depth = torch.nn.Parameter(torch.empty(1000,))

    @property
    def point3d(self) -> torch.Tensor:
        return pp.geometry.pixel2point(self.pts1, self.depth, self.K)

    def forward(self) -> torch.Tensor:
        return pp.reprojerr(
            self.point3d, self.pts2, self.K, (self.NED2EDN @ self.T @ self.EDN2NED).Inv() , reduction='none'
        )

graph = PoseGraph()
graph.load_state_dict(torch.load("./Results/step49.pkl"))
graph = graph.cuda()

kernel = Huber(delta=0.1)
corrector = FastTriggs(kernel)
optimizer = LM(graph, solver=Cholesky(),
                        strategy=TrustRegion(radius=1e3),
                        kernel=kernel,
                        corrector=corrector,
                        min=1e-8,
                        vectorize=True)
scheduler = StopOnPlateau(optimizer, steps=10,
                                        patience=4,
                                        decreasing=1e-6,
                                        verbose=True)
while scheduler.continual():
    loss = optimizer.step(input=())
    scheduler.step(loss)

@xukuanHIT Hope this helps!

I suspect this is due to some kind of numerical instability in optimizer / model, which is hard to reproduce and debug : (

MarkChenYutian · 2024-01-15T02:06:04Z

Also, I noticed that the loss reported by scheduler is different when the graph is on different devices. Switching the model on to CPU seems to decrease the probability of this error significantly (I have not experienced the "Solver Failed" error when running purely on CPU)

xukuanHIT · 2024-01-16T13:34:21Z

@MarkChenYutian Thanks for the detailed information. However, I have run the code 200 times, but still cannot reproduce the error. Could it be due to the different development environments?
output.txt

MarkChenYutian · 2024-01-19T19:56:05Z

It might be caused by the CUDA / cuDNN version or some other subtle differences in development environment since this problem only occasionally occurs when I'm using cuda acceleration.

wang-chen assigned xukuanHIT Dec 16, 2023

wang-chen added the bug Something isn't working label Dec 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scheduler not breaking optimization loop as expected when linear solver failed #308

Scheduler not breaking optimization loop as expected when linear solver failed #308

MarkChenYutian commented Nov 21, 2023

xukuanHIT commented Dec 17, 2023

MarkChenYutian commented Dec 18, 2023

MarkChenYutian commented Jan 15, 2024 •

edited

Loading

MarkChenYutian commented Jan 15, 2024

xukuanHIT commented Jan 16, 2024

MarkChenYutian commented Jan 19, 2024

Scheduler not breaking optimization loop as expected when linear solver failed #308

Scheduler not breaking optimization loop as expected when linear solver failed #308

Comments

MarkChenYutian commented Nov 21, 2023

🐛 Describe the bug

Versions

xukuanHIT commented Dec 17, 2023

MarkChenYutian commented Dec 18, 2023

MarkChenYutian commented Jan 15, 2024 • edited Loading

MarkChenYutian commented Jan 15, 2024

xukuanHIT commented Jan 16, 2024

MarkChenYutian commented Jan 19, 2024

MarkChenYutian commented Jan 15, 2024 •

edited

Loading