Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scheduler not breaking optimization loop as expected when linear solver failed #308

Open
MarkChenYutian opened this issue Nov 21, 2023 · 6 comments
Assignees
Labels
bug Something isn't working

Comments

@MarkChenYutian
Copy link
Member

🐛 Describe the bug

The optimization scheduler StopOnPlateau does not break the optimization loop correctly when the linear solver failed.

Code

    kernel = Huber(delta=.1)
    corrector = FastTriggs(kernel)
    optimizer = LM(graph, solver=Cholesky(),
                            strategy=TrustRegion(radius=1e5),
                            kernel=kernel,
                            corrector=corrector,
                            min=1e-5,
                            vectorize=vectorize)
    scheduler = StopOnPlateau(optimizer, steps=10,
                                            patience=4,
                                            decreasing=1e-6,
                                            verbose=verbose)
    while scheduler.continual():
        loss = optimizer.step(input=())
        scheduler.step(loss)

Output:

Processing frame 32
StopOnPlateau on step 0 Loss 5.662884e+03 --> Loss 2.004884e+03 (reduction/loss: 6.4596e-01).
StopOnPlateau on step 1 Loss 2.004884e+03 --> Loss 4.706959e+02 (reduction/loss: 7.6523e-01).
StopOnPlateau on step 2 Loss 4.706959e+02 --> Loss 1.361766e+02 (reduction/loss: 7.1069e-01).
StopOnPlateau on step 3 Loss 1.361766e+02 --> Loss 2.811403e+01 (reduction/loss: 7.9355e-01).
StopOnPlateau on step 4 Loss 2.811403e+01 --> Loss 6.645568e+00 (reduction/loss: 7.6362e-01).
StopOnPlateau on step 5 Loss 6.645568e+00 --> Loss 4.929847e-01 (reduction/loss: 9.2582e-01).
StopOnPlateau on step 6 Loss 4.929847e-01 --> Loss 4.090670e-03 (reduction/loss: 9.9170e-01).
StopOnPlateau on step 7 Loss 4.090670e-03 --> Loss 2.397438e-04 (reduction/loss: 9.4139e-01).
Cholesky decomposition failed. Check your matrix (may not be positive-definite) 
Linear solver failed. Breaking optimization step...
StopOnPlateau on step 8 Loss 2.397438e-04 --> Loss 2.397438e-04 (reduction/loss: 0.0000e+00).
Cholesky decomposition failed. Check your matrix (may not be positive-definite) 
Linear solver failed. Breaking optimization step...
StopOnPlateau on step 9 Loss 2.397438e-04 --> Loss 2.397438e-04 (reduction/loss: 0.0000e+00).
StopOnPlateau: Maximum steps reached, Quiting..

Versions

PyTorch version: 2.0.0
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.2 LTS (x86_64)
GCC version: (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0
Clang version: Could not collect
CMake version: version 3.24.1
Libc version: glibc-2.35

Python version: 3.10.6 (main, Mar 10 2023, 10:55:28) [GCC 11.3.0] (64-bit runtime)
Python platform: Linux-4.18.0-372.32.1.el8_6.x86_64-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: 12.1.105
GPU models and configuration: GPU 0: NVIDIA TITAN X (Pascal)
Nvidia driver version: 535.86.10
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.1
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.1
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.1
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.1
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.1
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.1
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.1
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

Versions of relevant libraries:
[pip3] numpy==1.26.0
[pip3] pytorch-quantization==2.1.2
[pip3] torch==2.0.0
[pip3] torch-tensorrt==1.4.0.dev0
[pip3] torchdata==0.6.0
[pip3] torchtext==0.15.1
[pip3] torchvision==0.15.1
[conda] Could not collect

@xukuanHIT
Copy link
Contributor

@MarkChenYutian Hi Yutian, can you provide the complete test code, including the model and input?

@wang-chen wang-chen added the bug Something isn't working label Dec 18, 2023
@MarkChenYutian
Copy link
Member Author

Hi Xukuan, the code is in a large codebase for an ongoing project and the reported bug does not deterministically occurs (the matrix is positive-definite for most of the time). Therefore, it is hard to put the code, model and input here directly.

I will try to build a minimum reproducible demo for this issue asap after the final exam : )

@MarkChenYutian
Copy link
Member Author

MarkChenYutian commented Jan 15, 2024

Hi, I have tried to dump the model state when the Cholesky decomposition failed. Check your matrix (may not be positive-definite) error occurs. One of the model weight file when such problem occurs is attached below (step49.pkl).
step49.zip, need to decompress to get .pkl file

However, when I tried to reproduce the problem by loading the model weight and re-run the optimization loop, the "Cholesky decomposition failed" error does not occur deterministically.

When optimizing the model, the scheduler generates this log

StopOnPlateau on step 0 Loss 4.499314e+06 --> Loss 2.250402e+06 (reduction/loss: 4.9983e-01).
StopOnPlateau on step 1 Loss 2.250402e+06 --> Loss 1.127921e+06 (reduction/loss: 4.9879e-01).
StopOnPlateau on step 2 Loss 1.127921e+06 --> Loss 5.645721e+05 (reduction/loss: 4.9946e-01).
StopOnPlateau on step 3 Loss 5.645721e+05 --> Loss 2.826061e+05 (reduction/loss: 4.9943e-01).
StopOnPlateau on step 4 Loss 2.826061e+05 --> Loss 1.522469e+05 (reduction/loss: 4.6128e-01).
StopOnPlateau on step 5 Loss 1.522469e+05 --> Loss 6.831369e+04 (reduction/loss: 5.5130e-01).
StopOnPlateau on step 6 Loss 6.831369e+04 --> Loss 2.861829e+04 (reduction/loss: 5.8108e-01).
StopOnPlateau on step 7 Loss 2.861829e+04 --> Loss 2.066058e+04 (reduction/loss: 2.7806e-01).
Cholesky decomposition failed. Check your matrix (may not be positive-definite) 
Linear solver failed. Breaking optimization step...
StopOnPlateau on step 8 Loss 2.066058e+04 --> Loss 2.066058e+04 (reduction/loss: 0.0000e+00).
Cholesky decomposition failed. Check your matrix (may not be positive-definite) 
Linear solver failed. Breaking optimization step...
StopOnPlateau on step 9 Loss 2.066058e+04 --> Loss 2.066058e+04 (reduction/loss: 0.0000e+00).
StopOnPlateau: Maximum steps reached, Quiting..

The model's weight after quitting optimization loop is stored in the step49.pkl.

However, when I load the model's state dict and rerun the optimization loop under exactly the same configuration, the error disappear mysteriously.

StopOnPlateau on step 0 Loss 2.066058e+04 --> Loss 8.814132e+03 (reduction/loss: 5.7338e-01).
StopOnPlateau on step 1 Loss 8.814132e+03 --> Loss 3.961462e+03 (reduction/loss: 5.5056e-01).
StopOnPlateau on step 2 Loss 3.961462e+03 --> Loss 1.379486e+03 (reduction/loss: 6.5177e-01).
StopOnPlateau on step 3 Loss 1.379486e+03 --> Loss 2.655012e+02 (reduction/loss: 8.0754e-01).
StopOnPlateau on step 4 Loss 2.655012e+02 --> Loss 2.335889e+02 (reduction/loss: 1.2020e-01).
StopOnPlateau on step 5 Loss 2.335889e+02 --> Loss 1.668978e+02 (reduction/loss: 2.8551e-01).
StopOnPlateau on step 6 Loss 1.668978e+02 --> Loss 1.625448e+02 (reduction/loss: 2.6082e-02).
StopOnPlateau on step 7 Loss 1.625448e+02 --> Loss 1.305630e+02 (reduction/loss: 1.9676e-01).
StopOnPlateau on step 8 Loss 1.305630e+02 --> Loss 1.282390e+02 (reduction/loss: 1.7800e-02).
StopOnPlateau: Maximum rejected steps reached, Quiting..

Below is the code to load the weight file and run exact the same optimization:

import torch
import pypose as pp
from pypose.optim import LM
from pypose.optim.kernel import Huber
from pypose.optim.solver import Cholesky
from pypose.optim.strategy import TrustRegion
from pypose.optim.corrector import FastTriggs
from pypose.optim.scheduler import StopOnPlateau

EDN2NED = pp.from_matrix(torch.tensor(
       [[0., 0., 1., 0.],
        [1., 0., 0., 0.],
        [0., 1., 0., 0.],
        [0., 0., 0., 1.]], dtype=torch.float32),
    pp.SE3_type)

NED2EDN = EDN2NED.Inv()

class PoseGraph(torch.nn.Module):
    def __init__(self, **kwargs) -> None:
        super().__init__()
        self.register_buffer("K", torch.tensor([
                [320., 0.  , 320.],
                [0.  , 320., 240.],
                [0.  , 0.  , 1.  ]
            ]))
        self.register_buffer("pts1", torch.empty(1000,2))        # N x 2, uv coordinate
        self.register_buffer("pts2", torch.empty(1000,2))        # N x 2, uv coordinate
        self.register_buffer("NED2EDN", NED2EDN)
        self.register_buffer("EDN2NED", EDN2NED)
        self.fx, self.fy = self.K[0, 0], self.K[1, 1]
        self.cx, self.cy = self.K[0, 2], self.K[1, 2]

        self.T = pp.Parameter(pp.identity_SE3())
        self.depth = torch.nn.Parameter(torch.empty(1000,))

    @property
    def point3d(self) -> torch.Tensor:
        return pp.geometry.pixel2point(self.pts1, self.depth, self.K)

    def forward(self) -> torch.Tensor:
        return pp.reprojerr(
            self.point3d, self.pts2, self.K, (self.NED2EDN @ self.T @ self.EDN2NED).Inv() , reduction='none'
        )

graph = PoseGraph()
graph.load_state_dict(torch.load("./Results/step49.pkl"))
graph = graph.cuda()

kernel = Huber(delta=0.1)
corrector = FastTriggs(kernel)
optimizer = LM(graph, solver=Cholesky(),
                        strategy=TrustRegion(radius=1e3),
                        kernel=kernel,
                        corrector=corrector,
                        min=1e-8,
                        vectorize=True)
scheduler = StopOnPlateau(optimizer, steps=10,
                                        patience=4,
                                        decreasing=1e-6,
                                        verbose=True)
while scheduler.continual():
    loss = optimizer.step(input=())
    scheduler.step(loss)

@xukuanHIT Hope this helps!

I suspect this is due to some kind of numerical instability in optimizer / model, which is hard to reproduce and debug : (

@MarkChenYutian
Copy link
Member Author

Also, I noticed that the loss reported by scheduler is different when the graph is on different devices. Switching the model on to CPU seems to decrease the probability of this error significantly (I have not experienced the "Solver Failed" error when running purely on CPU)

@xukuanHIT
Copy link
Contributor

@MarkChenYutian Thanks for the detailed information. However, I have run the code 200 times, but still cannot reproduce the error. Could it be due to the different development environments?
output.txt

@MarkChenYutian
Copy link
Member Author

It might be caused by the CUDA / cuDNN version or some other subtle differences in development environment since this problem only occasionally occurs when I'm using cuda acceleration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants