Workflows

New

Add-Ons

Troubleshooting

👁️ Overview

Introducing Workflows in Komodor, a powerful new feature designed to enhance visibility and troubleshooting for AI/ML workloads on Kubernetes. Accessible under the Kubernetes AddOns section, Workflows streamline monitoring and improve troubleshooting efficiency across various engines like ArgoWF and Airflow.

Workflows List Airflow for docs.png

⭐️ Key Features

Out-of-the-Box Monitoring: Automatically tracks workflows from ArgoWF and Airflow without additional configuration.
Support for Custom Engines: Identify workflows from engines like MLFlow or Kubeflow by adding specific labels.
Workflow Pod Monitor: Runs on each cluster by default, automatically detecting and tracking workflow-related pods.
Organized Workflows Page: Navigate workflows via engine-specific tabs, with aggregation by DAG/Template showing the latest run status.
Detailed Run Information: Gain insights into pod phases, issues, and correlated infrastructure events like node terminations.

Please note: Workflow data is retained for 3 days, with options to view previous runs via a dropdown for each template.

wf with correlated node termination for docs.png

🗣️ Motivation

Managing AI/ML workflows on Kubernetes can be challenging, particularly in complex environments where orphan pods are quickly deleted, leading to the loss of logs and data. The Workflows feature addresses these pain points by:

Providing full visibility into workflows, including orphaned pods.
Simplifying troubleshooting with automated issue detection and infrastructure correlation.
Enabling streamlined operations for AI/ML workloads for data teams.

🚀 Getting Started

Explore the new Workflows feature under the Kubernetes AddOns section in your Komodor dashboard. Monitor and troubleshoot workflow issues effortlessly.

For more details, check out our documentation

Chen

Cert-manager Add-on

New

Add-Ons

Reliability

We're excited to share that Komodor now includes cert-manager as a first-class citizen in the platform to improve cluster reliability and security.

This integration provides real-time visibility across clusters, ensuring certificates stay up-to-date and minimizing service disruptions due to expired or misconfigured certificates.

👁️ Overview

With this update, users can track, manage, and troubleshoot certificates easily using a unified cert-manager dashboard within Komodor.

Key Features

Cert-Manager Dashboard: View cross-cluster certificate statuses, with filtering options by cluster, namespace, issuer, and certificate status.
Controllers View: a separate tab showing all cert-manager controllers and their health, allowing you to quickly troubleshoot in case of issues.
Reliability Violations: Alerts are automatically generated for certificates that are expired, expiring soon, or failing renewal, allowing prompt resolution to prevent disruptions.

🚀 Getting Started

Head over to Kubernetes AddOns -> Cert-manager in the sidebar to start exploring.

If you have any issues, feedback, or questions, please reach out to us through Chat Support in the top bar—we’re here to help.

D Danielle

Health Management

New

Health Management

👁️ Overview

Introducing Komodor Health Management, a comprehensive solution designed to give you a clear view of your Kubernetes environment's health. This feature includes two new pages dedicated to Workload Health and Infrastructure Health, consolidating real-time issues and reliability risks into a single, cohesive experience.

As part of this update, the existing reliability page has been removed to streamline health monitoring and improve focus.

Health overview.png

⭐️ Key Features

Workload Health Page: Provides detailed insights into the health and stability of your services, highlighting real-time issues, reliability risks, and standards for optimization.
Infrastructure Health Page: Focuses on the health of your underlying Kubernetes infrastructure, including nodes and PVCs, with visibility into real-time issues and potential reliability risks affecting stability.
Integrated Monitoring: See both real-time issues and long-term risks side-by-side, helping you identify and resolve issues quickly and efficiently.
Health Policies: Real-Time Monitors configuration, Reliability Risk Policies, and Ignored Checks configuration pages have been relocated to the new Organization Settings area for easier access and management.
Cluster Overview Page: Adjusted to incorporate the above insights, providing a consolidated health view for quick assessment of workload and infrastructure issues, allowing you to track health trends effectively.

🗣️ Motivation

Komodor Health Management unifies real-time issues and long-term reliability insights in one view. This solution helps you track and resolve issues, optimize performance, and maintain a resilient Kubernetes environment by providing a clear perspective on workloads and infrastructure health for efficient troubleshooting and proactive risk management.

Benefits:

Comprehensive Health View: Gain a unified perspective on workloads and infrastructure health.
Efficient Troubleshooting: Quickly identify and resolve both real-time and long-term issues.
Proactive Risk Management: Address configuration and resource risks before they become problems.
Streamlined Experience: Focus on what matters most with a single, intuitive view.

🚀 Getting Started

Explore the new Workload Health and Infrastructure Health pages from the left menu. Dive into real-time issues, review reliability risks, and take control of your Kubernetes environment. For more details, check out our documentation.

Chen

New Navigation and Settings Updates

New

Managing complex Kubernetes environments requires intuitive navigation and quick access to essential tools. To improve user experience and accessibility, we’ve redesigned the Settings and Platform Navigation in Komodor.

👁️ Overview

This update introduces a restructured navigation layout that simplifies access to critical features while preparing the platform for future enhancements.

Left Sidebar Redesign: Organized into distinct sections and include our new AddOns features - Cert Manager and Workflows
Enhanced Top Bar: Improved User Settings and Organization Settings provide direct access to profile details, API keys, access management, and account configurations.

Each enhancement makes it easier to locate high-priority information and manage key settings, creating a smoother, more consistent experience.

🔍 What’s New

Left Sidebar: Now more intuitive with sections aligned by purpose.
Kubernetes AddOns section now includes our new Cert Manager and Workflows features.
Enhanced User & Organization Settings: Access RBAC settings, Usage and Audit pages, and our new Agents page directly from the top bar.
Monitors: Now relocated in the Configurations section for improved access to monitor settings.
Info Tab: Your go-to for Documentation, What’s New announcements, and real-time Chat Support.

Get started today by exploring the new layout and see our updated Documentation.

If you have any issues, feedback, or questions, please reach out to us through Chat Support in the top bar—we’re here to help.

Chen

Komodor API for Events and Resource Statuses

New

Reliability

KubeX

Komodor is now exposing data (events and resource statuses) through our API, allowing seamless integration with internal systems and development workflows, including popular IDP solutions like Backstage or Port.io.

👁️ Overview

We support APIs for retrieving services, jobs, node terminations, and issues, all filterable by clusters, namespaces, time ranges, and statuses. This gives you greater visibility and flexibility to consume Komodor data directly from your environments.

💡 The API also provides direct URL links to the relevant event or resource page in Komodor, making it easier for users to dive deeper into the data.

image - 2024-09-26T155348.623.png

New APIs:

Services:

POST /api/v2/services/searchRetrieve and filter services across clusters and namespaces with filtering options. For example - get all unhealthy services in a specific scope.

Jobs:

POST /api/v2/jobs/search Fetch Jobs and CronJobs with filtering options. For example - get all failed jobs within a specific scope.

Events - Supports node termination/node creation/deploys:

POST /api/v2/services/k8s-events/search Search for Kubernetes events in the service scope.

POST /api/v2/clusters/k8s-events/search Search for Kubernetes events in the cluster scope.

Issues - supports Availability issues/Failed deploys/Node issues/PVC issues

POST /api/v2/services/issues/search Retrieve issues in a service scope.

POST /api/v2/clusters/issues/search Retrieve issues in a cluster scope.

🗣️ Motivation

Streamline user workflows by integrating Komodor data directly into your existing tools, improving visibility and reaction time for critical issues.

We aim to drive adoption, engagement, and customer satisfaction by making Komodor an even more integral part of their daily operations.

🚀 Getting Started

Simply go to our public API library on Swagger and copy the relevant keys: https://api.komodor.com/api/docs/index.html#/Services

Screenshot 2024-09-26 at 15.51.58.png

Udi, DevRel

Unveiling KlaudiaAI: Your Personal Virtual SRE Companion

New

Troubleshooting

Klaudia is Komodor's advanced GenAI agent designed to revolutionize Kubernetes troubleshooting. By leveraging artificial intelligence, Klaudia simplifies and accelerates root-cause analysis in Kubernetes environments.

Kapture2024-07-24at14.30.38-ezgif.com-speed.gif

👁️ Overview

Klaudia leverages Komodor’s comprehensive dataset of past investigation flows, historical changes, events, and metrics to power precise diagnostics and actionable insights, with AI enhancing the ability to scale across the entire Kubernetes stack.

Data gathered from hundreds of companies of diverse sizes, analyzed and engineered on top of, that amounts to hundreds of developer years, to optimize real-world Kubernetes operations at modern speed and scale.

Key Features

Rapid Root Cause Analysis: Quickly identifies the source of different issues (currently, only pods are supported)
Context-Aware Recommendations: Provides tailored troubleshooting suggestions
Seamless Integration: Works within Komodor's existing inspection flow
User-Friendly Explanations: Breaks down complex Kubernetes concepts
Logs analysis: analyze logs to surface critical information and pinpoint potential root causes of issues.

🗣️ Motivation

We envisioned a tool that could transform AIOps from its current track record to a better and much more useful tool that not only detects issues in K8s environments but also provides precise, actionable steps to resolve them.

Benefits

Time Savings: Dramatically reduces the time required for manual investigations
Expertise Augmentation: Bridges knowledge gaps in Kubernetes troubleshooting
Actionable Insights: Provides clear, step-by-step remediation instructions
Continuous Learning: Improves over time by leveraging Komodor's comprehensive dataset

🚀 Getting Started

Klaudia is seamlessly integrated into your Komodor experience. When investigating pod issues, you'll automatically see Klaudia's insights and recommendations alongside traditional metrics and logs.

However, account admins are required to activate Klaudia from the account settings page.

Screenshot 2024-09-19 at 14.56.27.png

Udi, DevRel

Update to Workspace Management Permissions

New

Workspaces

Operations

Access Control

We've made changes to how workspace management works in Komodor. Previously, any user could create, edit, or delete workspaces within an account. With this update, these actions will now be controlled by Role-Based Access Control (RBAC) permissions.

We're introducing a new permission called manage:workspaces This permission will automatically be granted to users with the "all actions" policy. For other users, account admins will need to assign the manage:workspaces permission to those who need to manage workspaces.

We have implemented this change to provide more granular control over workspace management within your organization.

We recommend that account-admins review their team members' permissions and assign the new permission as needed.

Chen

Best Practice Violation - Single Point of Failure

New

Reliability

Misconfiguring Kubernetes workloads can lead to application downtime. Fortunately, Kubernetes offers recommendations for different configurations and best practices, to ensure applications remain fault-tolerant, and to keep users up and running.

image - 2024-09-15T154517.044.png

👁️ Overview

As part of our proactive Reliability Management offering, Komodor can now surface some of those misconfigurations to help you align with best practices, and to improve your reliability posture over time. Additionally, the detected violations are prioritized based on the runtime impact they had.

We also expanded the best practice checks to include an additional violation - SPoF (Single Point of Failure). In case we detect relevant misconfigurations along with a runtime issue (node issue/termination that led to an availability issue), a SPoF violation will be created.

image - 2024-09-15T154520.007.png

🗣️ Motivation

Utilizing Komodor's troves of context to tie between best practice violations, misconfigurations, and how they affect your application's performance. With the correct configuration, single points of failure can be avoided, making your application much more reliable.

image - 2024-09-15T154524.846.png

🚀 Getting Started

The feature is now GA and automatically activated for all accounts!

Simply navigate to the Reliability Management screen and you will be able to browse between all the best practice violations that Komodor detected, assess their impact, and remediate.
To configure the policy's thresholds navigate to the Reliability Policies tab and edit it directly through Komodor

Screenshot 2024-09-15 at 16.39.02.png

D Danielle

Cluster Groups

New

Improvement

Workspaces

Managing and monitoring multiple clusters across environments or regions can be complex and time-consuming.

To make this process more efficient, we’re excited to introduce Cluster Groups in Komodor — a powerful new feature that allows you to group clusters based on custom definitions, providing a centralized view and management capabilities for each group.

👁️ Overview

With Cluster Groups, you can create dynamic workspaces that represent environments, regions, or any logical grouping of clusters. This feature offers an aggregated view, enabling you to monitor status, troubleshoot issues, improve reliability, and optimize performance across your groups.

Each Cluster Group workspace is automatically populated with relevant data, and new clusters matching the defined criteria are added dynamically.

🗣️ Motivation

As organizations scale, managing multiple clusters across various environments becomes increasingly challenging.

Cluster Groups simplifies monitoring, management, and improving operational efficiency, especially for DevOps engineers and Kubernetes admins.

🚀 Getting Started

Navigate to the Workspace Switcher
Go to the "Cluster Groups" tab and click on “Add cluster group.”
Define Your Cluster Group: Use Wildcards to create a dynamic group based on naming patterns (e.g., *prod*, *np, dev-*).
View and Manage Your Cluster Group: Select the group from the workspace switcher to access the aggregated overview by clicking on the "Overview" tab on the left-side menu.

This new feature is designed to streamline the management of clusters, making it easier to oversee environments or regions with a unified and intuitive approach.

Start grouping your clusters today and take your Kubernetes management to the next level!

Udi, DevRel

RBAC Cluster Sync

New

KubeX

Managing Kubectl permissions has never been so easy!

🗣️ Motivation

Some users prefer interacting with their clusters using terminal and CLI tools, but managing Kubernetes RBAC for multiple clusters can be a tedious task for administrators.

The new RBAC cluster sync feature by Komodor simplifies this process, allowing cluster administrators to set permissions for multiple resources across multiple clusters from a single, easy-to-use interface, affecting both the Komodor UI and CLI interfaces for their users

👁️ Overview

RBAC Cluster Sync is an opt-in feature, allowing all users in the account to request for a kubeconfig and allow them to interact with the clusters directly using kubectl or other tools

Admins can grant a default-allow-get-kubeconfig policy, enabling users to download a kubeconfig file for the clusters they have access to.

Users can go to their settings page and fetch a kubeconfig containing all clusters they have access to. Komodor automatically syncs cluster permissions and komodor permissions in near-real-time.

🚀 Getting Started

❗ Ensure that the komodor-agent chart version is updated to at least 2.5.4

Go to Komodor's settings page

Under Features → RBAC Cluster Sync, click 'Enable'

For more information, FAQ, and how-tos check out the docs

Udi, DevRel